GitHub Pages: Jekyll Archive Pages
The popular jekyll-archives plugin isn’t available for GitHub Pages, yet is functionality that is frequently asked for: it’s a way to create listings of blog posts grouped by category, tag or year of posting. Luckily there is a way to duplicate this functionality with a little ingenuity, code and a GitHub Action.
Credit
Credit for this solution must go to Kannan Suresh, who published a walkthrough for this solution on his website, aneejian.com, and hosts example code on his GitHub Repo, jekyll-blog-archive-workflow.
Overview
Archives
In the context of Jekyll sites, or other blog sites, the term “archives” means a listing of all blog posts, often grouped by some metadata. In this case we are going to group blog posts by:
- Categories
- Tags
- Year (of blog post)
It is assumed that all blog posts in the site will have Front Matter YAML which includes both category and tag metadata for the post.
High-level design
In order to make this work, a few things are needed in your Jekyll site, and in your GitHub Repo:
1. A data file (JSON) listing all categories, tags and years of blog posts
This will be used by a GitHub Action to loop through each category, tag and year associated with the site’s blog posts.
2. Layout templates to be used for each grouping (categories, tags, years)
Three layout templates will define the appearance of each listing of blog pages. In order words, these are templates for the pages that show, for example, a list of blog pages that match a given category, etc.
If you don’t need different formatting for these listings it would be possible to change this solution to have just one template, but other changes would be needed, especially in the Python script used by the GitHub Action.
3. A GitHub Action workflow
The GitHib Action includes steps that will generate new pages in your site for each category, tag and publication year of your posts, listing the blog posts relating to that metadata.
4. Create index pages for each category, tag and year
Not included in the solution on aneejian.com are pages that hold indexes of each category, tag or year by which the blogs are grouped, with links to each page of grouped content.
Implementation
It is assumed that you are reading/following the published solution on aneejian.com.
A rough elaboration of the solution, plus details of a few changes which were needed in my own implementation, are detailed here:
_archives folder
The solution suggests putting the archivedata.txt file in a folder named _archives. The name of this folder seems to be optional, but if you want to change it you’ll need to adjust some of the configuration that follows.
The contents of this folder are published as a Collection, defined in the _config.yml file:
collections:
archives:
output: true
permalink: /archives/:path/
Thus, the archivedata.txt file is published under the
/archives/archivedata/
path in your site, which is important as it is needed
later on when the GitHub Action fires.
NOTE: Despite there (eventually) being other folders and files under this
path, e.g. /_archives/years/2024.md
, they aren’t published under the
/archives/
path. This is because each page has it’s own permalink defined
in its Front Matter, overriding the path that the Collection would have created.
Example Front Matter for /_archives/years/2024.md
---
title: 2024
year: "2024"
layout: archive-years
permalink: "year/2024"
---
_layouts folder
Eventually, a number of new files will be generated by the GitHub Action under
the _archives folder, e.g. /_archives/years/2024.md
. These files will have
Front Matter data that defines the layout (i.e. template) to be used to render
that page. See above for example Front Matter for such a page.
The published solution defines three layouts depending on whether posts are being grouped by category, tag or year of publication. These layouts are effectively hard-coded in the Python script used in the GitHub action so, unless you want to rewrite the script, it’s easiest to create the three layout files as suggested.
WARNING: The example layout files are themselves based on a default layout (template) for the site. In the example this template is named default but in my Jekyll Minima site the default template has been renamed as base, so the Front Matter for the layout files needed to be changed:
---
layout: base
---
Also, I just wanted a simple listing of the blog pages, with links to each, rather than the more advanced formatting used in the solution (which used an include), so I changed the content of my layout files accordingly.
GitHub Action
The GitHub Action (workflow) is triggered by changes made under the _posts folder, but it can be triggered manually from the Actions tab of your repo on GitHub. As noted above, it performs the following steps:
- Checkout your Git repo.
- Run the jekyll-blog-archive code:
- Spin up a docker container
- Run a Python script
- Create new pages for each category, tag and year JSON data file
- Configure git (CLI)
- Push the new pages back to your repo
Trigger path
The provided solution will run the workflow when changes are made within the _posts path, but this assumes that folder is in the root of the repo. In my case the Jekyll site was within a docs folder, so this needed to be fixed:
on:
workflow_dispatch:
push:
paths:
- "docs/_posts/**"
Python script
I made the required ones to specify the location of the archivedata file and the output path for the archive files. The following snippet of the add_archives.yml file shows these changes, as part of the step where the original repo is used to define the core of the Action:
- name: Generate Jekyll Archives
uses: kannansuresh/jekyll-blog-archive-workflow@master
with:
archive_url: "https://dev.joynt.co.uk/archives/archivedata"
archive_folder_path: "docs/_archives"
From what I can tell, this looks for the action.yml file in that repo which then creates a docker image using a dockerfile from that same repo. The Docker image includes a copy of the Python script that creates the many output files for each category, tag and year.
Snags
When I first tried to run this action I noticed that I was getting errors saying that it failed to update my repo. This was caused by two problems. Firstly the workflow was using master as the branch name (mine is main) and the second was restrictive permissions not allowing unauthenticated updates. So I made some more changes:
master vs. main branch
The original code used the master branch:
git push origin master
I needed to change this to main (the newer default name for the first branch of a new GitHub repo). However I needed to make further changes to this line; see below.
Adding a permissions section
I added the following, although I cannot be certain it was needed:
permissions:
contents: write
Using a Personal Access Token for git push
My repo doesn’t allow pushes from just anyone, so I created a new Personal
Access Token (PAT) and saved this as a secret in my repo. I changed the final
line of the workflow to use this token in the git push
command:
git push https://x-access-token:$@github.com/$.git HEAD:main || echo "No changes to push."
Conditional echo statements
It is worth noting, as they originally confused me, how the trailing echo
statements worked, e.g.
primary_command || echo "some text"
If the primary_command fails, i.e. gives a non-zero exit value, then the
command following the ||
(OR operator) runs.
Thus, if there were no changes that needed to be made to the repo then the logs for the Action would show this explicitly.
Index pages for categories and tags
It may be that some Jekyll themes automatically include templates that will create index pages for categories, but these pages weren’t created in my case. To fix this I created two new pages (I wasn’t so bothered about indexing by year):
/_pages/categories.md
/_pages/tags/md
These needed to list all the categories/tags in the site.posts
list, and to
provide links to the “archive” page for each category/tag. This would have been
easy, except that these archive pages had names generated by the Python script.
Sanitized category and tag names
The script sanitizes the names of the files it creates, which are based on the names of the categories and tags included in the blog posts. Thus, the names of the pages these indexes need to link to cannot be taken immediately from the original category or tag names. The following Liquid code was used to update the names of the categories and tags to match the output of the script:
{% assign value_escaped = category[0] | replace: ' ', '-' | replace: '.', '-' %}
{% assign value_escaped = value_escaped | replace: '#', 'sharp' %}
{% assign value_escaped = value_escaped | downcase %}
{% assign value_escaped = value_escaped | replace_regex: '[^a-z0-9_-]', '-' %}
Example index page for Categories
---
layout: page
title: "Categories"
permalink: /categories/
---
<ul>
{% assign sorted_categories = site.categories | sort %}
{% for category in sorted_categories %}
{% assign value_escaped = category[0] | replace: ' ', '-' | replace: '.', '-' %}
{% assign value_escaped = value_escaped | replace: '#', 'sharp' %}
{% assign value_escaped = value_escaped | downcase %}
{% assign value_escaped = value_escaped | replace_regex: '[^a-z0-9_-]', '-' %}
<li>
<a href="/category/{{ value_escaped }}">{{ category[0] }}</a> ({{ category[1].size }} posts)
</li>
{% endfor %}
</ul>