GitHub Pages: Jekyll Archive Pages

The popular jekyll-archives plugin isn’t available for GitHub Pages, yet is functionality that is frequently asked for: it’s a way to create listings of blog posts grouped by category, tag or year of posting. Luckily there is a way to duplicate this functionality with a little ingenuity, code and a GitHub Action.

Credit

Credit for this solution must go to Kannan Suresh, who published a walkthrough for this solution on his website, aneejian.com, and hosts example code on his GitHub Repo, jekyll-blog-archive-workflow.

Overview

High-level design

In order to make this work, a few things are needed in your Jekyll site, and in your GitHub Repo:

1. A data file (JSON) listing all categories, tags and years of blog posts

This will be used by a GitHub Action to loop through each category, tag and year associated with the site’s blog posts.

2. Layout templates to be used for each grouping (categories, tags, years)

Three layout templates will define the appearance of each listing of blog pages. In order words, these are templates for the pages that show, for example, a list of blog pages that match a given category, etc.

If you don’t need different formatting for these listings it would be possible to change this solution to have just one template, but other changes would be needed, especially in the Python script used by the GitHub Action.

3. A GitHub Action workflow

The GitHib Action includes steps that will generate new pages in your site for each category, tag and publication year of your posts, listing the blog posts relating to that metadata.

4. Create index pages for each category, tag and year

Not included in the solution on aneejian.com are pages that hold indexes of each category, tag or year by which the blogs are grouped, with links to each page of grouped content.

Implementation

It is assumed that you are reading/following the published solution on aneejian.com.

A rough elaboration of the solution, plus details of a few changes which were needed in my own implementation, are detailed here:

_archives folder

The solution suggests putting the archivedata.txt file in a folder named _archives. The name of this folder seems to be optional, but if you want to change it you’ll need to adjust some of the configuration that follows.

The contents of this folder are published as a Collection, defined in the _config.yml file:

collections:
  archives:
    output: true
    permalink: /archives/:path/

Thus, the archivedata.txt file is published under the /archives/archivedata/ path in your site, which is important as it is needed later on when the GitHub Action fires.

NOTE: Despite there (eventually) being other folders and files under this path, e.g. /_archives/years/2024.md, they aren’t published under the /archives/ path. This is because each page has it’s own permalink defined in its Front Matter, overriding the path that the Collection would have created.

Example Front Matter for `/_archives/years/2024.md`

---
title: 2024
year: "2024"
layout: archive-years
permalink: "year/2024"
---

_layouts folder

Eventually, a number of new files will be generated by the GitHub Action under the _archives folder, e.g. /_archives/years/2024.md. These files will have Front Matter data that defines the layout (i.e. template) to be used to render that page. See above for example Front Matter for such a page.

The published solution defines three layouts depending on whether posts are being grouped by category, tag or year of publication. These layouts are effectively hard-coded in the Python script used in the GitHub action so, unless you want to rewrite the script, it’s easiest to create the three layout files as suggested.

Example layout files

WARNING: The example layout files are themselves based on a default layout (template) for the site. In the example this template is named default but in my Jekyll Minima site the default template has been renamed as base, so the Front Matter for the layout files needed to be changed:

---
layout: base
---

Also, I just wanted a simple listing of the blog pages, with links to each, rather than the more advanced formatting used in the solution (which used an include), so I changed the content of my layout files accordingly.

GitHub Action

The GitHub Action (workflow) is triggered by changes made under the _posts folder, but it can be triggered manually from the Actions tab of your repo on GitHub. As noted above, it performs the following steps:

Checkout your Git repo.
Run the jekyll-blog-archive code:
- Spin up a docker container
- Run a Python script
- Create new pages for each category, tag and year JSON data file
Configure git (CLI)
Push the new pages back to your repo

Trigger path

The provided solution will run the workflow when changes are made within the _posts path, but this assumes that folder is in the root of the repo. In my case the Jekyll site was within a docs folder, so this needed to be fixed:

on:
  workflow_dispatch:
  push:
    paths:
      - "docs/_posts/**"

Python script

I made the required ones to specify the location of the archivedata file and the output path for the archive files. The following snippet of the add_archives.yml file shows these changes, as part of the step where the original repo is used to define the core of the Action:

  - name: Generate Jekyll Archives
    uses: kannansuresh/jekyll-blog-archive-workflow@master
    with:
      archive_url: "https://dev.joynt.co.uk/archives/archivedata"
      archive_folder_path: "docs/_archives"

From what I can tell, this looks for the action.yml file in that repo which then creates a docker image using a dockerfile from that same repo. The Docker image includes a copy of the Python script that creates the many output files for each category, tag and year.

/dist/_create-archive-files.py

Snags

When I first tried to run this action I noticed that I was getting errors saying that it failed to update my repo. This was caused by two problems. Firstly the workflow was using master as the branch name (mine is main) and the second was restrictive permissions not allowing unauthenticated updates. So I made some more changes:

master vs. main branch

The original code used the master branch:

git push origin master

I needed to change this to main (the newer default name for the first branch of a new GitHub repo). However I needed to make further changes to this line; see below.

Adding a permissions section

I added the following, although I cannot be certain it was needed:

permissions:
  contents: write

Using a Personal Access Token for `git push`

My repo doesn’t allow pushes from just anyone, so I created a new Personal Access Token (PAT) and saved this as a secret in my repo. I changed the final line of the workflow to use this token in the git push command:

git push https://x-access-token:$@github.com/$.git HEAD:main || echo "No changes to push."

Conditional echo statements

It is worth noting, as they originally confused me, how the trailing echo statements worked, e.g.

primary_command || echo "some text"

If the primary_command fails, i.e. gives a non-zero exit value, then the command following the || (OR operator) runs.

Thus, if there were no changes that needed to be made to the repo then the logs for the Action would show this explicitly.

Index pages for categories and tags

It may be that some Jekyll themes automatically include templates that will create index pages for categories, but these pages weren’t created in my case. To fix this I created two new pages (I wasn’t so bothered about indexing by year):

/_pages/categories.md
/_pages/tags/md

These needed to list all the categories/tags in the site.posts list, and to provide links to the “archive” page for each category/tag. This would have been easy, except that these archive pages had names generated by the Python script.

Sanitized category and tag names

The script sanitizes the names of the files it creates, which are based on the names of the categories and tags included in the blog posts. Thus, the names of the pages these indexes need to link to cannot be taken immediately from the original category or tag names. The following Liquid code was used to update the names of the categories and tags to match the output of the script:

{% assign value_escaped = category[0] | replace: ' ', '-' | replace: '.', '-' %}
{% assign value_escaped = value_escaped | replace: '#', 'sharp' %}
{% assign value_escaped = value_escaped | downcase %}
{% assign value_escaped = value_escaped | replace_regex: '[^a-z0-9_-]', '-' %}

Example index page for Categories

---
layout: page
title: "Categories"
permalink: /categories/
---

<ul>
{% assign sorted_categories = site.categories | sort %}
{% for category in sorted_categories %}
{% assign value_escaped = category[0] | replace: ' ', '-' | replace: '.', '-' %}
{% assign value_escaped = value_escaped | replace: '#', 'sharp' %}
{% assign value_escaped = value_escaped | downcase %}
{% assign value_escaped = value_escaped | replace_regex: '[^a-z0-9_-]', '-' %}
    <li>
        <a href="/category/{{ value_escaped }}">{{ category[0] }}</a> ({{ category[1].size }} posts)
    </li>
{% endfor %}
</ul>