Generate RSS Feed for All Languages in Hugo

By Yong-Siang Shih / Mon 16 July 2018 / In categories Notes

Hugo, Python, RSS

Hugo supports multilingual sites natively, and it generates a RSS feed for each sub-site of different language. However, it might be desirable to generate a global feed that includes all articles in all sub-sites.

One possibility is to define a custom output format for homepage in config.toml:

[outputs]
home = ["HTML", "RSS", "FEED"]

[mediaTypes]
[mediaTypes."application/rss"]
suffixes = ["xml"]

[outputFormats]
[outputFormats.FEED]
mediatype = "application/rss"
baseName = "feed"

In layouts/index.feed.xml, we then use the following range loop to iterate though all pages in all languages:

{{ range .Site.AllPages }}
{{ if .IsPage }}

Of course, the generated location would still be under the language prefix lang-code/. So to make it appear “global” you might need to manually copy it to the root directory.

In addition, I also wrote a simple Python script to combine RSS feeds from different languages.

Basically, it grabs all feeds and merges them into a single feed.

Firstly, I use lxml to load and parse the RSS feeds. We define root_dir as the root directory of the published site, which is usually a directory called public, and paths are the relative paths of the RSS files, which are usually {LANG_CODE}/index.xml.

from lxml import etree

def load_feeds(root_dir, paths):
    feeds = []
    for path in paths:
        with open(os.path.join(root_dir, path)) as infile:
            feeds.append(etree.parse(infile))
    return feeds

We would assume there are only two sub-sites to simplify the logic, but it’s easy to extend it to handle more sites. This is left as an exercise for the readers.

Secondly, we use pytoml to load the configuration file config.toml, and obtain the baseURL of the site. This would be used to set the location of the global RSS feed.

In addition, we obtain all entries from each feed and sort the entries by their published date.

Finally we inject all items into a RSS file.

import os
from datetime import datetime

import toml

NAMESPACES = {
    'atom': 'http://www.w3.org/2005/Atom',
}
D_FORMAT = '%a, %d %b %Y %H:%M:%S %z'

def process_feeds(main_feed, alt_feed, output_path, config_path):
    with open(config_path) as infile:
        config = toml.load(infile)
    base_url = config['baseURL'].rstrip('/') + '/'

    link_node = main_feed.xpath('//rss/channel/link')[0]
    link_node.text = base_url
    atom_node = main_feed.xpath(
        '//rss/channel/atom:link', namespaces=NAMESPACES)[0]
    atom_node.attrib['href'] = os.path.join(base_url, output_path)
    last_build_node = main_feed.xpath('//rss/channel/lastBuildDate')[0]
    last_build_alt_node = alt_feed.xpath('//rss/channel/lastBuildDate')[0]

    if datetime.strptime(last_build_node.text, D_FORMAT) < datetime.strptime(
            last_build_alt_node.text, D_FORMAT):
        last_build_node.text = last_build_alt_node.text

    all_items = []

    main_items = main_feed.xpath('//rss/channel/item')

    all_items.extend(main_items)
    all_items.extend(alt_feed.xpath('//rss/channel/item'))
    all_items.sort(
        key=lambda x: datetime.strptime(x.xpath('pubDate')[0].text, D_FORMAT),
        reverse=True)

    channel = main_feed.xpath('//rss/channel')[0]
    for item in main_items:
        channel.remove(item)

    for item in all_items:
        channel.insert(len(channel.getchildren()), item)
    return main_feed

The command line options could be handled by the following code.

import argparse


if __name__ == "__main__":
    # Parsing arguments
    parser = argparse.ArgumentParser(description='Merge RSS feeds.')
    parser.add_argument(
        '--root-dir', required=True, help='publish root directory')
    parser.add_argument(
        '-o', '--output', required=True, help='output rss file')
    parser.add_argument(
        '-i', '--input', required=True, nargs='+', help='path of the feeds')
    parser.add_argument(
        '-c', '--config', required=True, help='path of the config')
    args = parser.parse_args()
    assert len(args.input) == 2

    feeds = load_feeds(args.root_dir, args.input)
    feed = process_feeds(
        *feeds, output_path=args.output, config_path=args.config)

    feed.write(
        os.path.join(args.root_dir, args.output),
        pretty_print=True,
        encoding='utf-8')

Running this script would produce a combined RSS feed at output path. This is also how the current global RSS feed for City of Wings is generated.

Notebook
Yong-Siang Shih

Author

Yong-Siang Shih

Software Engineer, Machine Learning Scientist, Open Source Enthusiast. Worked at Appier building machine learning systems, and interned at Google, IBM, and Microsoft as software engineering intern. Love to learn and build things.* Follow me on GitHub

Load Disqus Comments