Jekyll static site generator

Static Site Generators (SSGs), which have arguably been popularised by Github’s integration of Jekyll into Github Pages, have become dominant over the past decade. I continued using Jekyll to automate the finer aspects of my personal sites’ navigation.

Jekyll is used to generate the links to this text from https://adrian.geek.nz/docs & https://adrian.geek.nz/docs.atom. There are other SSGs, but since I use Jekyll that’s the one I’ll discuss now.


Once the commandline arguments & config file are parsed, the main driver is the Site class & it’s process method. Which consists of the major steps of:

  1. Clearing all relevant properties.
  2. Read
  3. Generate
  4. Render
  5. Cleanup
  6. Write

The first files Jekyll reads in is the layouts directory. Then it finds all the blogposts, recurses into the plugin, dynamic pages, & static files from the relevant directories. Include/exclude configs are applied. These files are sorted.

Having read all the pages, it now reads in the YAML, CSV, or TSV data, configured collections (I’m a heavy use of this!) & sorts them, and any theme assets (personally I avoid this in favour of something more custom).

If a maximum number of blogposts is configured, any more are chopped off the list.


Next it runs all subclasses of Generator that may be provided by loaded plugins. A search isn’t turning up any builtin ones.


Next it iterates over every page in every collection, then every page outside a collection, to run the renderer attached to each of those pages having assigned a “payload” to it.

By default this involves collecting data for the page, it’s paginator, a current_document property, syntax highlighting options, & any data from the template. So that it can optionally send that data through Liquid templates for the page itself & it’s layout(s), between which it runs any attached “converters” (including markdown parsing with optional fanciness).


Tidying deletes any files created in building the site, and any data files it was generated from.


Writing the involves iterating over the site’s pages, static_files, & collections properties for Documents to write to a new file if the Regenerator indicates it’s out of date based on a cache file it previously wrote.

Throughout this process events are triggered for plugins to hook into.

Liquid

A central component of any static site generator, or server-side (and even client-side) webframework, is a templating language. I recently discribed Blaze as used with HAppStack, though since the whole point of Jekyll is to avoid writing Ruby (or whatever) code Shopify’s Liquid, which it uses, is a Django-like templating language rather than a DSL-like module.

Similar to my own “Prosody” which I’ve also described.

It uses a {{ variable }} & {% tag %} syntax embedded in the output text.


Liquid’s central class is Template. Upon calling it’s parse method with some source code, it extracts passed configuration, runs a regex for a tokenizer, than the parser by repeatedly dispatching to the named tag which was previously registered in a mapping1 & adds it to a list. This parser method may be called recursively upon handling block tags like {% if %} & {% for %}.

If the tag is a variable or includes one, that includes it’s own lexer (using yet more regexes), scanner, & parser to handle identifiers & literals with following |filters. All to make the parser more lenient than what Django itself does.


To “render” a template it first converts all passed into a single Context object holding a stack of mappings, some of which may be Drop subclasses implementing some Ruby magic to make it easier to pass custom classes into Liquid templates. Then it recursively calls a render_to_output_buffer method on the parsed Abstract Syntax Tree!

This method takes an input prefix string & concatenates on it’s computed output string, possibly recursing into child elements.


To evaluate any variables it looks it’s name up in the current context (stack of mappings), then it iterates over each filter recursively evaluating their arguments before applying it. And tells the context to apply a global filter callback.

|filters are methods on a class referenced by the template’s Context. A usually noop to_liquid method is called on the result.

I’m not discussing all the standard tags & filters bundled with Liquid. Or added by Jekyll.

Kramdown

One nice feature of Jekyll is that you can give it markdown files & it’ll convert them to HTML & incorporate them into your Liquid templates. This is implemented by today’s topic Kramdown.

Kramdown is a wanabe Pandoc: it parses & serializes a handful of doc formats to/from a common abstract syntax tree. But I’ll adjust focus on Kramdown-flavour markup input & HTML output.


Parsing markup involves configuration, parsing, restructure, correct & replace abreviations, & footnotes steps.

Configuration involves converting a list of enabled block & span parsers into both a mapping, referring to a global mapping, & a single regexp.


To prepare the text for parsing it converts into UTF-8 & removes some extraneous whitespace. To parse it wraps the input in a Scanner & iterates over all those parsers calling a configured mehtod on the successful one. This yields block-level Elements, onto which extraneous text is appended.


Restructuring involves traversing over the parsed AST to parse any inline Elements & e.g. collapse consecutive blank lines. Parsing those inlines works much like parsing block elements, except the consolidate regexp is used to capture any unparsed text into the restructured AST.


It then postprocesses any parsed abbreviations, so that it can traverse the tree again to markup any occurances of those acronyms.


Then it parses inlines within, marksup references to, & validates footnotes.


Outputting the HTML involves The Visitor Pattern over all the Elements parsed from Kramdown’s markdown flavour (or whatever else), to concatenate together strings of raw HTML. And regexp-escape any <, >, &, & possibly " in the raw text.

There’s syntax highlighting integration, which Jekyll for some reason reimplements.

LibYAML

Behind a couple of wrappers, one of which has plenty of (apparantly unsafe, which Jekyll bypasses) Ruby magic, Jekyll uses LibYAML. Or for JRuby SnakeYAML. YAML is the syntax you can use Jekyll to configure it, prepend metadata to page bodies, & possibly define other input data.

YAML isn’t the most widely-used syntax, and has a reputation for having lots of irrelevant edge cases. But it looks pretty. Like my complaints regarding HTML error correction!


LibYAML is implemented as a state machine in C, with a lexer & scanners above & below those tokens. It uses Unicode Byte Order Marks to determine which text encoding it should decode to UTF-8. You can specify other callbacks (than the read() syscall) to provide more data to the parser.

The lexer maintains a stack of keys, with a hotpath for single-line ones. Whitespace is ignored apart from leading indentation whose tokens need to be specially generated & closed with the input stream.


Once the lexer has tackled all that, it considers whether it needs to parse (depending on it’s name) & emit a %directive token. Then document start (denoted ---) & end (denoted (...) indicators, followed by [, {, ], }, ,, -, ?, & : punctuations, which may be lexed slightly differently. *, &, !, & |, >, ', " also parses the subsequent tokens to get the token’s value.

Otherwise it lexes unquoted literals.


Upon that relatively complex lexer is a pushdown state machine parser emitting “events” (so arguably it’s another lexer, and thus leading to concur maybe YAML is too complex), starting from the STREAM_START then (IMPLICIT_)DOCUMENT_START states. (IMPLICIT_)DOCUMENT_START handles certain pragmas which must preced the rest of the document, where IMPLICIT adds a little more leniency. It ends at DOCUMENT_END to close any open objects.

DOCUMENT_CONTENT is the central state dispatching to most the others.


There’s some symmetrical code to output YAML back to text.

Syntax Highlighters e.g. Rouge

I’ve been reading a variety of syntax highlighters, the software which colours e.g. strings, numbers, & keywords differently in your text editor, webpages, terminal, PDFs, etc. In the hopes this makes the code easier to skim.

There’s 3 main subcomponents in a syntax highlighter which are kept, except for editors, loosely coupled: the lexer, formatter, & styles. The obvious way to do styles for the web is CSS. These components can be hotswapped for different results.


Most if not all lexers in a syntax highlighter predominantly involves repeatedly applying a regular expression to the text. Though it’s not uncommon for a regular expression to require switching to a new state with different regexes. Or gasp to require pushing & popping that state to/from a stack.

Maybe this simplicity helps explain why there seems to be a syntax highlighter implemented in every language with regexes!

This yields slices of the input text matching each token’s regex…


This makes it easy for a “formatter” to surround each lexed token with some notation (e.g. HTML tags) derived from the specified colour theme, and even more trivial to insert formatting directives in a side table like Pango’s. Or you could, like Pygments (implemented in Python), output e.g. images!

Most of the code in a syntax highlighter is there to make it trivial to add more (regexp-based) lexers & possibly themes.

Some syntax highlighters like highlight.js (JavaScript) include logic for merging these token streams.


Things get a little more complex in code editors (like CodeMirror [JavaScript] or GtkSourceView [GObject C]) in their attempt to minimize how much work they’re constantly redoing. They save the lexer’s state at known points in the text being editted & highlighted so it can resume from there rather than the beginning, and generally operates line-by-line because the edittor does too.

Finally, Skylighter (Haskell) takes an interesting approach in that it compiles the lexers (sourced from KDE) to a dispatch table to traverse at runtime implementing a combined regexp.