Haskell Build Tools

I’ll get back to describing how various software I use works, namely the Haskell build tools, starting with the testrunner I use.

HSpec

Testrunners are very trivial tools whose task is to count up the fraction of tests which has failed. And focus on presenting a nice (terminal) UI & API to encourage developers to actually use it.

In Haskell there’s HUnit based on Java’s JUnit & HSpec based on Ruby’s RSpec. HSpec’s more featureful!


HSpec features concurrency, shuffling, seeds randomness, filtering, timing, only rerun failures, & configuration. Most of the code goes into reading that configuration.

When you call describe & it you construct a tree which you can pass to the hspec function. The filtering, randomness & concurrency is applied to that tree.

it adds code into the tree for catching the errors thrown by shouldBe, etc to be counted by hspec into it’s counts for the reports printed to the terminal!

file-embed

Correction: This is implemented amongst the desugurers, but is actually run during the renaming & typecheck phases. Adding a lot of complications to that code making me wish for syntax GHC found easier to implement.

In developing Rhapsode I maintain it’s useragent stylesheet, and the about: pages as seperate files. But I compile them into Rhapsode’s executable so those files don’t have to travel with it. So I don’t need to know where they currently are.

This is achieved with the file-embed module, which in turn builds upon Template Haskell!

Template Haskell runs during the desuguring stage to evaluate whatever code you tell it to, which’ll return AST for it to splice in.


This capability makes Template Haskell extremely powerful, though I unlike others have absolutely no interest in doing anything crazy with it. Behind the scenes that code is fully compiled into machine code, to be casted to a function pointer & called like in a JIT.

Template Haskell also provides syntax to help you construct that AST.

All file-embed then does is hook Template Haskell up to Haskell’s file/directory reading APIs ran at compiletime!

Cabal

To build Haskell projects I use a tool called “Cabal” which gets bundled alongside GHC. Cabal also allows me to install additional dependencies from Hackage (a topic for later!) or elsewhere, which I mostly use so I don’t have to writer parsers I’m not interested in.

(Relatedly I won’t describe Cabal’s Parsec-based parser)

Package Management

The cabal update command downloads ./00-index.tar.gz to $CACHE/00-index.tar.gz. Seems straightforward, but Hackage needed to implement it’s own URL dispatcher (like my own HURL) to handle (ETag-based) caching & file: URLs.

For an HTTP client it can configurably use curl, wget, Windows Powershell, or a different hackage than I do, each of which gets it’s own dispatch table. Meanwhile my HURL only supports GET.

Cabal checks whether to ask you to cabal update upon each run.


The corresponding command cabal install is essentially split into three steps behind-the-scenes.

The first resolves the requested hackages & their dependencies. The second runs Haskell’s build system over them in an appropriate temp directory. The final copies the built libraries &/or executables into the appropriate filepaths.

All surrounded by plenty of I/O handling.


The 00-index.tar.gz file is decompressed & parsed into a size-balanced Binary Search Tree for package lookups.

As for resolving dependencies, I’m finding that code a bit convoluted to read (perhaps due to optimization avoiding redoing previous plans) but it appears to be the same facility used for planning in which order files should be compiled.

It ultimately converts the data into a generic graph structure and hand to an appropriately-optimized “Solver”.


These solvers would implement the Partial Ordering algorithm of “just keep doing what you can until you’ve done everything”.

At a higher level the “planner” is split into four phases: parse the dependencies graph with given build flags, reduce to requested targets, check for needed rebuilds whilst downloading everything, & remove uptodate packages from the plan.

Installing Dependencies

There’s four phases to a build: prebuild (consisting of initialize, select target, check what needs rebuilding, & remove uptodate as discussed above), build, & postbuild.


For the rebuild stage it starts by configuring whichever Haskell compiler it’s running afterwhich it can verify certain configurations. Then it finds that compiler’s module directory & resolves dependencies within it, keeping the user informed, with any build flags.

At which point it chooses (which seems to now be dead code) & runs a “solver” with default settings to resolve the dependencies.

Then it traverses the full graph resolving any hackages & asking each file for it’s dependencies.


That’s immediately followed up with yet another traversal over the dependency graph, but I’m not clear what that’s for.

The result of all this is cached as a JSON file. All the while it’s been making sure it’s source files don’t change out from under it, rerunning the code if necessary.

Then it drops installed packages the plan, at which point it calls the plan “rebuilt”.

A callback is run to reduce the plan to desired targets, then extracts (possibly downloading) packages where needed.


If it’s not a “dryrun”, in the build phase it starts by deciding whether/how-much to parallelise, taking some locks, and creating some directories.

Having things setup, it begins asynchronously downloading needed hackages before semi-asynchronously iterating over the build plan for each:

  1. Waiting for the download to finish
  2. Extracting the tarball
  3. Recursively call the configure, build, & maybe haddock
  4. Maybe copy files to installation dir.

In the post-build phase (unless it’s a dry run), it first updates a cache file so it can avoid redoing that work next time. Followed by a GHC configuration file, & reporting any errors to you on the commandline.

I think I’ve dug into this enough for today, tomorrow I’ll discuss Cabal’s “solvers” & “install plans”.

Install plans

Continuing on from yesterday, Haskell’s build system “Cabal”’s concept of “Install Plans” appears to be little more than a (possibly generic) graph with a list of “goals” attached & an implementation of the Partial Ordering Algorithm.

Install Plans don’t appear to do anything with their goals themselves…

It primarily represents the graph in a Mapping (Binary Search Tree) with a typeclass for getting the keys & neighbours. Though it also caches conversions to edgelists.

Solvers

As for that concept of “solvers”, they don’t implement Partial Ordering like I thought (far too much code for that!). Instead they’re job is to normalize all the inputs regarding dependencies between modules.

There’s hints that there used to be multiple solvers but now there’s only one, which runs the following steps:

  1. Convert condition trees into an internal format & evaluate all possibilities; checking for conflicts.
  2. Validate no cyclic dependencies.
  3. Optimize for likely goals.
  4. Enforce various validation rules.
  5. Prune the graph to possibly avoid reinstalls, possibly complain about uninstalled dependencies, & possibly something about something called “constraints”. All using the same infrastructure as for hueristically selecting goals.
  6. Convert to a “build tree” with added “link nodes”.
  7. Either report an error, or convert to the graph used for install plans.

Build

Cabal’s main job is to provide a suite of commands to compile all your Haskell and maybe C files & maybe run them, or maybe it’s testsuite (which may be written using HSpec). You’d never know it from using Cabal, but behind the curtain it doesn’t work much differently from Make!

cabal configure extracts & caches information about the dependencies between files (not clear where exactly this info’s coming from), so that cabal build can use that to determine which files it needs to rebuild, in which order, & with which commands.

Interestingly there’s an extensible build system integrated into GHC itself which is presumably there to aid continued bootstrapping from one of the less production-quality implementations. Combining that with wget/curl & tar commands via a shellscript reimplements enough of Cabal to bootstrap it!

Haddock

Haddock is Haskell’s tool for generating reference documentation. The easiest type of documentation to largely automate, even if I often wish developers would write other types as well!


Once it’s parsed the commandline input for itself & (called as a library) GHC, it runs GHC to generate binary header files it can more readily parse. Which it’ll postprocess to construct a mapping with timings & coverages reported to stdout.

From there it keeps just the public APIs, rewrites the AST so “instances” are associated with their datastructure or typeclass, lightly process links, & runs a renamer optionally reporting warnings of what’s missing.

The renamer covers exported items, docs (seamingly twice, not clear what they are), args, & orphaned “typeclass instances”.

These files will be cached for later reuse.


After a bit more processing (mostly commandline parsing), it checks which format(s) it should convert the parsed Haskell interfaces to.

This may be an index page listing all the APIs (catalogued by initial letter if there’s more than 150), a page showing the signatures & module hierarchies, pages for every module each possibly with a table of contents, a JSON index for JS-based search, indexes to be searched via Hoogle, Latex document, and/or a hyperlinked version of the source code.


To create that hyperlinked version of the source code it runs GHC’s lexer (postprocessed) over each file referring to the previously parsed API definition to figure out where each type or function should link & #anchor. Some trivial JavaScript is added so hovering over link highlights all others to the same destination, a.k.a. referencing the same variable.

Latex & Hoogle are serialized manually, whilst Hoogle implements it’s own JSON serializer. GHC bundles an XHTML serializer which Hoogle uses, but I’d rather describe Blaze (elsewhere).


Generating nice HTML for each type signature gets quite involved, and for each item Hoogle can attach links to: this item, the hyperlinked sourcecode, and/or a wiki for reader comments.

The documentation may be written in a spatial Markdown variant which Haddock can parse & serialize via a method table telling how to write the equivalent Latex or (un)linked XHTML.

You may select which CSS theme it should link in, and it may add JS to select an alternate stylesheet, search the JSON index, or add more options regarding expanding & collapsing <details> elements.


These scripts are built using Preact & minified, degrades very nicely, & atleast one represents a feature browsers should implement!

Hackage

Hackage is a Happstack webapp for cataloging Haskell modules, an official instance of which that Cabal will default to is at hackage.haskell.org. Making it easy to find & reuse 3rd (and 1st) party Haskell code, some of which is used to develop Cabal & Hackage.

Happstack is a webframework (like Django or Ruby on Rails), though Hackage wraps it in it’s own internal abstractions primarily serving to split Hackage into largely-independant “features”.


These features Hackage is split into primarily includes a farily-generic user accounts system, “core” exposing the tarballs Cabal requires to install these hackages, & a nice HTML UI upon everything else!

Other features include:


The data tracked by each of these features is tracked as ordinary Haskell values in Acidic databases, except for the tarballs which are stored in a content addressed store referenced from one of those Acidic databases.

The .cabal file from those uploaded tarballs will be extracted & stored alongside it’s reference to optimize:


The Package List feature serves to parse the .cabal metadata for the Search & Reverse Index features. Cabal & Haddock hackages are imported to aid parsing this & other files from the tarball.

The search index is entirely in-memory, & uses several bog-standard hueristics to compute search results. jQuery datatables is used to allow visitors to sort search results clientside, a feature I believe browsers should provide to all pages whether or not they implement it themselves.

Several features include backup/restore utilities upon their Acidic databases.

Hoogle

Over the past week or so I described, in part, how Haddock (typically run via Cabal) converts the APIs described by GHC’s abstract syntax tree into a more machine readable format without the implementation details, to be aggregated by Hackage for harvesting handy API search engines like Hoogle & Hayoo. Anyone noticing a naming convention here?

I’d heavily recommend Hoogle to anyone picking up Haskell! It takes great advantage of Haskell’s design!


You can use Hoogle as a (self-hostable) webservice or an offline (once it’s database is downloaded) commandline utility. As a webservice it uses Happstack’s Blaze for HTML templating, though it uses the simpler WAI (Web Application Interface)/Warp for an HTTP server.

The search syntax is specifically designed for it’s domain of Haskell APIs (which I want to see more of!), and once parsed are run in 3 phases the rest of this thread will focus on. Lexing is mostly done by calling words.


The first pass is to apply (and parse) [+/-][category:]mod syntax. Or parse namespaces as such. Whilst this is parsed, it corrects the results from it’s lexer.

To do apply these it first figures what the different keys mean (finishing parsing) & completes any prefixes you haven’t finished typing.

Then this phase returns the corrected query for transparancy, whether the is:exact tag was found, a filtering callback, & a new callback for when this is the entirety of the search via DB lookups.


Distinguishing the next two phases syntactically (unless they’re both present) is a little tricky, so Hoogle says “if it’s all alphanumeric or all symbols, it’s search-by-name ignoring parens.”

The actual search is written C for speed, case-insensitively (save for leading character) calling strstr() for each name in the haystack multistring & needles array before quicksorting the scores.

That haystack, amongst other arrays, is stored in a version-numbered & section-headered binary file.


The most useful & distinctive feature of Hoogle is the ability to search by typesignature, a very powerful way to express what you need in Haskell! These are parsed, after some preprocessing, using a fork of GHC’s parser.

After applying some leniency, it computes a “fingerprint” it can lookup in a first-pass linear scan, consisting of the rarest types in that signature, it’s “arity” (number of arguments) & number of terms.

A 2nd pass resolves generics in the search or type signature…


When testing if type signatures match the search query, they don’t have to match perfectly. Failing to find an exact match it’ll try swapping the positions in a two-argument function and/or wrap the result in a “Maybe” type. But also there’s generics.

When comparing two type signature, if one or both has a generic in a specific spot it’ll “unify” by assigning a new value to that spot via an STRef. Each unification is penalized in search rankings.

Foreign Function Interface

When writing code in a new-ish language you will inevitably need to call code written in another programming language, typically C. Usually for I/O. Even Haskell’s standard Prelude needs to do this!

My Rhapsode auditory web browser needs to do this for the sake of text-to-speech & vice versa (via eSpeak NG & CMU Sphinx)!


To start: GHC provides a language feature whereby you can generate bindings to or from C, or other supported calling conventions, for functions with limited types for args & returns.

This is mostly resolved during the desugaring phase (alongside, e.g. Template Haskell expressions), generating:

If an imported FFI function returns IO a (where a is an FFI-compatible type), GHC’s optimizer will be told these calls must be correctly ordered. Or in terms it understands, that the function returns a mutated RealWorld State#! These types compile to absolutely nothing.


If the APIs you want to import/export so happen to only use the FFI-supported types (which isn’t unlikely), that makes writing bindings as trivial as typing their type signature.

When that’s not the case, the Foreign.* modules provides various FFI-compatible pointer types each treated differently by the GC. These abstract “primitive” functions understood specially by GHC’s backend behind typeclasses. Upon which it implements converters to Haskell’s own List, String, & error types.

Or you can use the ByteString abstraction to keep the pointer’s data where it is for greater efficiency.


This covers most you might want to hand to/from C, even if you now need to convert many of them to the appropriate before calling into C. But what about pointers to structures?

There’s several strategies for addressing this, but GHC’s implementation of it’s standard libraries using a build tool called hsc2hs which converts a haskell file with added C macros into a C program that prints out pure Haskell. And then runs that program, using GCC’s code for parsing & laying out C structures.

Parser Generators

GHC itself uses these to implement it’s parser.

Alex

Alex takes a file listing what ammounts to more human readable regexes each with Haskell expressions indicating which token should be returned upon match.

This file is parsed self-hosted using Alex & Happy, and is translated to a heavily-optimized Haskell file for GHC, or another implementation, to compile with the rest of your codebase.

I thought libs like Parsec were close enough to BNF that Happy & Alex wouldn’t be needed, but I guess sometimes you really don’t want syntactic noise!


Everything except the regexes are transliterated (if not copy/pasted) directly into the output file with added support code, so I’ll spend the rest of this thread focused exclusively on the regexes.

During parsing they are converted/normalized into a non-consecutive sorted list of ranges, consulting a parse-time dictionary. These can be combined in-sequence or parallel with added *, +, or ? repitition operators using the RExp abstract syntax tree type.


This “scanner” is converted into a Nondeterministic Finite Automaton, which is a labelled directional graph. To evaluate an NFA, starting with a start node, you traverse all “epsilon” edges & all edges labelled for each character for the input string in turn. Or rather each byte in turn, because the UTF-8 or Latin1 charset is decoded here.

Those nodes are ID’d & stored in a mapping to their actions, labelled edges, & epsilon edges. Which is constructed using a monad to help assign IDs.


In an NFA if it wasn’t clear you can be in multiple nodes at once (hence “nondeterministic”), so this is fixed in a conversion to a “Deterministic Finite Automoton” merging relevant nodes together.

It queues up each nodeset to be merged together and adds them as a single node to the DFA if not already present, expanding the unioned byteranges into one edge per possible byte. Afterwhich the nodes are renamed to numeric IDs again.


That DFA brings the pattern into an optimal form for the computer to evaluate, but maybe the generated code’s more complex than necessary? In case that’s the case Alex uses Hopcroft’s Algorithm for DFA minimization.

In short Hopcroft’s Algorithm (incorrectly) assumes nodes are duplicate code until it discovers a contradiction. That is it finds a node that transitions to a subset of them.

Finally the DFA is arranged into bytearrays as an optimally-sparse graph.


That bytearray serves as a lookup from state# + inputByte to a new state#, where the lookup array for each state may overlap in the hopes that this table fits entirely in the CPU cache. A “check” table stores the expected inputByte for each slot, so that if that check fails a default transition from another table can be taken - thus allowing the graph to sparse & saving memory.

Another table maps state numbers to offsets in the main table, whilst there’s a list for code to run in response.


If my understanding of how modern CPUs work is correct, this would be the most efficient code possible for a lexer. At least once Alex has copy/pasted in the code for evaluating those lookup tables.

Normally lexers play havoc with a CPU’s branch predictors, but here most of those branches have been moved to data lookups which should already be present in the datacache. Instead all the branch predictors see is a tight loop.

Furthermore Alex has a special GHC mode to use it’s intrinsics directly! Thus getting rid of most everything that makes Haskell Haskell…

Happy

Happy parses (using Haskell it generated itself) syntactic grammar rules expressed via BNF with Haskell code to return upon match & outputs a Haskell file you can compile with the rest of your code.

Interestingly it’s lexer is handwritten, I guess Happy came before Alex.

BNF differs from regexps in that it uses a callstack & often does without repitition ops, e.g.: rules: rules rule | rule rule: id “:” prods “;” prods: prod “|” prods | prod prod: term | prods term term: id | nonterminal


The first steps are to:

  1. parse the input grammar file using a grammar not dissimilar to the above,
  2. “monomorphize” any parameterized rules so the rest of the code doesn’t need to know about that language feature I haven’t seen used,
  3. gather rule names whilst verifying there’s no (nonconsecutive) duplicates,
  4. & renames the rules to have numeric IDs whilst validating those references.

These IDs are allocated such that error < %dummy < starts < nonterminals < terminals < %eof.


The main trick to optimizing BNF rules is to figure out what the conditions actually are for each branch! Rather than rely on the overhead of backtracking like usual.

The first step of which is to figure out which tokens (“terminals”) each rule may validly start with, by looking up the first term (if it isn’t already a terminal) of each alternative for each rule until fixpoint. In the process of which it decomposes each rule to form a state machine.

Another two passes similarly computes all the starting nonterminals for each rule.

However in the face of left-recursion branching on just the current token yields an infinite loop, where we need to lookahead are marked with dummy nonterminals. So that another optimization pass may now determine which tokens it should scan ahead. Which uses both the mappings to starting nonterminals & to starting terminals.

The results of which are finalized with some array processing before being merged into the state table computed alongside the starting (per-rule) nonterminals.


From those “closures” it converts the BNF grammar into a state machine with “goto” & “actions” tables.

The goto table maps a state & an input (nonterminal or popped terminal) to a new state. The actions table consist of Fail, Shift, Accept, & Reduce opcodes for each state. These tables are merged together into one function per state pattern matching on the input number. Or Alex-like tables if told to.

Code which it generates after optionally outputting a bunch of debugging info.


A shift moves the current token onto the stack before having a specified callback (a.k.a. the new state) pattern match upon the next token once converted into it’s non-terminal number. A goto just calls the new state without advancing the tokens or updating the stack. A “reduction” pops a given number of tokens off the stack passing the values to some minimally modified Haskell code from the input file. And accept returns the top-of-stack value.


A function is generated to convert tokens into state numbers via pattern matching from the input file (also present in stack pops), and a lookup table is generated for reporting which tokens were expected upon error.

As far as Haskell’s typesystem is concerned the stack doesn’t end, instead the bottom-of-stack is marked with a lazily-evaluated expression that throws an error safely crashing the parser if it’s ever accessed. Cutting down on branching I presume…