Haskell Text & Web Processing

This page describes the dependencies I use to parse & process webpages in Rhapsode.

unordered-containers

http-client

As a webbrowser, Rhapsode’s job is to download & render webpages. I’ve written a blogpost describing how it does the rendering, but to download the pages it predominantly uses the Haskell http-client hackage. Here I’ll at least start describing how this works.

I also implement optional support for gemini: (via openssl-streams), file:, data: (via base64-bytestring), & anything other apps on your free desktop offer to handle for me.

The first step of using http-client is to construct a ManagerSettings object containing some magic numbers around handling connections; callbacks to wrap new network connections, requests, & responses; & callback to configure error handling & retrieve HTTP proxies. All of which is simple defaults, which’ll read proxy configuration from the system.

The second is to construct a Manager from those ManagerSettings.

There’s seperate modules to add OpenSSL or Cryptonite encryption.

Upon creating a Manager, it turns the callbacks to read HTTP proxy configuration into a modifier over HTTP request objects. And wraps all the network connection constructors into a single callback, which it then wraps into a “KeyedPool”.

The KeyedPool is crucial for managing network-level concurrency when fetching all the resources that makes up a webpage. Not that there’s many relevant to speech browsers like Rhapsode.

The KeyedPool is an Atomic Mapping of already-open sockets (possibly multiple timestamped per host) that may be reused (as indicated by the server) for new HTTP requests without the overhead of opening a new TLS socket. Newer versions of HTTP are specifically designed to aid this optimization, but I don’t whether they require it.

It’s got logic for closing old sockets in the background & upon dealloc. Connection (de)allocators & limits to the number of open sockets are caller-provided.

As for as how it managers atomicity, that’s simple: TVar and the Mapping implementation is provided by Data.Map


Gathering all the information required for an HTTP request starts with a call to requestFromURI which parses whether the server address, port number, whether to encrypt, URI path, query string, & authentication parameters from a given URI. After setting defaults, some of which I override!

Next up: A simple call to httpLbs, which in turn is a trivial wrapper around withResponse that converts the response body into a lazy bytestring. Making sure to use tail recursion.

I’ll describe bytestrings later…

withResponse is split into two parts: responseOpen & responseClose hooked together with Haskell’s bracket error handling function. With responseClose dispatching to a destructor on the Response object.

responseOpen meanwhile is where things get interesting!

responseOpen starts by iterating over all headers attached to the request (I add some for content negotiation, and soon caching) to validate their values don’t contain newlines.

Then it allows the request to swap out which manager is used, and the manager to apply modifications to the request via one of it’s callbacks.

After initializing some error handling, within which it allows those overrides again. Upon a redirect response, this main logic (next toots) will be repeated N times.

After that initial validation, callbacks, & (during) redirect following it calls httpRaw'. Which starts by calling the manager’s configured callback for rewriting the request to be intercepted by an HTTP proxy.

Then if there’s a cookie jar (which in Rhapsode’s case, there isn’t yet) it’ll serialize all matching cookies to a string, and attach it as a request header “Cookie”. The (de)serialization is implemented in another module.

Then it looks up an open socket from the Atomic Mapping (a.k.a. “KeyedPool”) I described yesterday, which’ll open a new connection if necessary. This socket is then wrapped to enforce a configurable timeout.

At which points it’s ready to serialize the request onto that socket & parse the response headers! I’ll describe how that works tomorrow.

If it sees a contentLength response header, it’ll wrap the response socket with function that truncates it’s reads. Or returns nothing.

If the “transfer-encoding” response header is “chunked” (regardless of content-length) the socket will be wrapped in a callback to parse the headers for multiple response bodies. And if “content-encoding” is “gzip” that callback will be further wrapped in calls to the Haskell bindings for the popular ZLib (de)compression library.

Callers can then repeatedly call that function to read the response body, or consult what else has been parsed!


Now I want to discuss how it serializes the HTTP request to that socket, & deserializes the response from it.

Let’s start with cookies (which I’ve disabled)! It’s (de)serialization is implemented in a seperately-packaged module & involves little more than splitting by “;” & “=”.

http-client manages the cookie jar itself from data that seperate module cookie parses from the Set-Cookie response header using the break function. Which may include dates for which cookie will call time to parse.

Then for requests http-client will extract just the keys & values to hand to cookie to serialize. Serialization is generally simpler than deserialization, but here another module blaze-builder is pulled in to speed things up.

I’ll discuss Blaze later…

To serialize the HTTP request to the socket, it starts by analyzing whatever datastructure (if any) you use to store the request body. Taking special care to allow for the “Expect: 100-continue” request header.

Blaze, specially configured, is used to once again speed up this serialization.

The first line contains the request method, the filepath (or, for proxies, full URL), & HTTP version (either 1.0 or 1.1). Here the hostname (if applicable), path, & query params are stored separately.

After that are key”: “value headers, for which http-client prepends:

After an extra newline the body is appended, decoded by Blaze from a choice of datastructures.

On deserializing the response, it first makes sure to give the caller a chance to write chunked data to the request if appropriate. From there it reads one line at a time.

On the first line (possibly preceded by at most 3 blank lines) it expects to read the HTTP version, response code, & it’s corresponding message (e.g. Not Found). Otherwise it errors. Status code 100 skips this header & waits until the next batch.

On subsequent non-empty lines it expects to find HTTP headers, parsed simply with a call to break (== ':') & strip. The body is left for the caller to read via header-configured wrapper functions.

An external module defines the AST for this.

The sockets to which that reads & writes are wrapped in a http-client-specific abstraction. This takes read, write, & close callbacks (defaulting to bindings to the standard C networking libraries integrated GHC’s mainloop), adding a flag for whether it’s already been closed & a stack to power an un-read method. That unread method is predominantly used within the readLine function.

Other modules may use this to add crypto, and special allowances are made for connecting to HTTP proxies.

Array processing

bytestring

I’ll introduce you to the concept of a ByteString, as defined in Haskell’s libraries. Put simply: It’s an array of bytes, a slice of RAM.

Sticking with the core Haskell language or it’s standard Prelude you’d get tempted to use a linked list, but that’s not what http-client receives from the network (nor what the parsers I call expects) so it’s slow. It receives mostly-consecutive bytes, so it’s best to stick with that representation! For Haskell to find a way to do so.

For the sake of C bindings, Haskell introduced a ForeignPtr type, compiled specially by GHC’s backend, for use in imperative-style code in a “IO monad”. bytestring builds upon this, insisting it’s safe to use it’s abstraction in any “lazy” code. And for GHC to apply it’s optimizations.

A veriaty of utilities are implemented upon it, including a Bucket Sort & a well optimized findSubstrings that may compare machine words or rolling hashes.

Occasionally it’ll call down into C…

As discussed in the context of unordered-containers, Haskell’s very keen on making copies. To avoid that cost, bytestring tracks a length & offset (a “slice”) alongside it’s arrays. It shouldn’t take much imagination these can be altered instead of the array itself for many of Bytestring’s operations.

Though when it does need to “modify” a Bytestring or create a new one, it does need to allocate a new array, which apparantly GHC optimizes specially…


I stated that the data received from the network is stored in mostly consecutive memory. Because comes from the network in “packets”. Yet Bytestrings, as I described them, stores their data in entirely-consecutive memory.

This gap is filled by “lazy Bytestrings”, that is a linked-list of the “strict” Bytestrings I described yesterday. Where we can start to embrace, rather than fight, Haskell’s more interesting features…

Performance-wise this means that for small ammounts of data that fits in the CPU memory cache (say, 16 or 32 bytes) it’ll process in one go using the strict Bytestrings I described yesterday. But for something larger, it’ll process whatever data it already has in the CPU before it gets to the next chunk. Which might not have been computed already, and it might not need to. In theory, a great balance!

Laziness isn’t perfect, but it can apply some surprisingly high-level optimizations!

On the other end of the spectrum (which I don’t use) is ShortByteString, a wrapper around the ByteArray# “intrinsic” which compiled specially by GHC’s backend. I don’t what makes ByteArray# different from ForeignPtr (apparantly the GC can compact them), but ShortByteStrings do not track slices like normal/strict ByteStrings do.

And ShortByteString does offer significantly fewer APIs compared to lazy or strict ByteString.


To finish off describing Haskell’s bytestring module, I’ll describe Builders.

Usually when you’re going to output a bytestring (over a network socket or whatever channel), you’re constructing pieces from other datastructures and concatenating them together. a Bytestring Builder makes this fast. And the Blaze module, originally developed to serialize HTML for Haskell webapps, makes it faster!

But how does it make it fast?

A Builder is a function which writes to a specified buffer segment and, for rapid concatenation, takes as input the next function to run. Various Builders are provided for basic types including numbers & the different types of Bytestrings.

For speed the buffer range is specified as raw Ptr’s to memory.

If the Builder has filled the specified buffer segment, it’ll return BufferFull with a callback to be run at the caller’s leisure, & how much of a buffer to possibly preallocate for it.

Altenatively a Builder may return a strict ByteString to be concatenated onto the end of the output stream, because it’s very common & usually trivial to optimize! Though it may copy instead if the Bytestring’s small enough.

Put actions are builders which also returns a second value. Most Builders are marked for inlining to reveal more optimizations to GHC.

Builders usually write to a new lazy Bytestring (via an intermediate format), but may write directly to a buffered file descriptor.

You may also (via a minor abstraction) handle the builder call your own code rather than write to a socket or construct a lazy ByteString.

To serialize numbers, lists, etc via a Builder, concepts of “Fixed” & “Bounded” primitives are introduced which knows the (maximum) ammount of space they require. These combined requirements can then be checked by a builder before serializing them.

text

Over the past several days I described Haskell’s bytestring module, But it’s more common to proess sequences of text than it is to process raw bits. Even if everything is ultimately raw bits under the hood.

For that Rhapsode uses the text module.

Which works extremely similarly to the bytestringmodule. Except that each item/character takes 16 or 32 bits (UTF16 encoding). And for some reason it’s built on different intrinsic types.

I won’t discuss what text shares with bytestring. It’s Builders are slightly different in that they return a list of strict to be turned into lazy Text.

At the simplest level the (strict) Text type has an Iter (moving in either direction) to aid deciding where to slice the original Text.

For more complex cases it uses a “fusion” system, whereby an iterator function returns an updated state alongside each word/char to be passed to the next call. Wrappers around this are inlined away.

This internal “Fusion” API provides “Stream” functions for:

This is heavily used to build the higher-level APIs upon, though the text’s stored as UTF16 & needs no help for splitting or combining.

Parsing

css-syntax

One very useful step for parsing is called “lexing”. This splits the input text into “tokens” (words, numbers, symbols, etc) so the main parsing step has less to worry about as it determines the relationships between those tokens.

In my case where I want to let help others parse their own selection of CSS properties, lexing is in-fact vital! For which I could reuse Haskell CSS syntax, which I’ll describe today.

I run Haskell CSS Tokens by calling it’s tokenize function passing it the downloaded Text I’ve decoded as UTF8 unless @encoding says otherwise. This will be preprocessed into a new Text to mark nil characters as invalid whilst replacing \x0D\x0A, \x0D, & \x0C with \x0A.

Next they establish a new Text object as an allocation to copy names/units/strings/etc into, before iterating over all the characters using :. (defined as a nicer operator-based syntax upon Text’s uncons.

Various tokens are absolutely trivial to lex, like:

Lexing whitespace (which is only significant in CSS selectors, making it a bit of a nuisance to deal with) & comments are almost as trivial.

It attempts to lex numbers (which may be followed by a % or a name for a “unit”) before names in case of ambiguity.

Lexing a number involves looking for an optional sign character, followed by decimal digits with an optional . in the middle, followed by an optional lower or upper case E and more digits. It’ll parse the integer component in blocks of 40 characters before putting off until lazily evaluated, and use the scientific module to parse floating point/exponents.

The tricky part of CSS identifier (beyond what can start or continue one) is escape codes: hex, suppress newline, or ignore rules.

The lexed CSS identifiers with any resolved escape codes are written into the preallocated arena with any escape codes resolved. If it’s followed by a open paren “(“ it’ll parse to a different token. If that name was “url” (either case) it’ll check if it’s argument isn’t quoted, indicating it needs special casing.

Strings also need this escape parsing, and are terminated with the quote they started with.

Names may also be prefixed with @ or #, the latter is more lenient which it’ll flag.

Whenever it finishes lexing any of those cases, it’ll yield a value of it’s Token type for me or others to pattern match upon. Other characters are labelled Delim, and this lazy token list ends whenever the input text does. UTF-16 decoding is largely skipped as irrelevent to selecting branches.

Another function iterates over tokens, determines they need a comment to separate them, and convert them back to text properly escaped. Numbers store their textual representation for performance.

scientific

I mentioned that Haskell CSS Syntax uses a seperate module to parse any floating point numbers.

Decoding numbers involves converting to the computer’s native base from our of base 10. For integers this is a simple multiply and add, as was implemented by Haskell CSS Syntax.

But for floating point we still have the same issue. “Scientific Notation” is coefficient*10^exponent, whereas computers use coefficient*2^exponent. So how does this work?

Haskell CSS Syntax yields to my code either unbounded Integers or Scientific values at it’s convenient, both of which I convert to CPU-native Floats. The Scientific in turn stores a fixed-bitsize base 10 exponent and an unfixed-bitsize coefficient.

To perform the conversion for small enough exponents Scientific consults a lookup table to decide what it should multiply (for positive exponents) or divide (for negative) by the coefficient.

The lookup table is constructed to start with 1 & 10, and followed by table[i, i+1] = [x*x, 10*x*x] where x = table[i/2]. You can do math directly on a Scientific value, but it may not be secure or performant.

If the coefficient is too large for that table (> 324) it may compute the additional exponent. Or more optimally it’ll first check whether it’s outside the exponents which are representable in Float’s bitsize. Both approaches are implemented.

Converting from a real number to a Scientific ammounts to long division, with some logic to prevent repeating digits from becoming infinite loops. Converting to/from text is easy as a Scientific hasn’t yet been fully converted to base 2.

For an unfixed-bitsize integer it uses Haskell’s builtin Integer type, which may be implemented GHC either as bindings to LibGMP or in pure Haskell. In which case it does most it’s work by iterating over each fixed-bitsize “digit” in the number.

XML Conduit

Here I’ll describe the module I use to parse XML documents for Rhapsode. Which is built upon Conduit for a parsing framework (queue that up for a toot thread!), following a detectUtf .| dropBOM .| tokenize .| toEventC .| addBeginEnd .| fromEvents pipeline. Conduit I believe focuses on enabling such pipelines!

In doing so, I’ll assume you know how parsers normally work and I will be using that terminology.


The AST HTML Conduit returns to me for processing is wrapped in a Document type holding any text/comments/doctypes preceding and following the element tree itself.

Elements hold a name, attributes, and children. Where each child may be another element, “instructions”, text, & comments. The Text type described earlier is used throughout, and Names may also have a resolved & unresolved namespace.

ParseSettings are given to the parser for how to decode &entities, namespaces, & illegal chars.

To store attributes (& resolve XML namespaces and entities) it uses a size-balanced Binary Search Tree provided by containers packaged with GHC.


If I’m handing XML Conduit a Bytestring as input (which I am unless you specify encoding= in the MIMEtype), it’ll start by decoding that text.

First it’ll check for a preceding Byte Order Mark character in any of the text encoding it supports before checking for a <?xml ?> tag in UTF8 for it’s encoding attribute. Telling Conduit to do the text decoding.

Then it’ll remove any Byte Order Marks from the text once decoded.


The tokenizer is a little unusual as it’s split into two passes and parses the additional structure (in the first pass) of primarily having attributes on open tags & <?xml ?>. This first pass will also decode any “entities” (as configured) inside an element or it’s attributes.

The second pass strips off the XML declaration, resolves any XML namespaces (with the aid of a stack), more entity decoding, and otherwise reencoding “tokens” to “events”.


To parse it adds Begin & End Document events to be skipped in the next & final pass. It’ll look for instruction, comment, or content events before or after parsing the root element & preceding DOCTYPE.

Beyond that parsing involves just asserting that start and end tags match as it restructures the list of events into a tree.

HTML Conduit

I described how Rhapsode parses XML using XML Conduit. XML’s easy for them to implement, but few webpages adhears to XML’s rigidity. So they glued an HTML lexer onto their XML parser that can handle the more common HTML leniencies, whilst returning the same AST to my code so I don’t have to care much which format I’ve downloaded.

Disclaimer: HTML Conduit is not standards compliant, this is not how your web browsers parses HTML. It does, however, work successfully.


HTML Conduit does not do any text encoding sniffing like XML Conduit or my CSS parser, it just assumes everything’s UTF-8. (Might this be a problem for me?)

In it’s first text pass, HTML Conduit lexes tokens for close, comment, special, or (with attributes) “normal” tags; or text. Entities are parsed slighly later, and the main reason to reimplement this step is to allow attribute values to be unquoted or absent. <script> is special cased to not have any child elements.


The code to parse & decode entities is taken directly from the XML parser, and the next step is to infer closing tags where missing. This is normally complicated afair, but the shortcut taken here is to maintain a stack of open tags and:


Finally before handing off to XML Conduit’s top-level parser is to add a surrounding <html> tags. And XML Conduit will then turn this into it’s Abstract Syntax Tree.

network-uri

I’ve discussed the libraries I use in Rhapsode to parse HTTP, CSS, XML, & HTML. So today I’ll finish off that trend by discussing the library I use to parse & serialize URIs. And how it resolves relative URLs into absolute URLs.

To make serialization fast, it ensures every character from the orginal text is preserved, so all it needs to do is concatenate them. Which is done right-associative to minimize laziness’s overhead.

With a trailing ++ to speed up further concatenation.


To parse or syntax-validate a URI it uses Parsec. I’ll discuss Parsec tomorrow.

There are seperate parsers for relative & absolute URLs & whether they may have a fragment or not. These call out to the exact same subparsers, closely matching the BNF from the IETF standard. Not that these parsers’ Abstract Syntax Tree is a tree at all.

Escape codes are not decoded during parsing, you can call those functions later.

These parsers makes up the vast majority of network-uri’s code.


To resolve a relative URI, it fills in any empty fields of it with those of the parsed “base” URI. And/or concatenates their paths, dropping the filename (after last “/”) from the base URI.

After that merge, it processes each /-seperated component of the URI’s paths. “./” are removed. “../” pops the stack. Anything else is pushed to the stack. The resulting is reversed and turned back into a path.

Another set of functions do the reverse of this, and others to normalize the URI’s chars.

Parser Frameworks

Parsec

Having described several parsers written in Haskell, I’ll now describe the parsing frameworks upon which they are built. The dominant of which is Parsec, which I’ll discuss today.

I’ll just describe the core of each of these frameworks without describing their extensive APIs.

As for why Haskell needs parsing frameworks: “eating” characters is not permitted to be a side effect. So for anything non-trivial you want a framework the trailing text returned from each subparser.


Every subparser in Parsec is represented by a function that takes as arguments the trailing function call whether it consumed input or not and whether it parsed successfully or not, allowing these subparser functions to be composed until you get your full parser.

To run a parser in sequence (» operator) you pass it callbacks telling it to return which was actually called. To run two in sequence you create a subparser which runs the second in the success callbacks.


The parser is defined such that upon chaining those parsers you may the success subparsers can take another argument (»= operator), so the results from this parsing can be captured by the caller. In Haskell the » & »= operators has special syntactic sugar which looks like imperative programming.

To compose two subparsers as alternatives (< > operator), you configure one to call the other on failure.

Those are a few examples of many, I’ll leave the rest up to your imaginations!

Conduit

Conduit is a system for definining pipelines that can switch between push & pull.

A “Conduit” is a function which takes a trailing function to call with some argument and returns a “Pipe”. A Pipe is essentiallya linked list with a “final result”, though it has several other interesting features.


A Pipe may alternatively callbacks in case this is a pull pipeline, not a push, it may return leftover input, and it indicates when sideeffects are required to avoid hampering optimizing processing of the rest of the pipe.

To “fuse” ( . operator) two Conduits it iterates over the downstream Pipe, partially echoing it in the result with leftover tokens fed back into it. Upon requesting more input it’ll iterate over the upstream until it has a result or another value.

To run a Conduit you maintain a stack to push upon Leftovers & pop upon NeedInput, but otherwise you just iterate over it until you reach the result.

“await” calls the Conduit’s continuation function via a response from NeedInput. “yield” returns a HaveOutput with the continuation function.

Concatenating two Conduits involves the second into the continuation for the first.

Again I’ll have to leave the rest of it’s APIs up to your imagination. With one exception for tomorrow.

attoparsec

attoparsec is used for lexing by XML/HTML Conduit.

attoparsec is better optimized for parsing (UTF-8) bytestrings, but it’s main feature is that it’s designed to parse input streams like a network response or stdin. Something I wish eSpeak-NG’s XML parser could handle correctly, that’s been a cause of some bugs early on.

Conduit Extra provides an integration between attoparsec’s parsers & Conduit’s streams.


An attoparsec parser may return, in addition to success or failure, a Partial holding a callback to supply with more input. Before many parsers it’ll check how much is left of the current chunk, and request more if needed.

And instead of having seperate callbacks for whether any input was consumed like Parsec, when fetching more input fails attoparsec checks whether requesting more input from the caller provides it. Otherwise it works pretty much the same with pass and fail callbacks.


To make concatenating strict bytestrings faster attoparsec defines a Buffer, with extra memory past the end of the bytestring allocated to be written into. Like you’d do in imparative languages.

To make testing whether a character is in a given set fast attoparsec may use either a sorted array or a bitmask depending on how large that set is. Either way it’s stored in a bytestring.

Or you could use it’s “Zepto” parser generators which may or may not be faster.


An attoparsec Zepto (sub)parser returns the trailing text alongside it’s result for the next subparser to decide what to do with. But that doesn’t support streaming input, and is slow at handling loops/recursion and backtracking.

conduit-extra provides a converter that translates attoparsec’s Partials into NeedInputs, and the others into the input to the next Conduit. Which is usually part of a loop which yields the results from repeatedly rerunning the parsers.

Prelude

Let me ask: How does this compare to the standard Prelude? A.k.a. the code I don’t need to import? A.k.a. the naive (mostly) pure-Haskell approach?

Specifically I’ve discussed:


I’ll start with numbers, because it’s the most seperate!

They could in theory be implemented in pure Haskell using “Church Numerals”, but that would be incredibly wasteful! So instead it uses “primitive” functions & types GHC’s backend will compile into specific opcodes.

Around these primitives boxed types are defined to allow for laziness. The Prelude declares math operators, & typeclasses for operator overloading. GHC’s frontend has hardly no concept of numbers, it’s all in the Prelude!


The Prelude defines linked lists which it uses for everything, via the code: data [] a = [] | a : [a]. GHC is really good at optimizing the processing of these, but not always the storage. That’s what Text & Bytestring’s for! ([a] is syntactic sugar for [] a)

A Haskell String is a linked list of unicode codepoints ([Char]) with hardly any APIs of it’s own. Char is defined amongst the other numeric types. type String = [Char]

Text encodings are stripped away in the I/O APIs, providing bufferring, typeclasses, etc around C’s <stdio.h>.


While GHC is bundled with containers implementing a size-balanced Binary Search Tree, the Prelude just uses a linked-list of key-value pairs (Eq k => [(k,v)]) by providing the function:

	lookup _key [] =  Nothing
	lookup  key ((x,y):xys) | key == x =  Just y
		| otherwise =  lookup key xys

(P.S. “otherwise” is part of the Prelude, not the parser. otherwise = True)

I’ve previously mentioned arbitrary precision Integers implemented using LibGMP or pure Haskell. These are imported into the Prelude.


The Prelude provides a read :: Read a => String -> a function making it trivial to parse any of it’s types. It’s not trivial for the Prelude though, it defines it’s own parsing framework behind the scenes to implement this.

At it’s simplest it defines a synonym for String -> [(a,String)] returning all possible parses. Or it may hide the character iterator from the parsing code by having it yield Get, Look, or possible Results before it’s ultimate Fail or Final.

Upon the Parser Combinators implemented upon those Get/Look operators (where branches may be merged) it implements a rudimentary Haskell pull-based lexer.

The Prelude uses that parser combinators framework and lexer to implement the Read typeclass allowing you to parse them with a call to read(s)Prec giving a operator precedance value to determine whether paranthese are needed. Don’t worry, GHC will offer to generate an implementation for you, and you can stick to the simpler String -> [(a,String)] API.

read/readMaybe/readEither/etc calls readPrec.