Regular Expressions/YARR

Odysseus’s “view source” page and several of it’s error screens use highlight.js for syntax highlighting. highlight.js in turn wraps JavaScript’s regular expressions, navigating some surrounding datastructures and outputting the results to a newly formed/merged DOM structure.

Today I’ll cover how JavaScriptCore implements these regular expressions.

There’s a handwritten JavaScript binding, but all the real logic is in a subcomponent called “yarr”. The first step here is to parse the “pattern”.

The parser itself is actually quite simple, there’s not too much for it to deal with. But converting many of these constructs to “character ranges” proves to be trickier, thereby outputting a flattened tree of “pattern terms”.

Internally yarr’s RegularExpression object just wraps the interpreter, but JavaScriptCore may try it’s JIT first.

The interpreter extracts “disjunctions” (control flow branches) from these “pattern terms” and will interpret them using an explicit stack whilst jumping between code for forward execution and backtracking.

For each kind of pattern term it has a method to interpret it, check if it occurs the minimum of times, and allow it to match up to the maximum of times no matter whether it’s “greedy” or not. The latter logic is repeated for each one, but looks near identical everywhere.

The JIT works in much the same as the interpreter, except it outputs machine code to do it’s real work with specific registers hardcoded to specific tasks. Also: