Text Layout & Rendering

Any text GTK renders is sent through Pango & its transative dependencies to handle richtext, linewrapping, internationalization, & more. Pango directly addresses the first 2, and I will be reimplementing Pango in Haskell. Or ideally paying someone else to do so if I can manage a decent income stream from my hobby…


If you ever wrote some XML markup to style your text in GTK, that is Pango! Btw unlike HTML there’s no inline-styling engine here, that’s the difference.

The public API mostly consists of a Pango Context, a GObject class.

It tracks & wraps a base direction, base & resolved “gravity”, a gravity hint, serial number to invalidate iterators, language (2 properties), fontmap, whether to round glyph positions, a font description, & a matrix transform. The main entrypoint is its get_metrics method!

get_metrics after normalizing its parameters considers returning its cached value. Otherwise loads the fonts from the map, iterates over them retrieving each’s metrics keeping the last, retrieves a language-specific sample-string to render (usually localized “I can eat glass and it doesn’t hurt me”, or something or other about a fox & a lazy dog) to pass to the true main entrypoints before cleaning up.

I’ll discuss pango_itemize_with_font later! The result of which is iterated over…

For each item it iterates over it attempts to lookup the font & its metrics, to conditionally copy some attributes over to the overall metrics & inform some postprocessing via Harfbuzz (with results tweaked, & extra abstraction). A width running-sum is computed.

Attribute lists are a sorted array of subclassable slices implicitly of the same string representing different styling options.

There’s textual formats from/to which attribute lists & their corresponding strings can be parsed or serialized. One’s seamingly for debugging, the other reuses the GMarkup lightweight XML parser bundled with GNOME’s GLib to build something convenient to use. Where various tagnames are parsed specially as shorthands.

GMarkup requires manually tracking relevant aspects of the tagstack. Attributes multi-buffered before being emitted, including in that tagstack. Can autounderline accelerators.

A Pango FontDescription holds a familyname, style, variant, weight, stretch, gravity, textual variations, mask, size, & some bitflags for whether the family or variations are static or the size is absolute. Font descriptions can be merged, hueristic differences can be computed to determine best match, more typical comparison methods, & various accessors. Includes a fairly sophisticated parser, used in Attributes parsing.

Fonts I’ll discuss later…

“Gravity” refers to an optional right angle rotation, can be converted to radians or matrix transforms & can be looked up for the Unicode “script” (roughly, alphabet) being used. A gravity hint defines which relevant gravity to prefer, mostly relevant when mixing scripts.

Pango defines its own (partial) matrix multiplication implementation with conversion from geometric transforms.

There’s logic for negotiating & inferring ISO language codes, & parsing preference lists. Or find sampletext.

Pango Itemization

To init it’s itemization Pango captures given context & text & attr iterator whilst computing/saving end, run_start, changed, embedding levels (via FriBiDi, later topic!), embedding end, gravity-related data (gravity, centered baseline, hint, resolved value, & font desc’s), script iterator with its range, width iterator, & emoji iterator properties. Nulls out result, item, embedding end offset, emoji font description, derived lang, current fonts, cache, basefont, first space, & font pos.

Upon both initialization & iteration range invariants are enforced. Finalization frees several of these properties. Iterating to the next item involves advancing the appropriate iterator.

Processing each resulting run involves checking which aspect has changed possibly computing a new gravity, derived language, or current font. Then processes each non-whitespace character allocating a new linkedlist “item” for output & stores results there handling final item specially.

For postprocessing it reverses that linkedlist & computes a running sum.

Pango’s “itemization” process is split into several iterators which are unioned together: embedding levels (precomputed), richtext attributes, Unicode scripts, emojis, & widths.

Some languages are left-to-right, others are right-to-left. Some are vertical (though those can usually be written horizontally too), & some are even diagonal (though no computer system I know of supports those)! Embedding levels computes which to use.

Precomputing embedding levels involves (after converting from Pango types to FriBidi types) computing the number of UTF8 characters, allocating 3 per-char sidetables of which it returns one (others are temporary), iterates over each char once looking up & recording their BiDi types with special handling for brackets whilst bitwise-or’ing & maybe (if flagged “strong”) and’ing these bitflags together, fastpaths unmixed text dirs, otherwise defers to FriBidi, & converts back to Pango types.

Pango attributes are stored in a sidearray from the text itself to make them trivial to iterate over! Though a stack is required to yield the end of all the attribute it has previously yielded the start of. Furthermore this stack is consulted to extract the styling for this run of text.

As stated previously Pango’s “attributes” are what’s parsed out of the XML (via GMarkup) you hand to it or GTK. They represent Pango’s richtext support!

To split the text into runs requiring different “scripts” (approximately a.k.a. “alphabets”) Pango iterates over each UTF-8 character. For each char Pango looks up the script The Unicode Consertium catalogued for it, for the “common” script looks up the corresponding charcode it pairs with, maintains a size-capped stack to balance those paired chars, & either fixes up any previously unknown scripts including in that stack or yields a script boundary.

To determine whether a char is an emoji Pango uses a lexer contributed by Chromium written in Ragel. The iterator checks whether the current is an emoji or not & scans all subsequent in the same classification.

Similarly the width iterator classifies (with some special cases) chars by horizontal or vertical writing directions according to builtin lookuptable.

Pango Layout

The PangoLayout GObject class tracks its PangoContext, richtext attributes, font description, tab indents (sized array of indents with boolean for units), the plain text, serial number for itself & its context, number of bytes, number of chars, layout width & height, initial indent, line spacing, justification/alignment, whether to ignore newlines, whether to autodetect textdirection, whether to wrap & whether it has, whether to ellipsize & whether it has, count of unknown glyphs, cached layout rects with flags, cached tabwidth, a decimal mode, resulting logical attributes, list of resulting lines, & a linecount. There is bitpacking involved, there’s a couple fields denoting which fields (the bulk of them) should be memcpy’d when duplicating.

Has standard GObject methods, & plenty of accessors.

A couple of these accessor methods wraps the XML parser for richtext markup. The serial number is used to detect changes invalidating layout computation, freeing the computed lines & resetting various properties whenever the input fields mutate via the accessors.

Upon accessing output properties (upon which some interesting logic is implemented) the PangoLayout lazily recomputes them ensuring all needed inputs are set. If flagged to infer textdir consults FriBiDi per-char or its context.

After clearing its output fields & retrieving initial extents/height from initial font, Pango’s highlevel layout algorithm involves repeatedly optionally looking for paragraph boundaries, optionally determining the base direction filling in gaps from the previous value, determines whether this is the final iteration with the last segment, runs the itemization algorithm I described above, copies attributes over to results whilst locating the correct slices thereof, optionally updates some per-item flags & attributes utilizing relatively-complex Unicode processing I don’t understand some of which is language-specific, applies some itemization postprocessing I’ll describe later, either repeatedly splits the items into lines (again I’ll describe later) or constructs a single-line, checks whether we’ve surpassed the high-limitation, & in preparation for the next iteration if any the next start index. Then it cleans up & aligns text!

PangoLayout’s has a method for iterating over the “runs” in a given line to locating UTF-8 offset in the appropriate run corrected to avoid landing the middle of clusters before iterating over the Harfbuzz-computed glyphs taking into account FriBiDi-computed text direction.

There’s a couple method for computing the appropriate line from the linkedlist for an index, one computing extents. And there’s methods combining these.

Another method computes new indexes moving up or down a line.

There’s a method which iterates over the lines locating the one which contains y-coordinate, then defers to the method for computing the x-coordinate. And a method which similarly does the reverse.

There’s further wrappers around these methods which returns properties of their results. A relatively complex one computes “strong & weak” rectangles to depict bidirectional text-cursors. And a further wrapper which incorporates Harfbuzz positions.

There’s methods for refining the chosen alignment, computing the x-offset to apply that alignment. This gets incorporated into the extents upon requesting it, & this info retrieved at the end of the core layout algorithm to ensure this is applied.

There’s yet more methods locating & processing appropriate runs. There’s an iterator over the computed lines or runs computing line-extents as it goes, with plenty of its own getters.

Pango Itemization Postprocessing

Text layout/rendering is full of little nuances which must be handled for internationalization’s sake. Here’ll describe such nuances (selecting “variants” & tweaking fontsize) which can’t fully be captured by intersecting several iterators to split the text up into “items”.

For variants it iterates over each item checking if a valid font-variant has been selected for it. If not it’ll split upper & lowercase letters to approximate them via fontsize & text transformation.

Another pass allows font designers to handle the difference between visual & geometric sizes, by iterating over every item. For each it analyzes the actual size of the selected font to compute how much it needs to be scaled to achieve the desired size. Results are saved into the item’s attributes. This size adjustment is tweaked further for superscript, subscript, & smallcaps text.

Pango Line-Splitting

Pango’s main job is arguably to split text up into lines (or is it to split text more generally around richtext, writing systems, etc more generally?), so how does that work?

Until there’s no more items (if there was none to begin with it adds an empty line) Pango allocates a line lightweight gobject, initializes several properties on it (possibly with minor computation), then it iterates over the previously-split “items”.

For each previously-split “item” Pango gathers styling properties & runs Harfbuzz. If the trailing chars a newline it inserts the item into the line’s linkedlist incrementing its length by what Harfbuzz computed whilst adjusting positioning to take into take into account tabs, before returning to the loop indicating this case. If the item fits entirely on the line it does likewise.

Otherwise it may have to split the item. For which it first sums the width & specialcases tabs as all-fit.

Then Pango checks if it can split at the end of the item & whether to add a hyphen. This tweaks the following fastpath check which checks if that width fits.

Having exhausted options it computes per-char widths iterates over the chars looking for valid breakpoints (taking into account hyphens) until we pass the max linelength. For each it might trial-split the text to consider the hyphens or tabs before updating the max-valid run.

If this fails it might try again with looser constraints.

If Pango has successfully split the item to fit (or has a force flag set) it either applies the split & returns that the first half all fits, or it indicates that nothing fits. Or splits as best it can indicating some of the item fits.

Otherwise it indicates nothing fits. Harfbuzz data is freed since linesplitting invalidates it.

Upon all fit the outerloop checks for tabs & removes the item from its todolist. Upon empty some fit it sets a flag & exits. Upon nonfit it backs up over previously-selected runs to see if they have valid splitpoints & reruns those checks for that tweak. For newlines it removes the item from its todolist & exits. Upon exiting it updates some counts & appends the line to the computed output.

Having computed a line Pango inserts missing hyphens, truncates whitespace, reverses runs, normalizes the text baseline, from Harfbuzz, considers ellipsizing the line, converts logical to visual order, redistributes whitespace, optionally “justifies” words in a seperate pass, & updates some flags.


With the variety of written languages around the globe, its not uncommon to want to combine them. Which can get tricky if those languages don’t agree on which direction they’re written in! To resolve text from “logical order” to “visual order” & back in C, adhearing subtle rules, we can use a simple well-tested library called “Fribidi”! As long as that text is entirely horizontal.

After validation & stack/heap allocation Fribidi starts by classifying each char via a lookuptable.

Chars are classified as left-to-right, right-to-left, arabic, european numerals, arabic numerals, european number seperator, european number terminator, common separator, non-spacing mark, boundary neutral, block separator, segment separator, whitespace, other, left-to-right embedding, direction embedding, direction overrides, pop/push directional flag, direction isolate, & first-strong isolate.

Another iteration refines other to extract bracket type & open/close. Via metaprogramming.

Given that classification fribidi commences it applies a run-length encoding & commences its core logic!

The first main iteration locates the first non-isolated letter & its text direction.

Then after initializing some collections, counters, & other state it iterates again to interpret char classifications upon a stack. Tweaking the “run” possibly moving between linkedlists. A postprocess iteration handles isolations populating an indexmapping.

Between compaction passes an iteration resolves “implicit levels” handling numbers differently from other text, & extracting the max level.

Then it merges back in the extracted runs taking care to apply new levels.

And finishes with an iteration to finalize the computed run properties, & an iteration for run-length decompression.

I’m not fully clear what exactly this is all accomplishing…

Then it allocs index-mappings & maybe rearranged text to return to caller.

As a cursive alphabet Arabic requires special handling selecting “joining chars” when rearranging into visual order, this involves a lookuptable preprocessing step & an iteration over the Arabic chars to detect the literal edgecases between Arabic & other writing systems. Then if rearranged text is requested it handles “shaping” arabic (involves “shaping” and/or “ligaturing” substeps both via lookuptables) & mirrored text (also involves a lookuptable) each specially.

P.S. Skimming over the rest of FriBidi I see logic for converting a handful of charsets to UTF32 & back, at least one of which required Fribidis logic. Since FriBidi deals in UTF32 by default, this included a UTF8 decoder.

GNU LibC vendors a much more thorough project for charset conversion, but I can’t be surprised FriBidi’s interested in the subject!


What is geometrically correct does not always appear so. So font designers may wish to tweak their fonts to look better at different sizes & styles, which they do by providing multiple fontfiles for the same “font family”. GNOME’s text stack uses FontConfig to choose these fontfiles.

FontConfig provides a library & a large suite of commands trivially wrapping it. I’ll be focusing on studying the fc-pattern command as a starting point.

The first step of querying the FontConfig fonts database is to construct a “pattern” to match against. These patterns are represented in-memory as a dynamic-array of properties, in turn holding linkedlists of tagged union & how strongly it “binds”. Values may be bytes, ints, bools, doubles, 4x4 transformation matrix, array of desired character coverage, arbitrary pointers, bitmask of supported scripts/languages, & floatingpoint ranges.

Offset pointers are common for ease of serialization.

Here the pattern is parsed from commandline args, starting with a “name” that is split by commas, dashes, & colons. If its unsplit its added as a string to the pattern to match the FAMILY, retrying upon commas. For dashes it’ll parse out the parse out floats to match SIZE. For colons it repeatedly looks up the property to match & converts the strings to the appropriate type.

Remaining commandline args are gathered into a multistring.

FontConfig Preprocessing

To cope with apps & sites that don’t consider which fonts are available on freedesktops, to “synthesize” fonts for styles that aren’t provided, & more FontConfig can preprocess queries according to some XML config. Reasonable defaults are shipped with the project.

Which first requires loading the XML config under a locked refcount using a LibXML SAX parser in reference to an element stack, the bulk of which happens on element-end. In the meantime it gathers various filepath collections.

For “match patterns” it iterates over the configured locales to consider incorporating them into the pattern once normalized. It then fills in a PRGNAME field, allocates some collections & font-family hashtables, & iterates over the configuration. For each config it branches over the config type & evaluates its condition to determine whether to mutate or replace the query pattern.

Then it cleans up.

A seperate pass may fill in default values for properties.

FontConfig Querying

To get the font database to query it first performs various checks to determine whether it needs to reload the database into memory, including file timestamps. Buried within the code to read configuration is code to very carefully mmap an index of all installed fonts. Interestingly I can add more within the bounds of the process, WebFonts are easy!

Or maybe it it’ll have your process carefully read over the font directories itself in sorted order, consulting FreeType (which I’ll discuss soon-ish!) via an extensive internal API binding building on a few utils. Config rules are applied to the extracted pattern for the font. Afterwhich it serializes this data back out so other processes don’t have to go through all this effort. This serializer includes an allocator within the mmap’d file utilizing a hashmap.

It may then (depending on how you call this code) index all the fonts into a hashmap by lang filtered by a pattern & read them back out into a linkedlist. This filtering requires all given properties to be present with compatible values. At which point (once normalized into common type) we get into FontConfig’s dynamic type system!


Harfbuzz is a vital component of a text rendering stack. It converts a sequence of Unicode characters into a sequence of positioned “glyphs”.

This makes English text look a little nicer, but is vital for other written languages. Especially non-latinate ones (i.e. outside Europe and the nations they colonised who didn’t previously have writing).

Harfbuzz centers around a single complex function (with multiple OS-specific backends), but to start: the datastructures you pass to that function!

Harfbuzz includes a methodtable wrapping LibUCD, GLib, LibICU, or noops for unicode processing. Implements trivial utilities itself.

Behind a few layers of light abstraction Harfbuzz provides a bitset segmented into a hashmap. The innermost implementation is a “bitpage”, wrapped in (with a 2nd layer allocation-managed mapping) “bitset”, possibly-negating “bitset invertable”, & finally a “set” with C bindings (from C++). Then there’s “set digest”.

Harfbuzz includes a closed-addressed hashmap implementation with consecutive rehashing, whose entries can be bitflagged as used or tombstoned. The initial hash is kept within a prime range though rehashing is capped to powers-of-two for efficiency, occupancy & population is tracked to decide when to resize.

Harfbuzz provides a resizable, quicksortable, slicable, binary-searchable C++ array type.

There’s memory-managed byteslice type called a “blob” used to wrap raw fontfile data.

One of the two main inputs provided to Harfbuzz’s main function is a “buffer” of the text you wish to shape. And its main output is a buffer of the corresponding glyphs. Buffers can function as a resizable array, with some of those methods performing text transcoding. The output can be normalized with sorting.

These have several methods for moving data around within them, a method for enforce bitmasks. Mostly for the sake of testing, presumably, there’s a diffing method.

Buffers have (beyond memory management) properties for:

Bitflags include:

The cluster level may be “monotone graphemes”, “monotone chars”, or “chars”. Can convert to/from JSON or markup formats. Can verify results.

Harfbuzz can parse a colon-seperated $HB_OPTIONS envvar.

It defines a 4char “tag” type (within a single CPU word) used, amongst several other things, to denote the text’s script/alphabet.

There’s a direction enum which can be interpreted as bitflags for trivial processing. You can determine this direction for the script.

There’s “language” canonicalized-string, with a linkedlist.

There’s parsing utils, used to read/write font features & variations.

As-usual it exposes a version number.

Harfbuzz defines word-sized bools, 32bit codepoints & positions & masks, numeric unions with or without including floats.

Font features consist of a tag, 32bit value, start, & end. Font variations consists of a tag & floatingpoint value.

Both tags & colours are 32bit values split into 4 components.

You can provide Harfbuzz callbacks for visually-rendering these glyphs, with an internal structure tracking internal state.

Harfbuzz Fonts & SubSetting

Besides buffer of the text to be shaped, Harfbuzz’s main function accepts the font with which its to be rendered. The font which defines the glyphs it can choose to represent that text.

Hafbuzz provides a font object specifying synchronization as its other dominant parameter & memory-management properties as well as:

These are loaded from blobs & an index via a “Face” object.

Faces themselves include some memory-management callbacks, index number, units-per-em, glyphcount, a “shaper” parameterized for this type, an OpenType face, & linkedlist of shape plans.

There’s a whole segment of the codebase modelling each table present in a OpenType fontfile (dominant & most featureful format WOFF files are a variant of). They have accessors & utility methods usually wrapping iterator (another type Harfbuzz reimplements itself). Many uses unions to implement to implement schema upgrades.

Then there’s utility methods like on the Sequence, SingleSubstFormat, or AlternateSet tables which applies the substitutions to a buffer. I am glossing over much of this code since there’s so much of it that looks highly repetitive, so I’ll discuss them in more detail later as I discuss how they’re used.

The glyf table, amongst some variations, has matrix-transform utility methods.

Harfbuzz wraps C++ iters with operator overloading to add functional-style map/etc in a pipeline.

To help ensure the job’s done right taking into account concerns Harfbuzz implements, they implement font subsetting themselves. For which it models “objects” as being a doubly-linkedlist of “real” & “virtual” “link” arrays. Links in turn hold an unsigned width, position, & object-index.

Postprocessing involves copying objects into a new array item-by-item & calls into a “repacker” to wrap that in a sorted graph (another subsystem), decide how to efficiently split it, & serialize result.

A subset plan holds various bitflags, sets, (possibly-hash) maps as well as some arrays & a source/dest font to which it provides accessors. Has C wrapper & default field values. Or you can create a subsetplan from a subset input, computing:

That’s just touching the surface, especially of glyph-IDs to retain. Frees properies upon error, has accessors.

The serialization logic can be extensive, & different formats may optimize differently.

With that plan it repeatedly iterates over preserved tables computing what to keep until fixpoint (both inner & outer loops), attaching a new “accelerator” within the font’s userdata after the fact.

Subsetting a table generally involves estimating the subsetted tablesize for allocation, deferring to a method on the table, reserializing it, & recovering from errors.

Shaping Dispatch

Harfbuzz’s central function is hb_shape_full, at least once they decided to add extra parameters to hb_shape.

hb_shape_full validates the buffer possibly copying data over into a new buffer for complete checks, creates/exectures/destroys a cached “shape plan”, possibly extensively-validates resulting buffer. Ensuring all glyphs face in same direction, breaking invariants followed, etc.

Creating a cached Harfbuzz shape plan involves checking whether its worth caching, if so allocating a caching key & iterate over a linkedlist looking for it, create the plan itself returning it immediately if we’re not caching, & allocating a node into that linkedlist.

Creating a Harfbuzz shape plan involves validating a direction & properties are given, allocating the structure, flags the fontface (fallingback to empty) as immutable, initialize the key, & init the OpenType info.

Executing a shaping plan involves, after validation, going over all builtin “shapers” checking whether the font holds relevant data & calling its entry function.

These builtin shapers may (depending on build flags) include:

Fallback Shaping

In case a font doesn’t provide shaping tables Harfbuzz provides a rudimentary fallback. Still less euro-centric than much toy text-rendering code, according to documentations supports Latin (this alphabet here being used for English), Cyrillic, Greek, Armenian, Georgian, Tifinagh, etc!

This fallback first looks up the space glyph & whether its present, zeroes output, & retrieves parameters before iterating over the buffer & cleaning up.

For each char this fallback shaper checks if its default-ignorable. Those become space glyphs with zero x & y advance. Otherwise it looks up the char, position advance, & position origin.

Cleaning up involves checking whether the given direction is “backward” (sounds a little judgemental of a term to me…) & if so the glyph order is reversed. Using an iteration with both forward & backward indices.

Then it finishes by unsetting all glyphs’ flags except HB_GLYPH_FLAG_DEFINED.

Looking up a glyph involves deferring to an internal methodtable. Horizontal & vertical advances are seperate methods internally. Same for origin, incorporating subtraction.

To perform these per-char lookups Harfbuzz may defer to FreeType (will describe later), or it may use its own OpenType implementation.

There’s a bit of C-PreProcessor magic which I found a bit difficult to navigate, but it defers to the font’s CMap table & in turn an appropriate callback for its “fontpage” (roughly equivalent to alphabet/”script”) or version number. In either case the fallback is to defer indirectly to an appropriate “get_glyph” method which may do a binary-search or array lookup.

There’s fastpaths for formats 12 & 4.

A second try may attempt rewriting the Unicode char.

Emulating Microsoft, Harfbuzz may try adding all 1s to the top nybble of single-byte chars. Or it may consult some sophisticated bittwiddling & lookuptables for fallback Arabic codepoints, in one of 2 codepaths.

Results are cached.

To compute horizontal advances for some glyphs Harfbuzz may first consult a cache, before consulting the HMTX table, tweaking scale, & writing into output arrays. Much of which is bypassed for char-by-char lookup.

Vertical use VMTX near-identically.

The HMTX & VMTX tables performs what resembles a bounds-wrapped array lookup, possibly followed up by consulting a VAR (variations) table for tweaks & possibly deferring to the glyph’s methodtable.

To compute vertical offsets (horizontal appears commented out?) Harfbuzz first computes the horizontal advance as I’ve just described, & checks for a VORG table. If present it checks the VMTX’s variants table for tweaks before deferring to the VORG table.

Otherwise it attempts to compute a vertical origin from the font extents & leading-bearing (think I’ll explore those topics later as it becomes more relevant?). Otherwise tries a trivial formula to compute origin from extents.

Consulting the VORG table involves a binary search.

OpenType Shaping

Harfbuzz’s core logic, when its not deferring to some other OS’s implementation, involves:

  1. Gathering parameters into a “context”
  2. Updating a bitmask of allocated variants
  3. Resetting a mapping
  4. Iterating over the buffer splitting “graphemes”
  5. Possibly considering (checking per-glyph bitflags) prepending a dotted-circle glyph if-available for unicode marks to apply to annotating with appropriate bitflags
  6. Possibly iterating over graphemes to merge or split them
  7. Iterating over buffer again considering whether to reverse the graphemes to enforce a consistent text direction
  8. Running the context’s shaper preprocessing plan surrounded by debug messages
  9. Considering flipping each character horizontally or verically, allocates variations
  10. Applying some normalization
  11. Looking for fractions to avoid breaking
  12. Telling the shaper to setup masks copying results over to the buffer
  13. Possibly reclassifying each char’s combining class
  14. Overwriting codepoints with computed glyphs
  15. Deallocating variations from the bitmask
  16. Allocating glyphs & ligatures into the bitmask
  17. Consults GDEF table to tweak glyph properties whilst clearing ligatures
  18. Recomputes each glyph’s properties
  19. Runs the plan’s substitution pass
  20. Removes flagged glyphs compacting the array whilst recomputing clusters.
  21. Computing positioning
  22. Postprocesses substitutions
  23. Propagates bitflags
  24. cleaning up

It computes positioning by:

  1. Clearing old info
  2. Computing initial info from the font’s advances & origin, with a fallback inserting whitespace via positioning properties.
  3. Tweaking that based on various font tables (h-origin, & repeatedly GPOS) including another shaper callback whilst considering “mark” & “ignorable” & deleted glyphs specially falling back to per-cluster layout along a “baseline”
  4. Consider reversing glyphs
  5. deallocates variants.

Substitution postprocessing involves possibly removing glyphs flagged deleted, overwriting invisible glyphs or deletes them, & running the shape plan.

Graphemes (first-pass char groupings)

There’s a general initial pass setting each glyph-info’s continuation & other Unicode flags based on which characters it sees.

Then there’s a choice between two codepaths selected by the buffer’s given cluster_level, implemented as methods on said buffer called per-grapheme. Hopefully they’re 1 toot each?

After some size-validation & taking a fastpath for char-level clustering, merging clusters involves iterating over each of its chars taking the min cluster number, iterating over following chars until they’re in a different cluster, same for preceding chars continuing into an “out-buffer”, & ensures all chars in this broadened range have their cluster set to the same value. Flagging them DEFINED upon change.

The alternate codepath flags each char-info in the range with UNSAFE_TO_BREAK & UNSAFE_TO_CONCAT, using one of a few minorly-differing codepaths handling edgecases.

Locale-Specific Preprocessing

Depending on your alphabet, Harfbuzz may apply additional preprocessing logic. Given Latin’s the one I’m familiar with I’m not sure how much sense I’ll make out of this, but here goes…

Hangul (if I understand correctly) needs to compose by its leading-vowel-?trailing syllables, each component having its own Unicode char-range. Or it may need to decompose, depending on the font.

This involves a main iteration. Which first checks for “tones” & whether to normalize by adding dotted circles.

Then it checks for leadings followed by vowels with a possible (if font has the glyph) trailing merging them into a single cluster. Otherwise for pre-combined glyphs it checks whether the font has the glyph & considers decomposing it with appropriate Hangul bitflags.

Considers zero-width chars during processing Hangul “tones”.

For Thai it uses a simpler iteration skipping non-“sara-am” chars, flagging continuations & switching to “sara-aa” chars.

Then scans back over certain chars rearranging or merging clusters.

If the font doesn’t provide embed its own shaping script Harfbuzz runs a 2nd iteration during preprocessing that classifies each char’s “mark type” filling in blanks or following a state machine for which glyph lookup table to use attempting Window’s then Mac’s mappings.

There’s a “Universal Shaping Engine” designed by Microsoft who’s preprocessing inserts dotted circles where vowels are missing.

Unless Harfbuzz is trying to be Uniscribe bug-compatible, for Indic it’ll apply that same logic filling in missing vowels. The condition for doing so is compiled from a Microsoft datafile to C via a Python script. Metaprogramming!

And that appears to be all the “scripts”/”alphabets” known to Unicode & Harfbuzz with special preprocessing needs?

Text Normalization

After splitting graphemes, preprocessing, & flagging mirroing Harfbuzz enforces some normalization to match the chars defined by the font.

First off it checks a handful of fields are populated. If the normalization-mode is “auto” it’ll choose an actual one based on whether the plan has a “gpos mark”. This mode is used to flag whether it always or might shortcircuit.

Further checks utilizes a “normalization context”.

Then there’s 3 “rounds”…

The initial “decomposition” round flags buffer output clear, & iterates over the buffer chars before syncing properties. For each char it skips ahead to the next “mark”-bitflagged char then rewinds. If it might shortcircuit it checks whether the pre-existing glyph already exists in the char, if so repeatedly decomposes the char handling failures & together with all marks decomposes the cluster.

Decomposing a char sees if there’s a directly-corresponding glyph to append to output.

If not it tries consulting the shaper methodtable (discuss that tomorrow…), shortcircuiting if it didn’t decompose or the font doesn’t have the char. Otherwise it recurses on the “a char” checking for its availability before or after depending on whether it “might shortcircuit”. Preceding check can also likewise be moved after.

Then handles spaces or hex 2011 specially if needed.

Decomposing a “cluster” involves iterating over looking for variation selectors, then decomposing each char.

For each variation selector codepoint in a cluster Harfbuzz iterates over said cluster looking for rare other variation selectors. For each in that inner loop it either consults the font for a variation or leaves the data for a later “GSub” pass to process (another day). skipping subsequent variation selectors.

The 2nd “reorder” pass (only if initial pass found char which needs this) iterates over glyphs locating runs with “combing classes” & sorts them with a postprocessing callback.

If there’s CGJ chars (hex 034F, has bitflag on containing buffer) Harfbuzz examines surrounding chars to determine to unset the hidden bitflag.

The 3rd & optional (upon same check, for specific modes) combines diacretics into single char when that’s what the font expects. By clearing output, iterating over glyphs, & syncing. For each glyph it checks whether its bigflagged as a Unicode mark & the combing classes before running a callback & looking up the glyph.

Locale-Specific Normalization

Harfbuzz’s logic & decomposing glyphs is usually deferred to some unicode library or other (LibICU has already been added to my queue), but it may be done specially for the different “scripts” harfbuzz supports. But even there it typically defers to some extent or other to said Unicode library.

For Hebrew failing generic composition & if there’s no GPOS marks, Harfbuzz uses a lookuptable or 2.

The first lookuptable is a switch statement upon 1st char with conditional checks upon second. In a specific branch it may consult an array of “dagesh forms”. All ammounted to hardcoded ligatures.

Indic disables decomposition of chars hex 0931, 09dc, 09dd, & 0b94 before deferring to general logic. Composition skips marks & recomposes hex 09af & 09bc as 09df specially, before deferring to general logic.

Khmer decomposes certain cars specially to insert hex 17c1 chars, before deferring to general logic. Composing skips marks before general logic.

The Microsoft USE algorithm composes chars likewise.

Arabic & Hebrew have special postprocessing after sorting “combining classes” in a cluster.

For combining clusters 220 then 230 Arabic scans ahead chars lower combining class & if it found anything it scans ahead “modified combining marks” with equal combining class.

Those subclusters are internally-merged & swapped, before assigning new combining classes.

Hebrew checks the combining class of 3 char runs, & if they have certain combining classes the char-clusters are merged & the chars are swapped.

Local-Specific Char-Property Computation

After normalizing text to use the available glyphs, the methodtable for the appropriate “script” (~alphabet) is given a chance to attach additional info.

Afterwards “font features” are applied into these bits via a dynamic lookuptable over the appropriate ranges (unless that range is the entire text) in the appropriate cluster.

For Arabic it allocates some bits, computes joining info, possibly computes Mongalian variations, & copies resulting data over into a different private property.

Arabic is a cursive script, meaning the glyph for each char is selected based on its surrounding chars. As such to compute those “joinings” it examines the preceding context consulting a multi-stage lookuptable to classify each preceding char, then another lookuptable to traverse a statemachine for the first non-“T” joiningtype.

Then it continues the lookups to classify each char & run the statemachine, & for the first non-“T”-joining trailing char.

For certain Mongolian chars (hex 180b, 180d, or 180f) it may copy the joining state from the previous char in a postprocessing loop.

For Myanmar it allocates some “category” & “position” bits, & for each char consults a multistage lookup table for each char’s category. The bulk of the logic here is in adjusting the mapping, discussed later.

Indic works the same as Myanmar here, except the looked-up value includes the position bits.

For Khmer it similarly allocates some “category” bits & for each char consults that same lookuptable for the category. Again most of the logic is in adjusting the mapping.

For Microsoft’s USE algorithm it considers running the Arabic routine before allocating category bits & iterates over each char retrieving values from a sophisticated lookuptable + bitwise manipulation.

For Hangul it iterates over all chars to lookup the previously-computed “hangul shaping feature” in a lookuptable, if that lookuptable’s present. The bits for the “hangul shaping feature” are deallocated regardless.

Locale-Specific Map Initialization

As the OpenType shaper is initialized, it gathers than compiles a map possibly followed by a date_create callback.

Gathering those features involves enabling “rvrn” (adding it to a feature_infos array, registers a nullcallback to the GSUB table, enables “ltra” & “ltrm” features or the rtl equivalents, compiletime-optionally enables “frac”/”numr”/”denom” for rendering fractions, enables “rand”, compiletime-optionally enables “trak”, enables “Harf” & “HARF”, & defers to the script!

After alphabet/”script”-specific logic it enables the “Buzz” & “BUZZ” features, adds, “abvm”/”blwm”/”ccmp”/”locl”/”mark”/”mkmk”/”rlig” features, adds “calt”/”clig”/”curs”/”kern”/liga”/”rclt” features or if-vertical “vert”, iterates over caller-specified features adding each, possibly adds them again to the AAT feature, & gives the “script” a chance to override this config.

Compilation involves selecting certain features/etc, compacting memory, & extracting various properties/conditions.

For Arabic it enables “stch” feature, registers a callback on GSUB table (checking ligature information per-glyph to choose “repeating” or “fixed shaping actions whilst setting ARABIC_HAS_STCH bitflags), enables “ccmp” & “locl” features, registers a null GSUB callback, adds “isol”/”fina”/”fin2”/”fin3”/”medi”/”med2”/”init” features flagging Syriac fallbacks & registering null GSUB callbacks, register a GSUB callback to deallocate arabic-shaping-action bits, enables “rlig” feature, registers a GSUB callback to apply some fallback shaping logic if the font doesn’t provide it), enables “calt” feature”, if “rclt” is missing adds a null GSUB callback & enables it, & enables “liga”, “clig”, & “mset”.

This Arabic fallback logic is implemented in another file with “Unicode” & “Win1256” subcomponents to populate the lookuptable. Its predominantly a per-glyph lookup.

For Khmer it registers a couple GSUB callbacks, enables “locl” & “ccmp” features”, adds “pref”/”blwf”/”abvf”/”pstf”/”cfar” features, registers a callback to deallocate syllable bits, & adds “pres”/”abvs”/”blws”/”psts” features.

An initial callback allocates syllable bits, & parses the text via Flex to populate those bits. Flagging each syllable as unsfe-to-break.

The other runs some more generic logic to insert dotted circles in place of missing vowels, before extensively within each syllable.

For Indic it registers a callback that lexes Indic syllables & flags each as unsafe-to-break, enables “locl” & “ccmp” features”, registers a reordering callback, adds “nukt”/”akhn”/”rphf”/”pref”/”blwf”/”abvf”/”half”/”pstf”/vatu”/”cjct” features, registers a final reordering callback, & adds “init”/”pres”/”abvs”/blws”/”psts”/”haln” features. Not disimilar to khmer!

The initial reordering callback involves checking some bits for the syllable-type to branch over alongside initial chars.

The outer branch chooses whether to apply this logic taking into account whether we want to be bug-compatible with Uniscribe. I’m not comprehending what the sizable inner branch is attempting to achieve as it rewrites the glyphs.

The final reordering callback looks for “varima” glyphs ensuring they’re categorized as “Halant”. Then iterates over chars, considers prefix/suffix, checks that leaves much remaining, checks for “rephs”, checks for “prefs”, & finalizes clustering.

That loop locates “base-c” chars & the subsequent “pref” char setting bitflag on the next non-halant char. Malayalam has its own base-c char logic taking into account joiners, halants, & consonants.

If the text is near-entirely a prefix/suffix it puts a bit more work into locating the central word.

For initial pre-m glyphs it memmoves some glyphs & merges resulting clusters. Otherwise it merges clusters split by pre-m glyphs.

I won’t put more time into studying Indic shaping…

For Microsoft’s USE algorithm Harfbuzz registers a callback (to allocate syllable bits, lex into it, propagates a certain bitflag, aggregates bitflags, & iterates over syllables to propagate joining flags taking the aggregated flags into account), enables “locl”/”ccmp”/”nukt”/”akhn” features, deallocates substitution bitflags adds “rphf” feature, registers a GSUB callback to iterate over syllables & their chars to see where substitutions happened, registers a substitution-deallocator GSUB callback again, enables “pref” feature, registers GSUB callback to record its use, enables “rkrf”/”abvf”/”blwf”/”half”/”pstf”/vatu”/”cjct” features, registers a GSUB callback to insert dotted circles in place of missing vowels & lightly reorders within each syllable, registers a syllabic bits deallocator GSUB callback, adds “isol”/”init”medi”/”fina” features, register null GSUB callback, & enables “abvs”/”blws”/”haln”/”pres”/”psts” features.

Finally for Hangul it just adds the “ljmo”, “vjmo”, & “tjmo” features.

Glyph Substitution

Harfbuzz’s main(?) “shaping” pass serves to substitute character IDs for glyph IDs corresponding to (usually) vector images within the fontfile based on the information extracted beforehand.

This may start with a AAT layout substitution, before consulting the standard path.

Most of the substitution logic in this general pass is shared with the layout logic.

To “apply” glyph substitutions or positioning to the text it iterates over the previously-queued stages & their “lookups” for each setting various info into a context object & calling another method. Whilst outputting debug info. After the “lookups” iteration it’ll call any pause callbacks if available.

For each lookup Harfbuzz it discards empty text-segments, & decides whether to apply forward or backward substitutions. Forward substitutions happen in-place & require synchronization!

In either direction checks each char’s bitflags against the stage (amongst other bitflags) & runs a substitution vs shaping specific callback. In forward direction it consults a cache.

This is underlaid by a hb_ot_layout_lookup_accelerator_t object & the appropriate table from the fontfile. This “accelerator” gathers & allocates summary information upon initialization retrieving info from the fontface; wrapping a “digest set”, “subtables” array, & cache-user index number.

Behind a few layers of abstraction Harfbuzz dispatches to methods on each opcode in the relevant fonttable.

Substitution opcodes

Harfbuzz’s AAT Layout substitution turns out to applies similar logic to what I discussed yesterday in a pre-processing step upon the optional Morx & Mort tables.

Arabic Postprocessing

After all the “shaping” computation (save some propagation & cleanup) Harfbuzz gives its “script”/alphabet methodtables a chance to do some final postprocessing. Here I’ll go over how Arabic use this opportunity! The other supported scripts don’t (expected more to dig into here…).

Shortcircuiting if stch char isn’t present according to buffer’s bitflags, it performs the following steps (next toot) twice for “measure” & “cut” passes.

The primary subtask is an reverse-iteration over the buffer’s chars within which it:

  1. For non- STCH_FIXED or STCH_REPEATING shaping actions, skipping them whilst (for CUT phase) removing them
  2. Otherwise iterates over the STCH_FIXED or STCH_REPEATING summing the total fixed & repeating width
  3. Iterates to sums the total width
  4. Computes sophisticated justfication
  5. In CUT phase iterates over once more duplicating repeating chars, having ensured there`s space for it during MEASURE

The secondary subtask is to (for MEASURE) enlarge the buffer or (for CUT) assign the new size.

Shaping Miscallania

Looking through the rest of Harfbuzz, I see:


FreeType is the preeminant library for reading and rasterizing fontfiles! Neatly split into several submodules, most of which tackle a different fontformat. Or maybe they implement common forms of processing.


“Autofitting” relates to aligning text to the pixel grid. Less relevant now given even moderately high-res displays, but still… For the old “standard res” we hacked in higher-res for the sake of text by using the 3 colours as “subpixels”.

FreeType autofit has a struct tracking the original, current/scaled, & current/fitted position/width for the text converting from font units to screen subpixels units. The computation can be parameterized upon a fontface, x/y scale, x/y delta, rendermode, & some bitflags that can disable horizontal, vertical, or advance hinting.

There’s a routine which compiletime-may consult Harfbuzz retrieving all glyphs in the font & writing an unsigned short to those indices in a given table.

This mutual dependency between FreeType & Harfbuzz may be broken on the FreeType side.

Writing systems (here from a static array, generated from macros) can have methodtables. There’s various datatables listing unicode ranges for different scripts.

A “module record” struct listing fallbackstyle, default script, whether to darken stems, & 8 darkening integral parameters.

A loader struct holding a face, globals, hints, metrics, whether its transformed, the matrix, delta, & 2 more vectors.

There’s a bubblesort for position arrays (actually a good sorting algorithm for trivial ammounts of data).

When initializing the globals structure, FreeType iterates over all the charcodes for each writing system seeing which are present in the font (couple inner loops) whilst populating a lookuptable. Defers to the Harfbuzz logic described earlier, if present. ASCII digits are handled specially, & maybe output debugging.

An teration over writing systems to populate a metrics lookuptable.

There’s a noop autofitting methodtable.

There’s a written to bubblesort & statistically summarize a width array.

There’s a routine for a storing an array in an “axis hints segment”. Or an edge, recomputing a bounds array.

There’s a routine to recompute hinting properties for new parameters, tweaking existing data.

There’s a routine for iterating over segments aligning them to the x or y grid. There’s an interpolation routine along the subpixel grid. And a seperate one for “weakpoints”.

All these routines operate upon a somewhat sophisticated internal datamodel.

There’s a routine abstracting glyph loading, adding unit conversion & a chance for the computed metrics, stem darkening, matrix transforms, the relevant writing system, etc to incorporate tweaks.

Everything’s abstracted behind a module methodtable.

The basic writing systems FreeType understands for autofitting include latinate, indic, & CJK. With a “blue” subsystem.

Each of those writing systems are processed according to their own datamodels. The Indic writing system for autofitting purposes is treated as a minor variation upon CJK.

The “blue” subsystem autofitting “bluestrings” is entirely datafiles.

I don’t think I’ll tackle the CJK writing system since I don’t write it & can’t speak to it. There is a fair bit of code there, which seems appropriate to my limited knowledge.

FreeType “autofitting”’s method table for this “latinate” writing system used here:

FreeType Core

FreeType includes a “core” module for all its dependencies to build upon. In here I see:

BDF Fonts

BDF is (according to the README here) an early & limited bitmap font format from Adobe intended to be human- & computer- legible. Could well have been vital to getting computing off the ground!

FreeType’s BDF implementation has a CMap subclass with binary-search lookups (2 methods).

A routine extracts bitflags & extends flags from BDF metadata.

This submodule implements its own resizable-array implementation for its parser.

Parsing a given FreeType Stream (filedescriptor wrapper) involves repeatedly:

  1. Checking whether to refill the buffer from that Stream
  2. Skip empty lines
  3. Locate end-of-line & whether we need to refill to do so; if so prepare
  4. Unless commentline run callback.
  5. Skip to end-of-line.

There’s number-parsing routines (seems they plonked a self-contained BDF parser into FreeType & minimally modified it to fit), an encoding-comparator, keyed accessors into a metadata parallel-array (+ FreeType hashmap), append a comment to their array, deallocator, & line-oriented statemachine parser.

This parser isn’t really all that complex relative to other parsers I’ve seen…

There’s a Face subclass extensively abstracting the parser’s output into FontConfig’s common format, including populating that CMap subclass & ensuring any requests are within BDF’s limited capabilities.

BZip2 Fonts

Read over the next FreeType submodule on my list. It was practically nothing!

To cut down on disk-usage FreeType includes a small wrapper around LibBZ2, hooking it up to its Stream trait & memory management. Resulting in a Stream decorator designpattern!

Font Caching

The last FreeType subsystem I studied used more CPU to cut down on memory usage by calling into LibBZip2 decompressor. Today I’ll study the reverse, namely how FreeType uses caching thus using more (temporary) memory to cut down on the ammount of work that needs to be recomputed.

In this subsystem I see:

CFF Fonts

Compact Font Format (CFF) is a font format developed by Adobe, and is currently one of the two possible “flavours” of OpenType fonts. Because getting everyone to agree on a standard is hard, in this case meaning that the current standard font format is actually two standards smooshed into one!

In this FreeType subsystem I see:

CID Metadata

Reading over the next FreeType subsystem on my list is “Character ID”. This includes a driver implementation performing a largely identity transform, or looking up additional metadata attached to the glyph.

There’s a glyph-loading routine that, after some initialization/validation, defers to its wrappee, retrieving additional info, applies any transforms, & computes/adjusts metrics.

There’s a relatively straightforward straightforward parser, involving a triple-nested loop & a postprocessing loop, & a corresponding deallocator.

There’s initializers & finalizers for that driver in a seperate file. Which also contains Face, Size, & Slot objects gathering appropriate data in their initializers.

That Face object defers to the “cidload” sourcefile to gather & parse its metadata, which alongside plenty of its own parsing routines defers to those other 2 components.

This “cidload.c” parser is not quite is trivial, and does involve an identifier-to-subparser lookuptable. Operates upon the generic Stream abstraction.

FreeType Dialog

FreeType includes a submodule you can call to aid wrapping it within commandline tools.

This includes verbose memory-management & multithreading utilities.

But mostly it serves to expose to provide its own variations upon the printf utilities operating upon FreeType datamodels, these share some code. Including a little commandline formatting.

Graphics Validation

For the sake of failing fast FreeType includes routines for validating graphics tables. Some of these tables are validated the same way as others, & a few just checks reserved bits are not used. There’s plenty of support routines for traversing these tables.

Lots of code gathering data then deferring elsewhere!

It validates that the zero-advance char infact does have zero-advance, alongside validating complementary chars & version numbers depending on bitflags.

It checks font names, checks for duplicates (if I’ve got this right?), routes several fields to be validated as “properties”, validates ranges don’t overlap (?), ensures known ligature-actions are given, ensures no out-of-bounds ligature index is given, checks version fields are what’s expected, validates lig-actions are in a dynamic ranges, validate used features are flagged as existing & in-bounds, validates max used glyph-id is in bounds, & dispatches based on present tables.

Furthermore it checks whether cerain glyph properties are positive-nonzero, whether necessary tables are present, all controlpoints have non-empty values, check non-duplicate feature names & flags, value-offsets are in the value-table, various other reserved-bitflags checks, & certainly others I’m missing.

There’s some datatables, some computed at compiletime. There’s a public API registering into FreeType’s subsystem infrastructure.

Uses sorting to detect duplicates.

Font Decompression

To reduce font filesizes the FreeType font-decoding library vendors a copies of LibGZip adding adaptors to its Stream trait, configured to use FreeType’s alloecators.

There’s also a LZW subsystem which similarly exposes a Stream implementation using a written-from-scratch decompressor upon FreeType abstractions. It appears to involve bit-unpacking backreferences & copying data over.

OpenType Validation

For the sake of failing fast FreeType includes an OpenType Font Format (a subset of TrueType Font Format) validator. This exposes an otv_validate function to FreeType’s subsystem infrastructure.

The initial checks are:

Validating GDEF tables includes checking:

Validating the basetable includes checking:

Intermingled with parsing & AST traversal.

Validating JSTF tables involves checking:

Validating math tables involves checking:

Validating GSUB tables involves similar checks in addition to, via dynamic-dispatch, each substitution-format & their parameters, whether coverages, counts, glyph-IDs, ligatures, enums, or (via recursion) dynamic types.

Validating GPOS works very similarly to validating GSUB, dispatchtable & all but with a few more support routines.

And finally there’s a large suite of utilities for validating common syntaxes between all these tables!

PCF Fonts

FreeType includes support for the old & simplistic X11 PCF (Portable Compiled Format) fontformat.

This is built with the aid of a bit/byte-reversal utilities.

There’s code exposing this font loader to FreeType’s submodule infrastructure, mostly serving as accessors. Though ofcourse it also calls & frees the parser. And it decodes glyphs.

That binary-parser splits the font up into “tables” according to a table-of-contents, parsing each table seperately with minimal use of pushdown automaton. Afterwhich it reformats the property-table into a standard FreeType structures, & validates results.

PFR Fonts

Portable Font Resource files is a concise vector format from BitStream Inc commonly used on TVs. Here Ill study FreeTypes implementation of it!

This exposes via FreeType’s subsystem infrastructure an iterator over the characters/glyphs (apparantly this format doesnt distinguish, not great…) in the bytestream. Or binary-search within this array. Plus allocator & finalizer.

Seperately it registers a class with accessors scaled to common units. Though kerning requires seperate lookup.

As for glyphloading, that defers to a couple of other lookup routines based on given bitflags, wrapped in a couple datastructures with their own allocators/finalizers, before thoroughly normalizing to common units upon success.

Those methods & the kerning-lookup, are implemented in a seperate file.

Parsing glyph may involve (after initial bit-parsing) recursively aggregating a compound-glyph (whilst normalizing units) or the basepath bit-parsing vector paths into a validated AST.

Upon initializing a PFR fontface object it gets binary parsed (with the help of a callback-lookuptable when “extra items” are bitflagged present) into a subsystem-internal AST with corresponding finalizer. Includes a couple minor subparsers, PFR header-parsing written declaratively via macros & bytecode.

The same file provides logging & validation routines also called upon PFR fontface init.

Bitmaps have their own lookup, binary-parsing, & conversion routine & subroutines thereof.

The bitmap lookup may either be done via a linear-scan or a binary-search based on whether its been validated that the array is sorted.

The metrics & (in one of 3 formats) bitmap are bit-parsed from a compact format seperately. With the bitmap being run-length decoded. Results are normalized to common units & loaded into a common representation.

PostScript Auxiliary

FreeType includes several subsystems related to PostScript fonts, a popular format from Adobe from which PDF is derived, including the “auxiliary” FreeType PostScript subsystem.

Amongst the methods “psaux” exposes to the broader FreeType system is a Type1 “decryptor” which XORs every byte in the stream with a pseudo-random number (PRNG) generator. I hear this security-through-obscurity fake cryptography used to be pretty popular?

A different (xor-shift) PRNG is also exposed.

FreeType’s psaux subsystem implements a decoder object, as well as subfonts.

It also defines a CFF-decoder object wrapping a CFF-builder & its CFF-loader in turn, with methods for preparation & single-function binary parsers.

The builder object provides a methodtable for converting the results from that decoder into standard FreeType FT_Outline objects at parse-time, with validation.

It exposes an implementation of the CMap class to the broader FreeType system.

The FreeType psaux includes methods performing array-lookups & iteration. There’s multiple such implementations. Possibly deferring to the Postscript-names system. All these methodtables are gathered into their own lookup table.

It may expose an “AFM Parser” object, exposing a textual pushdown automaton with keyword tables & a tokenizer.

It exposes a Type1 Decoder wrapping the psnames, pshinter, & a Type1 builder exposing single-function binary parsers. With 1 subparser.

There’s a Type1 Builder object some of these parsers use to construct standard FreeType FT_Outline objects instead of the CFF builder to take into account PostScript hinting.

There’s a PostScript builder object with several geometric properties but no methods.

The PostScript parser appears be a handful of mostly self-contained textual parsers resembling little more than a lexer. There’s several of these for different PostScript constructs, or loops around them.

It defers to a seperate file to parse literal values.

And the PostScript Table object implements a smallintmap/array.

CF2 Fonts

Within FreeType’s “psaux” subsystem there’s a decent chunk of code implementing CF2 fonts.

A “parser” is exposed to the broader FreeType system for CF2 fonts. Afer some validation, data gathering, & allocation (some of which are implemented in seperate functions) this defers to a cf2_getGlyphOutline & if successful setting the width property.

Now we reach a stack-based opcode interpretor over the given syntax pushed onto a stack, with extensive initialization. These opcodes usually manipulate the stack & may:

Otherwise error-reporting & recovery.

That hintmap incurs a linear-search (presumably assuming its dealing with tinydata), there’s a couple routines which together populates this hintmap. There’s hint bitflags/mask determining which codepaths get taken, in a dynamically-sized array.

These hints incur tweaks to coordinates to ensure the paths line up to low-res pixel grids which are less common now. Possibly manipulating the stack.

Line intersections are handled specially when building these paths, to generate new hints.

After that extensive computation (which also involves tracking “winding” via a cross-product) the path construction ultimately boils down to external callbacks.

There’s a “blues” system which helps generate data to populate the hintmap by iterating over an array checking various conditions. Presumably to compute the subpixel-grid the text is being fitted to?

PostScript Hinting

PostScript has its own “hinting” algorithm, which FreeType has a subsystem for allocating & deallocating methodtables of standard hinting interfaces. This is called by FreeType’s “psaux” subsystem described above.

This includes a “globals” methodtable which includes:

Linear scan query functions are exposed to the core algorithm upon this data.

Another methodtable with its own initializer & finalizer has methods for:

I was looking at the T2 methodtable there, but T1 is extremely similar! Both sharing “PostScript” methods with occasional tweaks. These are built upon “Dimension”, “Mask Table” (bitmask), & “Hint Table” (sorted array) objects. This file also includes a module-initializer storing the given input memory, & a module-finalizer.

As for that “apply” method & core logic:

Afte some validation it initializes various local properties from given data, & applies relevant scaling. Then per-dimension (x & y) FreeType PSHinter zeroes an array for the glyphs, iterates over the contours array, & skipping imperceptible linkedlist entries there-in to bitflag “extremums”. A second iteration flags these extremums’ directions.

Another iteration checks the bitflags ensuring they’re bitflagged as fitted where necessary, adjusts as-per the blues-table, & applies the alignment indicated by that blues-table (using recursion in the fallback path or applying length specific tweaks). Then this iteration checks snapping bitflags & applies any such snapping according to the alignment. Results are bitflagged as fitted.

Bitmasks are iterated over to ensure the “active” bitflag is set correctly & locate “strong points”, performing comparisons & hittests to ensure those are bitflagged appropriately. With a fastpath when there’s a single bitmask. Then all points with hints are bitflagged as strong.

Yet another iteration snaps points to “bluezones” & bitflags them as strong & fitted. Where they weren’t already strong, or have invalid tangents.

Contour-points are interpolated between these strongpoints with a bit of geometry. Another iteration interpolates yet more points, & a 3rd interpolates the rest.

These new points are saved in the standard FreeType datastructures, & rescaling is optionally applied.

PostScript Naming

The “psnames” subsystem called by “psaux” is mostly lookuptables & UTF-8 decoding.

Glyph Rasterization

Many font formats, including the most capable one(s) used near-exclusively today, represents the images for each of their glyphs in vector graphics for concision & rescaling. But most compositors requires raster graphics. So FreeType includes a rasterizer!

This exposes a method table & a few additional validation/normalization wrapper methods to the subsystem infrastructure tweaking the outline.

Aside from some essentially noops in that methodtable, after additional validation the render method.configures precision properties & an inner methodtable, performs a vertical “pass”, & if successful it might perform a horizontal “pass”.

These passes repeatedly performs a “conversion” & either iterate to the next “band” or applies the “sweep”.

Conversion involves iterating over the contours & the opcodes into cubic/conic curves & lines with scaling into a linkedlist tracking current point.

If decomposing that curve hasn’t failed the contours-iteration procedes to check the extreme for joins & overshoots. After that iteration it frees all intermediate data tracked to assemble those decomposed paths.

Drawing the sweep involves iterating over the path populating a waiting queue. After validating there’s y-turns exist in the outline it initializes output target if needed, computes offsets of each pathsegment from min-y, initializes counters & tweaks turn count, & loops.

That loop repeatedly iterates over the waiting list copying over to the left & right linked lists as appropriate. From there it bubblesorts (since its usually already sorted) the left & right lists, & iterates over the y-axis (or x if flipped) iterating over & virtually-merging the left & right lists to trigger the rasterization callback. Before iterating to the next adjusting render target & sorting. Any trailing lefts & rights are freed before iterating to the next turn in the waitinglist.

Remaining scanlines are iterated over & remaining left/right lines are dropped. Those callbacks essentially ammount to a fair bit of keeping around bitwise run-length decompression.

Signed Distance Fields

FreeType includes a submodule which renders glyphs to “signed distance fields”, though I don’t know why. Perhaps to aid automatic boldening?

A renderer class is exposed to the FreeType subsystems infrastructure, mainly consisting of a render method. This method, after initialization & validation, defers to another method & cleans up.

It also has methods wrapping FT_Outline operations. There’s another extremely class with a different render method.

Then there’s methodtables for getting/setting named properties.

There’s utilities for computing square roots, converting 16.16 fixedpoint numbers to a given integral range, & invert signs.

In another file a similar class has a render method which appears to actually the core logic! After validation & initialization for the given mode this one applies edge-detection to the bitmap filling in those distance values, then runs 4 passes to propagate these distances in each direction.

A last iteration finalizes the results into the desired format, & square-roots them. Otherwise I’m sure that square-root could become a bottleneck. Then cleans up.

Alternate core logic applies fancier but less memory-intensive calculations to glyph outlines. After its initialization & validation it runs the standard decomposition operation, & chooses a codepath based on whether there’s overlaps.

FreeType’s core logic to compute a signed-distance-field for a vector outline with overlaps, after additional validation & initialization, involves counting contours, allocating sidetables to store auxiliary info based on that count, iterate over the contours taking into their computed orientation carefully geometrically decomposing the path & tracing the outline into the output bitmap, iterates again over all pixels tested against all contours to fill in their values, & cleans up.

If there’s no overlaps to consider only a fraction of that work needs to be done. Namely what I phrased as “decomposing the path & tracing the outline into the output bitmap”, without need for its outerloop.

This is split into substeps: splitting the shape & generating the bounding box.

Generating the bounding-box, after initialization & validation, involves iterating over the contours & their edges. Foreach it computes a bounding box (differing based on edgetype) whose pixels it iterates over, to compute edge distances according to a choice of (often nontrivial, majority of this codebase) tweening functions & squareroots them. A final pixel iteration fills in missing values. Literal corner-cases are handled specially.

Splitting a shape involves iterating over a contour & their edges handling each edge-type differently.

Lines are cloned. Conics are recursively subdivided based on a computed split-point, or allocating two new edges in the base-case. As for cubics, ofcourse computing splitpoints differently.

Results are collected into a new contour.

In this file I also see some legacy functions that are commented due to memory issues, & a methodtable to interpret paths.

SFNT fonts

SFNT is a font-container format from Apple, originally designed to wrap TrueType fonts.

FreeType’s subsystem for it contains a file with 3 functions that dereferences metrics data from the file, 1 of which might (depending on build flags) send some through a configured Font Variation’s callbacks.

Similarly another file has a couple routines to lookup & free “bdf” properties from the fontfile.

Another file consults lookuptables both in the font & hardcoded to determine glyph names.

This SFNT subsystem as per usual exposes a methodtable to FreeType’s submodule infrastructure. This exposes methods for retrieving the charset encoding & registry (everyone uses UTF8 now).

That accessor wraps one I’ve described yesterday, & directly exposes it as a method.

There’s a method for looking up postscript font names, possibly (build & runtime) consulting the font-variation. Looks up an index then prefers Windows stringtable to Apple’s, since it appears they couldn’t agree in the early days. This font-variation lookup is implemented in this same file, checking a few different places & running callback methods before checking hashes.

There’s a support routine for looking up strings from the font, & amongst others. It implements Murmur Hashing itself. There’s a couple lookup tables.

There’s a methodtable for looking up glyph names, wrapping what I’ve described yesterday. And there’s a methodtable for looking up fonttables from the parsed datastructures, this one exposes the face-loading method defined elsewhere.

There’s a submodule for loading SBit images via the appropriate binary subparsers/decompressors, & encode result in the general FT_Bitmap type. Plenty of options have been added by now! Includes support routines for loading image metrics, or to load the bitmap tables.

There’s a submodule for loading & parsing various font tables. To be called by it’s sibling sourcecode files.

There’s routines for decoding & processing colours & gradients thereof. Parsing gradients can be quite involved.

There’s routines to decode contained WOFF2 files, with special composite glyph & bbox handling. Restructuring it into a more traditional format with sorted tables.

There’s a submodule defining loadable face objects with a few subparsers.

Finally there’s a binary-parser for character-maps, validated against a selection of formats via a callback array.

The bulk of this file defines various implementations of FreeType’s CMap class for each supported format it validates. Some of which requires support routines, mostly subparsers & (typically binary-search) lookups. Not to mention UTF8 iterators.

Font Smoothing

FreeType’s font-smoothing subsystem exposes a renderer decorator methodtable who’s render method, after validation & translation to 64x64 grid, chooses a codepath (if not directly deferring to inner renderer) & cleans up. In normal or light modes with the OUTLINE_OVERLAP flag set it tweaks the rendering parameters to a hardcoded scale. Or it may just scale x xor y by 3 for LCD screens.

The inner renders calls a bitmask-filling callback according to the hardcoded scale. Initialization & other methods are passed straight through to the inner renderer or core FreeType objects.

There’s a rasterization methodtable the main one defers to. It’s render method, after validation & bounding box computation, it initializes some variables & iterates over the y & each “band” therein to interpret the path & gather spans.

That interpretation involves, with error handling & possible disabled tracing, iterating over contours deferring to a func-interface with normalization & counting. To gather spans it Iterate over y axis & cells therein to gather spans with or without involving a callback.

There’s an internal methodtable it uses with that interpretation that gathers interpolated line segments &, where they’re not horizontal, iterates over the y-axis to prepare those “cells” as a linkedlist.

SVG Renderer

Freetype includes a SVG renderer, which appears to be mainly for testing purposes. If I’ve got this right so they can use external rasterizers they know work to diagnose when their’s don’t.

This renderer includes a matrix transform (with “delta” offset seperated), filling in default matrices & deltas.

There’s accessor methods, exposing a single validated svg-hooks property.

The renderer method allocates memory to call to call the render_svg callback, possibly after init_svg.

That render method also calls a preset_slot method calling the preset_slot callback.

There an initializer & deinitializer.

The actual logic appears to be elsewhere in those callbacks… Maybe I’ll get to that next time? If I find a caller?

FreeType Tooling

FreeType includes a handful of utility commands to aid it’s development.

This includes a shellscript calling a Perl-script to update the copyright year for all files in the git repo which are not listed in a file of exceptions. The perlscript retrieves the current year & loops over the file looking for strings to replace. That said I believe it’s more proper to list year created or years (plural!) editted…

A Python script outputs CORDIC constants, whilst computing precision loss.

There’s a Python-script grepping the C macros system to find debugging unused debugging macros, or ones duplicate definitions of them. That’s 2 grep loops followed by 2 output loops.

There’s a Python-script wrapping the build system (indicates to me that Make isn’t a well-designed tool…).

There’s a testscript for exercizing it’s PostScript AFM parser. Or for exercizing bbox computation from a vector “outline”.

There’s a script which greps FT_EXPORT macros to list & sort public APIs.

There’s a testsuite for trigonometry functions.

There’s a script which lowers Postscript “blues” data to C code.

There’s a fuzzer randomly tweaking a fonts in a directory until one fails to parse and/or noop-render.

There’s a Python-script full of glyph data, which converts that to C syntax.

TrueType font

Almost all fonts now use the TrueType fileformat, so FreeType has a subsystem to support them!

This subsystem exposes a methodtable to core FreeType some of which is dynamically looked-up, which mostly adds validation around other APIs I’ll describe & converts types.

There’s a file implementing binary-search or small-int-map lookups, initialization/finalization, & load specific opcode-programs embedded in the font & other control tables for TrueType Face objects.

Another file provides a “tweaks” bitflags setter, with behaviour validation routines against optional hardcoded testdata. This is called by the other files implementing TrueType.

There’s a file which provides initializers/finalizers for TrueType GlyphSlots, the module itself, glyph sizes, bytecode, faces & glyph zones. Glyph sizes can also be prepped & reset. Bytecode has a routine wrapping it’s interpretor. It validates faces contains .notdef glyph, amongst other things.

There’s a routine wrapping glyph lookups with fallbacks & unit conversion. These loaders & corresponding finalizers is defined in the same file running the appropriate programs embedded in the font. Glyph metrics are computed seperately from the vector outline. The most core glyph lookup opens the appropriate “glyph frame” with a bit of binary parsing, applies an appropriate matrix transform, applies some postprocessing including subglyph compositing & cleans up.

There’s a few support routines for composite glyphs aside from recursion, & one for simple glyphs. Both of which calls into a hinting system with preprepared zones, snapping points to the subpixel grid & applying emboldening.

The same file populates a methodtable on the font for glyph-loading routines, which actually does the binary parsing.

Another file defines initializers/finalizers/accessors for various graphics objects. With utils to incorporate into calculations.

These graphics includes blends, their variations stores, glyph & interpolation deltas, named instances, design coordinates, MM blends (with caching), MM vars (with unit conversion utils), variable tuples, graphics vars, M vars, horizontal and/or vertical advances, item deltas, HV vars & their index mappings, & A vars.

The same file has a utility for modifying the font’s CVT table. Most of the initializers include binary parsing. Support routines decompresses deltas & points.

TrueType Interpretor

Finally there’s a careful opcode interpretor which repeatedly upon validating (both within & outside the loop) it’s input loads the opcode & dispatches the appropriate function call before checking for errors inside the loop & cleaning up outside it. The majority is spent implementing opcodes:

Several calls down to rasterizer methods. And there’s a fair few support functions, & some deassembly data.

Type1 Fonts

FreeType includes support for PostScript Type1 fonts (which search results indicate are no longer going to be supported by Adobe). One file in this subsystem reads metrics from the PFM file presorting them, & provides methods for looking them up via binary-search or linear scans.

Another file defines a validated Parsing object for a “private dictionary” used during loading the font where it carefully & repeatedly reads PFB tags with extensive controlflow.

Another file defines Driver, T1 Face, GlyphSlot, & Size objects; with initializers & finalizers. Sizes also provides a method wrapping FT_Request_Metrics.

A dedicated file implements the methods for binary-parsing T1 fonts, wrapping a PostScript Auxiliary “decoder” providing it a straightforward glyph parser. As well as providing partial-parsers linear-scanning for desired metrics on the specified glyph.

This subsystem exposes a methodtable to FreeType, though as mentioned some of these methods are defined in seperate files. Whilst others get shallow wrappers. The bulk of what is defined in the same file as these methodtables are it’s property accessors.

By far the bulk of the logic is a pushdown macro-aided parser (and parser + loader classes) called by it’s Face constructor, with postprocessing to gather tables. Calls the private-dict parser. Provides collection accessors.

Type42 Fonts

Type42 is a PostScript font format, which FreeType supports.

Alongside the methodtable is defined a number of accessors & a couple array lookups.

Another file defines GlyphSlot (wrapping load_glyph method), sizes (wrapping generic FreeType infrastructure), the “driver” itself, & faces (initializer calls open method itself wrapping dict parsing). These are exposed to the rest of FreeType in that methodtable.

Finally another file defines “loader” & “parser” classes & implements a manually-written (apart from some struct-parsing macros) pushdown automaton binary parser, wrapping PSAux & Type1 subsystems.

Windows Fonts

FreeType supports some old Windows font formats, to finish my FreeType studies. This is relatively straightforward.

Aside from accessors this methodtable exposes a methods for (upon validating arguments):

That face initializer attaches a methodtable to iterate over the glyph array.

But mostly this subsystem consists of a couple binary parsers, with their own finalizer. The struct-parser helper macros are heavily used here!


LibICU (International Components for Unicode) provides a large C(++) toolbox for processing international text. Some of which we’re using in Haphaestus to help decide where to split text into lines!

LibICU Core Library

FreeType bundles core libraries but the bulk of it is in icu4c/source/common, & this section will list all the tools I see in this toolbox!

Lookup Tables

A large portion of LibICU are various lookuptables with more-or-less sophisticated wrapper functions some of which may support parsers & serializers, including (but not limited to) ones for:

Some lookuptables (including a couple listed above) are stored in a [trie] datastructure of which LibICU provides several implementations to heavily optimize them, including but not limited to ones:

There’s several utilities for constructing & iterating over these tries, possibly because we’re binary-parsing them from datafiles at runtime. The datastructures frequently provide utilities for swapping header fields between parsed files whilst, versioning, handling endianness, & reporting errors. Some of these datafiles can be validated & cached by LibICU Common routines, or iterated over. Or there’s hashmaps, or binary searching.

These parsed lookuptables include but are not limited to:

But most significantly we can count localizing strings as such dynamically-parsed lookuptables! With it’s looped & error-handled, amongst others, wrappers. LibICU provides localization infrastructure, including:

Basic Utils for LibICU

Like most software written in C (actually it’s in C++, I don’t know why they didn’t use the C++ standard library), LibICU reimplements some of the APIs missing from the C standard library itself. These include:

The C++-to-C language binding utils are used throughout LibICU4C to ensure you don’t need to use C++ to call it. Or in some cases like hashmaps LibICU binds it’s C implementation to C++, which doesn’t require any infrastructure.

Core Datastructures

There’s a central slicable (yielding another type) UnicodeString class constructable from (or convertable to) various string types, largely focusing on memory management, comparisons, & array operations. Numerous iterators are implemented around this, some of which performs text conversions who tend to have wrappers gathering results, whilst others perform bittwiddling to iterate over characters in a certain text encoding. Then there’s iterators like the one for “runs” which consults a stack & character proeprties. And additional infrastructure for type & text-encoding conversions.

Altering letter casing is more complex than an English speaker might assume, with the logic for that spanning several files, some of which integrates with other utils.

There’s text-encoding converters between specific encoding pairs, typically involving bittwiddling. Or methodtables, typically non-trivial & may involve lookuptables possible parsed from BLD datafiles like other lookuptables. These methodtables have wrappers for reading/writing codepoints in those strings, write a substring, or writing escaping sequences for unsupported chars. IDNA & UTS46 conversion includes Bidi checks.

This text-encoding conversion has a public API with format-sniffing, with numerous encodings being supported including Punycode.

There’s also a Unicode character set datastructure (whether a trie or binary-searched sorted array), which in turn converted to a subtype of a pattern type upon which text can be matched partially or fully. These character sets has utilities for parsing (a couple seperate utils there), expanding from character properties, serializing, iterating over forwards or backwards, & combining them. Or we might test against common character sets, a couple of which are specially-optimized! “Simple” patterns also have their own parser.


LibICU includes a complex “normalizer” class with lots of methods consulting tries, which amongst other things process “FCD” boundaries, merge composed chars into other equivalent chars, the reverse, various checks, etc. This is wrapped with utilities to normalize the joinpoint between strings to concatenate, possibly filtered or multi-choice iterators yielding normalized chars, normalize for comparison, & much more!

Bidirectional Text

To output text reordered into a consistant writing direction ubidi_writeReordered performs some validation & handles empty input specially before normalizing the bitflags by a previous lookup-table iteration.

If !BIDI_INSERT_LRM_FOR_NUMERIC it iterates over runs & outputs its chars in the appropriate order updating pointers.

Otherwise for each run it checks & normalizes the insertRemove field considering whether to insert text-direction marker characters before & after the text from the run possibly reversed.

BIDI_OUTPUT_REVERSE indicates whether LTR or RTL text should be reversed in the output. These 2 bitflags are branched over to create 4 codepaths.

Regardless a nil character is appended to close the string.

Outputting reversed text involves parameter validation, before branching over bitflags normally iterating over chars in reverse copying a UTF16 char to the destination.

Upon KEEP_BASE_COMBINING per non-combining char it gathers combining marks to not reverse the order of.

REMOVE_BIDI_CONTROLS causes it to strip off trailing BIDI control chars & skip any which remains the main iteration. Given DO_MIRRORING it looks up the mirror for any chars it encounters.

Plain & KEEP_BASE_COMBINING have their own fastpaths, but there’s also a fully versatile codepath checking the bitflags in the loop.

Outputting forward text needs less parameter-validation & has fastpaths for combination of DO_MIRRORING (map operation) & REMOVE_BIDI_CONTROLS (filter operation).

There’s C wrappers around an iterator over breakpoints (hoped I’d see more of that logic today…).

To apply a bidi transformation once arguments are validated it scans a lookuptable to find a matching scheme & reformats those arguments. Then iterates over desired actions from that scheme to call updating its stored text before reformatting back into common C types.

The scheme may apply char mirroring, perform Arabic-specific shaping, reverse the text, set REORDER_RUNS_ONLY mode, flag as inverse & set that mode, output reordering, and/or set the base-level.

A routine for building an inverse mapping, alongside a routine for repeatedly iterating over runs to build a mapping from logical indices to visual indices postprocessed to add insert points & control points. There’s several variations of that latter function. As well as a routine to apply the reordering maps.

A function, with fastpaths, consults a levels map to compute a list of runs & applies reordering.

That reordering, once args are validated & normalized, iterates over each embedding level find its text direction & reverse indices if necessary. A final reverse is considered in postprocessing.

There’s utilities around that runs list & those levels including constructors.

And finally there’s a file full of more-or-less sophisticated validating accessors (calculating missing data). With a few of their own lookuptables!

Text Breaking

There’s a few different techniques used for breaking text into lines, words, or sentences based on the locale.

You could use a regexp-like rules encoded into a binary tree & parsed much like all the other lookuptables/tries! Or there’s a dataheader, compiled from a datafile via a Perlscript, providing common rules. Or there’s parsers for a textual syntax. Also these have rule break sets, range descriptors, & transition tables with their own builders.

There’s some fastpath C++ classes.

Or you could process & match against word dictionaries to determine “possible words”, with subclasses specializing the selection for Chinese-Japanese-Korean scripts, Khmer, Burmese, Lao, & Thai. (Not speaking those languages, I don’t follow their code)

A factory returns the appropriate instance of the appropriate class to break text in the scripts the given char corresponds to, with a fallback class dispatching to the UnicodeSet class. There’s iterators wrapping this (and, as per usual, C APIs) adding rule-file lookups, filtering, caching, or buffering. Each as seperate iterator classes, which may optionally provide debug output.

Miscallaneous Text Utils

LibICU is primarily a toolbox for international text, and as such it provides several more such tools! Including several wrappers around case-folding including specifically for (possibly utilizing a stack) case-insensitive comparisons, skipping whitespace, diff calculator (via a Dynamic Programming algorithm) & patch applier, a couple C++ classes to optimize concatenation including for string formatting, a few string collections & hashmaps, wrappers around methods of it’s core class, fastpath operations on other string types (including files) than it’s internal one, & text escaping.

LibICU includes “ICU Service” objects consisting of an ID cache, a DN cache, a key, observable list of “factories”, & display names & visible ID map from those caches.

And as you might expect there’s several parsers, serializers, & syntax-validators for:

Data files

As mentioned when studying the Core library, LibICU relies heavily on datafiles. Most of these appear to be written in some sort of JSON-predecessor that personally I like better. Less syntactic noise.

Much of which is dedicated to declaratively describing where soft-lines, paragraphs, or words should break. Including “dictionaries” listing words, a fallback, name, type, embeddings count, hunits count, & an intvector. Or there’s newline-seperated word files. Or using regexp-like syntax.

Line-seperated word-dicts can optional specify integral priorities.

Jaml uses its own schema to define its break rules, consisting of prioritized “BW”, “TW”, & “UW” keys & values (reminds of defining “systems” specifically to declaratively serializing Ethiopian or Chinese numbers…). And there’s a schema for composing the different rules declared here, or for defining exceptions (usually for common abbreviations).

Line-breaking linguistic rules complex, & defers for the language.

There’s a large suite of files defining collation rules for each locale, some of which are empty or declares aliases. Others defines multiple versioned “sequences” multi-strings. Most of these are trivial. Does this have to do with determining which script some text is in?

LibICU includes a massive suite of files, in the JSON-predecessor format I mentioned yesterday, specifying the symbols/abbreviations & names for the currencies in the different locales. A few of these are empty, or aliases to other such files.

A few files define symbols seperately, and/or plural forms.

If you actually want currency conversion, the exchange rates change frequently enough that you’d want an internet service for that.

A couple of these files also include “from” & “to” int-arrays, don’t ask me what they’re for. There can be inheritence between the files.

Skimming more of LibICU’s data files today…

There’s a bunch of files in that JSON-predecessor format providing localizations for various locale keys, language codes (plain, long, menu, short, or variant), scripts (plain, stand-alone, or variants), codepatterns, locale display patterns, character label patterns, variants, or types for calendars, cf, collation (plain, alternate, backwards, or case-first/level, normalization, numeric, or streingth), &/or numbers, various units, & more.

Several of those files are aliases or empty, & an inheritence hierarchy is supported but rarely used.

Localizing localization! There’s a lot of these files, especially for English & Spanish!

There’s some binary files holding chinese Unicode characters, I’m not notivated enough to discern their syntax. And there’s 2 DTDs each for 3 XML schemas. I’ll describe these formats when I actually see them used.

LibICU includes another large suite of localization files in that JSON-predecessar format with an inheritance hierarchy, specifying how to format dates (including which calendar to use & how to format it), relative dates, times, numbers (including which system to use) & currency, various exampler characters, ellipis, more info, character labels, fields, & much more.

Several of these are empty or aliases.

It’s quite extensive, & far more capable than GNU LibC’s builtin support!

Skimming more of LibICU’s datafiles, there’s a large suite mapping various character sets (mostly IBM’s, but also Mac’s, Windows’, & others’) which consists of newline-seperated metadata followed by 3-column character(s)-mappings with integral flags indicating how accurate the mapping is.

There are #-comments (including a copyright notice) & blank in these files to ignore.

The IBM files are often the longest.

Skimming through yet more LibICU datafiles…

I see a file declaring language aliases. Another translates from language codes to likely “subtags”. Or a mapping with justifications. A mapping to typical timezones. Timezone aliases. One files contains various data about countries, calendars, & locales.

There’s several files on how to convert numbers to text for different locales. There’s a file specifying how to convert between common units.

Most of these are in the JSON-predecessor format.

Another file contains detailed info on the different timezones. Various shorthands are defined for locale fields. There’s a datafile of version numbers. I’m not clear “icustd.txt” is about, looks empty. There’s a file naming grammatical forms. Another describes how different languages are gendered. A mapping from languages to how the hours of the day are conceptually split up. There’s integral codes for different currencies.

There’s a template for a Microsoft Developer Studio buildfile in a pre-XML format (can’t you smell the history?).

There’s numerous files, with an inheritance hierarchy & aliasing, specifying how to BNF-parse natural-text numbers.

Skimming a bit more LibICU’s datafiles this morning this morning (I don’t have much time), I see numerous files providing localized country names (whether normal, short, or variant) in the JSON-predecessor format with an inheritance hierarchy. None are empty.

Continuing my LibICU datafiles studies…

There’s a large collection of files phonetically converting between text in different alphabets, using a #-commented, ;-deliminated, →-based syntax. In which it is possible to define variables & metadata or use regex-like patterns. Alongside these is a file in that JSON-predecessor format listing aliases & other tweaks on these conversions, & a 2nd minor one.

There’s also several files listing the allocation status of IETF RFC identifier-ranges.

Studying more of LibICU’s datafiles (I promise I’m getting through these!)…

I see a shellscript running normalization datafiles through a bazelisk command.

There’s a #-commented ;-seperated values file claiming to catalogue severe typos in the Unicode spec which impacts normalization. Another one defines case conversion mappings, amongst several other mappings & classification files largely relating to normalization. Including classification of characters as emoji.

There’s a shorthand syntax for those mappings formed from sequences of comparison & boolean operators between literal characters (possibly /-separated). Other such files use space-seperated hexcodes.

There’s a ;-seperated-values file defining testcases.

There’s a buildfile & changelog.

Then there’s a whole suite of files localizing various units using that JSON-predecessor format with an inheritance hierarchy & aliasing. A few of these files are empty to avoid redundancy.

Skimming the rest of LibICU’s datafiles…

There’s several mostly-trivial XML files describing parsing rules, agregated breaking rules, one referencing collation files, & one defining the version number & “root” as the language.

There’s numerous files in that JSON-predecessor format localizing timezone names, with an implicit or explicit inheritance hierarchy. Several of these files are empty.

And I usually don’t mention the buildscripts & documentation…

Auxiliary LibICU Libraries

Skimming over some of LibICU auxialary libraries & commands…

I see a lib which straightforwardly abstracts LibICU to provide more convenient I/O. There’s a trivial test script wrapping the ScriptRun class, which is defined in an adjacent file using loops, bittwiddling, & lookuptables computed (with a little help) by Core. There’s a library providing utilities to apply text transcoding as we output text to files. Alongside testfiles, localization files, & a file transcoding class.

Some less trivial aspects of the I/O abstractions make heavy use of lookuptables. This I/O library includes an interpreter for format strings. There’s a non-trivial text transcoding command. There’s a basic paragraph layout library, building on lookuptables, bittwiddling collections, & text-breaking APIs with sophisticated multi-pass processing. There’s Python language bindings, using the CTypes module & build automation.

Skimming over LibICU’s internationalization sublibrary today… There’s a C++ class for determining whether some text appears to be in UTF-8, & classes for other encodings. Some classes due little more than hold data, including ones for the different calendars. The CurrencyUnit class has branching determining which identifier to use. There’s collation dataheaders. There’s a class integrating Currency infrastructure to Plurality infrastructure.

Collation may be aided by graph processing. For some reason there’s additional character iterators here, supported by lookuptables. There’s a sniffer determining which text encoding is being used. There’s a class for transliterating from one charset to another according to datafiles via an iteration. There’s a reorderings collection. There’s a collation collection, parsed from datafiles. There’s lookuptables & bittwiddling for collation fastpaths. Collation has a lot of infrastructure!

Some of that infrastructure collation indices, or reading those files. Amongst other parsers. As well as collation comparison infrastructure, a latinate fastpath, & iterators. A whole class is dedicated to parameterizing this collation. There’s an NGram parser utilizing several hardcoded lookuptables. There’s a class merging case-mapping/collation & transliteration. There’s various converters to/from doubles, as well as between calendars.

Some of those converters are aided by lookuptables. And there’s converters to/from decimals. There’s a whole thing about classifying “eras” as used in certain variations of these calendars. There’s a text-escaping transliterator. There’s various string collections. There’s a class for looking up a localized string amongst a choice of them as per the language’s plurality rules.

Skimming over more of LibICU internationalization library… There’s several C++ classes wrapping array operations, one of which adds string formatting. There’s an iterator over field positions. There’s plenty of text formatting infrastructure & datastructures. There’s a text sanitization class. There’s yet more calendar datastructures, this almost seems like more of a time library than a text library… There’s a class for internationalizing gender via hashmaps.

There’s a datastructure for localizing measurements with units. There’s a large suite of classes for internationalizing number parsing & serialization according to the datafiles. There’s simple tests o numbers. There’s a couple different ways in which this i18n library handles normalization, including declarative parsing rules. There’s a transliterator which parses unicode names embedded in the text. There’s localized list parsing & serialization. There’s unitted number serialization.


Skimming the rest of LibICU’s internationalization library…

There’s bindings to Window’s date, timezone APIs, & string formatting APIs. There’s a C API wrapping a “TimeZoneTransition” class, & another wrapping TimeZoneRule objects or VTimeZone objects. There’s a lookuptable with wrapper functions for converting between different timescales. There’s a class for querying timezone metadata & parsing dates according to it. There’s configurable “spoofing” infrastructure to aid testing.

There’s an API for parsing a Unicode “codepage” file as a regexp, & a regexp parser/interpreter class. There’s collation iterators with normalization. A class for opening files describing numbering systems (I’ve implemented this for CSS!). There’s a C API wrapping a “Region” class, & another wrapping PluralRules. Or transliterators. Additional significant infrastructure, including datastructures & binary-search lookups, for unit conversions. There’s timezone parsing & serialization.

There’s a transliterator parsing unicode names, another parsing escape sequences. There’s C APIs wrapping a FieldPositionIterator class, or a list of format strings. Or DateTimePatternGenerator objects. There’s various formatter methodtables. There’s parsers for the datafiles. There’s APIs for looking up translated strings, like Gettext does. More number parsing/serialization. String searching, sophisticated.

Skimming over the rest of LibICU’s internationalization library (hopefully today’s not a lie…) I see more code for collation, timezones, transliterators (including case conversion), C bindings, & some catalogues thereof. Time units have their infrastructure. And I’m seeing support for another calendar. There’s a parser for timezones, & another for locale files. There’s a text substitution class, & a text search class.

This internationalization library implements string slices itself. There’s more text-search classes, including a superclass. It has an enum-parser for plural cases. There’s comparable sorting keys. There’s some singletons. There’s more date formatting, amongst other things. There’s comparable sorting keys. There’s collation/transliteration/number rules. There’s scripts to classify text into. There’s a “region” datastructure. There’s some iterators. And a few other things!

Skimming over more of Apache Solr, I see several factories for classes which highlights where results match the query, which in turn involves trivial text searching & reformatting with an occasional superclass. Or a couple slightly-less trivial ones using a regexp, & another couple providing the core routines, gathering fields, or aggregating the others together.

Similarly there’s factories & classes for indexing the documents to be queried. There’s a CSV parser. Solr has a suite of abstractions around Java servlets. There’s some legacy field types & other datastructures. Solr has it’s own logging infrastructure involving a circular list, which may defer to Log4j.

And I’m pleased to report there’s a sizable testsuite!

Skimming some of LibICU’s support scripts, including C scripts to generate the dataheaders (with support dataheaders) of which there’s lots, there are commands for: a performance test on user-specified text files (with support classes & headers), validate data files, retrieving metadata from LibICU possibly outputting as XML, generating packaging files, more datafile generators (some of which outputs binary), plenty of code for timezones, modifying datafiles, merging datafiles, & trivial memory checks.

Also there’s a test plugin. And plenty of serializer/deserializer support files! Especially in relation to packages, in a submodule with terminal abstractions.

Finally there’s a LibICU4J which reimplements the C++ library in Java, with the higher-level language making the code a bit more concise. But otherwise it’s basically the same logic with a couple of extra datafiles. Then as per usual there’s build infrastructure & testsuites.