Text Layout & Rendering

Any text GTK renders is sent through Pango & its transative dependencies to handle richtext, linewrapping, internationalization, & more. Pango directly addresses the first 2, and I will be reimplementing Pango in Haskell. Or ideally paying someone else to do so if I can manage a decent income stream from my hobby…

Pango

If you ever wrote some XML markup to style your text in GTK, that is Pango! Btw unlike HTML there’s no inline-styling engine here, that’s the difference.


The public API mostly consists of a Pango Context, a GObject class.

It tracks & wraps a base direction, base & resolved “gravity”, a gravity hint, serial number to invalidate iterators, language (2 properties), fontmap, whether to round glyph positions, a font description, & a matrix transform. The main entrypoint is its get_metrics method!

get_metrics after normalizing its parameters considers returning its cached value. Otherwise loads the fonts from the map, iterates over them retrieving each’s metrics keeping the last, retrieves a language-specific sample-string to render (usually localized “I can eat glass and it doesn’t hurt me”, or something or other about a fox & a lazy dog) to pass to the true main entrypoints before cleaning up.

I’ll discuss pango_itemize_with_font later! The result of which is iterated over…

For each item it iterates over it attempts to lookup the font & its metrics, to conditionally copy some attributes over to the overall metrics & inform some postprocessing via Harfbuzz (with results tweaked, & extra abstraction). A width running-sum is computed.


Attribute lists are a sorted array of subclassable slices implicitly of the same string representing different styling options.

There’s textual formats from/to which attribute lists & their corresponding strings can be parsed or serialized. One’s seamingly for debugging, the other reuses the GMarkup lightweight XML parser bundled with GNOME’s GLib to build something convenient to use. Where various tagnames are parsed specially as shorthands.

GMarkup requires manually tracking relevant aspects of the tagstack. Attributes multi-buffered before being emitted, including in that tagstack. Can autounderline accelerators.

A Pango FontDescription holds a familyname, style, variant, weight, stretch, gravity, textual variations, mask, size, & some bitflags for whether the family or variations are static or the size is absolute. Font descriptions can be merged, hueristic differences can be computed to determine best match, more typical comparison methods, & various accessors. Includes a fairly sophisticated parser, used in Attributes parsing.

Fonts I’ll discuss later…

“Gravity” refers to an optional right angle rotation, can be converted to radians or matrix transforms & can be looked up for the Unicode “script” (roughly, alphabet) being used. A gravity hint defines which relevant gravity to prefer, mostly relevant when mixing scripts.

Pango defines its own (partial) matrix multiplication implementation with conversion from geometric transforms.

There’s logic for negotiating & inferring ISO language codes, & parsing preference lists. Or find sampletext.

Pango Itemization

To init it’s itemization Pango captures given context & text & attr iterator whilst computing/saving end, run_start, changed, embedding levels (via FriBiDi, later topic!), embedding end, gravity-related data (gravity, centered baseline, hint, resolved value, & font desc’s), script iterator with its range, width iterator, & emoji iterator properties. Nulls out result, item, embedding end offset, emoji font description, derived lang, current fonts, cache, basefont, first space, & font pos.

Upon both initialization & iteration range invariants are enforced. Finalization frees several of these properties. Iterating to the next item involves advancing the appropriate iterator.

Processing each resulting run involves checking which aspect has changed possibly computing a new gravity, derived language, or current font. Then processes each non-whitespace character allocating a new linkedlist “item” for output & stores results there handling final item specially.

For postprocessing it reverses that linkedlist & computes a running sum.


Pango’s “itemization” process is split into several iterators which are unioned together: embedding levels (precomputed), richtext attributes, Unicode scripts, emojis, & widths.

Some languages are left-to-right, others are right-to-left. Some are vertical (though those can usually be written horizontally too), & some are even diagonal (though no computer system I know of supports those)! Embedding levels computes which to use.

Precomputing embedding levels involves (after converting from Pango types to FriBidi types) computing the number of UTF8 characters, allocating 3 per-char sidetables of which it returns one (others are temporary), iterates over each char once looking up & recording their BiDi types with special handling for brackets whilst bitwise-or’ing & maybe (if flagged “strong”) and’ing these bitflags together, fastpaths unmixed text dirs, otherwise defers to FriBidi, & converts back to Pango types.


Pango attributes are stored in a sidearray from the text itself to make them trivial to iterate over! Though a stack is required to yield the end of all the attribute it has previously yielded the start of. Furthermore this stack is consulted to extract the styling for this run of text.

As stated previously Pango’s “attributes” are what’s parsed out of the XML (via GMarkup) you hand to it or GTK. They represent Pango’s richtext support!


To split the text into runs requiring different “scripts” (approximately a.k.a. “alphabets”) Pango iterates over each UTF-8 character. For each char Pango looks up the script The Unicode Consertium catalogued for it, for the “common” script looks up the corresponding charcode it pairs with, maintains a size-capped stack to balance those paired chars, & either fixes up any previously unknown scripts including in that stack or yields a script boundary.


To determine whether a char is an emoji Pango uses a lexer contributed by Chromium written in Ragel. The iterator checks whether the current is an emoji or not & scans all subsequent in the same classification.

Similarly the width iterator classifies (with some special cases) chars by horizontal or vertical writing directions according to builtin lookuptable.

Pango Layout

The PangoLayout GObject class tracks its PangoContext, richtext attributes, font description, tab indents (sized array of indents with boolean for units), the plain text, serial number for itself & its context, number of bytes, number of chars, layout width & height, initial indent, line spacing, justification/alignment, whether to ignore newlines, whether to autodetect textdirection, whether to wrap & whether it has, whether to ellipsize & whether it has, count of unknown glyphs, cached layout rects with flags, cached tabwidth, a decimal mode, resulting logical attributes, list of resulting lines, & a linecount. There is bitpacking involved, there’s a couple fields denoting which fields (the bulk of them) should be memcpy’d when duplicating.

Has standard GObject methods, & plenty of accessors.


A couple of these accessor methods wraps the XML parser for richtext markup. The serial number is used to detect changes invalidating layout computation, freeing the computed lines & resetting various properties whenever the input fields mutate via the accessors.

Upon accessing output properties (upon which some interesting logic is implemented) the PangoLayout lazily recomputes them ensuring all needed inputs are set. If flagged to infer textdir consults FriBiDi per-char or its context.


After clearing its output fields & retrieving initial extents/height from initial font, Pango’s highlevel layout algorithm involves repeatedly optionally looking for paragraph boundaries, optionally determining the base direction filling in gaps from the previous value, determines whether this is the final iteration with the last segment, runs the itemization algorithm I described above, copies attributes over to results whilst locating the correct slices thereof, optionally updates some per-item flags & attributes utilizing relatively-complex Unicode processing I don’t understand some of which is language-specific, applies some itemization postprocessing I’ll describe later, either repeatedly splits the items into lines (again I’ll describe later) or constructs a single-line, checks whether we’ve surpassed the high-limitation, & in preparation for the next iteration if any the next start index. Then it cleans up & aligns text!


PangoLayout’s has a method for iterating over the “runs” in a given line to locating UTF-8 offset in the appropriate run corrected to avoid landing the middle of clusters before iterating over the Harfbuzz-computed glyphs taking into account FriBiDi-computed text direction.

There’s a couple method for computing the appropriate line from the linkedlist for an index, one computing extents. And there’s methods combining these.

Another method computes new indexes moving up or down a line.


There’s a method which iterates over the lines locating the one which contains y-coordinate, then defers to the method for computing the x-coordinate. And a method which similarly does the reverse.

There’s further wrappers around these methods which returns properties of their results. A relatively complex one computes “strong & weak” rectangles to depict bidirectional text-cursors. And a further wrapper which incorporates Harfbuzz positions.


There’s methods for refining the chosen alignment, computing the x-offset to apply that alignment. This gets incorporated into the extents upon requesting it, & this info retrieved at the end of the core layout algorithm to ensure this is applied.

There’s yet more methods locating & processing appropriate runs. There’s an iterator over the computed lines or runs computing line-extents as it goes, with plenty of its own getters.

Pango Itemization Postprocessing

Text layout/rendering is full of little nuances which must be handled for internationalization’s sake. Here’ll describe such nuances (selecting “variants” & tweaking fontsize) which can’t fully be captured by intersecting several iterators to split the text up into “items”.

For variants it iterates over each item checking if a valid font-variant has been selected for it. If not it’ll split upper & lowercase letters to approximate them via fontsize & text transformation.

Another pass allows font designers to handle the difference between visual & geometric sizes, by iterating over every item. For each it analyzes the actual size of the selected font to compute how much it needs to be scaled to achieve the desired size. Results are saved into the item’s attributes. This size adjustment is tweaked further for superscript, subscript, & smallcaps text.

Pango Line-Splitting

Pango’s main job is arguably to split text up into lines (or is it to split text more generally around richtext, writing systems, etc more generally?), so how does that work?

Until there’s no more items (if there was none to begin with it adds an empty line) Pango allocates a line lightweight gobject, initializes several properties on it (possibly with minor computation), then it iterates over the previously-split “items”.


For each previously-split “item” Pango gathers styling properties & runs Harfbuzz. If the trailing chars a newline it inserts the item into the line’s linkedlist incrementing its length by what Harfbuzz computed whilst adjusting positioning to take into take into account tabs, before returning to the loop indicating this case. If the item fits entirely on the line it does likewise.

Otherwise it may have to split the item. For which it first sums the width & specialcases tabs as all-fit.

Then Pango checks if it can split at the end of the item & whether to add a hyphen. This tweaks the following fastpath check which checks if that width fits.


Having exhausted options it computes per-char widths iterates over the chars looking for valid breakpoints (taking into account hyphens) until we pass the max linelength. For each it might trial-split the text to consider the hyphens or tabs before updating the max-valid run.

If this fails it might try again with looser constraints.

If Pango has successfully split the item to fit (or has a force flag set) it either applies the split & returns that the first half all fits, or it indicates that nothing fits. Or splits as best it can indicating some of the item fits.

Otherwise it indicates nothing fits. Harfbuzz data is freed since linesplitting invalidates it.


Upon all fit the outerloop checks for tabs & removes the item from its todolist. Upon empty some fit it sets a flag & exits. Upon nonfit it backs up over previously-selected runs to see if they have valid splitpoints & reruns those checks for that tweak. For newlines it removes the item from its todolist & exits. Upon exiting it updates some counts & appends the line to the computed output.

Having computed a line Pango inserts missing hyphens, truncates whitespace, reverses runs, normalizes the text baseline, from Harfbuzz, considers ellipsizing the line, converts logical to visual order, redistributes whitespace, optionally “justifies” words in a seperate pass, & updates some flags.

FriBidi

With the variety of written languages around the globe, its not uncommon to want to combine them. Which can get tricky if those languages don’t agree on which direction they’re written in! To resolve text from “logical order” to “visual order” & back in C, adhearing subtle rules, we can use a simple well-tested library called “Fribidi”! As long as that text is entirely horizontal.

After validation & stack/heap allocation Fribidi starts by classifying each char via a lookuptable.


Chars are classified as left-to-right, right-to-left, arabic, european numerals, arabic numerals, european number seperator, european number terminator, common separator, non-spacing mark, boundary neutral, block separator, segment separator, whitespace, other, left-to-right embedding, direction embedding, direction overrides, pop/push directional flag, direction isolate, & first-strong isolate.

Another iteration refines other to extract bracket type & open/close. Via metaprogramming.


Given that classification fribidi commences it applies a run-length encoding & commences its core logic!

The first main iteration locates the first non-isolated letter & its text direction.

Then after initializing some collections, counters, & other state it iterates again to interpret char classifications upon a stack. Tweaking the “run” possibly moving between linkedlists. A postprocess iteration handles isolations populating an indexmapping.


Between compaction passes an iteration resolves “implicit levels” handling numbers differently from other text, & extracting the max level.

Then it merges back in the extracted runs taking care to apply new levels.

And finishes with an iteration to finalize the computed run properties, & an iteration for run-length decompression.

I’m not fully clear what exactly this is all accomplishing…

Then it allocs index-mappings & maybe rearranged text to return to caller.


As a cursive alphabet Arabic requires special handling selecting “joining chars” when rearranging into visual order, this involves a lookuptable preprocessing step & an iteration over the Arabic chars to detect the literal edgecases between Arabic & other writing systems. Then if rearranged text is requested it handles “shaping” arabic (involves “shaping” and/or “ligaturing” substeps both via lookuptables) & mirrored text (also involves a lookuptable) each specially.


P.S. Skimming over the rest of FriBidi I see logic for converting a handful of charsets to UTF32 & back, at least one of which required Fribidis logic. Since FriBidi deals in UTF32 by default, this included a UTF8 decoder.

GNU LibC vendors a much more thorough project for charset conversion, but I can’t be surprised FriBidi’s interested in the subject!

FontConfig

What is geometrically correct does not always appear so. So font designers may wish to tweak their fonts to look better at different sizes & styles, which they do by providing multiple fontfiles for the same “font family”. GNOME’s text stack uses FontConfig to choose these fontfiles.

FontConfig provides a library & a large suite of commands trivially wrapping it. I’ll be focusing on studying the fc-pattern command as a starting point.


The first step of querying the FontConfig fonts database is to construct a “pattern” to match against. These patterns are represented in-memory as a dynamic-array of properties, in turn holding linkedlists of tagged union & how strongly it “binds”. Values may be bytes, ints, bools, doubles, 4x4 transformation matrix, array of desired character coverage, arbitrary pointers, bitmask of supported scripts/languages, & floatingpoint ranges.

Offset pointers are common for ease of serialization.


Here the pattern is parsed from commandline args, starting with a “name” that is split by commas, dashes, & colons. If its unsplit its added as a string to the pattern to match the FAMILY, retrying upon commas. For dashes it’ll parse out the parse out floats to match SIZE. For colons it repeatedly looks up the property to match & converts the strings to the appropriate type.

Remaining commandline args are gathered into a multistring.

FontConfig Preprocessing

To cope with apps & sites that don’t consider which fonts are available on freedesktops, to “synthesize” fonts for styles that aren’t provided, & more FontConfig can preprocess queries according to some XML config. Reasonable defaults are shipped with the project.

Which first requires loading the XML config under a locked refcount using a LibXML SAX parser in reference to an element stack, the bulk of which happens on element-end. In the meantime it gathers various filepath collections.


For “match patterns” it iterates over the configured locales to consider incorporating them into the pattern once normalized. It then fills in a PRGNAME field, allocates some collections & font-family hashtables, & iterates over the configuration. For each config it branches over the config type & evaluates its condition to determine whether to mutate or replace the query pattern.

Then it cleans up.

A seperate pass may fill in default values for properties.

FontConfig Querying

To get the font database to query it first performs various checks to determine whether it needs to reload the database into memory, including file timestamps. Buried within the code to read configuration is code to very carefully mmap an index of all installed fonts. Interestingly I can add more within the bounds of the process, WebFonts are easy!


Or maybe it it’ll have your process carefully read over the font directories itself in sorted order, consulting FreeType (which I’ll discuss soon-ish!) via an extensive internal API binding building on a few utils. Config rules are applied to the extracted pattern for the font. Afterwhich it serializes this data back out so other processes don’t have to go through all this effort. This serializer includes an allocator within the mmap’d file utilizing a hashmap.


It may then (depending on how you call this code) index all the fonts into a hashmap by lang filtered by a pattern & read them back out into a linkedlist. This filtering requires all given properties to be present with compatible values. At which point (once normalized into common type) we get into FontConfig’s dynamic type system!

Harfbuzz

Harfbuzz is a vital component of a text rendering stack. It converts a sequence of Unicode characters into a sequence of positioned “glyphs”.

This makes English text look a little nicer, but is vital for other written languages. Especially non-latinate ones (i.e. outside Europe and the nations they colonised who didn’t previously have writing).

Harfbuzz centers around a single complex function (with multiple OS-specific backends), but to start: the datastructures you pass to that function!


Harfbuzz includes a methodtable wrapping LibUCD, GLib, LibICU, or noops for unicode processing. Implements trivial utilities itself.

Behind a few layers of light abstraction Harfbuzz provides a bitset segmented into a hashmap. The innermost implementation is a “bitpage”, wrapped in (with a 2nd layer allocation-managed mapping) “bitset”, possibly-negating “bitset invertable”, & finally a “set” with C bindings (from C++). Then there’s “set digest”.


Harfbuzz includes a closed-addressed hashmap implementation with consecutive rehashing, whose entries can be bitflagged as used or tombstoned. The initial hash is kept within a prime range though rehashing is capped to powers-of-two for efficiency, occupancy & population is tracked to decide when to resize.

Harfbuzz provides a resizable, quicksortable, slicable, binary-searchable C++ array type.

There’s memory-managed byteslice type called a “blob” used to wrap raw fontfile data.


One of the two main inputs provided to Harfbuzz’s main function is a “buffer” of the text you wish to shape. And its main output is a buffer of the corresponding glyphs. Buffers can function as a resizable array, with some of those methods performing text transcoding. The output can be normalized with sorting.

These have several methods for moving data around within them, a method for enforce bitmasks. Mostly for the sake of testing, presumably, there’s a diffing method.


Buffers have (beyond memory management) properties for:

Bitflags include:

The cluster level may be “monotone graphemes”, “monotone chars”, or “chars”. Can convert to/from JSON or markup formats. Can verify results.


Harfbuzz can parse a colon-seperated $HB_OPTIONS envvar.

It defines a 4char “tag” type (within a single CPU word) used, amongst several other things, to denote the text’s script/alphabet.

There’s a direction enum which can be interpreted as bitflags for trivial processing. You can determine this direction for the script.

There’s “language” canonicalized-string, with a linkedlist.

There’s parsing utils, used to read/write font features & variations.

As-usual it exposes a version number.


Harfbuzz defines word-sized bools, 32bit codepoints & positions & masks, numeric unions with or without including floats.

Font features consist of a tag, 32bit value, start, & end. Font variations consists of a tag & floatingpoint value.

Both tags & colours are 32bit values split into 4 components.

You can provide Harfbuzz callbacks for visually-rendering these glyphs, with an internal structure tracking internal state.

Harfbuzz Fonts & SubSetting

Besides buffer of the text to be shaped, Harfbuzz’s main function accepts the font with which its to be rendered. The font which defines the glyphs it can choose to represent that text.

Hafbuzz provides a font object specifying synchronization as its other dominant parameter & memory-management properties as well as:

These are loaded from blobs & an index via a “Face” object.

Faces themselves include some memory-management callbacks, index number, units-per-em, glyphcount, a “shaper” parameterized for this type, an OpenType face, & linkedlist of shape plans.

There’s a whole segment of the codebase modelling each table present in a OpenType fontfile (dominant & most featureful format WOFF files are a variant of). They have accessors & utility methods usually wrapping iterator (another type Harfbuzz reimplements itself). Many uses unions to implement to implement schema upgrades.


Then there’s utility methods like on the Sequence, SingleSubstFormat, or AlternateSet tables which applies the substitutions to a buffer. I am glossing over much of this code since there’s so much of it that looks highly repetitive, so I’ll discuss them in more detail later as I discuss how they’re used.

The glyf table, amongst some variations, has matrix-transform utility methods.

Harfbuzz wraps C++ iters with operator overloading to add functional-style map/etc in a pipeline.


To help ensure the job’s done right taking into account concerns Harfbuzz implements, they implement font subsetting themselves. For which it models “objects” as being a doubly-linkedlist of “real” & “virtual” “link” arrays. Links in turn hold an unsigned width, position, & object-index.

Postprocessing involves copying objects into a new array item-by-item & calls into a “repacker” to wrap that in a sorted graph (another subsystem), decide how to efficiently split it, & serialize result.


A subset plan holds various bitflags, sets, (possibly-hash) maps as well as some arrays & a source/dest font to which it provides accessors. Has C wrapper & default field values. Or you can create a subsetplan from a subset input, computing:

That’s just touching the surface, especially of glyph-IDs to retain. Frees properies upon error, has accessors.

The serialization logic can be extensive, & different formats may optimize differently.


With that plan it repeatedly iterates over preserved tables computing what to keep until fixpoint (both inner & outer loops), attaching a new “accelerator” within the font’s userdata after the fact.

Subsetting a table generally involves estimating the subsetted tablesize for allocation, deferring to a method on the table, reserializing it, & recovering from errors.

Shaping Dispatch

Harfbuzz’s central function is hb_shape_full, at least once they decided to add extra parameters to hb_shape.

hb_shape_full validates the buffer possibly copying data over into a new buffer for complete checks, creates/exectures/destroys a cached “shape plan”, possibly extensively-validates resulting buffer. Ensuring all glyphs face in same direction, breaking invariants followed, etc.


Creating a cached Harfbuzz shape plan involves checking whether its worth caching, if so allocating a caching key & iterate over a linkedlist looking for it, create the plan itself returning it immediately if we’re not caching, & allocating a node into that linkedlist.

Creating a Harfbuzz shape plan involves validating a direction & properties are given, allocating the structure, flags the fontface (fallingback to empty) as immutable, initialize the key, & init the OpenType info.


Executing a shaping plan involves, after validation, going over all builtin “shapers” checking whether the font holds relevant data & calling its entry function.

These builtin shapers may (depending on build flags) include:

Fallback Shaping

In case a font doesn’t provide shaping tables Harfbuzz provides a rudimentary fallback. Still less euro-centric than much toy text-rendering code, according to documentations supports Latin (this alphabet here being used for English), Cyrillic, Greek, Armenian, Georgian, Tifinagh, etc!

This fallback first looks up the space glyph & whether its present, zeroes output, & retrieves parameters before iterating over the buffer & cleaning up.

For each char this fallback shaper checks if its default-ignorable. Those become space glyphs with zero x & y advance. Otherwise it looks up the char, position advance, & position origin.


Cleaning up involves checking whether the given direction is “backward” (sounds a little judgemental of a term to me…) & if so the glyph order is reversed. Using an iteration with both forward & backward indices.

Then it finishes by unsetting all glyphs’ flags except HB_GLYPH_FLAG_DEFINED.

Looking up a glyph involves deferring to an internal methodtable. Horizontal & vertical advances are seperate methods internally. Same for origin, incorporating subtraction.


To perform these per-char lookups Harfbuzz may defer to FreeType (will describe later), or it may use its own OpenType implementation.

There’s a bit of C-PreProcessor magic which I found a bit difficult to navigate, but it defers to the font’s CMap table & in turn an appropriate callback for its “fontpage” (roughly equivalent to alphabet/”script”) or version number. In either case the fallback is to defer indirectly to an appropriate “get_glyph” method which may do a binary-search or array lookup.

There’s fastpaths for formats 12 & 4.


A second try may attempt rewriting the Unicode char.

Emulating Microsoft, Harfbuzz may try adding all 1s to the top nybble of single-byte chars. Or it may consult some sophisticated bittwiddling & lookuptables for fallback Arabic codepoints, in one of 2 codepaths.

Results are cached.


To compute horizontal advances for some glyphs Harfbuzz may first consult a cache, before consulting the HMTX table, tweaking scale, & writing into output arrays. Much of which is bypassed for char-by-char lookup.

Vertical use VMTX near-identically.

The HMTX & VMTX tables performs what resembles a bounds-wrapped array lookup, possibly followed up by consulting a VAR (variations) table for tweaks & possibly deferring to the glyph’s methodtable.


To compute vertical offsets (horizontal appears commented out?) Harfbuzz first computes the horizontal advance as I’ve just described, & checks for a VORG table. If present it checks the VMTX’s variants table for tweaks before deferring to the VORG table.

Otherwise it attempts to compute a vertical origin from the font extents & leading-bearing (think I’ll explore those topics later as it becomes more relevant?). Otherwise tries a trivial formula to compute origin from extents.

Consulting the VORG table involves a binary search.

OpenType Shaping

Harfbuzz’s core logic, when its not deferring to some other OS’s implementation, involves:

  1. Gathering parameters into a “context”
  2. Updating a bitmask of allocated variants
  3. Resetting a mapping
  4. Iterating over the buffer splitting “graphemes”
  5. Possibly considering (checking per-glyph bitflags) prepending a dotted-circle glyph if-available for unicode marks to apply to annotating with appropriate bitflags
  6. Possibly iterating over graphemes to merge or split them
  7. Iterating over buffer again considering whether to reverse the graphemes to enforce a consistent text direction
  8. Running the context’s shaper preprocessing plan surrounded by debug messages
  9. Considering flipping each character horizontally or verically, allocates variations
  10. Applying some normalization
  11. Looking for fractions to avoid breaking
  12. Telling the shaper to setup masks copying results over to the buffer
  13. Possibly reclassifying each char’s combining class
  14. Overwriting codepoints with computed glyphs
  15. Deallocating variations from the bitmask
  16. Allocating glyphs & ligatures into the bitmask
  17. Consults GDEF table to tweak glyph properties whilst clearing ligatures
  18. Recomputes each glyph’s properties
  19. Runs the plan’s substitution pass
  20. Removes flagged glyphs compacting the array whilst recomputing clusters.
  21. Computing positioning
  22. Postprocesses substitutions
  23. Propagates bitflags
  24. cleaning up

It computes positioning by:

  1. Clearing old info
  2. Computing initial info from the font’s advances & origin, with a fallback inserting whitespace via positioning properties.
  3. Tweaking that based on various font tables (h-origin, & repeatedly GPOS) including another shaper callback whilst considering “mark” & “ignorable” & deleted glyphs specially falling back to per-cluster layout along a “baseline”
  4. Consider reversing glyphs
  5. deallocates variants.

Substitution postprocessing involves possibly removing glyphs flagged deleted, overwriting invisible glyphs or deletes them, & running the shape plan.

Graphemes (first-pass char groupings)

There’s a general initial pass setting each glyph-info’s continuation & other Unicode flags based on which characters it sees.

Then there’s a choice between two codepaths selected by the buffer’s given cluster_level, implemented as methods on said buffer called per-grapheme. Hopefully they’re 1 toot each?

After some size-validation & taking a fastpath for char-level clustering, merging clusters involves iterating over each of its chars taking the min cluster number, iterating over following chars until they’re in a different cluster, same for preceding chars continuing into an “out-buffer”, & ensures all chars in this broadened range have their cluster set to the same value. Flagging them DEFINED upon change.

The alternate codepath flags each char-info in the range with UNSAFE_TO_BREAK & UNSAFE_TO_CONCAT, using one of a few minorly-differing codepaths handling edgecases.

Locale-Specific Preprocessing

Depending on your alphabet, Harfbuzz may apply additional preprocessing logic. Given Latin’s the one I’m familiar with I’m not sure how much sense I’ll make out of this, but here goes…

Hangul (if I understand correctly) needs to compose by its leading-vowel-?trailing syllables, each component having its own Unicode char-range. Or it may need to decompose, depending on the font.

This involves a main iteration. Which first checks for “tones” & whether to normalize by adding dotted circles.


Then it checks for leadings followed by vowels with a possible (if font has the glyph) trailing merging them into a single cluster. Otherwise for pre-combined glyphs it checks whether the font has the glyph & considers decomposing it with appropriate Hangul bitflags.

Considers zero-width chars during processing Hangul “tones”.


For Thai it uses a simpler iteration skipping non-“sara-am” chars, flagging continuations & switching to “sara-aa” chars.

Then scans back over certain chars rearranging or merging clusters.

If the font doesn’t provide embed its own shaping script Harfbuzz runs a 2nd iteration during preprocessing that classifies each char’s “mark type” filling in blanks or following a state machine for which glyph lookup table to use attempting Window’s then Mac’s mappings.


There’s a “Universal Shaping Engine” designed by Microsoft who’s preprocessing inserts dotted circles where vowels are missing.

Unless Harfbuzz is trying to be Uniscribe bug-compatible, for Indic it’ll apply that same logic filling in missing vowels. The condition for doing so is compiled from a Microsoft datafile to C via a Python script. Metaprogramming!


And that appears to be all the “scripts”/”alphabets” known to Unicode & Harfbuzz with special preprocessing needs?

Text Normalization

After splitting graphemes, preprocessing, & flagging mirroing Harfbuzz enforces some normalization to match the chars defined by the font.

First off it checks a handful of fields are populated. If the normalization-mode is “auto” it’ll choose an actual one based on whether the plan has a “gpos mark”. This mode is used to flag whether it always or might shortcircuit.

Further checks utilizes a “normalization context”.

Then there’s 3 “rounds”…


The initial “decomposition” round flags buffer output clear, & iterates over the buffer chars before syncing properties. For each char it skips ahead to the next “mark”-bitflagged char then rewinds. If it might shortcircuit it checks whether the pre-existing glyph already exists in the char, if so repeatedly decomposes the char handling failures & together with all marks decomposes the cluster.

Decomposing a char sees if there’s a directly-corresponding glyph to append to output.


If not it tries consulting the shaper methodtable (discuss that tomorrow…), shortcircuiting if it didn’t decompose or the font doesn’t have the char. Otherwise it recurses on the “a char” checking for its availability before or after depending on whether it “might shortcircuit”. Preceding check can also likewise be moved after.

Then handles spaces or hex 2011 specially if needed.

Decomposing a “cluster” involves iterating over looking for variation selectors, then decomposing each char.


For each variation selector codepoint in a cluster Harfbuzz iterates over said cluster looking for rare other variation selectors. For each in that inner loop it either consults the font for a variation or leaves the data for a later “GSub” pass to process (another day). skipping subsequent variation selectors.


The 2nd “reorder” pass (only if initial pass found char which needs this) iterates over glyphs locating runs with “combing classes” & sorts them with a postprocessing callback.

If there’s CGJ chars (hex 034F, has bitflag on containing buffer) Harfbuzz examines surrounding chars to determine to unset the hidden bitflag.


The 3rd & optional (upon same check, for specific modes) combines diacretics into single char when that’s what the font expects. By clearing output, iterating over glyphs, & syncing. For each glyph it checks whether its bigflagged as a Unicode mark & the combing classes before running a callback & looking up the glyph.

Locale-Specific Normalization

Harfbuzz’s logic & decomposing glyphs is usually deferred to some unicode library or other (LibICU has already been added to my queue), but it may be done specially for the different “scripts” harfbuzz supports. But even there it typically defers to some extent or other to said Unicode library.

For Hebrew failing generic composition & if there’s no GPOS marks, Harfbuzz uses a lookuptable or 2.

The first lookuptable is a switch statement upon 1st char with conditional checks upon second. In a specific branch it may consult an array of “dagesh forms”. All ammounted to hardcoded ligatures.


Indic disables decomposition of chars hex 0931, 09dc, 09dd, & 0b94 before deferring to general logic. Composition skips marks & recomposes hex 09af & 09bc as 09df specially, before deferring to general logic.


Khmer decomposes certain cars specially to insert hex 17c1 chars, before deferring to general logic. Composing skips marks before general logic.

The Microsoft USE algorithm composes chars likewise.


Arabic & Hebrew have special postprocessing after sorting “combining classes” in a cluster.

For combining clusters 220 then 230 Arabic scans ahead chars lower combining class & if it found anything it scans ahead “modified combining marks” with equal combining class.

Those subclusters are internally-merged & swapped, before assigning new combining classes.

Hebrew checks the combining class of 3 char runs, & if they have certain combining classes the char-clusters are merged & the chars are swapped.

Local-Specific Char-Property Computation

After normalizing text to use the available glyphs, the methodtable for the appropriate “script” (~alphabet) is given a chance to attach additional info.

Afterwards “font features” are applied into these bits via a dynamic lookuptable over the appropriate ranges (unless that range is the entire text) in the appropriate cluster.


For Arabic it allocates some bits, computes joining info, possibly computes Mongalian variations, & copies resulting data over into a different private property.

Arabic is a cursive script, meaning the glyph for each char is selected based on its surrounding chars. As such to compute those “joinings” it examines the preceding context consulting a multi-stage lookuptable to classify each preceding char, then another lookuptable to traverse a statemachine for the first non-“T” joiningtype.

Then it continues the lookups to classify each char & run the statemachine, & for the first non-“T”-joining trailing char.

For certain Mongolian chars (hex 180b, 180d, or 180f) it may copy the joining state from the previous char in a postprocessing loop.


For Myanmar it allocates some “category” & “position” bits, & for each char consults a multistage lookup table for each char’s category. The bulk of the logic here is in adjusting the mapping, discussed later.


Indic works the same as Myanmar here, except the looked-up value includes the position bits.

For Khmer it similarly allocates some “category” bits & for each char consults that same lookuptable for the category. Again most of the logic is in adjusting the mapping.


For Microsoft’s USE algorithm it considers running the Arabic routine before allocating category bits & iterates over each char retrieving values from a sophisticated lookuptable + bitwise manipulation.


For Hangul it iterates over all chars to lookup the previously-computed “hangul shaping feature” in a lookuptable, if that lookuptable’s present. The bits for the “hangul shaping feature” are deallocated regardless.

Locale-Specific Map Initialization

As the OpenType shaper is initialized, it gathers than compiles a map possibly followed by a date_create callback.

Gathering those features involves enabling “rvrn” (adding it to a feature_infos array, registers a nullcallback to the GSUB table, enables “ltra” & “ltrm” features or the rtl equivalents, compiletime-optionally enables “frac”/”numr”/”denom” for rendering fractions, enables “rand”, compiletime-optionally enables “trak”, enables “Harf” & “HARF”, & defers to the script!


After alphabet/”script”-specific logic it enables the “Buzz” & “BUZZ” features, adds, “abvm”/”blwm”/”ccmp”/”locl”/”mark”/”mkmk”/”rlig” features, adds “calt”/”clig”/”curs”/”kern”/liga”/”rclt” features or if-vertical “vert”, iterates over caller-specified features adding each, possibly adds them again to the AAT feature, & gives the “script” a chance to override this config.

Compilation involves selecting certain features/etc, compacting memory, & extracting various properties/conditions.


For Arabic it enables “stch” feature, registers a callback on GSUB table (checking ligature information per-glyph to choose “repeating” or “fixed shaping actions whilst setting ARABIC_HAS_STCH bitflags), enables “ccmp” & “locl” features, registers a null GSUB callback, adds “isol”/”fina”/”fin2”/”fin3”/”medi”/”med2”/”init” features flagging Syriac fallbacks & registering null GSUB callbacks, register a GSUB callback to deallocate arabic-shaping-action bits, enables “rlig” feature, registers a GSUB callback to apply some fallback shaping logic if the font doesn’t provide it), enables “calt” feature”, if “rclt” is missing adds a null GSUB callback & enables it, & enables “liga”, “clig”, & “mset”.

This Arabic fallback logic is implemented in another file with “Unicode” & “Win1256” subcomponents to populate the lookuptable. Its predominantly a per-glyph lookup.


For Khmer it registers a couple GSUB callbacks, enables “locl” & “ccmp” features”, adds “pref”/”blwf”/”abvf”/”pstf”/”cfar” features, registers a callback to deallocate syllable bits, & adds “pres”/”abvs”/”blws”/”psts” features.

An initial callback allocates syllable bits, & parses the text via Flex to populate those bits. Flagging each syllable as unsfe-to-break.

The other runs some more generic logic to insert dotted circles in place of missing vowels, before extensively within each syllable.


For Indic it registers a callback that lexes Indic syllables & flags each as unsafe-to-break, enables “locl” & “ccmp” features”, registers a reordering callback, adds “nukt”/”akhn”/”rphf”/”pref”/”blwf”/”abvf”/”half”/”pstf”/vatu”/”cjct” features, registers a final reordering callback, & adds “init”/”pres”/”abvs”/blws”/”psts”/”haln” features. Not disimilar to khmer!

The initial reordering callback involves checking some bits for the syllable-type to branch over alongside initial chars.


The outer branch chooses whether to apply this logic taking into account whether we want to be bug-compatible with Uniscribe. I’m not comprehending what the sizable inner branch is attempting to achieve as it rewrites the glyphs.

The final reordering callback looks for “varima” glyphs ensuring they’re categorized as “Halant”. Then iterates over chars, considers prefix/suffix, checks that leaves much remaining, checks for “rephs”, checks for “prefs”, & finalizes clustering.


That loop locates “base-c” chars & the subsequent “pref” char setting bitflag on the next non-halant char. Malayalam has its own base-c char logic taking into account joiners, halants, & consonants.

If the text is near-entirely a prefix/suffix it puts a bit more work into locating the central word.

For initial pre-m glyphs it memmoves some glyphs & merges resulting clusters. Otherwise it merges clusters split by pre-m glyphs.

I won’t put more time into studying Indic shaping…


For Microsoft’s USE algorithm Harfbuzz registers a callback (to allocate syllable bits, lex into it, propagates a certain bitflag, aggregates bitflags, & iterates over syllables to propagate joining flags taking the aggregated flags into account), enables “locl”/”ccmp”/”nukt”/”akhn” features, deallocates substitution bitflags adds “rphf” feature, registers a GSUB callback to iterate over syllables & their chars to see where substitutions happened, registers a substitution-deallocator GSUB callback again, enables “pref” feature, registers GSUB callback to record its use, enables “rkrf”/”abvf”/”blwf”/”half”/”pstf”/vatu”/”cjct” features, registers a GSUB callback to insert dotted circles in place of missing vowels & lightly reorders within each syllable, registers a syllabic bits deallocator GSUB callback, adds “isol”/”init”medi”/”fina” features, register null GSUB callback, & enables “abvs”/”blws”/”haln”/”pres”/”psts” features.


Finally for Hangul it just adds the “ljmo”, “vjmo”, & “tjmo” features.

Glyph Substitution

Harfbuzz’s main(?) “shaping” pass serves to substitute character IDs for glyph IDs corresponding to (usually) vector images within the fontfile based on the information extracted beforehand.

This may start with a AAT layout substitution, before consulting the standard path.

Most of the substitution logic in this general pass is shared with the layout logic.


To “apply” glyph substitutions or positioning to the text it iterates over the previously-queued stages & their “lookups” for each setting various info into a context object & calling another method. Whilst outputting debug info. After the “lookups” iteration it’ll call any pause callbacks if available.

For each lookup Harfbuzz it discards empty text-segments, & decides whether to apply forward or backward substitutions. Forward substitutions happen in-place & require synchronization!


In either direction checks each char’s bitflags against the stage (amongst other bitflags) & runs a substitution vs shaping specific callback. In forward direction it consults a cache.

This is underlaid by a hb_ot_layout_lookup_accelerator_t object & the appropriate table from the fontfile. This “accelerator” gathers & allocates summary information upon initialization retrieving info from the fontface; wrapping a “digest set”, “subtables” array, & cache-user index number.


Behind a few layers of abstraction Harfbuzz dispatches to methods on each opcode in the relevant fonttable.

Substitution opcodes


Harfbuzz’s AAT Layout substitution turns out to applies similar logic to what I discussed yesterday in a pre-processing step upon the optional Morx & Mort tables.

Arabic Postprocessing

After all the “shaping” computation (save some propagation & cleanup) Harfbuzz gives its “script”/alphabet methodtables a chance to do some final postprocessing. Here I’ll go over how Arabic use this opportunity! The other supported scripts don’t (expected more to dig into here…).

Shortcircuiting if stch char isn’t present according to buffer’s bitflags, it performs the following steps (next toot) twice for “measure” & “cut” passes.


The primary subtask is an reverse-iteration over the buffer’s chars within which it:

  1. For non- STCH_FIXED or STCH_REPEATING shaping actions, skipping them whilst (for CUT phase) removing them
  2. Otherwise iterates over the STCH_FIXED or STCH_REPEATING summing the total fixed & repeating width
  3. Iterates to sums the total width
  4. Computes sophisticated justfication
  5. In CUT phase iterates over once more duplicating repeating chars, having ensured there`s space for it during MEASURE

The secondary subtask is to (for MEASURE) enlarge the buffer or (for CUT) assign the new size.

Shaping Miscallania

Looking through the rest of Harfbuzz, I see:

FreeType

FreeType is the preeminant library for reading and rasterizing fontfiles! Neatly split into several submodules, most of which tackle a different fontformat. Or maybe they implement common forms of processing.

Autofitting

“Autofitting” relates to aligning text to the pixel grid. Less relevant now given even moderately high-res displays, but still… For the old “standard res” we hacked in higher-res for the sake of text by using the 3 colours as “subpixels”.

FreeType autofit has a struct tracking the original, current/scaled, & current/fitted position/width for the text converting from font units to screen subpixels units. The computation can be parameterized upon a fontface, x/y scale, x/y delta, rendermode, & some bitflags that can disable horizontal, vertical, or advance hinting.

There’s a routine which compiletime-may consult Harfbuzz retrieving all glyphs in the font & writing an unsigned short to those indices in a given table.


This mutual dependency between FreeType & Harfbuzz may be broken on the FreeType side.

Writing systems (here from a static array, generated from macros) can have methodtables. There’s various datatables listing unicode ranges for different scripts.

A “module record” struct listing fallbackstyle, default script, whether to darken stems, & 8 darkening integral parameters.

A loader struct holding a face, globals, hints, metrics, whether its transformed, the matrix, delta, & 2 more vectors.


There’s a bubblesort for position arrays (actually a good sorting algorithm for trivial ammounts of data).

When initializing the globals structure, FreeType iterates over all the charcodes for each writing system seeing which are present in the font (couple inner loops) whilst populating a lookuptable. Defers to the Harfbuzz logic described earlier, if present. ASCII digits are handled specially, & maybe output debugging.

An teration over writing systems to populate a metrics lookuptable.


There’s a noop autofitting methodtable.

There’s a written to bubblesort & statistically summarize a width array.

There’s a routine for a storing an array in an “axis hints segment”. Or an edge, recomputing a bounds array.

There’s a routine to recompute hinting properties for new parameters, tweaking existing data.

There’s a routine for iterating over segments aligning them to the x or y grid. There’s an interpolation routine along the subpixel grid. And a seperate one for “weakpoints”.

All these routines operate upon a somewhat sophisticated internal datamodel.


There’s a routine abstracting glyph loading, adding unit conversion & a chance for the computed metrics, stem darkening, matrix transforms, the relevant writing system, etc to incorporate tweaks.

Everything’s abstracted behind a module methodtable.

The basic writing systems FreeType understands for autofitting include latinate, indic, & CJK. With a “blue” subsystem.


Each of those writing systems are processed according to their own datamodels. The Indic writing system for autofitting purposes is treated as a minor variation upon CJK.

The “blue” subsystem autofitting “bluestrings” is entirely datafiles.

I don’t think I’ll tackle the CJK writing system since I don’t write it & can’t speak to it. There is a fair bit of code there, which seems appropriate to my limited knowledge.


FreeType “autofitting”’s method table for this “latinate” writing system used here:

FreeType Core

FreeType includes a “core” module for all its dependencies to build upon. In here I see:

BDF Fonts

BDF is (according to the README here) an early & limited bitmap font format from Adobe intended to be human- & computer- legible. Could well have been vital to getting computing off the ground!

FreeType’s BDF implementation has a CMap subclass with binary-search lookups (2 methods).

A routine extracts bitflags & extends flags from BDF metadata.

This submodule implements its own resizable-array implementation for its parser.


Parsing a given FreeType Stream (filedescriptor wrapper) involves repeatedly:

  1. Checking whether to refill the buffer from that Stream
  2. Skip empty lines
  3. Locate end-of-line & whether we need to refill to do so; if so prepare
  4. Unless commentline run callback.
  5. Skip to end-of-line.

There’s number-parsing routines (seems they plonked a self-contained BDF parser into FreeType & minimally modified it to fit), an encoding-comparator, keyed accessors into a metadata parallel-array (+ FreeType hashmap), append a comment to their array, deallocator, & line-oriented statemachine parser.

This parser isn’t really all that complex relative to other parsers I’ve seen…

There’s a Face subclass extensively abstracting the parser’s output into FontConfig’s common format, including populating that CMap subclass & ensuring any requests are within BDF’s limited capabilities.