UNIX Standard Libraries

No programmer can get far at all without reusing existing code provided by the operating system, if not others as well. At the very least we need our software to communicate with other programs (like crucially the Linux kernel), and through them humans, to have any reason to exist. This concept of reusable code is referred to as “libraries” or “modules”.

GNU LibC

GNU LibC, following the POSIX standards, are the most fundamental libraries on most UNIX-like systems today. Allowing for I/O, memory allocation, filesystem manipulation, etc. Put simply, it’s primary job is to abstract away the datastructures exposed by both Linux, central config files, GCC, & the underlying CPU. Full details on the linked page.

Related: WeLibC’s blog

StdC++

C++ is a variant of C which adds syntax for classes & more to C. As such C++ has it’s own standard library atop of C’s to take advantage of these syntactic features.

Though projects may implement their own C++ standard library, or reimplement something similar but not identical via a extremely verbose of C macros. Or use a higherlevel language.

C++’s stdlib mostly just wraps C’s (which itself often uses methodtables internally), so that should make this faster to get through!

I’m ignoring the headerfiles in this study. I skimmed them, they do contain some nontrivial implementation, but nothing that interesting.

Compiling these C++ files works mostly the same as for C, though with added complications throughout. Still nothing compared to what’s required to make optimal use of a variety of modern CPUs!


In the implementation code there’s an array of prime numbers used for implementing hashmaps in those headerfiles, and maybe elsewhere too.

For the C++ filesystem APIs (as provided by GCC) there’s a _Dir_base struct which wraps opendir/readdir/closedir. Seperate classes further abstracts this with a C++ iterator, maybe a depth-first stack, & more.

There’s platform-independant reexports of the basic file operations followed by utils for: reading the time, or filetype & status, properties from the result of the “stat” syscall; reformatting copy file options; copy a file from a source path to a destination path after validating that destination using “stat” & the above functions throwing any relavent errors & [f]chmoding it (uses sendfile syscall with byte copying fallback); & platform-independantly retrieve available diskspace.

There’s a large suite of higher level file operations abstracting these, including finding absolute & canonical paths, copying files & symlinks, creating dirs & sym/hardlinks, etc.

There’s a path struct (used by directory traversal) which stores a “components” array/stack (split by “/”) for userspace traversal alongside the filepath string, overrides the divide operator to concatenate paths, tracks the filepath type, & provides other filepath manipulation/comparison/hashing utilities.


Introduced in C++ 98…

There’s a simple hash function for doubles involving multiplies, adds, & divides. Strings use FNV hashing, as implemented in the headerfiles.

ios_base contains numeric formatting parameters, & error-throwing methods.

basic_istream has an ignore(n) method, with error handling & buffering.

There’s transfer, reverse, & [un]hook doubly-linked list methods.

free_list appears to allocate out of arrays. codecvt implements format conversion upon generics.

There’s a routine for wrapping stdin/stdout/stderr with [w]istream & [w]ostreams.

The >> operator & getline methods on basic_istreams retrieves the underlying buffer & reads chunks until it finds the delimitor.

There’s code for serializing & validating float format strings.

There’s a global of type __gnu_parallel::_Settings.

It ensures you can do trigonometry, etc on long doubles & floats.

There’s functions for copying data from one basic_streambuf to another until EOF.

__pool_alloc_base will allocator it’s larger chunks (split between multiple callers) via bumppointer falling back to a smallintmap of linkedlist freelists with error handling.

gslice_to_index does something involving populating certain indices of an array with indices &, in part, values computed via an intermediary array.

There’s a locale class for parsing locale configuration string, environment variables, & elsewhere. Includes a cache.

There’s a redblack tree implementation. A redblack tree is a binary search tree which enforces a balanced depth via having track it’s own depth-parity. Which we lable as “red” and “black” for the sake of visual non-jargony explanations. Thereby ensuring all keys present are equal fast to lookup!

There’s a Stream implementation for Strings strstreambuf with read/write indices & reallocation.

The __pool allocator has plenty of setup & cleanup, managing a segmented array to allocate out of, with a free array as fallback.


In C++11 the following was introduced to to it’s standard libraries…

A hashfunction on doubles, & another (FNV) on strings, for use by a hashmap implementation sized to prime numbers (shortcut smallintmap for choosing capacities by length < 13).

A “future” type tracking locked state of arrival of asynchronous data.

A wrapper class around atomic mutexes.

A string implementation for those compiled into your executable, dynamically copied to heap as-needed.

Then there’s new exceptions!

There’s a class for character categorization.

There’s a psuedorandom number generator backed by a “random device” class.

There’s futher wrappers around mutexes like _Sp_locker & atomic conditions. Upon which the gthread class is built.

There’s a trivial snprintf implementation.

There’s a class representing locale-related configuration.

“futex” class wraps both those syscalls (as well as time syscalls) & userspace atomics.

“systemclock” class wraps the gettimeofday syscall.

There’s an implementation of swappable buffers to underly I/O streams (ios_base class).

Threads have methods wrapping other syscalls.

There’s copy-on-write strings which reallocate into heap upon first write.

There’s an atomic flags class.

There more locale classes.

There’s text decoding.

There’s debugging utilities for outputting human-semireadable data & attaching profilers to sequences/collections.


In C++17, the following were added to the standard library:

There’s a an allocator out of a pre-allocated buffer used for parsing & formatting between numbers & strings. Error handling is involved.

For some reason there’s a struct wrapping fixed-sized int types with operator overloading.

Additional filesystem classes were introduced, including ones for iterating over directories, filepaths (including copy, create, equivalence, etc operations) largely wrapping previous implementations.

There’s a reimplementation of formatting floating point numbers to (decimal or hexadecimal) text.

There’s an arena allocator, with it’s own bitset implementation & suballocators, whereby it increments pointer into a “chunk” until overflow. At which point another chunk will be allocated. All memory allocated in a “pool” waits until the pool’s free to free themselves.

Much of this implementation for C++17 is built on code taken from “ryu”…

The code GCC’s implementation of C++17’s takes from Apache’s Ryu library uses lookuptables with multiplies, adds, shifts, & comparisons for postprocessing, to compute powers, inverse powers, corresponding divide, & (seperate lookuptable) reencoding. It may use similar math (without lookup tables) to compute multiplies, shifts, & divides, remainders, etc.

It decomposes floats to, using those power computations, a digit lookup table, & debugging messages, to format them as text.


For C++20 I just see an internal headerfile defining basic_stringbufs, & basic_[i,o]stringstreams. So I can’t comment further. C++20 didn’t change the standard libraries much…

Then there’s the more extensive public headers which contains more of the implementation…

Gettext

When communicating textually it is important to be prepared to teach your program the (natural) languages spoken by its prospective users. I don’t think its reasonable to expect devs to “localize” the software themselves, but thanks in large part to Gettext it is definitely reasonable to expect us to “internationalize” our software!

I’ll study Gettext following along the dev pipeline of extracting, translating, compiling, & applying the translated UI text.

UI String Extraction

To use Gettext you mark any UI text in your programs with the _ macro (or variants) & run the xgettext command over your source code.

After extensive initialization including parsing extensive per-computerlanguage lists of flags into a hashmap xgettext parses extensive commandline flags. Then handles –version & –help specially before extensively validating/normalizing those commandline flags. After possibly reading input files from a file, it appends the remaining commandline args.


Then gettext considers the issue of textencoding normalizing everything to UTF8, allocates an array, generates a plaintext metadata entry to append, possibly parses the previous .po file (using a plugin like any other language) to append to its previous entries possibly translating its charset.

And finally iterates over each input file before cleaning up, sorting results, & outputting them as serialized text. For each it possibly parses a rulelist from XML or infers computerlanguage (based on filenames, particularly extensions where there’s a fastpath, & XML root tag), & based on that info it parses the file using a selected callback or traverses an XML file looking for translatable text.

C UI String Extraction

I’ll describe how the callback for extracting strings from C (and C++/Objective-C) works here, without exploring the other callbacks. They all work more-or-less the same way with a few exceptions. Like for Ruby it delegates to rxgettext.

This involves initializing various globals including of what to look for (KDE extends this) then balances parens like a LISP parser.


Balancing parens involves a recursive innerloop (restart innerloop once its balanced the parens it found so far) examing each C token (lexer described in subsequent toots). If the token’s an identifier it’ll set a flag & precede immediately to symbol logic without unsetting it. If the token’s a symbol it unsets that flag & swaps out its iterator whilst handling ObjC specially. For LParen tokens it recurses exitting upon EOF before nulling out the iterators & unsetting the flag. Upon RParen it closes the iterator & exits the innermost recursive loop. For Commas it resets context, iterators, & the flag. For colon it handles ObjC specially or nulls the iterators resetting the flag in either case. For string literals it saves them aside with or without checking (via previously-saved state) that they’re in an appropriate macro call before nulling iterators & resetting flag. Closes iterator & exits upon EOF. Otherwise nulls iterators & resets flag.


The C lexer it uses operates in 9 “phases”. The topmost phase of which lightly postprocesses the tokens largely serving to lookup all names in its hashset falling back to calling them “symbol” tokens.

Phase8 is split into 4 subphases. The topmost concatenates consecutive strings. Phase8c strips ‘@’ symbols preceding strings when lexing ObjC. Phase8b drops whitespace. Phase8a lowers inttype macros to strings.

Phase6 (phase7 got rearranged) for each phaseX token preceded with a ‘#’ it looks for the end of line or “define” token buffering these into a “pushback” buffer. Then checks whether that was a linenumber macro to update that global counter before freeing the buffered tokens & clearing saved comments. The body of “# define”s are left in the tokenstream in case they contain UI text.

PhaseX lowers ‘#’ tokens differently for start-of-line vs mid-of-line.

Phase5 does much of the work branching over phase4 inital char!


For EOF chars phase5 emits a EOF token. For newlines it emits end-of-line tokens. For other whitespace it collapses subsequent non-newline whitespace before emitting a whitespace token. For letters & ‘_’ it scans subsequent ones & digits capturing them in a buffer with extensive special handling for C++11 strings before emitting name tokens. For apostrophes it retrieves the next char & emits a character token. Parens, commas, hashes, colons, & (for ObjC) ‘@’s each have their own token. For ‘.’ it decides whether that’s a decimal number or symbol. For digits or decimal numbers it scans & buffers subsequent chars emitting a number token parsed via standard lib. For doublequotes it scans each phase7 char until it sees quotes emitting a string literal token. Otherwise it yields symbol tokens.


Phase7 handles lexing escape sequences in string or char literals for phase5.

Phase4 lexes, removes, & saves aside comments to be attached to translation messages.

Phase3 handles escaped newlines.

Phase2 optionally decodes “trigraphs”, which appears to be a concept grappling with limitations of early keywords. By interpreting ‘?’-prefixed chars as a different char, e.g. “?(“ = ‘[’.

Phase1 counts line numbers & strips escaped newlines.

Phase0 handles alternate newline encodings. Uses getc/ungetc as a scanner.

PO-File Intialization

To start a translating a program using Gettext into a new language you run msginit to copy the extracted UI strings from a .pot file into a new .po file. Today I’ll study how msginit works!

After initializing its own internalization (amongst a couple other things) & parsing commandline flags (handling –version & –help specially) validating there aren’t any more commandline args msginit locates the .pot file in the directed if not explicitly given & warns about using the “C” locale.


From there msginit normalizes the localename, generates an output file name if not explicitly given warning about overriding existing files, parses the .pot file (next toot), on Windows overrides $PATH, normalizes metadata on each of the UI strings, ensures there’s the correct number of plural forms (extracting that count from metadata somehow, I’m failing to find all that code) OR for English there’s variant which prefills from the POT file, outputs these UI strings, & exits.


Opening the .pot file handles “-“ specially & incorporates a configurable to search relative paths within as well as a couple extensions it’ll try appending. Successfully opened file’s returned to the parser.

The parsers allocates/initializes some initial state to be cleaned up on exit; before calling a parse method surrounded by setting/unsetting a global, calling parse_[de]brief methods, & error reporting.

Parse method is given externally. Other methods collect/track parsed data.


msginit can parse one of 3 “input formats” the default being ofcourse .po[t].

The .po lexer is handwritten with a scanner dealing with escaped newlines & full UTF8. This lexer repeatedly branches over the initial ASCII char, handles UTF8 EOF errors, or complains about unquoted Unicode characters.

For newlines it unsets some flags. For otherwhitespace it continues the loop. For ‘#’ it checks whether its followed by a ‘~’ or ‘ ’ (has special meaning) to skip that prefix & set a flag.

Otherwise ‘#’ scans to the end of line with or without emitting a token as per config. Double quotes collects all chars until the endquote warning about newlines or EOF & resolving escape sequences. ASCII letters, ‘_’, or ‘$’ collects subsequent such chars or digits to convert into the appropriate keyword token or an error. For digits it scans subsequent digits, parses via the LibC, & emits a NUMBER. Brackets are each their own token with state updates.

The parser is implemented via Bison.


Said parser is relatively simple with rules to parse comment (own token), domain (domain & STRING tokens), message (some combination of intro, stringlist, MSGSTR token, & pluralforms). Message intros in turn consist of a MSGID token possibly preceded by MSGCTXT & stringlist and/or PREV variants of these tokens. Pluralforms consist of MSGID_PLURAL token & a stringlist. Each pluralform consists of a MSGSTR token, bracketed NUMBER token, & a stringlist.

A stringlist is a sequence of STRING tokens.

The stringtable & properties alternate input syntax more resembles the C lexing.


For output it first checks whether there is anything to output & whether we want to output the nothing anyways. It checks if/how it should complain about there being multiple domains based on given output syntax class, otherwise checks whether/how it should complain about plurals & contexts. Checks whether we’re outputting to stdout & whether to configure colouring, & calls print with an open file.

The output syntax options are some as for input but independant from it.


The default PO syntax iterates over each message outputting the metadata entry specially before extracting the header & in turn charset to inform how it outputs each message obsolete ones last.

Outputting a message involves outputting all the different comments first then message context (with optional charset warnings), message ID, plural ID, & message string(s).

Colors added via a rabbithole of a bundled library. I’m not going to discuss libgettextpo responsible for Gettext’s fancy formatting since it looks like quite the tangent! Though, ofcourse, colouring is optional. Interestingly it does support outputting to HTML for this formatting!

There’s a complex utility for wrapping text string with preceding identifier within the PO writer module, amongst other simpler helper functions.

The other output syntaxes are quite straightforward without colouring & text wrapping support.

Po-File Merging

As software continues to develop it’ll add (and remove) UI strings for localizers to keep on top of. If the devs are considerate they’ll stick to a lifecycle where over certain periods they’ll avoid adding new UI strings so the localizers have a chance to catch up.

To incorporate these upstream changes into their localization where they can localize it localizers use the msgmerge command!


After initializing its own internationalization & parsing commandline flags validating 2 additional commandline args are given msgmerge handles --version & --help specially, validates commandline flags, in update mode ensures output syntax matches input syntax, tweaks flags for use by msgfmt, possibly initializes OpenMP, calls the core logic, applies desired sorting, handles update mode specially whilst outputting the messages just like msginit.


In update mode it might sorts obsolete entries to the end, checks whether any changes where made, & if so generates a backup file & outputs the messages.

The core logic involves reading the two inputs (just like msginit), ensures there’s a metadata message, iterates over the POT file looking for charsets in the headers, considers everything to UTF8 or otherwise handles text encodings, handles charsets specially in fuzzymode, allocates an output messagelist, in multidomain mode allocates sublists (whilst matching domains, validating plural expressions, & iterating over existing definitions to find previously-translated match fuzzily or not to combine) otherwise skips the sublist allocations, optionally performs postprocessing regarding source locations for msgfmt, & validates results.

Po-File Editting

The next step of the internationalization/localization workflow described by Gettext’s manual is to actually edit the localizations! Said manual mentions desktop & emacs interfaces, but personally I’m more tempted to study a web interface “WebLate”. Since (even without JS, just forms) the web is a handy way to quickly gather information from people!

Web-based Localization

WebLate is a self-hostable Django webapp allow multilingual facilitating translations contributions to software projects which use tools like Gettext!

Upon Django’s builtin accounts system WebLate adds:


WebLate bundles a suite of “addons” you can enable or disable which upon database update may:

Each of these have a form integrated into an admin web & commandline UIs. There’s a common superclass & database model holding JSON for addon-specific configuration.


To provide an HTTP API WebLate has a routing util for capturing which virtually-hosted WebLate instance is selected, an auth token, declarative serializers via rest_framework, routing paths to views via that utility, & various views returning results of database queries.

WebLate’s custom “auth” Django-app for some reason implements its own permissions system with WebLate-specific instances, Django-Admin integration, migration utility functions, & various utilities for checking these permissions. Bundles its own createadmin, importusers, & setupgroups management commands. Declares its own Group & User classes upon Django’s baseclasses for some reason. Also adds a templatetag for checking permissions for the logged-in user. Auth’s database model can automatically group users based on email-address regexps.

If you’re selling access to your WebLate instance “billing” adds a database model for which plans your customers are on & their invoices, with Django Admin integration. WIth Celery-automated tasks & a management command to compute & send invoices. Or there’s views to download your invoices online.


WebLate has a Django-app which highlights syntax errors as a according to a choice of classes checking different interpolation or markup syntax; involving gather, sort, & overlap removal steps to be rendered via a template. Has database models for which checks to perform. There’s web UIs for viewing these.

Incorporates a datafile of localized language names, Has management commands for listing ignored checks, list top failing checks for untranslated entries, & rerun checks.

Supported checks include:

These all share a superclass (with variants), and are often hueristic.


For importing & exporting localizations WebLate implements a minor variation upon Python’s builtin io.ByteIO that adds mode & name attributes, a ParseError exception, & infrastructure to call the appropriate importer or exporter class.

Most of these exporters are provide by Translation ToolKit (which I won’t discuss) via wrapper, but there’s also an importer for Excel via OpenPyXL, & automatically determining which importer to delegate to. They share featureful datamodelling superclasses. Each of its many supported importer & exporter formats gets its own, typically trivial class.

The “gitexport” Django-app provides utility functions to compute the Git URIs, & who’s views will proxy Git’s internal git-http-backend script.

The “lang” Django-app datamodels the human languages being localized into with Django Admin integration. Including standard plurals config, datafiles, display web UI, shell commands, & fuzzy matching.

For the sake of its own localization language data is moved into its own near-empty Django-app.

There’s an Django-app for tracking agreements to legal terms with its own database mode, forms, web UI (with even its own “templates”), decorator for use by other Django-app’s web UIs or a “middleware” around all, Django-Admin integration, & template tag for linking to these agreements.


WebLate’s “machinery” Django-app offers classes integrating into various machine translation services including previous translations on this WebLate instance, dummy translations for “Hello, world!”, deepl.com, glosbe.com, translation.googleapis.com, terminology.microsoft.com, mymemory.translated.net, amagama-live.translatehouse.org or other instances, AWS Translate (via boto3 module), microsofttranslator.com possibly with API keys from cognitive.microsoft.com, SAP MT instances, translate.yandex.net, youdao.com, fanyi.baidu.com, or an Apertium instance.

Which have their own configuration fields, and all have a common baseclass helping to abstract HTTP API usage, prioritizing, rate limiting, & language selection.

WebLate’s “memory” Django-app offers forms, its own integration “machinery” class, bridge over to whoosh module, Celery-automated cleanup, web UI, & shell commands for importing, exporting, deleting, & listing translation-memory XML/JSON files.

WebLate’s “screenshots” Django-app offers Django Admin integration, database fieldtype subclassing ImageField with additional validation, modelform, database model, Celery-automated cleanup, & CRUD (Create/Read/Update/Delete) web UI recording illustrative app screenshots clarifying what UI strings refer to. With integrated OCR via tesserocr.

WebLate’s “vcs” Django-app implements support for Git with or without Gerrit or GitHub, Subversion, GPG, Mercurial, & SSH keys as dynamically-loadable subclasses of a common baseclass aiding those in deferring to the commandline; with its own configuration fields.

And there’s various tweaks to the Django Admin, mostly to add a performance dashboard & special SSH keys screen.


For the clientside WebLate vendors Chartist.js, specially-configured BootStrap including their Datepicker, slugify.js, autosize.js, js.cookie, Modernizr.js, Mousetrap.js for keyboard shortcuts, multi.js for nicer multiselects (tells me I’ve got some redesign work to do in Iris…), Clipboard.js, FontAwesome & Font Linux, jQuery, & a couple more fonts for serverside CAPTCHAs alongside its own robots.txt & security.txt.

In general utils WebLate implements for itself includes:

There’s symbols & localized labels for EMPTY, FUZZY, TRANSLATED, & APPROVED states.

As well as abstractions around Whoosh, Django Messaging (for use alongside Django ReST framework), Django templating (without autoescaping but with special context), the current site, & Django Templates localization.


At WebLate’s core is it’s “trans” Django-app! This provides:

For the sake of templates WebLate provides a template tag to extensively format translations including diff rendering & whitespace highlighting, as well as tags to render random WebLate project self-“adverts”.

Also various simple accessors on checks, name database lookups for slugs, counts, numerous links, rendering messages & checks & translation progress, outputting time more naturally, querying message state, retrieving messages, aid a tab UI, & checksum, permission, & legal checks.


But mostly it is lots of viwes & models!

For Weblate is centred around the database models:

Alongside this it defines config fields, & event handling to reflect database changes in repo files. Lots of methods largely relating to VCS, including on Manager/QuerySet classes.

For the core WebLate web UI there’s Django-views for:

Outside the translation view there;s very little in the way of helper functions beyond what I’ve discussed previously. Though Django’s forms framework is used extensively to interpret/validate user input!

Catalog Processing

Beyond the more boiler-platey logic msgcat reads & deduplicates the input filenames from a commandline args & maybe a given file & calls catenate_msgdomain_list which in turn parses of those files & iterates over them to determine the encoding, again to determine the identifications, count the number of translations for each message, twice to drop needed messages, maybe determine common encoding, determine output encoding (ideally no conversion) if not given by user, apply text reencoding, copy all the messages into a single output catalog, & handle duplicates specially.

msggrep parses 5 input “grep tasks” deferring to a libgrep to handle multiple regexp syntaxes using them to compile the given regexps, before filtering all messages by filename, msgctxt, msgid, msgid plural, msgstrs, translator comments, & other comments.

msgcomm works basically the same way as msgcat but with additional globals.

msgconv calls iconv on all parsed messages before writing them back out possibly sorted.

msgfilter runs a subcommand (or builtin function) upon all parsed messages once text-reencoded before serializing them back out possibly sorted.

msgexec runs a subcommand upon all parsed messages once text-reencoded echoing their output instead of serializing results.

msguniq catenates a single file.

msgcmp removes obsolete entries between 2 inputs, extracts textencoding from header fields to ensure if one's UTF-8 other is as well, conanicalizes text for fuzzy matching if requested, allocs an output, iterates over messages to retrieve & display matching entries in other file, & a final iterations outputs strings which weren't present.

msgattrib reads the catalog possibly alongside a allow/block-list catalogues to filter by, & iterates over it to update fuzzy & obsolete flags.

And finally msgen (did I cover this already? Name assumes English is the source language) copies source text to translated text for each entry the parsed file.

You can also build your own utils based on the same library all these commands I’ve been describing use.

Compiling PO files

Once you have fully-enough translated .po files Gettext requires you to compile them into .mo files, which is an on-file sorted parallel array with an optional hashmap index! To do so you you use the msgfmt command, which can be reversed with the msgunfmt command.

After initializing its own internationalization & parsing commandline flags handling --version & --help specially msgfmt validates there are additional commandline args unless handling XML or .desktop input.


Then msgfmt extensively validates those commandline flags, handles .desktop files or directories of them specially echoing the data it parses with added localizations, handles XML mode specially parsing a rulelist & before merging it with all XML in a directory both utilizing an external XML parser, possibly allocates a new domain in lack of an output filename, reads specified input file according to the specified syntax, checks that syntax produces UTF8, & remove obsolete strings.

With special cases out of the way & .po (or whatever) messages parsed msgfmt now iterates over the catalog domains to check plural formulas match the counts seen elsewhere in the file whilst trial-evaluating said formulas & various other per-message basic checks (i.e. begins or ends with newlines, validates format strings matches, validates both have accelerators, & validates metadata has all necessary fields), then outputs the messages in appropriate syntax & maybe outputs stats.


For .mo output validating there are in fact messages to output msgfmt deletes the “POT-Creation-Date” header for reproducible builds before opening the output file if not stdout (a.k.a. “-“) taking care not to overwrite existing files and:

  1. with some arrays allocates over all messages to concate msgctxt & msgid into msgctid, tests for system-dependant strings, parses C or ObjC format strings to see if there's any platform-specific directives, & gathers strings into appropriate array.
  2. Sort the platform-independant strings if any found
  3. Computes min output version
  4. Computes hashmap (if desired) size of a prime number at least 4/3s full & > 3
  5. Gather a header struct without or without including the headerfields for platform-specific strings
  6. Optionally apply a byteswap to the header & outputs them.
  7. Iterate over strings to prepare length & offset fields optionally byteswapped before outputting them.
  8. Do same for their corresponding translations.
  9. If outputting a hashmap index alloc/zero said hashmap, insert each entry (using HashPJW hashfunction with increment-rehashing), optionally byteswaps each entry, & writes them out.
  10. If including platform-specific strings generate an array splitting them by platform writing the segments header out followed by the clustered strings
  11. Write each original string then all of their translations
  12. If needed do same for platform-specific strings
  13. Cleanup!

After initializing & parsing/validating commandline flags handling --help & --version msgunfmt parses each input file for the specified syntax possibly sorts messages by their ID, & outputs them back out using the same .po serialization most other commands use!

For .mo files (also C# or seperately Java, C#, or TCL) it opens the file (if not stdin a.k.a. “-“), checks whether we need to swap the byteorder, performs format validations, & iterates over all strings into a messagelist.

Incorporating Translations

Once you’ve gone through the translation process I described the tools for above, you now need to actually incorporate those translations into your software! For this the functions you call to mark text to be translated also looks up those translations to be swapped into the UI.

But first you need to call textdomain to set the catalog from which it looks these UI strings up.

If its argument is NULL textdomain returns the current global.

Otherwise textdomain claims a writelock, then examines the arg further. If its empty textdomain sets the global (and local) to “messages”. If it’s unchanged it sets the local. Otherwise it sets the global & local to a new copy of the arg.

In anycase if the new local is non-NULL it increments the catalog counter & considers freeing the old value, before releasing lock & returning the new value.

In short textdomain is a slightly fancy accessor.


The rest of Gettext's API including gettext/_ are trivial wrappers around dcigettext. Here I’ll describe how dcigettext works when the args unspecified by gettext/_ are NULLed out. Domains, categories, & plurals will be described later. All of which are handled in this function.

If the UI string is unspecified dcigettext returns NULL. Otherwise it saves the error number, claims readlocks, & retrieves configured catalogue.

After that initialization it populates a cache-searching, searches that binary-search-tree under a readlock, & returns the looked up translation if it found one cached releasing locks & restoring errorcodes.

Otherwise determines whether it needs to be more careful due to running as root, determines the directory to search for in the path, iterates over all configured locales exiting upon “C” or “POSIX” mmaping & validating & caching the files so it can lookup translations in them.

If it successfully found a translation dcigettext updates the cache (checking whether there's an existing cache entry it can overwrite) before restoring errorcodes, releasing locks, & returning the result.

Otherwise if not in a SUID program it checks $GETTEXT_LOG_UNTRANSLATED to see if (under a lock) it should log the untranslated UI string possibly lazily-opening the logfile to do so. To aid localizers in prioritizing. Then returns the untranslated string!


Searching for a localization in a mmaped file involves checking if said file has a hashtable. If so it performs a hashmap lookup (HashPJW with increment rehashing until tombstone or match), otherwise performs a binary search over the sorted keys table (almost as fast!).

In eithercase upon success looks up the translation in either the cross-platform or platform-specific arrays, extensively considers whether we need to convert text encodings, & returns result with length.


What with synonyms & context sometimes the untranslated UI string is not enough to identify the appropriate translation! So for disambiguation dcigettext & some of its wrappers accept a “category” & heavier-weight “domain”.

Categories get validated first-thing (after the untranslated UI string) at the start of each call with LC_MESSAGES_COMPAT being converted into the default LC_MESSAGES. They are incorporated into the caching. And is used in determining which locale to use!

I’m failing to to see where the functions it calls to convert the category into a locale are defined even searching online, but I think I can infer they relate to LibC's APIs. Unless this locale is “C” it then consults $LANGUAGE & applies platform-specific preprocessors to normalize format, before returning that priority list.

The category is then incorporated into the filename of the .mo files it should consult. I described how it handles the priority list yesterday.

Domains default to a configurable global, are considered in cache lookup, located within a global path to get a directory to look for the .mo within, are incorporated into the .mo filepath, & domains are incorporated into missing-message logging.

Plurals

Many if not most natural languages have different gramatical structures (“pluralforms”/”plurals”) to indicate different quantities. Though not every language agrees how quantities map to their pluralforms! e.g. Is 0 plural, not, or something else?

One of those things you might assume is trivial…

Gettext’s facilities for this assumes English as a source language, though I suspect those assumptions can easily be overcome for programming in other languages.


dcigettext & its many wrappers will resolve plurals (count defaults to 0) once it successfully looked up translation in the configured/selected catalogue. If unsuccessful it may optionally apply an English-like “germanic” n == 1 pluralform between the 2 given strings.

This involves interpreting (for the given count) the plural formula from the catalogue & iterating over the multistring to find the computed index. Which .MO compilation validates via bruteforce stays in-range.


Interpreting the plural formula is done over the abstract syntax tree recursively branching over the number of operands (0-3 inclusive) before before branching over & applying the mathematical operation.

Said expression is parsed when loading in the .mo file by locating the “plural=” & “nplurals=” headerfields of the metadata entry (translation for “”) parsing nplurals via strtoul after scanning digits, & parsing plural using Bison & a manual lexer. Relatively trivial usage. Defaults to returning an AST representing Germanic n != 1 expression.