UNIX Standard Libraries

No programmer can get far at all without reusing existing code provided by the operating system, if not others as well. At the very least we need our software to communicate with other programs (like crucially the Linux kernel), and through them humans, to have any reason to exist. This concept of reusable code is referred to as “libraries” or “modules”.

GNU LibC

GNU LibC, following the POSIX standards, are the most fundamental libraries on most UNIX-like systems today. Allowing for I/O, memory allocation, filesystem manipulation, etc. Put simply, it’s primary job is to abstract away the datastructures exposed by both Linux, central config files, GCC, & the underlying CPU. Full details on the linked page.

Related: WeLibC’s blog

StdC++

C++ is a variant of C which adds syntax for classes & more to C. As such C++ has it’s own standard library atop of C’s to take advantage of these syntactic features.

Though projects may implement their own C++ standard library, or reimplement something similar but not identical via a extremely verbose of C macros. Or use a higherlevel language.

C++’s stdlib mostly just wraps C’s (which itself often uses methodtables internally), so that should make this faster to get through!

I’m ignoring the headerfiles in this study. I skimmed them, they do contain some nontrivial implementation, but nothing that interesting.

Compiling these C++ files works mostly the same as for C, though with added complications throughout. Still nothing compared to what’s required to make optimal use of a variety of modern CPUs!


In the implementation code there’s an array of prime numbers used for implementing hashmaps in those headerfiles, and maybe elsewhere too.

For the C++ filesystem APIs (as provided by GCC) there’s a _Dir_base struct which wraps opendir/readdir/closedir. Seperate classes further abstracts this with a C++ iterator, maybe a depth-first stack, & more.

There’s platform-independant reexports of the basic file operations followed by utils for: reading the time, or filetype & status, properties from the result of the “stat” syscall; reformatting copy file options; copy a file from a source path to a destination path after validating that destination using “stat” & the above functions throwing any relavent errors & [f]chmoding it (uses sendfile syscall with byte copying fallback); & platform-independantly retrieve available diskspace.

There’s a large suite of higher level file operations abstracting these, including finding absolute & canonical paths, copying files & symlinks, creating dirs & sym/hardlinks, etc.

There’s a path struct (used by directory traversal) which stores a “components” array/stack (split by “/”) for userspace traversal alongside the filepath string, overrides the divide operator to concatenate paths, tracks the filepath type, & provides other filepath manipulation/comparison/hashing utilities.


Introduced in C++ 98…

There’s a simple hash function for doubles involving multiplies, adds, & divides. Strings use FNV hashing, as implemented in the headerfiles.

ios_base contains numeric formatting parameters, & error-throwing methods.

basic_istream has an ignore(n) method, with error handling & buffering.

There’s transfer, reverse, & [un]hook doubly-linked list methods.

free_list appears to allocate out of arrays. codecvt implements format conversion upon generics.

There’s a routine for wrapping stdin/stdout/stderr with [w]istream & [w]ostreams.

The >> operator & getline methods on basic_istreams retrieves the underlying buffer & reads chunks until it finds the delimitor.

There’s code for serializing & validating float format strings.

There’s a global of type __gnu_parallel::_Settings.

It ensures you can do trigonometry, etc on long doubles & floats.

There’s functions for copying data from one basic_streambuf to another until EOF.

__pool_alloc_base will allocator it’s larger chunks (split between multiple callers) via bumppointer falling back to a smallintmap of linkedlist freelists with error handling.

gslice_to_index does something involving populating certain indices of an array with indices &, in part, values computed via an intermediary array.

There’s a locale class for parsing locale configuration string, environment variables, & elsewhere. Includes a cache.

There’s a redblack tree implementation. A redblack tree is a binary search tree which enforces a balanced depth via having track it’s own depth-parity. Which we lable as “red” and “black” for the sake of visual non-jargony explanations. Thereby ensuring all keys present are equal fast to lookup!

There’s a Stream implementation for Strings strstreambuf with read/write indices & reallocation.

The __pool allocator has plenty of setup & cleanup, managing a segmented array to allocate out of, with a free array as fallback.


In C++11 the following was introduced to to it’s standard libraries…

A hashfunction on doubles, & another (FNV) on strings, for use by a hashmap implementation sized to prime numbers (shortcut smallintmap for choosing capacities by length < 13).

A “future” type tracking locked state of arrival of asynchronous data.

A wrapper class around atomic mutexes.

A string implementation for those compiled into your executable, dynamically copied to heap as-needed.

Then there’s new exceptions!

There’s a class for character categorization.

There’s a psuedorandom number generator backed by a “random device” class.

There’s futher wrappers around mutexes like _Sp_locker & atomic conditions. Upon which the gthread class is built.

There’s a trivial snprintf implementation.

There’s a class representing locale-related configuration.

“futex” class wraps both those syscalls (as well as time syscalls) & userspace atomics.

“systemclock” class wraps the gettimeofday syscall.

There’s an implementation of swappable buffers to underly I/O streams (ios_base class).

Threads have methods wrapping other syscalls.

There’s copy-on-write strings which reallocate into heap upon first write.

There’s an atomic flags class.

There more locale classes.

There’s text decoding.

There’s debugging utilities for outputting human-semireadable data & attaching profilers to sequences/collections.


In C++17, the following were added to the standard library:

There’s a an allocator out of a pre-allocated buffer used for parsing & formatting between numbers & strings. Error handling is involved.

For some reason there’s a struct wrapping fixed-sized int types with operator overloading.

Additional filesystem classes were introduced, including ones for iterating over directories, filepaths (including copy, create, equivalence, etc operations) largely wrapping previous implementations.

There’s a reimplementation of formatting floating point numbers to (decimal or hexadecimal) text.

There’s an arena allocator, with it’s own bitset implementation & suballocators, whereby it increments pointer into a “chunk” until overflow. At which point another chunk will be allocated. All memory allocated in a “pool” waits until the pool’s free to free themselves.

Much of this implementation for C++17 is built on code taken from “ryu”…

The code GCC’s implementation of C++17’s takes from Apache’s Ryu library uses lookuptables with multiplies, adds, shifts, & comparisons for postprocessing, to compute powers, inverse powers, corresponding divide, & (seperate lookuptable) reencoding. It may use similar math (without lookup tables) to compute multiplies, shifts, & divides, remainders, etc.

It decomposes floats to, using those power computations, a digit lookup table, & debugging messages, to format them as text.


For C++20 I just see an internal headerfile defining basic_stringbufs, & basic_[i,o]stringstreams. So I can’t comment further. C++20 didn’t change the standard libraries much…

Then there’s the more extensive public headers which contains more of the implementation…

GNU CoreUtils

GNU CoreUtils provides the majority of the most-used commands on the commanline. Which mostly consists of language bindings from the C standard lib to the text-heavy user-interactive terminal; these C standard lib calls in turn are mostly language bindings to kernel systemcalls via Assembly!

GNU CoreUtils’ commands are largely mutually independant, with a tiny bit of shared code. There is some overlap with Bash’s builtin commands, in which case those take precedance.

All these commands, and more, that I’m describing can be compiled into a single coreutils executable with the entrypoint comparing argv[0] to all commandnames known to coreutils.h & calls the corresponding function. Otherwise parses some commandline longflags to determine the function to call, or perform typical initialization (with –help also listing those commandnames) to possibly output an error.

Yes, you don’t need to switch to BusyBox for this!


yes (after initializing internationalization, registering an exit callback to safely fclose standardout, & parsing common GNU commandline flags namely --help & --version; yes there’s a minor shared library, mostly for normalizing UNIXes) adds “y” to the commandline args if missing, computes memory size needed for space-concatenating (unparsing) those commandline args then allocates with a minsize (duplicating text if too small) & does so.

Then infinite loops writing that to stdout!

whoami after typical initialization & validating no additional args are given, wraps geteuid syscall (a getter for a property of the current process) & getpwuid.

uptime after typical initialization retrieves the optional commandline argument & reads that utmp file (later topic) before retrieving system’s uptime in seconds via kernel specific facilities (on Linux /proc/uptime) & manually converts that into a human-readable & localized value to write to printf.

unlink after typical initialization retrieves it’s single argument & wraps the unlink systemcall (defered to an OO-method on the appropriate filesystem).

unexpand, on more standard UNIXes & after mostly-typical initialization parsing commandline flags itself with no other commandline args, optionally wraps uname outputting specified fields of it space-seperated, & optionally makes a couple sysinfo calls to output CPU & hardware names.

tty, after mostly-typical initialization, parses -s, -h, & -v flags itself validating there’s no further commandline arguments either wraps isatty outputting via error code or ttyname outputting via stdout depending on presence of -s.

truncate, after initialization, parses & validates commandline flags, stats & opens the first (non-optional) commandline arg, & for each of the rest temporarily opens it, computes missing flags via fstat & maybe fseek before wrapping ftruncate.

true & false optionally runs the typical initialization if there’s only two arguments, with their main purpose being to immediately exit with a specific errorcode.

touch, after mostly-typical initialization, parses & validates commandline args possibly lstating a reference file or calling gettime syscall to fill in missing fields, optionally parses the given date, before iterating over commandline args wrapping fopen and optionally fdutimensat.

tee, after mostly-typical init, parses flags, disables specified signal handlers, configures binary for stdin & stdout with sequential access optimizations, opens all the args into a newly-allocated array, repeatedly reads blocks from stdin it to all of them, & once all outputs or the input closes it closes any remaining files.

sync, after mostly-typical init, parses & validates flags, for each arg temp-opens that file, fcntl(F_GETFL)s it, & wraps f[data]sync or syncfs.

sleep, after typical initialization, parses each commandline arg as doubles + units summing the results to pass to the xnanosleep syscall.

runcon (simple sandboxing), after mostly-typical initialization, repeatedly parses & validates flags, calls into SELinux to construct a new context possibly based on the current one, validates & saves that context, & execvps the trailing arguments.

rmdir wraps that syscall maybe ignoring certain errors & maybe iterating over parents.

realpath, after mostly-typical initialization, parses commandline flags validating there’s no further commandline args, normalizes the base path in different modes, & for each arg normalizes it, maybe finds common prefix with base to reformat into a relative path, & outputs it.

pwd, after mostly-typical initialization, parses commandline flags to determine whether to return normalized $PWD or getcwd with fallback concatenating all parent dirs from “.”.

readlink, after mostly-typical initialization, parses commandline flags validating argcount based on -n, & for each arg outputs the result of a readlink syscall wrapper reallocating larger buffer on fail or of a hashmap-aided common utility iterating over the link normalizing it whilst readlink any intermediate links realloc’ing as needed.

printenv, after mostly-typical initialization, parses commandline flags & iterates over libC’s environ to output all or specified envvars.

printf, after mostly-typical initialization, checks for –help or –version flags performing their special tasks, validates there’s at least 1 commandline argument, & interprets that first argument akin to C’s standard lib concatenating the given text with appropriate escaping, unescaping, & validation.

nproc, after mostly-typical initialization, parses commandline flags before checking OpenMP envvars before consulting traditional kernel-specific syscalls (now getaffinity’s standard).

nohup, after typical initialization, checks $POSIXLY_CORRECT to determine which error code to use, validates there are args, checks whether we’re running in the terminal, possibly NULLs out stdin, opens a new stdout/stderr, disables SIGHUP signals, & execvp’s it’s remaining args.

nice, after mostly-typical initialization, parses flags with special path, parses offset if given, wraps nice or setpriority, & execvps remaining args. If there are none another codepath outputs niceness.

mktemp, after mostly-typical initialization, parses it’s commandline flags, validates it’s commandline args possibly providing defaults, concatenates on a given suffix and/or containing directory, & calls gen_tempname_len with appropriate bitflags. Which GNU CoreUtils reimplements themselves in case it’s not available in LibC.

mknod, after mostly-typical initialization, parses & validates it’s commandline flags including SELinux/SMACK context, & wraps makedev + mknod or mkfifo.

mkfifo, after mostly-typical initialization, parses commandline flags including SELinux or SMACK context, possibly calls mode_compile/umask/mode_adjust, & for each commandline arg error-handled mkfifo possibly with preceding SELinux defaultcon.

logname, after typical initialization, validates there are no commandline args wraps getlogin & puts result.

link, after typical initialization, validates there’s at exactly 2 commandline args & wraps link syscall.

mkdir parses flags including SMACK/SELinux context, validates there are args, possibly calls mode_compile/adjust & umask, & for each arg temp-registers the security context whilst calling a thoroughly error-handled wrapper around mkdir syscall. With or without a callback which mkdirs any missing parents.

kill parses commandline flags to either kill syscall each arg as pid or serialize each arg as signal with or without number. Or same for all signals.

id, after mostly typical initialization, parses & validates commandline flags, & for each commandline arg parses it as a user+group (an int or two) to wrap getpwuid & print specified properties. If there were no commandline args it instead retrieves it’s own user/group to print the same specified properties.

hostname, after typical initialization, wraps sethostname if single commandline arg or gethostname if there’s none. Otherwise complains.

hostid wraps gethostid.

groups, after typical initialization, wraps getpwnam for each arg. With getpwuid, getgroups, and/or getgrgid calls to help output it’s result. Or if there were no args retrieves the process’s own uid, & (effective) gid to output the same way.

getlimits, after typical initialization, prints various numic constants from LibC.

echo, after mostly-standard initialization if –help or –version aren’t given, possibly parses flags and/or unescape each arg & outputs each in turn.

dirname, after mostly-typical initialization, parses flags & for each arg (complains if none) locates last “/” in the path & outputs up to that point.

date, after mostly-typical initialization, parses & validates flags largely into a format string with a default, computes input time via posixtime or gettime or stat+get_stat_mtime or parse_datetime2, optionally calls settime, & ultimately wraps fprintftime. Or if given -f parses & reformats each of that file’s lines.

chroot, after mostly-typical initialization, parses & validates commandline flags & retrieves ownership from stating reference file or parses the specified user before opening & normalizing all the specified files deciding whether to actual call chown & retrieving fstatat.possibly then does so with appropriate chown variant. Using a shared filesset abstraction to implement recursion, which I’ll study tomorrow.

chgrp works very similarly, using the exact same core function.

chmod, after mostly-typical initialization, parses & validates flags, parses the given mode (usually from first arg) manually parsing & reformatting octal ints or as text or copies this info from stating a reference file (also there’s normalization), uses that same FTS abstraction to implement optional recursion, & for each given file dequeued from FTS it handles (possibly outputting) any dequeued errors, with validation/normalization calls chmod, & possibly outputs info about it.

chroot, after mostly-typical initialization, parses & validates commandline flags, wraps chroot, parses & retrieves groups (as ints or via getgrnam) to also wrap setgroups sycall if available, & execvps remaining args.

And finally basename, after mostly-typical initialization, parses & validates commandline flags before iterating over each commandline arg looking for “/”s & maybe a given suffix to strip off the result it fputs.


Some of GNU CoreUtils’ commands for manipulating the filesystem are more interactive, allowing optional recursion, prompts, & error reporting. Namely cp, mv, & rm. ln & ls appears to be in a similar bucket. These are the topics of today’s study!

cp, after mostly-typical initialization, checks whether SELinux is enabled & initializes a context, parses & validates commandline flags into that context (& a global for e.g. backup suffixes), registers SELinux context, & with a global closed hashmap cp performs some further flags validation, & if the destination is a directory it might start populating that global hashmap & iterates over the remaining args to maybe strip off trailing slashes, maybe concatenate on a parent directory possibly creating it or preprocess away “..”, performs core logic if parent dir exists possibly followed by chown/chmod/SELinux syscalls. Or (after lowering a backup edgecase) directly calls this core logic!

That core logic for copying a file involves possibly considering (for mv not cp) calling renameat2 or equivalent available syscall, possibly carefully fstats to warn about non-recursively copying directories, uses the hashmap to validate args aren’t duplicated, if the move didn’t succeed or wasn’t performed might [l,f]stat it, maybe perform validation against copying a file to itself emitting an error upon failure, maybe checks timestamps, considers whether to abandon the op, warns about overriding a file with a directory or vice versa, performs checks against loosing data, maybe applies autobackup operations, and/or attempts unlinking the destination. Then cp/mv validates it won’t copy into a symlink it created in the process of copying, considers writing verbose output, updates or consults the hashmap, to perform yet more validation, for mv attempts to apply a rename op followed by optional SELinux updates, verbose/error output, and/or hashmap updates.

Then computes permissions whilst configuring SELinux state, if it’s copying a directory it performs more validation, possibly updates hashmap and/or verbose output, & co-recurses over a multistring collected from readdir syscall. If it’s a symlink carefully calls the symlinkat syscall. If it’s hardlink carefully calls the linkat syscall. If it’s a fifo calls mknod falling back to mkfifo. If it’s a device file calls mknod. There’s another link copying case. Then cleans up.

To copy a regular file cp/mv opens & fstats it with a bit more validation, opens the destination file configuring SELinux permissions & cleaning up on failure with a couple more alternate open paths, validates & fstats the destination, possibly attempts to use the clone syscall, or carefully reads from one file to write the data to the other whilst carefully avoiding/punching “holes” then copies permissions over.


mv works basically the same as cp, running the same logic in a different mode & incorporating a trailing call to rm’s core logic.

rm in turn, after mostly-typical initialization, initializes some parameters for it’s core logic then parses & validates flags into it possibly prompting the user (via a possibly localized shared utility function) with the argcount. Before calling the core logic shared with mv. Which incorporates the FTS utility for optional recursion’s sake.

For each file to be deleted involves checking the file’s type. If it’s a directory it might complain (especially if it’s “.”, “..”, or “/”) possibly flagging ancestor dirs before prompting & carefully unlinking it. If it’s a regular file, stat failure, symbolic link with or without target, postorder or unreadable directory, dangling symlink, etc possibly flags ancestor dirs, possibly prompts (gathering data to report), & unlinks the file. It’s skips cyclic directories & reports errors.

ln, after mostly-typical initialization, parses & validates commandline flags before extracting the target & destination filepaths from commandline args possibly creating & fstating it. Then ln initializes autobackup globals, considers initializing a deduplication hashmap, & with some preprocessing runs the core logic per argument. There’s a seperate codepath to this core logic for single arguments.

This core logic involves possibly attempting to call [sym]linkat before diagnosing errors whilst applying backups & tweaking filepaths to try again. If either suceeded it possibly updates the hashmap and/or verbose output. Or reports the failure undoing backup.

ls, after typical initialization, parses & validates it’s extensive commandline flags followed by $LS_COLORS/$COLORTERM in which case it disables tabs & performs postprocessing, if recursion is enabled initializes a hashmap & obstack, retrieves the timezone, initializes “dired” obstacks, possibly initializes a table for escaping URIs whilst retrieving hostname, mallocs a “cwd file”, clears various state, enqueues commandline args whilst lstating them & deciding how to render, dirs are enqueued in a linked list, results are optionally mpsorted then seperates out dirs to be enqueued, the current batch of files are printed to stdout in a selection of formats, then dequeues each directory skipping cycles & repeating similar logic, then cleans up after colours, dired, and/or loop detection if any of those were used.

ls reimplements a tiny subset of ncurses (& lookuptables from e.g. filetypes to colours) for the sake of columnized & colourized output.


To aid implement optional recursion in rm, chown, chmod, etc GNU CoreUtils implements a “FTS” utility.

To open a FTS filesystem traversal it validates the arguments, callocs some memory whilst saving some properties, might test-open “.”, computes the maximum length of it’s arguments to allocate memory to store any of them in, allocs a parent FTS entry & one for each commandline argument referencing it possibly qsorting it, & allocs current entry.

To dequeue the next entry from FTS it performs some validation, considers re-yielding & re-lstating the previous entry for certain kernel errors, considers calling diropen syscall for recursion’s sake. If the current entry’s a directory it may close it if instructed to by caller, possibly clears it’s fts_child property, possibly calls diropen whilst carefully avoiding “..” whilst updating a ringbuffer and/or the system current director, & traverses to the directory’s child.

Then iterates to next child by calling dirfd or opendirat+lstat whilst handling errors, decides whether to descend into directories, initializing some vars, & before cleanup/sorting repeatedly calls readdir, allocing/initializing memory to hold the new entry & it’s filepath carefully handling errors,lstats the file for more fields to store, & inserts into a linked list.

If it found a next entry it gets validated & tweaked as instructed by caller whilst recalling lstat.

Once all the entries in a directory have been traversed it follows the parent pointer freeing previous memory, & validates/tweaks it before yielding that virtual entry.

This tree traversal may be augmented with a hashmap to detect cycles.

I don’t see much use of that ringbuffer…


Beyond exposing language bindings for LibC’s/Linux’s syscalls the other dominant task GNU CoreUtils’ commands perform is to perform textual transformations or summaries of stdin. The simpler cases of which I’ll describe today!

uniq, after mostly-typical initialization, parses & validates commandline flags whilst gathering an array of 2 filepaths. Which if not “-“ will be freopened over stdin & stdout with sequential access & (internal util) linebuffering optimizations enabled.

The fastpath reads each line from input stream (enlarging buffer until finds the configured delimiter, defaults to newline), skips configured number of whitespace-seperated fields & then chars, compares against previous row case-sensitively or not, & depending on grouping mode outputs a delimiter and/or outputs the line whilst updating state.

The slowpath also tracks an error-checked count of repeated lines & whether we’ve seen our first delimiter, moving a line read & write out of the loop.

unexpand converts spaces back to tabs by, after mostly-typical initialization & parsing & normalizing commandline flags into e.g. a tabstop array & temp-allocated filename list both via a shared util with expand, pops & fopens (with sequential optimizations) the first file off that array, mallocs a blank column, & repeatedlys reads the next char popping the next file upon EOF, looks up appropriate tabstop from array upon blank chars stopping future conversions if it goes beyond the end, validates the line wasn’t too long, replaces the whitespace with a tab char if it was already one or we’ve changed tabstops, decrements column upon \b recalculating tabstops, & otherwise increments column. If it has prepared pending whitespace to write it’ll finalize & output it. Then output the non-whitespace char.

Repeating until end-of-line (innerloop) & end-of-files (outerloop).

tac, after mostly-typical initialization & commandline flags parsing including regexp-parsing the “sentinal” & bytesize validation, finds the remaining commandline args default to “-“, configures binary output mode, before flushing remaining output & cleaning up for each arg it opens it in binary mode (handling “-“ specially), lseeks to end, handles seekable & non-seekable files differently, & cleans up.

For nonseekable files it copies over to a seekable file before tacing it.

For seekable files, or after converting non-seekable files, it normalizes the computed seek offset to be multiple of a precomputed read_size lseeking there, before lseeking back then forward a page at a time looking for EOF, & repeatedlies runs the configured regex to find the configured line seperator OR performs a simpler fixed-size string-in-string search, if it didn’t find a match at filestart it outputs a line & exits. Or it reads from line start into newly realloced memory.

If it found a match it outputs that line with or without trailing line seperator, updating past_end & maybe match_start properties.

There’s also an in-memory codepath I don’t see used.

paste, after mostly-typical initialization & parsing commandline flags, defaults args to “-“ escaping those filepaths, runs serial or parallel core logic before cleaning up based on one of those flags.

That serial logic involves opening each file (handling “-“ specially) with sequential optimization.

After opening each file checking for empties, it then copies individual chars from input to output replacing any line delims & adding a trailing one if needed.

The parallel logic involves opening each file validating stdin handling, then repeatedly iterates over each file considering outputting extra delims from a preprepared buffer before repeatedly copying chars from input to output.

Or if the file was already closed it considers which state it needs to update or delims to output.

nl, after mostly-typical initialization & parsing flags, prepares several buffers, processes each file (defaulting to just “-“ a.k.a. stdin) & maybe fcloseing stdin, & returns whether all of those files were successful.

Processing a file involves fopening it (handling “-“ specially) with sequential optimizations, reading each line determining via memcmp whether we’re in a header, body, footer, or (where the real logic/incrementing+output happens) text. Resets counter for non-text.

join, after mostly-typical initialization, registers to free an array on exit, parses & validates commandline flags, gathers an array of filenames whilst determining which join fields to use, & with the two files open handling “-“ specially it runs the core logic.

This core logic consists of enabling sequential read optimizations, initializes state for both of the input files populated with their first line, maybe updates some autocounts, maybe runs an initial join, & repeatedlies…

For each pair of lines join memcmps the appropriate field case-sensitively or not, might output a linked list of fields or just the fields being joined from file at lower key whilst advancing it to next line, advances leftfile until no longer equal then same for rightfile, maybe outputs those lines, & updates each file’s state whilst checking for EOF.

Trailing lines from either file are possibly printed after this loop & memory is cleaned up.

Fields are split upon reading each line.

head, after mostly-typical initialization, parses & validates commandline flags possibly with special handling for integral flags, defaults remaining args to “-“, & in binary output mode iterates over all those args for each temporarily opens the filepath (handling “-“ specially) optionally outputs a filepath header surrounded by fat arrows & uses different core logic for whether we’re operating in terms of lines or bytes & whether we’re outputting a finite number of them.

For fixed number of bytes head copies a bufferful of data at a time until we’ve met that target.

For fixed number of lines head copies a bufferful of data at a time whilst counting newlines, until we’ve line count. Or rather decrements the linecount until 0.

To output all but the last n bytes head, if it can query the filesize, copies a computed number of bytes a bufferful at a time. Or in 1 of 2 ways copies a buffer of computed size at a time, chopping off n bytes once it reaches EOF.

To output all but last n lines on a seekable file head reads it backwards a bufferfile at a time counting newlines until it found the stoppoint in bytes. Then copies a bufferful at a time until it reaches that point.

To output all but last n lines on a pipe head allocs a linked list & repeatedly reads a bufferful at a time maybe immediately outputting the line if we’re not eliding anything, counts newlines in that buffer, & considers merging buffers or outputting the old head.

To wrap text to a fixed width fold, after mostly-typical initialization & flags parsing, iterates over every arg falling back to “-“. For each it temporarily opens the file (handling “-“ specially) with sequential optimizations & reads a char at a time adding them to a buffer.

Upon newlines it writes the buffered text. Otherwise computes new column handling \b, \r, & \t specially. If overflows given width it might locate last buffered whitespace to output until, or outputs full buffer.

fmt, after mostly-typical initialization & parsing flags handling digits specially, iterates over & fopens the args fallingback to stdin as “-“. For each it enable sequential optimizations, followed by a configured prefix, handles preceding blank lines then optionally reads of rest paragraph collapsing solo-newlines. For each such paragraph it performs split-costed linewrapping & outputs them in a seperate pass. Then tidies up errors after the loop.

fmt’s a more sophisticated fold!

To replace tabs with spaces expand, after mostly-typical initialization & parsing commandline flags, finalizes tabstops & saves an array of commandline arguments fallingback to “-“ then dequeues on, & before possibly cleaning up reading stdin repeatedlies: read each char from each file in turn, upon tab looks up the corresponding tabstop & outputs the appropriate number of spaces, decrements column upon \b, or increments column, then outputs the reachar (except tabs translated to spaces).

cut, after mostly-typical initialization & parsing & validating flags including field selection, iterates over all remaining args falling back to “-“, & cleans up after parsed fields & reading stdin. For each it temporarily fopens the file (handling “-“ specially) with sequential optimizations, handling byte & field counts differently.

For byte cuts it counts non-delimiter chars locating appropriate cut entries to determine when to output the delimiter, whilst copying all chars out unless the current cut entry indicates otherwise.

Field cuts works essentially the same, except reading entire fields split by configurable delimiters instead of individual chars.

These cut entries are tracked in an array with high & low bounds.

csplit, after mostly-typical initialization & parsing & validating commandline flags, validates there are remaining commandline args, reopens the given file over stdin, parses given regexps, registers a signals handler, iterates over & applies the given “controls” before carefully temporarily opening an output file to write all buffered lines to. This output file might also be opened when processing any of those controls.

For regexp controls at an offset repeatedlies looks up a line (upon failure to find this it either outputs the rest of the file or reports to stderr) before evaluating the regexp over that line to determine whether to output it.

For non-offset regexp controls it does basically the same logic but slightly simpler.

For linecount controls it creates the output file, reports errors, repeatedly removes queues lines to save to the file until reaching the desired linecount, & tidies up.

Upon dequeueing a line csplit considers whether it needs to read a new bufferful of data & split it into lines.

comm, after mostly-typical initialization & parsing flags, validates args count before running it’s core logic. Which involves fopening each specified file (handling “-“ specially) with sequential optimizations, perfile data allocated, & the firstline read in. Then mergesorts the lines from both files into stdout with or without collation, closes those files, & optionally outputs intersection/exclusion counts.

And last but not least cat, after mostly-typical initialization & flags parsing, fstats stdout to help determine most optimal codepath & maybe sets it to binary mode.

Then cat iterates over it’s args fopening & fstating each one, retring the optimal blocksize, validates it’s not cating a file to itself, if various aren’t set it’ll simply repeatedly copy data from input to output an optimally sized & memaligned buffer at a time OR with plenty of microoptimization it iterates over the buffer reading more as-needed looking for newlines possibly inserting linenumbers and/or escaping lines between them.

Before writing remaining text & cleaning up.


GNU CoreUtils provides several useful commands for rearranging & summarising or rearranging text files!

wc, after mostly-typical initialization, retrieves optimal buffersize, configures line buffering mode, checks $POSIXLY_CORRECT, parses & normalizes flags indicating which counts it should output, possibly opens & fstats the file those specified listing other files to summarize OR consults args, possibly fstats all input files again to estimate the width of the eventual counts, iterates over & validates all files whether listed in a file or args, running the core logic for each or reads from stdin, & tidies-up whilst outputting desired counts.

This core logic involves temp-opening the file again handling “-“ specially, possibly enables sequential optimization, possibly tries fstating again seeks near the end in case the size is approximate & uses repeated read for exact size.

Or the core logic may involve considering whether we can use AVX2-specific microoptimizations before repeatedly reading a bufferful in counting bytes & newlines. AVX2 allows x86 CPUs to do this in 256bit (or rather wc uses 2 of them for 512bit chunks) chunks by summing equality results.

Or it reads a bufferful at a time whilst decoding UTF8 chars (with ASCII fastpath) via mbrtowc handling \n, \r, \f, \t, space, \v, & kanji specially. Or maybe it’s compiled to not support Unicode.

After counting lines, words, chars, and/or bytes in each file it outputs those numbers before adding to the total sums across all files in case we want those values too.

tr, after mostly-typical initialization & flag parsing possibly altering locally-configured locale, validates args count, initializes a linkedlist & escapes then parses a regex-like pattern (or two) via an internal scanner, validates them, switches to binary input mode with sequential optimizations, & under various differing conditions consults a rename table and/or a couple smallintset (both compiled from parsed input pattern) to determine which input chars to output.

tail, after typical initialization & parsing & validating commandline flags obsolete first, defaults remaining commandline args to “-“, locates/validates/warns about “-“ in those args, shortcircuits if certain flags are all unset, allocs an array, considers whether to output headers, enables binary output mode, iterates over all those args performing the core logic, then given -f goes back to output any additional lines writing to those files ideally using INotify syscalls & hashmap to interpret it’s responses.

The core logic involves temporarily opening the specified file in binary mode (handling “-“ specially) & if successful possibly outputs the filename surrounded by fat arrows, locates the last n bytes or lines to write, & given -f validates & populates properties for the above -f loop.

To output all but the first n bytes tail consults fstat beforelseeking & copying bufferfuls from input to output before outputting any remaining text with without headers.

To output the last n bytes tail lseeks to the end of the file less n, decides whether it needs to apply pipe logic or seek back to start, & refines the seek position before outputting any remaining text with or without headers.

For pipes it buffers into a segmented linkedlist until EOF than outputs said buffer.

To output all but the first n lines tail fstats the file, reads bufferfuls counting (or rather decrementing) newlines until it reaches desired count, possibly outputs remainder of that buffer, & outputs the remainder of the file.

To output the last n lines tail fstats the file, tries seeking to end, reading the file backwards a bufferful at a time counting newlines until reaches desired count, outputs the buffer from that point followed by the remainder of the file.

To output the last n lines of a pipe tail reads the pipe into a linkedlist of buffers counting the number of newlines in each until EOF, uses those counts to locate the start of those lines, & outputs them before cleaning up their memory.

split, after mostly typical initialization & parsing & validating commandline flags , extracts & validates commandline args, freopens the specified input file over stdin, enables binary input mode, fstats the input to get optimal blocksize, mallocs a memaligned buffer, performs some trail reads to get the filesize, if in filtering mode registers SIGPIPE to be ignored, & decides whether it wants to apply the logic for digits/lines, bytes, byteslines, chunkbytes, chunklines, or RR.

For linesplits it reads input in bufferfuls counting newlines to determine which output file to write to.

For bytesplits it reads bufferfuls tracking a bytes countdown to determine which output to write to.

To split into lines of a maximum bytesize split reads buffers of the specified size counting newlines within them to determine which output to write that buffer to. Or splits buffer at a line break.

To split bytes chunks it may either be equivalents by bytes or it seeks to a computed start index copying buffers of the specified size to output.

Or there’s a variant which avoids splitting lines.

Or a variant that cycles between the array of output files in a “round-robin” fashion.

sort, after initializing locale as per usual, loads some envvars, further initializes locale, generates some lookuptables, alters signal handlers (mostly to ignore, resets SIGCHLD handler away from parent’s), registers to cleanup tempfiles & close stdout on exit, allocs/clears some configuration, parses extensive commandline flags with various conditional tweaks into those structures & other localvars, opens & tokenizes file specified by –files0-from flag if present, propagates various properties of fields to compare by whilst determining whether any requires randomness, ensures there’s at least one entry in that linkedlist before validating it, maybe outputs debugging information, initializes randomness if required via getrandom via a wrapper reading from a file instead for debugging purposes, sets a default tmpdir of $TMPDIR or /tmp, defaults remaining args to “-“, normalizes the ammount of RAM to use, optionally extensively validates instead, validates it can read all the inputs & write all the outputs, & with minimal tidyup commences bigishdata sort! (Might be overengineered for modern hardware…)

You can configure sort to only use it’s disk-based mergesort assuming given input files are already sorted. This involves an initial loop which merges each pair (or whatever) of input files (by advancing whichever’s head, parsing the line into fields for comparison, is lower copying to output) then each pair of those.

Or when compute’s the bottleneck not RAM (which RAM was on early computers) retrieve CPU core count as max pthreads count & for each bufferful of input from each file in turn temp initializes a mutex-locked priority queue & merge tree node (possibly in a new pthread) to apply an in-RAM mergesort with it’s chunks prioritized via the priority queue. Once these sorted arrays get to large they’re written to disk for the disk-based mergesort.

The comparator either interprets the relevant commandline flags parsed into an array selecting certain fields from the pre-lexed line applying a choice of comparison logics (includes randomly-salted MD5 for shuffling). Or it uses possible collated memcmp.

shuf, after mostly-typical initialization & parsing & validating commandline flags, gathers inputs whether empty, echoed from commandline args, numeric range, or specified files, initializes random number generator, maybe populates an array listing the new random index for each line in the file taking great care to preserve randomness distribution, considers closeing stdin, maybe computes the array via a sparse hashmap & randomly swapping indices, writes the randomly choosen indices or lines at those indices or the “reservoir”, & tidies up.

seq, after mostly-typical initialization & pasing/validating commandline flags, determines whether it can use a fastpath.

The fastpath keeps the number in text form incrementing chars carrying at ‘9’, populating a bufferful before outputting them. Uses memcmp to decide when to end.

Otherwise it parses & validates given floats whilst checking whether it’s actually an int & computing output width, reconsiders fastpath, generates a formatstring in lack of -f, & outputs each multiple of given step (carefully avoiding precision loss) up to limit printfing each adding seperators & terminator.

ptx, after mostly-typical initialization, calls setchrclass(NULL) if that syscall’s available, parses flags, gathers an array of filenames with linecount & text buffer sidetables whether or not args are given or GNU are left enabled, chooses a default output format, compiles given regexp whilst compiling a char rewrite table, loads all chars from a file into that rewrite table as “breaks”, loads a second sorted sidetable of “break words” from a given file, initializes some counts, for each given input file read it all into memory run the core logic & update line counts, sorts results, applies various normalization, & iterates over these results to compute charwidth of it’s fields & output in a choice of syntax including TeX.

That core logic involves iterating over the file’s text probably running the regexp to locate the start index for next iteration, & repeatedlies locates next word start & end via regexp or scanning over chars present in rewrite table skipping empty words, updates max length & counts, binary searches sidetable of allow & block sorted wordlists skipping words as dictated by them, possibly allocs an occurs_table entry & populates it partially as directed by caller, & possibly skips trailing chars than whitespace.

pr (which reformats text for printing), after mostly-typical initialization & pasing & validating/normalizing commandline flags, copies trailing commandline args into a new array, iterates over all those filenames defaulting to stdin or tells the core logic to render them in parallel, & tidies up.

That core logic involves computing various layout parameters opens each file being laidout in parallel (handling “-“ specially) with sequential read optimizations whilst layout pageheader text, possibly allocs memory to store columns, partially laysout a given number of pages to skip them using that parameter as the initial page number, computes some per-column layout parameters whilst choosing per-column callbacks representing whether to directly output text or buffer it in columns, then renders each subsequent page.

For each output page it resets some layout parameters, validates there’s text to layout, repeatedly outputs lines, updates flags, resets state, & outputs padding.

Laying out a line involves iterating over cols calling their callback (possibly skipping the rest of the input’s line) until it has no more lines to output, in parallel mode ensures columns remain aligned even when empty, & considers adding newlines.

Whilst laying out columns per-page it reads in the first line for each of them & reshuffles lines between columns to keep them balanced.

That callback applies text alignment, line numbers, textwrapping, etc & buffers text via the other callback.

od, after mostly-typical initialization & initializing a couple lookup tables, parses, validates, & normalizes commandline flags, choose a “modern” or “traditional” for extracting commandline arguments, possibly into a printf-string & read bytesize, defaults commandline args to “-“, opens the first file of those, carefully skips a specified number of bytes, computes the least common multiple (shared util) between all given readwidths & uses that to compute number of bytes per block, computes necessary padding to align output, in some builds outputs debugging info, & runs one of two variants of the core logic before possibly attempting to close stdin.

If we’re “dumping strings” from the file repeatedly it keeps reading bytes looking for at least a given number of ASCII chars loading them into a buffer, then reads until it’s found the nil terminator resizing the buffer as necessary, then outputs the address via configured callback & escapes/outputs the found string.

Otherwise od reads bufferfuls at a time with or without (two mainloops) checking against end offset, outputting last byte specially consulting computed lowest-common-multiple, & outputs end address.

Upon reading bufferful of data it considers closing current on EOF & opening the next one. To write that block it first compares against previous block. If they were equal it’ll output atmost “*\n”. Or output address followed by each specifier’s callback possibly followed by hex format.

digest, after typical initialization & enabling output buffering, parses & verifies flags, defaults args to “-“, & iterates over each arg either checking the hash or computing the hash & outputting it via the caller-specified callback.

To check a file’s hash digest opens the file (handling “-“ specially) repeatedly reads a line & strips it, extracts the hash, whether it’s a binary file, & filepath, runs core logic, compares to expected value, & decides which results to output.

The shared library for computing CRCs (not a really a hashfunction, but works well for detecting transmission errors!) embeds via the C preprocessor a script to generate it’s own lookuptable headerfile using bittwiddling & 2 intermediary tables. Has seperate codepath specifically for making optimal use of x86/x64 CPUs to take advantage of their pclmul instructions.

CRCs at it’s core involves repeated bitwise shift & XOR.

Finally base## (e.g. base64) commands, after mostly-typical initialization & parsing flags possibly (depending on build flags) including a seperately option for desired base, validates it has at most one argument defaulting to “-“ which it then temporarily fopens (handling “-“ specially) with sequential read optimizations, & either decodes or encodes the data as specified by -d.

For decoding it mallocs some buffers, reads a fullbuffer in, & calls the appropriate decode function.

Encoding works basically the same way but possibly with added text wrapper.

The logic (which is in a library shared within GNU CoreUtils) for encoding & decoding base32 or base64 text involves bitshits, bitmasks, & character lookup tables. There’s wrappers around this, as well as similar code written inline with the command, tweaking the behaviour for additional basenc options.


Not every system call GNU CoreUtils exposes to the commandline is very text centric.

dd, after typical initialization & configuring signal handlers, retrieves the system’s optimal buffersize, initialization a translation table with sequentially-incrementing numbers (identity transform), decodes keyword args in a different syntax from usual into e.g. filenames & bitflags & ints, updates that transform table, opens the if= given input file with specified flags checking whether it’s seekable, opens the of= given output file with specified flags possibly iftrancateing it use ifstat to diagnose failures, retrieves the current time via whatever highprecision syscall is available, & runs the core logic before diagnosing errors, ensuring any signals have been handled, cleaning up, & outputting final status.

dd’s core logic involves maybe fstating & lseeking the given offset fallingback to read twice (the second time outputting zeroes in their place for “seek” options), allocs input & output buffers, possibly retrieves current time to determine whether to outputs status, stops if we’ve copied enough records, maybe zeroes the input buffer, reads a full or possibly partial (depending on if keyword arg) buffer of conditional size whilst handling signals & warning about errors, updates counters possibly clearing output cache or possibly lseeks past bad blocks whilst invalidating cache maybe ending this loop, possibly zeroes the input buffer’s tail, maybe takes a fastpath outputting that input buffer immediately, maybe renames all the bytes according to the translation table, maybe swaps every two bytes, & lseeks & writes that postprocessed buffere either in full or a char at a time whilst tracking columns.

After dd’s mainloop it outputs the final byte if present, maybe pads with spaces, maybe adds a final newline, outputs last block if necessary, if the final op was a seek fstats, lseeks, & ftruncates the file, & f[data]syncs the file whilst handling signals.

Signals are handled several of these syscalls. Clearing output caches involves some throttling & posix_fadviseing.

Some standard translation tables are bundled for e.g. EBCDIC.

df, after mostly-typical initialization & parsing & validating flags including resolving filesize units filling in certain missing ones from envvars, test-opening each file whilst stating it, parses the syscall or device file listing currently mounted filesystems complaining upon error, maybe ensures all data is syncd so we can analyze it, allocs fields to hold each row data & outputs a header, gathers desired entries, probably outputs them ideally nicely aligned, & cleans up.

Gathering desired entries may involve for each commandline arg iterating over the mount linkedlist looking for the specified device whilst canonicalizing filepaths & stating the files looking for closest match, then reformatting format into text adding filesize units back in whilst calling which ever stat[v]fs-variant syscall is available. Or complaining if that device has been “eclipsed”.

Then another couple iterations where the device is used.

Or it might iterate over the mountlist deduplicating it via a temporary hashmap & filtering the entries as specified by the parsed commandline args whilst stating them, populating a templinkedlist before possibly copying it over, then iterates over the now-filtered mountlist to populate the table as per before.

In populating the table it increases some counts to also possibly output.

The alignment code is quite sophisticated, & in a internally-shared library.

du, after mostly-typical initialization & parsing & validating commandline flags & $DU_BLOCK_SIZE & maybe $TIME_STYLE envvars, determines where to read the argument list from possibly freopening the file given by –files0-from over stdin, mallocing a hashset of device-inode pairs, tweaks some bitflags, repeatedlies retrieves & validates the next specified file to apply the core logic to, tidies up all the allocated memory, & possibly prints the total count.

du’s core logic involves reusing the filetree traversal used by chown, chgrp, rm, etc.

Upon anything other than errors (which it reports) or entering directories it checks whether the commandline flags specified to exclude the file. If not it configures NSOK entries to be revisited & precedes to the next one validating it’s not an error & reconsidering whether to exclude. If so it tells the traversal to skip this entry.

For dirs it does no further processing & errors are reported.

Then per-file du gathers a local struct from the provided stat info, callocs or reallocs to form a tree out of this data, adds to appropriate counters, & maybe outputs the filesize, maybe date, & label.

There’s cycle detection logic in traversing directories referring to a lazily-loaded mounttable & the device+inode hashset.


env, after mostly-typical initialization & initializing a signals table, parses & validates commandline flags & validates it’s not receive the = op, resets all signal handlers to either default or ignore, maybe configures a signal mask, & maybe outputs these new signal handlers, maybe switches to a new current directory possibly with debug/error messages, maybe outputs which command it’s about to run, & execvps it tidying up on failure.

There’s a shared util func for traversing up a filepath to until before it changes st_dev/st_ino.

install, after mostly-typical initialization& parsing & validating commandline flags largely into a struct but also globals like (from a shared util mentioned previously) backup suffixes, validates there’s at least one additional commandline arg, further parses a couple flags, & either with a preserved current working directory & SELinux context creates the specified directory with any missing parents.

Or with a global hashmap (and with or without creating any missing parent dirs) stats the file, if successful copies the file over into new location as per cp if needed, if successful maybe runs the strip command over it, copies timestamps over via utimens syscall, & copies permission (both traditional UNIX & SELinux) permission attributes over.

Or it prepends a directorypath first before doing that.

pinky starts by mostly-typical initialization & parsing commandline flags.

In short mode pinky reads the specified UTmp file, determines which datetime format to use, outputs a heading line, & iterates over that UTmp file looking for user processes possibly filtered by a provide arrayset, stats the listed file & consults LibC’s flatfile databases to determine what text to output with.

In longmode it iterates over the commandline args, consults pwnam for more extensive info to output, followed by the user’s ~/.project & ~/.plan files.

There’s shared utils for manipulating SELinux devicefiles. There’s other shared utils for parsing field references from commandline flags. And another involved in quite sophisticated commandline flags parsing.

In storing data in magnetic fields harddisks left residual traces of deleted data, so it can be useful to repeatedly overwrite it with whitenoise to ensure that data is truly gone. Solidstate drives I believe don’t have the same issue, and using these “secure erase” tools just serves to shorten their lifespans. GNU CoreUtils provides shred for this.

shred, after mostly-typical initialization & parsing commandline flags, validates there’s additional commandline args, initializes a random-number generator & registers for it to be cleaned up on exit, & iterates it’s commandline arguments (handling “-“ specially, jumping near-straight to the core logic) temp-opening & maybe chmoding if necessary to apply the core logic before repeatedly renaming the file to progressively shorter names & unlinking it whilst syncing each step.

shred’s core logic fstats the file validating result, computes optimal buffersize whilst retrieving exact filesize, populates the buffer with random data with various counters possibly reusing previous chunks of randomness, & go over buffer again to improve randomness slightly, & repeatedlies seeks back to the start of each block a given number of times, possibly bittwiddles the buffer, outputs status, & repeatedly verified-writes random segments of the buffer to the file being shredded.

After the innermost loop shred outputs status info & syncs to disk so it actually has an effect.

stat, after mostly-typical initialization & parsing flags, validates there’s remaining args & (filling in dynamically-generated defaults) it’s -c/–printf flag, & for each commandline arg calls fstat or statfs or available variant syscall, before manually interpreting (lots of options) the given format string to generate output text whilst maybe locating mountpoint or SELinux context.

stdbuf, after mostly-typical initialization & parsing commandline flags, validates there’s additional commandline args, sets specified envvars, extracts the directory containing this command possibly referring to /proc/self/exe symlink, configures LD_PRELOAD envvars to “libstdbuf.so” whereever that is, & execvps the remaining args. libstdbuf in turn adds a little code to the executable(s) which parses those envvars to pass to setvbuf syscall.

stty, after mostly-typical initialization & parsing & thoroughly validating commandline flags, possibly reopens the specified file over stdin turning off blocking mode, retrieves the input’s mode, possibly parses $COLUMNS whilst outputting the specified subset of hardcoded controlchars. Or iterates over all specified settings (amongst other IOCTLs) encoding a new mode to subsequently pass to tcsetattr & reports if tcgetattr yields anything different.

Includes lightweight text wrapping.

test/[, after mostly-typical initialization, specifically checks for sole –help & –version flags on [, validates there are args, & runs an immediately-evaluated pushdown parser with a scanner but no lexer beyond the caller splitting commandline args. Whose leaves call various syscalls, typically stat variants to retrieve/compare different returned properties.

timeout, after mostly-typical initialization & parsing flags, validates there’s at least 2 args remaining, parses the next commandline arg as a timeout duration with an optional unit, maybe calls setpgid(0, 0) so all subprocesses are killed with timeout, configures signal handlers, & forks execvping the remaining commandline args in the child process with reset SIGTTIN & SIGTTOU signal handlers, in the parent process ensures we receive SIGALRM signals calls the appropriate available syscall to schedule it’s triggering, blocks several signals, & waits for child process to end.

Once the child process has ended (or a signal was received) it checks a flag set by the SIGALRM callback & kills self without coredumps.

Various signals including SIGALRM considers killing the child process or resetting a second timeout.

users, after typical initialization, parses the given UTmp (or default) file, iterates over every userprocess therein extracting names to qsort then output & deallocate.

And finally for today who, after mostly-typical initialization & parsing & validating commandline flags into various bools, selects a time format with max charsize to allocate to interpreting it, decides how to behave based on commandline args count, temporarily-parses the given or typically UTmp file, & decides how to handle it by presence of -q.

If it’s present it iterates over all user processes, extracts & outputs the trimmed name, & outputs how many of those entries it counted.

Otherwise considers outputting a heading line, considers calling ttyname for data to filter entries to only list ourself, & for each entry in that file to output an (if enabled for it’s type) appropriate line for it’s type. This list bit gets fairly involved yet tedious, shares an internal utility for outputting tablelines & formatting times/periods.


tsort, after typical initialization & validating there’s at most one arg defaulting to “-“, mallocs a root node, freopens the specified file over stdin if not “-“ enabling sequential read optimizations & initializing a multistring tokenizer, for each token it locates where to place it in the tree whilst balancing & inserts a new treenode there, validates there was an even number of tokens, counts all treenodes, & computes output from it.

To compute output from it’s binary tree tsort gathers a linkedlist of binary tree nodes with no dependencies, outputs each of their strings whilst removing them from the binary tree & decrements the counts on their dependencies to determine which to add to this linked list. If there’s tree nodes left after this that indicates there’s a loop in which case it iterates over the tree to find & output these loops, removing an edge to break the cycle so it can try again.

pathchk, after mostly-typical initialization & parsing commandline flags to global bools to determine checks are performed, validates there’s at least one additional commandline arg, & iterates over them. For each it maybe checks if there’s a leading hyphen (early UNIX bug treated all of those as stdin), maybe checks if the filename is empty, maybe checks whether the filename is pure ASCII (excluding symbols & whitespace) or checks whether the file exists via lstat, partially based on that it might check the charlength of the filepath, it might check the charlength of each path component in 2 (fast & slow) passes. Appropriate error messages for any of these failing checks are written to stderr.

Though I sure hope no modern system requires these portability checks!

numfmt, after mostly-typical initialization whilst maybe setting the CPU’s floating point precision & retrieving default decimal point from locale, parses & validates flags including a printf-like format string, reallocs a buffer according to configured padding, & iterates over commandline args (with possible warning in presence of –header) OR stdin’s lines (the first given number of lines of which of which are treated as a header).

For each (surrounded by delimiters) iterates over the line’s fields removes specified suffixes & whitespace, maybe reconfigures the padding buffer based input charlength, carefully parses the number more leniently & resiliently, computes charsize to validate it’s not too large, reassembles the printf format string whilst applying a choice of rounding to parsed number to reconsider whether to show the decimal point, applies that format string, possibly adds a suffix & applies alignment via mbsalign, & outputs that formatted number with appropriate prefix & suffix. Or outputs raw input text.

expr, after typical initialization & validating there’s non-“–” commandline args, and runs an immediately-evaluated pushdown parser with a scanner over the commandline args operating upon a tagged enum holding either multiprecision integers (mpz_t) or multibyte strings. Also can be tested if it’s falsy or evaluate regexps upon the : string infix operator.

Results are converted into a textual output and a boolean errorcode.

dircolors, after mostly-typical initialization & parsing & validating flags, either outputs some pre-compiled hardcoded text, OR guesses which shell is used based on $SHELL if not explicitly stated via commandline flags before reformatting the input file or stdin surrounded by shell-specific prefix/suffix text.

Parsing/reformatting the specified input streams involves possibly temporarily-freopening the specified file over stdin if not “-“, retrieves $TERM, & repeatedlies with linecounts reads & straightforwardly-parses each line, if the keyword was “TERM” check if it matches $TERM, & unless it didn’t reformats keyword-arg pairs quoting each seperated by “=” possibly adding or removing punctuation, replacing keys with acronyms in a pair of lookuptables, or dropping “OPTIONS”, “COLOR”, & “EIGHTBIT” keywords.

This reformatted text is buffered into a string for output.


To relatively efficiently extract prime factors from a number factor, after mostly-typical initialization with added exit-handler outputting any remaining buffered text, parses a handful of commandline flags, possibly zeroes out a frequencies buffer to output after core logic, iterates over trailing commandline args or tokenized stdin.

For each it parses the number, considers taking a fastpath or reporting any errors, or fallsback to using multiprecision arithmatic.

The fastpath (if the number’s small enough i.e. 2 words) recursively dividing by 1,000,000,000 & taking remainder to aid outputting the int to which it adds a “:” (a similar technique is used to output to output factors once computed), & computes the actual factors by first trying extracting some obvious factors & iterates over a pre-generated (I’ll describe how soon) table of prime numbers which are factors of the input in two passes to quickly discard options.

If there’s more prime factors to find it’ll check if the simplified input itself is prime using some math theorems (Miller-Rabin & Lucas) I’m not familiar with, after discarding any additional 2 factors. If so it adds it to a large prime factors array to be outputted seperately.

Otherwise it computes the square root checking if that’s a prime then iterates over 1 of 2 tables & does more computation involving squareroots, ofcourse remainders, & recursion.

Within or after that pass it tries using Pollard Rho’s recursive algorithm involving modulo-multiply/adds/subtracts to narrow done prime candidatse to record.

There’s variants of most of these function for operating in either or two words, and a variant of all of them to operate in a dynamic number of words.

To autogenerate the smallprimes table it parses the first arg as a int, allocs/zeroes some tables, iterates over that range, & outputs results.

For each number “i” in that range it populates a table with p=3+2i, than populates each multiple of p from (pp - 3)/2 to the given max number.

To output the actual primes from those tables it counts the number of bits in a wide_uint & outputs it as a C-preprocessor macro, outputs P macro calls for each prime as the diff from last prime, diff from 8 ahead, & (via bitwise shifts & ORs) the inverse. Then uses the inverse & limits to locate the next prime to output as FIRST_OMITTED_PRIME.

Ext4FS

Having just studied GNU CoreUtils, of which most are more or less simple wrappers around various syscalls (for trivial wrappers, see historical code), which LibC will forward to Linux via a special Assembly opcode & Linux will forward to it’s appropriate implementation via caching/mounting lookup layer called the “Virtual File System”.

Linux supports various filesystems but the one you’re probably using is called Ext4FS which I’ll study today!


To aid allocation of blocks in which store files Ext4FS computes group numbers & offsets from a block number ideally via divide, & provides per-group checksummed bitmasks to test whether that memory’s free.

There’s functions for retrieving clusters counts, maybe summing/subtracting/counting them.

Another for dereferencing a block’s descriptor.

There’s a function to carefully retrieve & validate (partially via checksum) a group’s allocation bitmask, with or without blocking on locks. Initializing that bitmask if needed, populated with a scan.

One to check a desired allocation count against various (mostly per-CPU) counters in the superblock’s info to see if desired memory is available, possibly claiming via a wrapper function.

Another for retrieving the # of blocks used by a group, or to retrieve the supergroup possibly via GCD.

There’s a function for counting free blocks by summing each group’s group (with valid bitmask) descriptor’s count, which is validated against bitmasks in debugging build.

It may compute a hint for this allocation by first considering bitmasking & maybe incrementing which blockgroup the rootnode specified it should use, applying a rootnode-specified multiplicand & offset to get the first block number of that group, reads it’s blockcount property & computing from the thread ID.

There’s a multiblock buddy allocator implemented around the single-block allocator & a redblack tree.

There’s code for migrating a file’s blocks between allocation groups.


There’s code to serialize & deserialize access control lists between a file attribute a Linux-shared type.

To handle readdir syscall Ext4FS retrieves encryption data if present, if it’s htree-indexed initializes an iterator if necessary with hashes & that we haven’t reached EOF before reformatting into a redblack tree & possibly unsets a bitflag if there’s a checksum, checks if there’s inline data as a file attribute (useful for configuration & lock files) which it reads specially, maybe allocates some memory to decrypt into, & repeatedlies checks for fatal signals, maps in some more blocks handling error & empty cases, maybe validates blocksizes and/or checksums setting a flag on success, & in a innerloop validates directory entries, increments an offset, & emits each directory entry with or without decryption.

llseeking Ext4 dirs isn’t speical.

HTree directories are converted into redblack trees & on into linear scan dirs if they client wishes to list it. Data from this conversion may need to be freed upon closeing the directory.

There’s a validation function alongside this readdir implementation.

Lots of encoding details are defined structs, enums, & macros. With inlined functions handling byteorder.

There’s a “journal” which tracks all ops in a ringbuffer to aid recovering from unexpected shutdowns.

Extent references have checksums, credits (akin to currency, to determine when to merge), access permissions, dirty flags, meta & space blocks & roots with indexes, & (pre)caching.

They can be split, validated, read, seek via a binary search (2 variants), initialized & deinitialized, “mapped”, & zeroed.

To “map” an extent (ignoring debug output) it first traverses the extents tree reading & validating entries in as needed thus flattening it for a binary search, gets & validates depth, retrieves treetraversal path at that depth if so considers expanding certain holes before returning that block, otherwise unless this is disabled creates it.

Which involves gathering some fields, allocating some space with good memorylocality ideally by extending the allocations on either side, inserts into extents tree merging where profitable, updates reservedspace counts, & considers syncing to the journal.

To truncate some extents it calculates the range to remove, deletes it under lock from tree via a redblack tree retrying slightly later upon failure then any trailing holes.

To fallocate some extents it depending on the given mode it’ll return -EOPNOTSUP, flush state & removes those extents with journalling, tries inlining data into file attributes, flattens that range of the tree, removes the given range from the tree with journalling, zeroes it out, or allocs new ranges.

There’s a couple wrappers around mapping extents converting a selected range to an IOVec to be copied to userspace.

To map some “fiemap” extents it checks cache if indicated to by bitflag (or clears bitflag), validates the given range, & defers to Linux’s generic filesystem code with a callback to read from the file attribute or have it wrap the map blocks code.

To precache it might check inlined data under lock, retrieve cache if bitflagged, run generic logic, validate range, & flattens tree.

Related to the extents tree there’s an extents status tree used for partially locking files and more. This is implemented similarly to extents trees but is entirely in-memory as a redblack tree.

There’s a journalling fastpath for smaller ops.

Underlying the read syscall it constructs an iterator in one of three types if not shuttingdown & non-empty.

Upon refcount deallocation allocs dynamically-allocated blocks, discards preallocations under lock, & frees HTree dir info.

Underlying the write syscall it constructs an iterator in one of three types if not shutting down based on bitflags. The “dax” write iter, after validation, starts journalling, registers & journals an “orphan” inode, & hands a callback to filesystem-generic code possibly wrapped in a decorator or two. I’ll describe these callbacks later. One of the decorators handles most of the journalling, another is filesystem-generic.

The “DIO” iter checks memalignment, obtains locks, journals, hands one of two callbacks (whether we’re overwriting or not) to filesystem-generic code possibly decorated with journalling, cleans up, & commits writes via filesystem-generic code.

Most methods on mmap’d Ext4FS DAX files segfaults unless it’s copy-on-write, though I’m not making much sense of this callback. The methods for normal Ext4FS files are largely filesystem-generic though it ensures blocks are mapped before being written to.

The implementation for mmap decides which of those methodtables to hand to the given virtual-memory struct to integrate into the process’s memory mapping, if it’s not shutting down or unless needed DAX mapping is unsupported.

The implementation for open, with several wrappers, defers to filesystem-generic code. Ext4FS’s llseek is also largely filesystem-specific.

There’s an internal filesystemmap object, which I think lives ondisk.

Syncing is reflected in the journal as “barriers”, and flushes the underlying blockdevice.

There’s an internal hashfunction.

There’s functions for allocating & deallocating inodes directly out of the relevant allocation bitmasks.

There’s a handful of functions dealing with some concept of “chains”, blocks, & paths.

Functions for operating upon file bodies stored “inline” within the file’s attributes.

This section’s quite long, so won’t cover the highlevel inode objects exposed externally.

There’s several supported IOCTLs, updating properties & deferring to the other lowerlevel components.

It uses “MMP” checksumming for the allocation bitmasks, filebodies, filemetadata, etc.

Extent slices can be shifted around.

There’s a concept of orphaned inodes, which sometimes is just a step of allocating inodes.

The I/O methods are an abstraction around internal paged I/O functions.

There’s several functions dedicated to resizing allocation groups.

There’s an internal rate-limited “superblock” structure, defined alsongside some of the methods for mounting & unmounting ExtFS filesystems.

Symlinks are their own type of inode, or rather 3, as exposed to outside world.

There’s a pagecache with (publicly exposed) “verity descriptors”.

And natively understands several file attributes including HURD’s.

Gettext

When communicating textually it is important to be prepared to teach your program the (natural) languages spoken by its prospective users. I don’t think its reasonable to expect devs to “localize” the software themselves, but thanks in large part to Gettext it is definitely reasonable to expect us to “internationalize” our software!

I’ll study Gettext following along the dev pipeline of extracting, translating, compiling, & applying the translated UI text.

UI String Extraction

To use Gettext you mark any UI text in your programs with the _ macro (or variants) & run the xgettext command over your source code.

After extensive initialization including parsing extensive per-computerlanguage lists of flags into a hashmap xgettext parses extensive commandline flags. Then handles –version & –help specially before extensively validating/normalizing those commandline flags. After possibly reading input files from a file, it appends the remaining commandline args.


Then gettext considers the issue of textencoding normalizing everything to UTF8, allocates an array, generates a plaintext metadata entry to append, possibly parses the previous .po file (using a plugin like any other language) to append to its previous entries possibly translating its charset.

And finally iterates over each input file before cleaning up, sorting results, & outputting them as serialized text. For each it possibly parses a rulelist from XML or infers computerlanguage (based on filenames, particularly extensions where there’s a fastpath, & XML root tag), & based on that info it parses the file using a selected callback or traverses an XML file looking for translatable text.

C UI String Extraction

I’ll describe how the callback for extracting strings from C (and C++/Objective-C) works here, without exploring the other callbacks. They all work more-or-less the same way with a few exceptions. Like for Ruby it delegates to rxgettext.

This involves initializing various globals including of what to look for (KDE extends this) then balances parens like a LISP parser.


Balancing parens involves a recursive innerloop (restart innerloop once its balanced the parens it found so far) examing each C token (lexer described in subsequent toots). If the token’s an identifier it’ll set a flag & precede immediately to symbol logic without unsetting it. If the token’s a symbol it unsets that flag & swaps out its iterator whilst handling ObjC specially. For LParen tokens it recurses exitting upon EOF before nulling out the iterators & unsetting the flag. Upon RParen it closes the iterator & exits the innermost recursive loop. For Commas it resets context, iterators, & the flag. For colon it handles ObjC specially or nulls the iterators resetting the flag in either case. For string literals it saves them aside with or without checking (via previously-saved state) that they’re in an appropriate macro call before nulling iterators & resetting flag. Closes iterator & exits upon EOF. Otherwise nulls iterators & resets flag.


The C lexer it uses operates in 9 “phases”. The topmost phase of which lightly postprocesses the tokens largely serving to lookup all names in its hashset falling back to calling them “symbol” tokens.

Phase8 is split into 4 subphases. The topmost concatenates consecutive strings. Phase8c strips ‘@’ symbols preceding strings when lexing ObjC. Phase8b drops whitespace. Phase8a lowers inttype macros to strings.

Phase6 (phase7 got rearranged) for each phaseX token preceded with a ‘#’ it looks for the end of line or “define” token buffering these into a “pushback” buffer. Then checks whether that was a linenumber macro to update that global counter before freeing the buffered tokens & clearing saved comments. The body of “# define”s are left in the tokenstream in case they contain UI text.

PhaseX lowers ‘#’ tokens differently for start-of-line vs mid-of-line.

Phase5 does much of the work branching over phase4 inital char!


For EOF chars phase5 emits a EOF token. For newlines it emits end-of-line tokens. For other whitespace it collapses subsequent non-newline whitespace before emitting a whitespace token. For letters & ‘_’ it scans subsequent ones & digits capturing them in a buffer with extensive special handling for C++11 strings before emitting name tokens. For apostrophes it retrieves the next char & emits a character token. Parens, commas, hashes, colons, & (for ObjC) ‘@’s each have their own token. For ‘.’ it decides whether that’s a decimal number or symbol. For digits or decimal numbers it scans & buffers subsequent chars emitting a number token parsed via standard lib. For doublequotes it scans each phase7 char until it sees quotes emitting a string literal token. Otherwise it yields symbol tokens.


Phase7 handles lexing escape sequences in string or char literals for phase5.

Phase4 lexes, removes, & saves aside comments to be attached to translation messages.

Phase3 handles escaped newlines.

Phase2 optionally decodes “trigraphs”, which appears to be a concept grappling with limitations of early keywords. By interpreting ‘?’-prefixed chars as a different char, e.g. “?(“ = ‘[’.

Phase1 counts line numbers & strips escaped newlines.

Phase0 handles alternate newline encodings. Uses getc/ungetc as a scanner.

PO-File Intialization

To start a translating a program using Gettext into a new language you run msginit to copy the extracted UI strings from a .pot file into a new .po file. Today I’ll study how msginit works!

After initializing its own internalization (amongst a couple other things) & parsing commandline flags (handling –version & –help specially) validating there aren’t any more commandline args msginit locates the .pot file in the directed if not explicitly given & warns about using the “C” locale.


From there msginit normalizes the localename, generates an output file name if not explicitly given warning about overriding existing files, parses the .pot file (next toot), on Windows overrides $PATH, normalizes metadata on each of the UI strings, ensures there’s the correct number of plural forms (extracting that count from metadata somehow, I’m failing to find all that code) OR for English there’s variant which prefills from the POT file, outputs these UI strings, & exits.


Opening the .pot file handles “-“ specially & incorporates a configurable to search relative paths within as well as a couple extensions it’ll try appending. Successfully opened file’s returned to the parser.

The parsers allocates/initializes some initial state to be cleaned up on exit; before calling a parse method surrounded by setting/unsetting a global, calling parse_[de]brief methods, & error reporting.

Parse method is given externally. Other methods collect/track parsed data.


msginit can parse one of 3 “input formats” the default being ofcourse .po[t].

The .po lexer is handwritten with a scanner dealing with escaped newlines & full UTF8. This lexer repeatedly branches over the initial ASCII char, handles UTF8 EOF errors, or complains about unquoted Unicode characters.

For newlines it unsets some flags. For otherwhitespace it continues the loop. For ‘#’ it checks whether its followed by a ‘~’ or ‘ ’ (has special meaning) to skip that prefix & set a flag.

Otherwise ‘#’ scans to the end of line with or without emitting a token as per config. Double quotes collects all chars until the endquote warning about newlines or EOF & resolving escape sequences. ASCII letters, ‘_’, or ‘$’ collects subsequent such chars or digits to convert into the appropriate keyword token or an error. For digits it scans subsequent digits, parses via the LibC, & emits a NUMBER. Brackets are each their own token with state updates.

The parser is implemented via Bison.


Said parser is relatively simple with rules to parse comment (own token), domain (domain & STRING tokens), message (some combination of intro, stringlist, MSGSTR token, & pluralforms). Message intros in turn consist of a MSGID token possibly preceded by MSGCTXT & stringlist and/or PREV variants of these tokens. Pluralforms consist of MSGID_PLURAL token & a stringlist. Each pluralform consists of a MSGSTR token, bracketed NUMBER token, & a stringlist.

A stringlist is a sequence of STRING tokens.

The stringtable & properties alternate input syntax more resembles the C lexing.


For output it first checks whether there is anything to output & whether we want to output the nothing anyways. It checks if/how it should complain about there being multiple domains based on given output syntax class, otherwise checks whether/how it should complain about plurals & contexts. Checks whether we’re outputting to stdout & whether to configure colouring, & calls print with an open file.

The output syntax options are some as for input but independant from it.


The default PO syntax iterates over each message outputting the metadata entry specially before extracting the header & in turn charset to inform how it outputs each message obsolete ones last.

Outputting a message involves outputting all the different comments first then message context (with optional charset warnings), message ID, plural ID, & message string(s).

Colors added via a rabbithole of a bundled library. I’m not going to discuss libgettextpo responsible for Gettext’s fancy formatting since it looks like quite the tangent! Though, ofcourse, colouring is optional. Interestingly it does support outputting to HTML for this formatting!

There’s a complex utility for wrapping text string with preceding identifier within the PO writer module, amongst other simpler helper functions.

The other output syntaxes are quite straightforward without colouring & text wrapping support.

Po-File Merging

As software continues to develop it’ll add (and remove) UI strings for localizers to keep on top of. If the devs are considerate they’ll stick to a lifecycle where over certain periods they’ll avoid adding new UI strings so the localizers have a chance to catch up.

To incorporate these upstream changes into their localization where they can localize it localizers use the msgmerge command!


After initializing its own internationalization & parsing commandline flags validating 2 additional commandline args are given msgmerge handles --version & --help specially, validates commandline flags, in update mode ensures output syntax matches input syntax, tweaks flags for use by msgfmt, possibly initializes OpenMP, calls the core logic, applies desired sorting, handles update mode specially whilst outputting the messages just like msginit.


In update mode it might sorts obsolete entries to the end, checks whether any changes where made, & if so generates a backup file & outputs the messages.

The core logic involves reading the two inputs (just like msginit), ensures there’s a metadata message, iterates over the POT file looking for charsets in the headers, considers everything to UTF8 or otherwise handles text encodings, handles charsets specially in fuzzymode, allocates an output messagelist, in multidomain mode allocates sublists (whilst matching domains, validating plural expressions, & iterating over existing definitions to find previously-translated match fuzzily or not to combine) otherwise skips the sublist allocations, optionally performs postprocessing regarding source locations for msgfmt, & validates results.

Po-File Editting

The next step of the internationalization/localization workflow described by Gettext’s manual is to actually edit the localizations! Said manual mentions desktop & emacs interfaces, but personally I’m more tempted to study a web interface “WebLate”. Since (even without JS, just forms) the web is a handy way to quickly gather information from people!

Web-based Localization

WebLate is a self-hostable Django webapp allow multilingual facilitating translations contributions to software projects which use tools like Gettext!

Upon Django’s builtin accounts system WebLate adds:


WebLate bundles a suite of “addons” you can enable or disable which upon database update may:

Each of these have a form integrated into an admin web & commandline UIs. There’s a common superclass & database model holding JSON for addon-specific configuration.


To provide an HTTP API WebLate has a routing util for capturing which virtually-hosted WebLate instance is selected, an auth token, declarative serializers via rest_framework, routing paths to views via that utility, & various views returning results of database queries.

WebLate’s custom “auth” Django-app for some reason implements its own permissions system with WebLate-specific instances, Django-Admin integration, migration utility functions, & various utilities for checking these permissions. Bundles its own createadmin, importusers, & setupgroups management commands. Declares its own Group & User classes upon Django’s baseclasses for some reason. Also adds a templatetag for checking permissions for the logged-in user. Auth’s database model can automatically group users based on email-address regexps.

If you’re selling access to your WebLate instance “billing” adds a database model for which plans your customers are on & their invoices, with Django Admin integration. WIth Celery-automated tasks & a management command to compute & send invoices. Or there’s views to download your invoices online.


WebLate has a Django-app which highlights syntax errors as a according to a choice of classes checking different interpolation or markup syntax; involving gather, sort, & overlap removal steps to be rendered via a template. Has database models for which checks to perform. There’s web UIs for viewing these.

Incorporates a datafile of localized language names, Has management commands for listing ignored checks, list top failing checks for untranslated entries, & rerun checks.

Supported checks include:

These all share a superclass (with variants), and are often hueristic.


For importing & exporting localizations WebLate implements a minor variation upon Python’s builtin io.ByteIO that adds mode & name attributes, a ParseError exception, & infrastructure to call the appropriate importer or exporter class.

Most of these exporters are provide by Translation ToolKit (which I won’t discuss) via wrapper, but there’s also an importer for Excel via OpenPyXL, & automatically determining which importer to delegate to. They share featureful datamodelling superclasses. Each of its many supported importer & exporter formats gets its own, typically trivial class.

The “gitexport” Django-app provides utility functions to compute the Git URIs, & who’s views will proxy Git’s internal git-http-backend script.

The “lang” Django-app datamodels the human languages being localized into with Django Admin integration. Including standard plurals config, datafiles, display web UI, shell commands, & fuzzy matching.

For the sake of its own localization language data is moved into its own near-empty Django-app.

There’s an Django-app for tracking agreements to legal terms with its own database mode, forms, web UI (with even its own “templates”), decorator for use by other Django-app’s web UIs or a “middleware” around all, Django-Admin integration, & template tag for linking to these agreements.


WebLate’s “machinery” Django-app offers classes integrating into various machine translation services including previous translations on this WebLate instance, dummy translations for “Hello, world!”, deepl.com, glosbe.com, translation.googleapis.com, terminology.microsoft.com, mymemory.translated.net, amagama-live.translatehouse.org or other instances, AWS Translate (via boto3 module), microsofttranslator.com possibly with API keys from cognitive.microsoft.com, SAP MT instances, translate.yandex.net, youdao.com, fanyi.baidu.com, or an Apertium instance.

Which have their own configuration fields, and all have a common baseclass helping to abstract HTTP API usage, prioritizing, rate limiting, & language selection.

WebLate’s “memory” Django-app offers forms, its own integration “machinery” class, bridge over to whoosh module, Celery-automated cleanup, web UI, & shell commands for importing, exporting, deleting, & listing translation-memory XML/JSON files.

WebLate’s “screenshots” Django-app offers Django Admin integration, database fieldtype subclassing ImageField with additional validation, modelform, database model, Celery-automated cleanup, & CRUD (Create/Read/Update/Delete) web UI recording illustrative app screenshots clarifying what UI strings refer to. With integrated OCR via tesserocr.

WebLate’s “vcs” Django-app implements support for Git with or without Gerrit or GitHub, Subversion, GPG, Mercurial, & SSH keys as dynamically-loadable subclasses of a common baseclass aiding those in deferring to the commandline; with its own configuration fields.

And there’s various tweaks to the Django Admin, mostly to add a performance dashboard & special SSH keys screen.


For the clientside WebLate vendors Chartist.js, specially-configured BootStrap including their Datepicker, slugify.js, autosize.js, js.cookie, Modernizr.js, Mousetrap.js for keyboard shortcuts, multi.js for nicer multiselects (tells me I’ve got some redesign work to do in Iris…), Clipboard.js, FontAwesome & Font Linux, jQuery, & a couple more fonts for serverside CAPTCHAs alongside its own robots.txt & security.txt.

In general utils WebLate implements for itself includes:

There’s symbols & localized labels for EMPTY, FUZZY, TRANSLATED, & APPROVED states.

As well as abstractions around Whoosh, Django Messaging (for use alongside Django ReST framework), Django templating (without autoescaping but with special context), the current site, & Django Templates localization.


At WebLate’s core is it’s “trans” Django-app! This provides:

For the sake of templates WebLate provides a template tag to extensively format translations including diff rendering & whitespace highlighting, as well as tags to render random WebLate project self-“adverts”.

Also various simple accessors on checks, name database lookups for slugs, counts, numerous links, rendering messages & checks & translation progress, outputting time more naturally, querying message state, retrieving messages, aid a tab UI, & checksum, permission, & legal checks.


But mostly it is lots of viwes & models!

For Weblate is centred around the database models:

Alongside this it defines config fields, & event handling to reflect database changes in repo files. Lots of methods largely relating to VCS, including on Manager/QuerySet classes.

For the core WebLate web UI there’s Django-views for:

Outside the translation view there;s very little in the way of helper functions beyond what I’ve discussed previously. Though Django’s forms framework is used extensively to interpret/validate user input!

Catalog Processing

Beyond the more boiler-platey logic msgcat reads & deduplicates the input filenames from a commandline args & maybe a given file & calls catenate_msgdomain_list which in turn parses of those files & iterates over them to determine the encoding, again to determine the identifications, count the number of translations for each message, twice to drop needed messages, maybe determine common encoding, determine output encoding (ideally no conversion) if not given by user, apply text reencoding, copy all the messages into a single output catalog, & handle duplicates specially.

msggrep parses 5 input “grep tasks” deferring to a libgrep to handle multiple regexp syntaxes using them to compile the given regexps, before filtering all messages by filename, msgctxt, msgid, msgid plural, msgstrs, translator comments, & other comments.

msgcomm works basically the same way as msgcat but with additional globals.

msgconv calls iconv on all parsed messages before writing them back out possibly sorted.

msgfilter runs a subcommand (or builtin function) upon all parsed messages once text-reencoded before serializing them back out possibly sorted.

msgexec runs a subcommand upon all parsed messages once text-reencoded echoing their output instead of serializing results.

msguniq catenates a single file.

msgcmp removes obsolete entries between 2 inputs, extracts textencoding from header fields to ensure if one's UTF-8 other is as well, conanicalizes text for fuzzy matching if requested, allocs an output, iterates over messages to retrieve & display matching entries in other file, & a final iterations outputs strings which weren't present.

msgattrib reads the catalog possibly alongside a allow/block-list catalogues to filter by, & iterates over it to update fuzzy & obsolete flags.

And finally msgen (did I cover this already? Name assumes English is the source language) copies source text to translated text for each entry the parsed file.

You can also build your own utils based on the same library all these commands I’ve been describing use.

Compiling PO files

Once you have fully-enough translated .po files Gettext requires you to compile them into .mo files, which is an on-file sorted parallel array with an optional hashmap index! To do so you you use the msgfmt command, which can be reversed with the msgunfmt command.

After initializing its own internationalization & parsing commandline flags handling --version & --help specially msgfmt validates there are additional commandline args unless handling XML or .desktop input.


Then msgfmt extensively validates those commandline flags, handles .desktop files or directories of them specially echoing the data it parses with added localizations, handles XML mode specially parsing a rulelist & before merging it with all XML in a directory both utilizing an external XML parser, possibly allocates a new domain in lack of an output filename, reads specified input file according to the specified syntax, checks that syntax produces UTF8, & remove obsolete strings.

With special cases out of the way & .po (or whatever) messages parsed msgfmt now iterates over the catalog domains to check plural formulas match the counts seen elsewhere in the file whilst trial-evaluating said formulas & various other per-message basic checks (i.e. begins or ends with newlines, validates format strings matches, validates both have accelerators, & validates metadata has all necessary fields), then outputs the messages in appropriate syntax & maybe outputs stats.


For .mo output validating there are in fact messages to output msgfmt deletes the “POT-Creation-Date” header for reproducible builds before opening the output file if not stdout (a.k.a. “-“) taking care not to overwrite existing files and:

  1. with some arrays allocates over all messages to concate msgctxt & msgid into msgctid, tests for system-dependant strings, parses C or ObjC format strings to see if there's any platform-specific directives, & gathers strings into appropriate array.
  2. Sort the platform-independant strings if any found
  3. Computes min output version
  4. Computes hashmap (if desired) size of a prime number at least 4/3s full & > 3
  5. Gather a header struct without or without including the headerfields for platform-specific strings
  6. Optionally apply a byteswap to the header & outputs them.
  7. Iterate over strings to prepare length & offset fields optionally byteswapped before outputting them.
  8. Do same for their corresponding translations.
  9. If outputting a hashmap index alloc/zero said hashmap, insert each entry (using HashPJW hashfunction with increment-rehashing), optionally byteswaps each entry, & writes them out.
  10. If including platform-specific strings generate an array splitting them by platform writing the segments header out followed by the clustered strings
  11. Write each original string then all of their translations
  12. If needed do same for platform-specific strings
  13. Cleanup!

After initializing & parsing/validating commandline flags handling --help & --version msgunfmt parses each input file for the specified syntax possibly sorts messages by their ID, & outputs them back out using the same .po serialization most other commands use!

For .mo files (also C# or seperately Java, C#, or TCL) it opens the file (if not stdin a.k.a. “-“), checks whether we need to swap the byteorder, performs format validations, & iterates over all strings into a messagelist.

Incorporating Translations

Once you’ve gone through the translation process I described the tools for above, you now need to actually incorporate those translations into your software! For this the functions you call to mark text to be translated also looks up those translations to be swapped into the UI.

But first you need to call textdomain to set the catalog from which it looks these UI strings up.

If its argument is NULL textdomain returns the current global.

Otherwise textdomain claims a writelock, then examines the arg further. If its empty textdomain sets the global (and local) to “messages”. If it’s unchanged it sets the local. Otherwise it sets the global & local to a new copy of the arg.

In anycase if the new local is non-NULL it increments the catalog counter & considers freeing the old value, before releasing lock & returning the new value.

In short textdomain is a slightly fancy accessor.


The rest of Gettext's API including gettext/_ are trivial wrappers around dcigettext. Here I’ll describe how dcigettext works when the args unspecified by gettext/_ are NULLed out. Domains, categories, & plurals will be described later. All of which are handled in this function.

If the UI string is unspecified dcigettext returns NULL. Otherwise it saves the error number, claims readlocks, & retrieves configured catalogue.

After that initialization it populates a cache-searching, searches that binary-search-tree under a readlock, & returns the looked up translation if it found one cached releasing locks & restoring errorcodes.

Otherwise determines whether it needs to be more careful due to running as root, determines the directory to search for in the path, iterates over all configured locales exiting upon “C” or “POSIX” mmaping & validating & caching the files so it can lookup translations in them.

If it successfully found a translation dcigettext updates the cache (checking whether there's an existing cache entry it can overwrite) before restoring errorcodes, releasing locks, & returning the result.

Otherwise if not in a SUID program it checks $GETTEXT_LOG_UNTRANSLATED to see if (under a lock) it should log the untranslated UI string possibly lazily-opening the logfile to do so. To aid localizers in prioritizing. Then returns the untranslated string!


Searching for a localization in a mmaped file involves checking if said file has a hashtable. If so it performs a hashmap lookup (HashPJW with increment rehashing until tombstone or match), otherwise performs a binary search over the sorted keys table (almost as fast!).

In eithercase upon success looks up the translation in either the cross-platform or platform-specific arrays, extensively considers whether we need to convert text encodings, & returns result with length.


What with synonyms & context sometimes the untranslated UI string is not enough to identify the appropriate translation! So for disambiguation dcigettext & some of its wrappers accept a “category” & heavier-weight “domain”.

Categories get validated first-thing (after the untranslated UI string) at the start of each call with LC_MESSAGES_COMPAT being converted into the default LC_MESSAGES. They are incorporated into the caching. And is used in determining which locale to use!

I’m failing to to see where the functions it calls to convert the category into a locale are defined even searching online, but I think I can infer they relate to LibC's APIs. Unless this locale is “C” it then consults $LANGUAGE & applies platform-specific preprocessors to normalize format, before returning that priority list.

The category is then incorporated into the filename of the .mo files it should consult. I described how it handles the priority list yesterday.

Domains default to a configurable global, are considered in cache lookup, located within a global path to get a directory to look for the .mo within, are incorporated into the .mo filepath, & domains are incorporated into missing-message logging.

Plurals

Many if not most natural languages have different gramatical structures (“pluralforms”/”plurals”) to indicate different quantities. Though not every language agrees how quantities map to their pluralforms! e.g. Is 0 plural, not, or something else?

One of those things you might assume is trivial…

Gettext’s facilities for this assumes English as a source language, though I suspect those assumptions can easily be overcome for programming in other languages.


dcigettext & its many wrappers will resolve plurals (count defaults to 0) once it successfully looked up translation in the configured/selected catalogue. If unsuccessful it may optionally apply an English-like “germanic” n == 1 pluralform between the 2 given strings.

This involves interpreting (for the given count) the plural formula from the catalogue & iterating over the multistring to find the computed index. Which .MO compilation validates via bruteforce stays in-range.


Interpreting the plural formula is done over the abstract syntax tree recursively branching over the number of operands (0-3 inclusive) before before branching over & applying the mathematical operation.

Said expression is parsed when loading in the .mo file by locating the “plural=” & “nplurals=” headerfields of the metadata entry (translation for “”) parsing nplurals via strtoul after scanning digits, & parsing plural using Bison & a manual lexer. Relatively trivial usage. Defaults to returning an AST representing Germanic n != 1 expression.