GNU LibC (Core Libraries)

GNU LibC, following the POSIX standards, are the most fundamental libraries on most UNIX-like systems today. Allowing for I/O, memory allocation, filesystem manipulation, etc. Put simply, it’s primary job is to abstract away the datastructures exposed by both Linux, central config files, GCC, & the underlying CPU.

Cross-platform modules

GNU LibC provides several libraries which (building upon it’s others) are platform-independant. Mostly these serves to throttle calls into the operating system’s “kernel” e.g. Linux.

Commandline Flags (argp)

<argp.h> aids parsing commandline flags, since they’re not parsed when parsed into the program. And it can output help info for it’s configuration.

To do the parsing, after initialization argp_parse repeatedly calls “posix” APIs wrapped in state tracking & caller-specified parsing callbacks. & a finalizer.

That’s possibly preceded by reformatting the list of arguments to parse into the string-based syntax and/or adding some of it’s own options for version or help.

There’s a suite of functions for outputting errors/warnings related to commandline flags, including reformatting the list of options & some callbacks into localized & user-friendly help text via an array of “hols”. These “hols” are a flattened array of “entries” & “clusters” referencing the original data.

In converting from “hols” to textual output it uses gettext (ofcourse, described later) & a suite of buffered textstream formatting functions aiding indentations & textwrap.

Cryptography (crypt)

GNU LibC includes a basic crypto library builtin, which’ll be the last one I discuss today. Crypto is built on a concept of “trapdoor” mathematical functions, so I will describe the different standard algorithms it supports in terms of their underlying trapdoor function.

Memory Allocation (malloc)

Memory allocation is a central construct in computer programming upon which basically everything else relies. Most languages make it an implicit aspect of their syntactic constructs, making it part of their “runtime system”. C doesn’t have a runtime system however, and Assembly doesn’t implement it. So it ends up being GNU LibC’s job.

First off, GCC provides debugging hooks incorporated into the allocators. These may fprintf() relevant data, sourced inpart from the dynamic loader, and call a function for GCC to break on. Or they may seek to detect multithreading conflicts.

And there’s atomic profiling functions which can optionally wrap all the memory allocators, outputting this data to a file specified by the MEMUSAGE_OUTPUT environment variable.

There’s an internal allocator for fixed-size arrays which specifically checks for integer overflow resulting from multiplication.

GNU LibC provides “obstacks” which implements a stack of arbitrary-sized values, using a merge between arrays (implemented in the C preprocessor thus gauranteeing inlining) and links. Calling a stored callback to allocate or free the next chunk. Making sure to maintain memory alignment.

There’s also code for (re)allocating dynamically-resized arrays, which tracks it’s size in a header struct.

And finally, if you have an allocation pattern where everything’s freed simultaneously you can go without a lot of the bookkeeping required by GCC’s main allocator. In the common case allocating by simply atomically incrementing an appropriate in-memory pointer, making it very fast! Yes, care is taken to make this thread-safe. Uses mmap() directly for an underlying allocator.

Fully Flexible Allocation

Last night I described several auxialary memory allocators GNU LibC provides, catering to certain usecases it can better optimize for. Usually when projects reimplement their own memory allocator it’s because they fall into one of these cases (or because they’re implementing a programming language), though not always e.g. GLib’s “slice” allocator.

But catering for more general usecases requires more bookkeeping, and balancing between time, space, portability.

The central construct in this generic memory allocator are cyclical doubly linked lists for a range of sizes, formed out of the unused memory. The data in this linked list are the nodes themselves, which can be removed for use by your program until free() reinserts it into the appropriate linked list.

Between every two chunks in memory is a header specifying the size of the chunks on either side, and indicating if the lower-side is free, to aid defragmentation & realloc().

If malloc() requires more memory to allocate into, or if you request a very large ammount of data from it, GNU LibC will carefully call mmap() with profiling & optional validation. Thus handing the job off to the kernel. The initial bytes of this will be used to store the head of additional linked lists. Yes munmap() & mremap() are also used, as well as brk().

To delay the cost of freeing memory (including malloc() excess) a special linked list is kept with a corresponding flag.

The allocator is thread-protected by a mutex, and compiletime-optionally offers per-thread recently-freed lists. Which malloc() would check first.

The main logic of malloc() then involves computing which bucket to check, checking early whether it needs to mmap() more memory from the kernel, & iterates over the buckets large enough for it’s needs to see who has any memory it can remove from the linked lists for caller-use. Possibly stealing everything else in that list for use by it’s thread.

There’s two types of bins malloc() might check like that “fastbins” and, for smaller sizes, “normal bins”. Otherwise it needs to “consolidates”.

Then it compiletime-optionally the local thread allocator & iterates over the recently freed list, validating and sorting the entries as it looks for an appropriately-sized entry.

Failing all that it iterates over the chunks from the fastbins to find the one with the minimum excess space, or checks remaining bins.

Then cleans up & returns a pointer.

Freeing memory meanwhile involves compiletime-optionally checking if the local thread allocator hasn’t overflowed, then considers whether to insert into a fastbin, the recently freed list, or hand off to the kernel.

Consolidation involves checking every chunk in a fastbin to see if it’s now contiguous with free()’d memory & if so moving it into the recently-freed (or “unsorted”) list. So that the actual consolidation can be done lazily.

realloc() checks if the next contiguous chunk is free before falling back to normal malloc()/free()/memcpy() routines. mtrim() does the reverse, freeing segments of allocated memory.

malloc_usable_size() checks the between chunk readers to return how much RAM was allocated. Then there’s functions for returning gathered statistics, possibly outputting it to stderr!

Or you could alter the magic numbers used by this allocator.

Remote User Accounts (nis)

Having skimmed nis, I think I’ll skip over describing it in detail. NIS (formally known as Yellow Pages after the phonebook) is a legacy authentication protocol. It’s useful in a business or cloud setting for synchronizing UNIX user accounts between devices, but it has since been superceded by LDAP & I don’t know why GNU LibC still supports it.

I would consider this feature, amongst similar ones in the Linux kernel, spyware wherever any distro auto-enables it. And it’s near-certainly irrelevant to the home computing uses I’m interested in.

Hence why, in addition to it being superceded by LDAP, I’m not interested in digging into GNU LibC’s NIS support beyond briefly describing what it is.

PThreads (nptl)

The UNIX fork() syscall already offers asynchronisity, so pthread’s job is to reintroduce some useful synchronization. Including offering the caller useful datastructures to invoke their own synchronization.

A callback is called by the platform-specific code to run other appropriate caller-registered callbacks “on fork”.

Thread-local variables are stored in an array, or overflow hashmap, within the current thread’s struct pthread. How it gets that containing thread-local variable representing the current thread is platform specific, so I’m just kicking the can down the road. Ultimately there needs to be CPU support.

pthread_setattr_default_np sets some validated global variables as requested.

PThread provides functions (with validation) wrapping __sched_getparam, etc to set thread priority in Linux.

Threads can have a stack of callbacks registered to be called upon errors before jumping to an unwind_stop caller-specified callback.

“Semaphores” wraps the CPU’s support for “atomic values” & the kernel’s support for “futexes”, which incidentally in turn wraps atomic values (again!) & the process scheduler. Calls into the kernel are slow so semaphores first updates it’s own atomic value, whilst adding error handling & stuffing an extra count into the same atomic value.

A read/write lock is a variant of semaphores which has two atomic counters, enforcing the invariant that no single writer may lock the data whilst there’s others reading it. GNU LibC stuffs a couple bitflags into the read counter.

(Atomics are read/write operations which cannot be interrupted, full details later)

Once variables uses an atomic & futex to ensure a given callback is only ever called once. Atomically flagging whether it has been, or is being, run.

PThreads includes a complex mutex representation counting (non-atomically) the number of users, distinguishing different subtypes, allowing the same thread to “recursively” lock & unlock the mutex whilst it already has it locked via a seperate count, track owners for “adaptive” locks, etc.

WebKit has written they don’t like this overhead, so they wrote their own.

There’s also a lock() variant which specifies how long to wait before timeout.

A barrier allows a configurable number of threads into the critical section, and once they’ve all finished it lets any further threads try to enter the critical section again, using 2 atomic counters (for threads in & threads out) & a futex.

When various OS signals are received those stacked callbacks mentioned earlier are called, protected by an internal lock. Seperate functions (protected by their own internal locks) allows you to send one of those OS signals to a pthread.

A “condition” is amongst the primitive synchronization construct: threads block on a condition until it’s triggered. GNU LibC uses futexes (yet again) to power it’s conditions, swapping between two of them as determined by atomic use-counters ensuring it can always be triggered. Ultimately the kernel’s scheduler has to be the one implementing atomic conditions.

Upon the SIGSETXID syscall pthreads will call that syscall() on themselves again with an internal lock.

Finally there’s some decent initialization logic to run on program startup & a cache of allocated memory for struct pthreads.

TODO What’s nptl_db tracking?

Useraccounts (nscd)

In most computing setups, you want to limit who can use the computer. This is what useraccounts are for, and is implemented by either nss or nscd.

Central functions like __nscd_getgrouplist first consults a pthread-safe profiled hashmap cache for data on a usergroup, etc before requesting said data via a UNIX socket from a local SELinux-secured daemon centrally caching this data from /etc/* in RAM.

The central daemon parses those “passwd”, “group”, “hosts”, “services”, “netgroup” using the same caching, etc TSV files from /etc/, serves those files or parsed data to requesting clients, & aggregates stats for logging. The daemon is started as-needed & secured via SELinux if possible. It takes great care to cleanup any mess in it’s local environment to avoid security vulnerabilities.

Miscallaneous Utils (stdlib)

Going over the non-trivial cross-platform utility functions in stdlib:

Also there’s several functions which may be overwritten by platform-specific code, including:

Platform-dependant APIs with stubs dumped in “stdlib” include:

And there’s a linter script here too…

Highlevel (Textual) I/O (“libio”)

“libio” makes the crucial task of inputting & outputting (primarily textual) information more efficient & convenient.

“libio” includes several backends including various wrapping provided memory, but mostly it wraps a kernel “file descriptor” from “io”. This file descriptor backend will be discussed through the rest of this thread, & includes a buffer & text decoding.

Additional state (including the seeked offset in the buffer) is tracked in-process on open files by the “io” GNU LibC sublibrary beyond the file descriptor & the buffer for reducing how many syscalls are needed against it, some of which is parsed from the caller-specified “mode”.

The buffer is implemented as a ringbuffer, with seperate read & write indices into the buffer chasing each other. Once they catch up, or upon certain seeks, up the file is “flushed”. The text decoding dispatches to a different backend provided by the “iconv” sublibrary, applied in-place within the ringbuffer.

There’s an API for opening a process & returning I/O streams for communicating with it. The public APIs in addition to calling the methods on the backend, or initializing the appropriate one, also updates the in-memory state for the file.

Evaluating formatting strings is handed off to the “stdio-common” sublibrary I’ll cover later.

There’s several stubs it experts entirely to be overwritten by platform-specific code, including:

Also overridable, but with platform-indendant defaults:

In most cases where there’s locked & _unlocked variants of a function, one or the other needs to be overwritten with platform specific code.

Semi Platform-Dependant

Sublibraries in this section are largely platform-dependant, but includes some code which is platform independant. Though, due in part to incomplete knowledge when I classified GNU LibC’s sublibraries, the difference is from the platform independant sublibraries described above is a little fuzzy.

InterNETworking (“inet”)

“inet” allowing you to open communication channels to computers elsewhere on the InterNET. The main platform-independant task of which is to resolve the destination, whether it’s textually identified by IPv4, IPv6, hostname, or DNS addresses.

inet_ntoa applies the format string “%d.%d.%d.%d” to the given parsed IPv4 address. inet_netof & inet_lnaof are hueristics for extracting the mask from a sole IPv4 address.

inet_network parses IPv4 addresses into 4 bytes, with options for indicating numbers are written in base 8, 10, or 16.

ether_ntohost & ether_hostton wraps the “nss” sublibrary to resolve hosts identified by hostnames.

ether_ntoa_r applies the format string “%x:%x:%x:%x:%x:%x” to the given parsed IPv6 address.

ether_line & ether_aton_r parses an IPv6 address, only supporting base 16 numbers.

rexec[_af] (utilizing ruserpass) & the superceding rcmd implements protocols for running shell commands on a remote machine. ruserpass consolts common (non-specialized) environment variables for authentication parameters.

There’s functions, with a preceding __idna_name_classify_check, for resolving the IDNA encoding of domain names written in the broader Unicode character set, via a dynamically loaded library.

__check_pf indicates whether a IPv4 and/or IPv6 address was parsed.

There’s functions converting connection “deadlines” to/from other formats used elsewhere.

There’s functions for reading & writing (as per the IETF spec) the options in the extended IPv6 header provided from the kernel via struct cmsg. Including routing information for e.g. ping/traceroute.

bindresvport iterates over privileged ports to bind a socket on an open one.

To aid network management “nss” & “nscp” sublibraries are wrapped to implement “net groups”.

And finally getnameinfo, after some validation, recomputes the hostname the given network address was parsed from, branching upon the socket type & flags. As well as the TCP or UDP port number. Ultimately calling the other functions I have or will described. The localhost address is cached, protected by a lock & double-checked first flag.

__inet6_scopeid_pton retrieves & serializes network interface indices.

As always I’m skipping over minor functions.

inet_makeaddr merges a network address & host address into the space of a single structure, determining how far to shift the network address based on which “class” it is in before OR’ing the two together. This code can be overriden by platform specific code.

Fully platform-specific stubs include:

Input/Output (“io”)

One of the most important (and challenging, but not at this layer I’m describing today) things an operating system’s APIs implements are input & output. Important in the sense of “if a tree falls in the forest and noone’s around to see it, does it make a sound?”

Most languages including C expose I/O to programs via APIs for them to call, which ultimately needs to wrap language bindings. GNU LibC’s “io” sublibrary is almost entirely language bindings.

Starting with the few platform-independant functions in GNU LibC’s “io” sublibrary, ppoll wraps poll. Adapting the format for timeouts & temporarily setting a sigprocmask. statx* wraps fstatat64 with alternate interfaces.

Internal functions for determines whether a file has changed based on the results of stat64.

And there’s a fairly sophisticated depth-first iterator over all descendant files of a directory, with binary-searched caching, filepath normalization, symlinks, etc.

GNULibC’s “io” sublibrary has a few functions which may be overriden by platform-specific code but don’t have to be:

stat64(file, buf) = fstatat64(AT_FDCWD, file, buf, 0). stat(file, buf) = fstatat(AT_FDCWD, file, buf, 0). lstat(file, buf) = fstatat(AT_FDCWD, file, buf, AT_SYMLINK_NOFOLLOW). lstatat64(file, buf) = fstatat64(AT_FDCWD, file, buf, AT_SYMLINK_NOFOLLOW) And lockf[64] is adapted to calls to fcntl with reformatted arguments.

Platform dependant stubs include:

LOGging IN (“login”)

I’ve described several platform-independant APIs before for authenticating into a computer, which you do so the kernel can authorize & audit your interactions as per it’s configurations. But how does the kernel know who’s currently logged-in? That’s what “login” does.

For the platform-independant components:

logwtmp straightforwardly wraps updwtmp whilst passing the current time from clock_gettime64.

pts_name wraps ptsname_r, doubling the passed buffersize until it’s large enough to fit the response.

logout reads the current state from the useraccounts logfile, and appends an updated state of DEAD_PROCESS with a fresh timestamp.

openpty wraps getpt, grantpt, unlockpt, ioctl(TIOCGPTPEER), optionally pts_name & open, optionally tcsetattr(TCSAFLUSH), optionally ioctl(TIOCSWINSZ), & optionally pts_name to open & configure a new pair of tty file descriptors.

try_file_lock wraps fcntl64_nocancel with a noop SIGALRM temporarily set. Restores existing SIGALRM configuration on exit.

If ioctl(TIOCSCTTY) is unsupported, login_tty instead closes the old stdin, stdout, stderr, & a reopened device file for the shell. In either case it calls setsid & configures the stdin, stdout, & stderr` to point to the given shell file descriptor, closing the old file descriptor number.

file_unlock wraps fcntl64_nocancel(F_SETLKW).

tty_name wraps ttyname_r like pts_name does ptsname_r.

maybe_setutent opens the authenticated logfile & seeks to the end if it’s not open already. read_last_entry reads a binary record at a saved offset.

login retrieves the shell name for stdin, stdout, or stderr, strips if /dev/ if present, and appends it to the logfile.

__libc_getutent_r ensures the logfile’s open, locks it, & reads the last entry using above functions. internal_getut_nolock retries upon failure.

__libc_getutline_r opens the logfile if necessary, locks it, & reads the final entry until successful.

__libc_pututline opens the logfile if necessary, ensures we’ve got permission to write it erroring out if not, checks whether the new entry matches the last one to determine where to write it, seeks to the new location, writes it, & truncates the file!

updwtmp opens & locks the logfile, seeks to end, truncates off partial writes, & appends to it.

Amongst other related functions.

Functions with platform-independant fallbacks but can be overriden by platform-specific code:

utmpname with appropriate locks & closed logfile, validates & stores the given filepath in a global.

updwtmp fotwards (with filename adjustments) to the internal implementation described above. As does (with a lock & validation) getutline_r, getutid_r, setutent, getutent_r, pututline, & endutent.

getutent, getutid, & getutline allocates global buffers for their namesakes.

Platform-dependant stubs include:

Shell commands are provided for outputting the logfile as text (“utmpdump”) & for configuring access rights to pseudoterminals (“pt_chown”).


Next on my list of GNU LibC semi-platform-dependant sublibraries is “misc”, so sounds like no good introduction is possible.

Amongst it’s platform-independant functions are:

lfind which iterates over an array finding the first element equal to the given value according to a given callback. lsearch wraps this to append the item to the end of the array if absent.

insque inserts an element into a doubly-linked list (initial properties defined by struct qelem). remq removes it.

getusershell, endusershell, &setusershell consults globals set by initshells calling it if necessary. initshells frees the old values, opens & fstat64s a configuration file, mallocs the new values, & reads each line into the global array stripping comments, spaces, & non-absolute paths.

getpass tries to open /dev/tty falling back to stdin & stderr, ensures it’ll get closed, lock the file, tries disable echoing via tcgetattr, writes the given prompt, reads a line & restores.

fstab_init allocates a buffer, & reopens or resets the file descriptor. fstab_fetch reads an entry into that buffer. fstab_convert converts from internal to external structs. This all wraps the “mntent” APIs, these several higher level wrappers, & there’s deallocators.

The internal __fd_to_filename function converts the filedescriptor integer to a string (inline code) & appends to FD_TO_FILENAME_PREFIX.

dirname locates the last (non-final) “/” & returns that slice of the string.

The internal __FCVT_R function validates it’s arguments, extracts the sign bit then takes the absolute value, divides by 10 until we get the desired rounding, calls SNPRINTF, strips off leading zeros, & adds trailing zeros. The internal __ECVT_R function wraps __FCVT_R with additional preprocessing the number being converted to text.

daemon (necessary for init systems prior to systemd) does a dance to detach this process from it’s parent. By forking, calling setsid, cds to “/”, opens & fstat64s “/dev/null”, & configures stdin, stderr, & stdout ot point to it.

is_open wraps fcntl(F_GETFL) (or on Windows, _get_osfhandle). flush_stdout wraps fflush(stdout with validating stdout_fd is_open. print_errno_message wraps strerror[_r] & fprintf to output a given errnum. error_tail wraps vf[x]printf(stderr), optionally print_errno_message, putc or fxprintf, fflush, & optionally exit.

Functions which may be overriden by platform dependant code, but has platform-independant implementations include:

__error_internal adds pthread shutdown, stdout flush, outputting the program’s name, & file locking to error_tail. error adds variadic arguments. [__]error_at_line[_internal] adds pthread shutdown, stdout flushing, file locking, and outputting filename & line number fallingback to program name. The filename & number are presumably provided by the C preprocessor.

Turns out error[_at_line] is overridable by platform-specific code.

tsearch, tfind, tdelete, twalk[_r], & tdestroy implements a “red-black” balanced binary-search-tree, utilizing internal tdestroy_recurse, trecurse[_r], maybe_split_for_insert, & check_tree[_recurse] support functions & a given comparison callback. binary-search-trees stores values less than it’s value under left child, & greater than under right child. A red-black tree ensures all leaves are at roughly the same depth by tracking whether each branch is left-heavy (roughly “red”) or right-heavy (roughly “black”).

[v]syslog[_chk] validates the passed flags, grabs the syslog lock ensuring it will be unlocked, tests the “priority” against a bitmask, opens a “memstream” to output formatting results to with a fallback for out-of-memory errors, locks the memstream, grabs the current time & formats it, retrieves the current seek offset, outputs the “LogTag” or program name, optionally the process ID in square brackets, optionally a “: “, restore the errno, & output the caller-provided format string, That text optionally gets outputted to stderr if configured, if SIGPIPE is available a signal handler will be configured for it, opens the configured socket with a couple retries if necessary, sends the message to that socket, closing & reopenning open error, & cleans up.

openlog under the lock (re)opens the socket with a couple retries. closelog[_internal] under a lock resets some globals & closes the socket.

setlogmask under a lock sets the LogMask global returnning the old value.

There’s a hashmap implementation maintaining a prime size using linear rehashing.

getttynam iterates over ttyents to find the one with the desired name. getttyent opens & locks the file, reads each line where they’re not too big whilst skiping comments & preceding whitespace, parses the first found data line utilizing an internal “skip” function to handle quotes & otherwise splitting by whitespace, & returns it. setttyent opens the file & configures locking. endttyent closes it.

getauxval checks various globals depending on the given “type”.

mk[o]stemp[s] & mktemp wraps __gen_tempname. setmntent, endmntent, getmntent[_r], addmntent, & hasmntopt opens, closes, reads, & writes a TSV file of mountpoints. sbrk wraps brk to retrieve the old boundary & set a new incremented boundary, with added validation.

Functions with platform-dependant implementations in GNU LibC’s “misc” sublibrary include:


“posix”’s platform-independant code here largely implements simple character-wise interpretors, including:

Functions which can be overriden by platform-specific code but don’t have to be include:

Function stubs whose implementation are overriden by platform-dependant code include:

Regular Expressions

Regular expressions provide an easy way (when not overcomplicated, as they frequently are) to test whether some text matches some given relatively-simple syntax, extract where they do, or replace those occurances.

I’ll start from regcomp

Anyways I quite like this implementation! The syntax is simple enough to not overstay it’s welcome, it uses the most optimal algorithm (unlike most other languages), & it still found ways to incorporate the features devs like elsewhere.


regcomp starts by initializing/allocating the given regex_t pointer & reformatting given bitflags before handing off to the main code & if successful compiling “fastmaps” from it.

re_compile_internal further initializes the provided regex_t pointer, allocs/inits a DFA from the regex’s buffer reallocating said buffer if needed, handles errors, copies the regex source code into the DFA (re)alloc/init’ing a re_string_t for it whilst reencoding the text & recording char boundaries.

It then runs it’s handwritten pushdown automaton bin_tree_t of partial re_dfa_t emitting parsing pulling tokens from it’s lexer, whilst lowering character classes & repititions. Refer to the UNIX documentation for details on this syntax.

If successful it than analyzes this AST to convert it into Non-deterministic Finite Automoton form. Which involves allocating next, org_indices, edests, & eclosures arrays for the first re_dfa_t in the regex’s allocation buffer.

Amongst those arrays is a subexpression smallint map, populated from a postorder traversal over the Abstract Syntax (Binary) Tree setting appropriate bitflags to indicate where it has done so.

The next several postorder traversals are for lowering away parenthesized “subexpressions”, gathering the possible initial characters for each branch, computing the subsequent possible characters from that, constructing a NFA graph from that analysis & the control flow, & calculate epsilon closures.

The difference between an Non-determinstic Finite Automaton (NFA) & a Deterministic Finite Automaton (DFA) is that an NFA may have multiple edges for the same char going off a node, and that it may contain “epsilon nodes” which may be traversed prior to following the edge for a character. Expressing is easier in a NFA, but interpretation’s harder.

Another compiletime-optional pass traverses the graph/automaton to flag where we’re dealing in pure ASCII. Then computes initial state & cleans up.


To evaluate a regular expression compiled as per above, you call regexec which validates it’s input takes a lock & hands off to re_search_internal.

re_search_internal validates the fastmap if present, computes the number of extra matches to capture (stored seperately in memory), performs additional validation, allocs/inits a new re_string_t for the text being matched followed by the re_match_context_t. There’s additional validation for multicaptures & more initialization.

Iterating over each starting position in the string to be matched (in forward or reverse order) it considers whether it can use the fastmap as a cache quickly discarding incorrect starting points, takes a slice of the string being matched making sure to properly handle multibyte characters, than runs the core interpretor (check_matching) checking for success. & copies out all the text which has been captured.

After validating/selecting it’s initial state check_matching optionally saves an offset for capturing text and/or dynamically lowers backrefs into the DFA, checks whether to exit, & enters it’s mainloop.

This mainloop first considers whether it needs to realloc buffers (should only be needed in presence of backrefs), looks up & traverses the appropriate edge by char or multibyte “word”, optionally stores the state in a per-char log, detects failures & whether to retry, & maybe exits.

re_match/fork[_2] are a variant of regexec which operates on a slice & does more preprocessing to ensure a fastmap is computed & that there’s space for at least 1 pseudoregister. Those registers are returned to the caller. The _2 variant concatenates two strings before hand.

re_exec is a BSD-compatible interface to regexec.

During interpretation this regex engine may wish to prune impossible-to-reach nodes. Upon updating the per-char log NFA nodes are merged into 1 DFA node.

(De)Serialization (“stdio-common”)

GNU LibC’s “stdio-common” sublibrary some higher-level I/O abstractions & syscalls, largely relating to serializing & deserializing C’s primitive types to/from text.

psignal(sig, s) runs fxprintf(NULL, "%s%s%s\n", s or "", ": " or "", _(__sys_siglist[sig])) or asprintf(&buf, _("%s%sUnknown signal %d\n"), s or "", ": " or "", sig) before fxprintfing the result. For some reason it doesn’t directly fxprintf the int, worried about triggering additional signals?

psiginfo extracts si_signo, si_code, etc fields from the given siginfo_t & fprintfs them to at most 128b (via fmemopen) before outputting to stderr.

printf_size is a printf plugin which, handling NaN & infinity specially, repeatedly divides the decoded input by 1000 (if uppercase specifier) or 1024 (if lowercase) to find the appropriate unit before forwarding to the floating point formatter & appending the unit via put[w]c.

printf_fphex is another printf plugin which, after retrieving params from the format spec & the locale, hands the decoded data to ito[w]a[_word] it’s constituent parts, for postprocessing to round, etc.

After handling a quirk with stderr, perror serializes the given errnum via strerror_r & fxprintfs the result with the optionally-given message.

itowa_word serializes an integer by repeatedly divides the given value by the given base looking up the remainder in one of two arrays (depending on desired casing) to output that digit at the start of the output string (decrementing from end of buffer). Special codepaths are generated for bases 10, 16, & 8 to encourage GCC to specially optimize them, certainly it can for 16 & 8…

itowa does similar for long long ints more complexly using multiprecision maths, fastpaths, & chunking.

itoa[_word] don’t appear to be any different from itowa[_word].

tmpfile wraps __gen_tmpfd, fallingback to path_search & gen_tempname, before opening the new file.

Functions which may be overriden platform-dependant code includes:

Function stubs with platform-dependant code for their actual implementations include:

Floating Point Serializing

printf_fp is a final bundled printf plugin which outputs floats in decimal, which is more complex & uses multiprecision arithmatic from the previously-described “stdlib” sublibrary.

After extracting the locale-specified parameters & the input, outputting NaN or infinity specially, printf_fp initializes a struct hack_digit_param (internal type) & mallocs it’s other fields & computes it’s scalesize based on the extraced exponent.

It performs some complex multiprecision math (which honestly I fail to comprehend) to convert the mantissa * 2^exponent format into mantissa * 10^exponent format (the new mantissa is fully computed later one digit at a time), and to decide where to place the decimal point by counting zeroes. This math is performed differently whether the base 2 exponent is > 2, negative, or between 0-2 inclusive.

It derives formatting parameters from the formatting char, estimates the number of grouping characters, & validates no overflows happened.

It mallocs a buffer to use, iterates over the computed integer digit positions (pre-computed) lazily finishing the multiprecision arithmatic (divmod & mul operations) to emit each one, adds a decimal point or trailing zeros, continues emitting those digits for the fractional component, rounds those characters informed by the subsequent digit, counts the number of added zeroes & then removes them, removes a leading 0, adds thousands seperators, serializes new (integer) exponent if needed, it adjusts the desired width, outputs specified padding, outputs the sign, insists ‘0’ padding be left-aligned, looksup a factor from the locale, allocs a new buffer, & copies the previously computed text into it, applies internationalization postprocessing, outputs that text, cleans up, & adds trailing padding.

Interpreting Textual User Input

Allowing programs to receive user input is the easy bit, interpreting it is the hard bit. Which is perhaps why it’s typically part of programming language’s standard libraries rather than their syntax.

vfscanf & it’s variants, building on a pair of trivial wrappers (char_buffer & scratch_buffer`) around the core “malloc” sublibrary implementing slices & arrays, after initializing it’s variables & variadic argument iterator iterates over the chars of the given format string.

Until it reaches the end of the format string (to be matched some given “input” “libio” file, yes “libio” & “stdio-common” are mutually dependant) vfscanf & variants repeatedly:

  1. Checks if the multibyte (non-ASCII) char are complete, matching them against the input file.
  2. Retrieves the next char & checks if it isn’t a %. If so skips whitespace (indicating input whitespace should be skipped too), retrieves the input character possibly skipping whitespace, & checks for match.
  3. Otherwise prepares the character buffer & checks if the next character is a digit. Parses the number and checks whether it’s trailed by a $. If so it’s an argument position, otherwise a width & should skip steps 4 & 5.
  4. Checks for *, ', & I chars & converts them to SUPRESS, GROUP, & I18N bitflags.
  5. If the next char’s a digit the number’s parsed as the “width”.
  6. If the next char’s a h, l, q, L, as, aS, a[, m, z, j, or t are converted into bitflags.
  7. Checks for premature format string end & whether the next(? Step 6 “unreads” invalid chars) is a [, c, C, n, or otherwise flagged to skip whitespace, all subsequent whitespace input chars are skipped.
  8. Checks whether the next, and final of this format declaration, char is %, n, c, C, s, S, x, X, o, u, d, i, e, E, f, F, g, G, a, A, [, or p, handling them appropriately.
  9. Repeat.

At format end trailing space is skipped & errors handled.

For %% vfscanf & variants check whether the next input char is also % & not EOF; %% is the escape syntax.

For non-suppressed %n the count of read input chars to the correponding variadic argument (though e.g. %2$... duplicates the arg iterator to skip that number of places). %*n is a noop.

%x, %X, %o, %u, %d, %i, & (which also adjusts bitflags) %p all selects an appropriate to use in the generic number formatting code.

%c defaults the width to 1, optionally ensures enough space (with a cap) is malloc‘d, ensures we haven’t reached EOF, reads that number of chars into (if not suppressed) specified memory realloc’ing more memory as needed whilst compiletime-optionally (don’t know why you’d disable) decoding UTF8, & truncates text to actual size if callee-malloc’d.

That falls through to %C which appears to be doing the same thing?

%s & %S does basically the same thing but escapes on space/EOF too.

To parse a number it checks whether the input char is - or + for it to add to the charbuffer decrementing format width if set, optionally parses 0x as base 16 or 0 as base 8 (defaults to 10), base 10 with i18n-enabled has own codepath, validates each input char (thousands seperator handled special) whilst adding it to a the charbuffer, optionally case-insensitively parses “(nil)” for %p, ungetcs the last char, adds a nil character, calls strto[u]l[l], & assigns to variadic arg.

To parse floating point numbers for upper-or-lower %e, %f, %g, or %a it checks for a - or + input char, scans (signed) “nan” or “inf[inity]”, parses 0x to indicate base 16, validates each character is a (possibly hexadecimal) digit, singular e (or for hexadecimal p), + or - following that e or p, singular localized decimal point, thousands seperator prior to said decimal point, substitutes out localized chars seperately, calls strto[f,d] & assigns to appropriate arg.

%[ requires additional parsing to determine whether it’s followed by ^ indicating not_in flag & all the characters prior to ]. If the long flag’s set it iterates over the given width of input chars & iterating over the valid set specially interpreting ranges indicated by - to indicate whether each input char is valid, then optionally copies those input chars over to the appropriate variadic arg alloc’ing if configured with fastpath for fails.

Communicating to Humans

Whilst vfscanf & it’s variants aids your programs in interpreting textual user input, vfprintf & it’s variants aids your programs to output their data in human-legible textual form using the same format string syntax.

After validating it’s params (adding buffering to the output if missing) & with the output file locked, vfprintf & variants uses wcschr or strchr to locate first format declaration & sputn to output preceding text.

Unless that was the entirety of the work (which GCC optimization ensures it isn’t) until the end of the format string vfprintf & variants repeatedly uses several jump tables to parse the slightly more complex format declarations into several local variables.

Ultimately those jump tables end at a final char, like:

The number of specs interpreted are counted, the next one is located, & the intermediate text is written via sputn at the end of each iteration. Extensive macro-aided error handling is present throughout.

Once a positional arg is seen in the format string, a printf_positional fallback function is called for the remainder of the format string. This parses that remaining format string into a scratch_bufferto figure out how to convert the variadic args for interpretator input.

printf_positional uses the same code (with the aid of C macros) to interpret the format declarations. Upon encountering an unknown format declaration it serializes the format specifier itself.

This slowpath interpretor can also utilize caller-registered callbacks during parsing & input gathering, registered into lazily-allocated smallintmaps via register_printf_specifier/modifier/type. Parsing’s in a seperate file.

Bytearrays (“string”)

GNU LibC’s “string” sublibrary processes bytearrays, usually those representing (UTF8-encoded) text. As such looping over the bytes in these arrays is extremely common!

There’s no functions which requires a platform-dependant implementations, but some functions allow for them. I’ll start with the functions are never overriden with platform-specific optimizations.

swab iterates in reverse over every 2 bytes in the array & swaps (via a temp var) their positions.

An old implementation of strtok first iterates over initial string characters matching the given seperator, then if that’s not the end of the string iterates over the remaining characters it finds the given seperator char (which is replaced with a nil byte) or end of string. Old implementations of strsep wraps strchr or reimplements the loop itself. Old implementations of strcspn iterates over all characters which matches the given seperator(s) & returns the tail, as does strspn. An old implementation for strpbrk iterates over all chars which don’t match any of the given seperator(s) or the nil byte.

strfry swaps 2 psuedorandomly-selected bytes within the given nil-terminated string once for every char in that string, tracking it’s own pseudorandom state.

strcmp iterates over every equal char in the two given strings in tandem, returning the difference between the final, possibly nil, chars.

memfrob xors every char in reverse with 42.

envz_entry iterates over the given multistring (multiple nil-seperated strings with external byte-length) has an outer loop iterating over the strings & an inner loop which determines how much of the string matches the given “name”. If the following char is nil or ‘=’ that string is returned, otherwise returns NULL.

envz_get wraps envz_entry to before conditionally skipping past that ‘=’ if present. envz_remove wraps envz_entry & conditionally envz_delete.

envz_add wraps envz_remove to remove the appropriate entry, then either reallocs enough space to (if successful) memcpy the given key & value with added ‘=’ & 0 bytes OR does so by calling argz_add/append.

envz_merge iterates over every entry in the 2nd given multistring to retrieve corresponding in the 1st via envz_entry & argz_appending it after optionally removing the old entry if needed.

envz_strip iterates over entries to find empty ones & memmove over them.

Duff’s Device reffers to wrapping a loop in a switch statement, with the cases pointing inside the loop body as a clunky micro-optimization reducing loop iterations. Thankfully GNU LibC’s “string” library doesn’t quite do this (to allow merging loop body instructions?), but it does get close. GCC can do all these some optimizations, but I guess GNU LibC can’t rely on the right optimization level being set.

“string” functions applying similar microoptimizations also includes:

GNU LibC’s “string” sublibrary also provides several more trivial wrapper functions:

Platform-overridable “string” Functions

Finishing my explorations of GNU LibC’s “string” sublibrary, roughly the other half of it’s API may be overriden by platform-specific code. But if not here’s how they work:

utf8_encode decides whether to directly store the char as an ASCII byte, or compute how many bytes should be used (via a bittwiddling loop) & storing 6 bits of the char at time iterating over those destination bytes in reverse with that highorder bit set. The first byte (last bits) also encodes the length.

find_idx wraps followed by following a table of how to far skipahead in the string for a given number of iterations. find_position wraps find_idx with a lookup in the given rulesets smallintmap given by the locale.

do_xfrm for a locale-specified number of iterations looksup that local-specified pos for the current char & (handling 0 specially) iterates over the (UTF-32?) input starting at that current char for each calling find_idx & following the rulesets with char backlog.

There’s another variant of do_xfrm with added caching, with strxfrm choosing which to use after shortcircuiting for no locale-specified rulesets & locating the locating max index of a character to rewrite.

Turns out the input strings are normal UTF8.

strverscmp uses iterates over equal chars whilst following a “state” lookup table, with a seperate lookup table determining whether the remaining characters by decimal number length or char difference.

strrchr repeatedly calls strchr until it finds the last one, skipping ahead after each match.

str[n]casecmp compares the lowercased chars (via tolower[_l]) of both inputs in lockstep & returns the difference between the first non-equal chars.

argz_stringify iterates over every string in the multistring (via strnlen) & replaces the seperating nil chars with a given seperator char.

strlen iterates over every string in the multistring storing pointers in a given presized array.

argz_create iterates over the given string array once to count up the total len (including nil terminators) so it can malloc a multistring into which it can stpcpy each string in that array.

argz_count iterates over each string in the multistring to return a count of them.

argz_create_sep mallocs the given ammount of space & loops over the chars to replace the given deliminator with the nil byte. argz_add_sep does reverse.

argz_replace iterates over the strings via argz_next filtered by strstr. For each it uses strstr to find each occurance in that string, and either repeated str_appends (preceding by strndup or argz_append/argz_add to replace each occurance.

Functions which are microoptimized to iterate in large bitsizes within their loop bodies include:

memchr which uses bittwiddling tricks to iterate over all chars looking the first matching the given needle.

memcmp iterates every byte in lockstep in the two inputs in chunks of in chunks of 4 long ints, with special fastpath when the alignment between the two inputs are the same.

memmove uses the same optimization (calling into kernel, wrapped with CPU, optimizations via C macros) tricks as memcpy, except it first determines whether there’s overlap requiring it to iterate in reverse order.

memrchr uses bittwiddling tricks to operate 1 long int at a time exiting once it finds the given byte.

strchr[nul] uses the same bittwiddling tricks to iterate over the given string 1 long int at a time looking for the first occurance of the given char (which is preprocessed for those bittwiddling tricks).

strnlen uses the same bittwiddling tricks as strlen iterate over the string 1 long int at a time.

strcasestr meanwhile also uses higher-level optimizations!

After checking whether the needle fits in the haystack, it checks for repetitions in the needle.

Locating where the suffix first repeats involves iterating over the needle’s (canonicalized) chars twice comparing prefix & suffix chars, each time with three pointers: two for prefix & one for suffix. Extracts the length of the repeating prefix as the “period”.

If the needle’s short enough & there’s repetition(s) it repeatedly counts how many chars match between the needle & haystack at their current offsets in the right half then left to determine to return this position to the caller. If there were no repetitions it’ll canonicalize the needle & repeatedly compare it against each position in the string suffix than prefix canonicalizing each char of the haystack for comparison.

If the needle’s too long for either of those, it first computes a shift table mapping each char to it’s last occurance in the needle, or the needle length if absent. Then it’ll repeatedly check that shift table for how far to skip ahead before doing the comparison at each position. If there’s repetitions (and the needle is too long) that loop also involves tracking an offset into the needle for which repetition to check.

Simpler wrapper functions which may be overriden by platform-specific code includes:


Processing time can be useful or even vital to many applications. You can’t see it, you can’t hear it, you can’t weigh it, you can’t… measure it in a laboratory. All a computer can do is count every time e.g. a certain capacitor charges & discharges through a resistor, forming an electronic pendulum.

Ben Eater has some great videos on the hardware side (YouTube via Invidious):

Here I’m exploring GNU LibC’s “time” sublibrary, which exposes the time in both seconds-since-1970 & the Gregorian calender with internationalized rendering. Could still be more internationalized if C didn’t stick solely to the Gregorian calendar… Then there’s, especially historical, timezones & leap seconds…

tzfile_read looks up the caller-specified file relative to the directory specified by the “TZDIR”, checks whether it changed, & parses the binary array data under lock.

tzfile_default tzfile_reads a constant filename & overrides several fields.

tzfile_compute calls tzstring on various fields, including every zone referenced in the types array. Or wraps tzset_parse_tz & tz_compute. Or traverses a transitions graph. In any case looksup the full details for the timezone (including a call to tzstring) & extracts the leap seconds to add or subtract.

offtime computes the days, incorporates timzone offset, hour, min, secs, weekday, & year. Computing years is the trickiest step for offtime, for which until fixpoint it estimates that a year is 365 days before going back & computing the number of leapdays. Then it computes the month & day-of-month from the remaining day-of-year. The results are packaged in a struct.

mktime is the central function for converting that broken-down Gregorian timestamp (struct tm) back into seconds since 1970. It first extracts & validates (Windows needs extra for timezones) the arguments.

After extracting & validating the arguments it normalizes away leapseconds, until fixpoint converts back & forth between to determine any corrections which need to be made for the difference between iterations (exiting if e.g. Daylight-Savings Time invalid) before doing the final conclusion. The conversion is done via localtime64.

nl_select/get_era_entry wraps a table parsed & chronologically-verified from the locale files.

difftime[64] subtracts two seconds-since-1970 timestamps, taking care around the sign & weird numeric representations.

nl_get_[w]alt_digit & nl_parse_alt_digit parses “alternative digits” for use in date formatting from the user’s current locale if they haven’t already, and retrieves the properties that parser sets.

For functions which may be overriden by platform-specific code:

asctime[_r] wraps snprintf & the format string “%.3s %.3s%3d %.2d:%.2d:%.2d %d\n”.

get_date[_r]opens the file specified by the “DATEMSK” environment variable, trims whitespace surrounding the given text, reads each date on each line via getline & strptime exiting once successful, thens cleans up & infers missing data.

gettimeofday wraps clock_gettime converting the resulting struct timespec to a struct timeval.

gmtime[64][_r] wraps tz_convert. tz_convert in turn wraps tzset, the previously described tzfile_compute, & maybe offtime.

settimeofday wraps clock_settime & maybe settimezone.

The aforementioned internal struct tzstring_l type is a linked list of sized strings. tzstring[_len] initializes this datastructures & appends to a global tzstring_list.

The internal update_vars function resets the daylight, timezone, & tzname globals from the tz_rules global. compute_offset converts seconds, minutes, & hours into a total number of seconds.

parse_tzname extracts all the initial alnum chars at the start of a string into a tzstring which to store into the name property of the relevant tz_rule. parse_offset checks a sign char & uses sscanf to parse a timezone offset which to store into the offset property of the relevant tz_rule. parse_rule parses the date formats which precede log entries.

tzset_parse_tz wraps parse_tzname/offset once or twice falling back to tzfile_default, before attempting parse_rule.

tzsetunder a lock with a singleton flag retrieves the “TZ” environment variable checking whether it’s changed & ignoring preceding “:”, reads it via tzfile_read, saves the result in the first of tz_rules global & updates other globals to match.

tz_compute first calculates the seconds-since-1970 datetime for both global tz_rules on the given year (year into seconds if after 1970, then depends on calendar), then optionally compares them to determine whether it’s Daylight Saving.

strftime & variants aids outputting a gregorian date in a given format string resembling printf, etc’s. This involves an iteration over the format string copying each, possibly multibyte, char over optionally, except for various ASCII characters it handles verbatim, converting casing, until it reaches a ‘%’. Simpler if you don’t compiletime-enable internationalization. Then parses the “_-0^#”, digit, & E or O flags. Before acting upon the formatting character, fallingback to outputting invalid chars.

”%%” is the escape sequence which outputs “%”. “%n” outputs a newline. “%t” outputs a tab.

Heavy use of macros throughout this code help to avoid buffer overflows.

Other format chars might lookup fields of the given struct tm in global arrays and/or (usually enabled by the ‘E’ modifier) the configured locale. Various of those will update casing flags if configured by the ‘#’ modifier. Not all modifiers are valid on all formatchars, if they’re encountered it errors out. Interestingly AM & PM are stored in a single string.

Other formatchars recurses to apply another another format string, whether hardcoded and/or supplied by the configured locale. These tend to not accept modifiers.

Yet others output various properties, possibly with minor (pre?)processing, as a possibly-padded (via memset) integer by repeatedly dividing by 10 inline outputting remainders (+’0’) from right to left. Where padding is present it usually defaults to spaces. The locale may specify alternative characters to be consulted first before outputting most integers.

Or for “%s” (GNU extension) it’ll use mktime to convert back into seconds-since-1970 UNIX timestamp for the integer to output, handling negatives specially. Whilst “%V”, “%g” & “%G” computes the ISO weekday via a helper function postprocessed to enforce year boundaries, and “%z” unapplies daylight savings if set & retrieves the difference.

And some formatchars looks up “eras” from the locale to incorporate some property thereof in it’s output, usually falling back to computing the century from the year.

strptime essentially works in the reverse of strftime. Consisting mostly of a loop over the format string. Whitespace skips any amount of whitespace in the input string. Non-percentiles are matched against the input advancing both iterators, returning NULL on non-match. Modifiers (“-_0^#”) & field width (digits) are skipped in favor of leniency so you can reuse the format strings for strftime, and then acts upon the formatchar.

”%%” matches “%” in the input.

Some formatting directives like “%a” iterates over multiple string arrays, including those provided by the configured locale, to determine which matches (via strncasecmp[_l]) & what it should store for e.g. tm_wday giving the longest matching string precedance.

Other formatting directives trigger recursion on formatting strings which might be provided by the locale.

Others skip spaces then parses integers via multiply-add by 10 whilst it sees a digit. Sometimes requiring more computation.

“%O” triggers scanf to check the subsequent character. Some numbers (especially for “%O”-prefixed directives) may first consult nl_parse_alt_digit.

A statep parameter is passed through to recursion & unpacked into local variables, indicates postprocessing shouldn’t be done yet. Said postprocessing increments the hour by 12 if PM is set, incorporates the century and/or era into the year. If an era was desired defaults to +100.

Days may be specified in a few ways which needs to be made consistant. If it doesn’t have either a month & a monthday but does have a yearday it can calculate the former taking into account leap years, and from there the weekday if lacking. Vice versa it can calculate the yearday from the year, month, & monthday. If it has a week (in some form) & a weekday it can calculate the monthday, month, & year from it.

Days may be specified in a few ways which needs to be made consistant. If it doesn’t have either a month & a monthday but does have a yearday it can calculate the former taking into account leap years, and from there the weekday if lacking. Vice versa it can calculate the yearday from the year, month, & monthday. If it has a week (in some form) & a weekday it can calculate the monthday, month, & year from it.

UTF16 Wide Strings “wcsmbs”

GNU LibC’s “wcsmbs” sublibrary implements UTF16 processing, which loosely resembles the “string” library with builtin “iconv” integration. Though UNIX programs now typically stick to UTF8 so “wcsmbs” doesn’t get as much use or optimization. (Microsoft’s kernel however deals in ASCII or UTF16, so things are different there) Typically when any real text processing is desired we pull in the external LibICU library to be properly internationalized!

I wouldn’t recommend using “wcsmbs” (or “wctype”) since it doesn’t appear to have kept up after the Unicode Consertium found more than 65,536 characters accross the different written languages. I recommend using LibICU instead.

wctob wraps the innards of “gconv” to convert char into a Unicode codepoint, fastpath for ASCII.

wc[s]width wraps the configured locale & “wctype” to compute the width of each char in the given widestring, wcswidth sums the result.

wcstok wraps wcsspn then wcspbrk, returning early NULL on error.

wcsspn iterates over the given widestring & with an innerloop over the acceptables chars to exit as soon as it finds a char in that set.

wcsrchr has a C macro-unrolled loop (defaulting to no unrolling) over the given widestring storing the index of the last occurance of the given char, which it should return upon reaching the nil byte.

wcspbrk repeatedly calls wcschr for each starting index in the given widestring.

wcsncmp iterates every equal char in the two given widestrings in lockstep, comparing them, returning that difference once nonequal or end of strings.

wcsncat wraps wcs[n]len & wmemcopy.

wcsmbs_getfct wraps gconv_find_transform with validation. wcsmbs_load_conv wraps wcsmbs_getfct with added pre- & post- processing. wcsmbs_clone_conv wraps get_gconv_fcts with locked refcounts. wcsmbs_named_conv wraps wcsmbs_getfct to retrieve converters in both directions, erroring if either’s unavailable. nl_cleanup_ctype retrieves the locale’s private.ctype property, cleaning up if unavailable.

mbsrtowcs_l wraps the “gconv” internals to convert a string in a locale- (caller-provided) specified encoding into a (UTF16) widestring, specially handling when it’s left to do the malloc. mbrtoc16 works similarly choosing the input text encoding from the global locale.

c16tomb splits the specified 16bit char into one or two bytes before continuing via wcrtomb.

For “wcsmbs” functions which may have platform-dependant optimizations applied:

btowc wraps the “gconv” internals to convert a Unicode codepoint int to a widechar according to the global locale’s configured char encoding, with fastpaths taken for single-pass conversions & especially ASCII.

mbrtowc also wraps the “gconv” internals to convert a string to a widestring according to the configured locale’s charset, taking special error-reporting care.

mbsnrtowcs is another wrapper around the “gconv” internals handling sized strings, specially handling cases where it’s responsible for mallocing the output.

wcrtomb is a more straightforward “gconv” internals wrapper converting widestrings to strings, given some state to propagate between consecutive calls.

wcscasecmp iterates lockstep over the equal (once lowercased) chars in both given widestrings, returns the case insensitive difference of the next char.

wcschrnul iterates over a given widestring until it reaches the given char or nil, returning that pointer.

wcschr, with C macro-based loop unrolling, iterates over a given widestring saving the last index where it encountered the given char & returns that once it reaches the nil byte.

wcscmp iterates lockstep over the equal chars in both given widestrings, returning the difference between the next chars.

wcscpy wraps wmemcpy or uses it’s own macro-unrolled loop.

wcslen iterates over a given widestring in chunks of 4 & returns the char count.

wcsncasecmp iterates lockstep over the equal-once-lowercased chars two given widestrings returning the lowercased difference between the subsequent chars.

wcsnrtombs (GNU-specific, very useful internally) again wraps the “gconv” internals to convert from a sized widestring to sized string according to the given locale, specially handling when it’s responsible for the malloc. wcsrtombsworks similarly.

wcsstr iterates over the given “haystack” widestring (including some inner fastpath loops) with a seperate iterator over the given “needle” widestring to determine at which index that needle occurs in the haystack without backtracking on said haystack.

wmemset sets each index in the given sized widestring to the given character in chunks of 4.

wmemcmp the compares the given number of chars in both given widestrings to determine the first differing character & returns that diff, 4-chunks.

And wmemchr iterates over the given sized widestring to determine the first differing widechar & returns it’s pointer.

Other more trivial wrapper functions which may be overriden by platform-specific optimizations include:

Wide Characters

I want to finish describing the GNU LibC’s platform semi-dependant sublibrary implementations. So having just described “wcsmbs” dealing with “wide strings” I’ll now describe it’s complementary “wctype” for dealing with individual wide characters. Again I’d recommended to use LibICU instead.

Most of “wctype”’s platform-independant function implementations wraps wctype_table_lookup after looking up lookuptables from the global locale & checking whether it can take an ASCII fastpath.

That also closely describes towlower_l & to_wupper_l, except those don’t take ASCII fastpaths & wraps wctrans_table_lookup instead.

The rest can all be overriden with CPU-specific optimizations, several of which are also implemented in that same way…

wctrans[_l] looks up the list of charset names from the locale, finds the index of the desired name, & looks up the corresponding wctrans_t from the locale. wctype[_l] works similarly. The *_l funcs allow caller-specified locales.

The aforementioned table lookups (wctype_table_lookup, wcwidth_table_lookup used by “wcsmbs”, wctrans_table_lookup functions) consults three tiers of hashtables without any rehashing.

towctrans[_l] & iswctype[_l] just validates their desc arg is provided & publicly wraps those lookup functions. The caller-specified locale is ignored.

Plugin Modules

Several GNU LibC sublibraries, including that for loading these Dynamically Loaded Modules, may defer to plugin modules to support different formats, text encodings, naming systems, etc. While you may add your own, GNU LibC bundles a few such plugin modules built-in.

Executable & Linkable Format (“elf”)

When you compile a program, without special effort adding additional capabilities requires recompiling that code. As such it’s often useful to engineer away this rigidity! That’s what dynamically-loaded modules do!

On Linux-based systems this is often achieved using the Executable & Linkable Format (ELF), and it’ll take me several days to get through it’s featureset as implemented within GNU LibC’s “elf” extension to “dlfcn”.

Entrypoint _dl_start (“rtld”)

The entrypoint _dl_start is called via platform-dependant code and, with time-profiling & a cleared bootstrap_mem.l_info array, retrieves some platform-specific globals into bootstrap_mem before converting an array in the ELF file into that l_info smallintmap with a little postprocessing & validation. Then initializes internal malloc-wrapper stubs before calling _dl_start_final & returning it’s optionally mapped start code address.

With additional profiling & a dl_rtld_map loaded from some global pointers or compiletime-conditionally what _dl_start gathered, _dl_start_final in turn calls the __builtin_frame_address function which I can’t find it’s definition & forwards to the “sysdep” subsystem (later).

Upon start (_dl_sysdep_start) iterates over the dl_auxv to retrieve appropriate fields, calls platform-dependant code to retrieve system memorypage size if currently unknown, if there’s a new sysinfo & the lib doesn’t have dl_sysinfo_dso already updates global, initializes “tunables” (won’t cover that topic…), runs some platform-specific initialization code (via C macros), computes dl_platform length if missing, sbrk-allocs some memory, validates standard FDs, & runs the dl_main code.

That function, and any necessary cleanup (none by default), might be overriden with platform-dependant code. And there’s a _dl_show_auxv function for textually outputting that dl_auxv array in case debugging is desired!

_dl_sysdep_start in turn calls dl_main which is where the bulk of “elf”’s code is! With initialized state, locks, stacks, etc dl_main starts by iterating over envvars to set appropriate globals.

Then it conditionally (for suid executables) clears any “unsecure” environment variables in a platform-specific list & checks if /etc/suid-debug is present and/or parses debug configuration opening the appropriate file.

If the ELF library was ran as the commandline program, handling it specially (not digging into details), otherwise allocates a main_map representing the embedding executable. In either case iterating over the program headers to populate various main_map, etc fields.

Those extracted fields from the ELF header array are postprocessed, if it’s not the main executable it converts an array in the file into a smallintmap just like it did first thing on _dl_start, & initializes a new hashmap.

If dl_main was called for verification, it’ll validate the presence of main_map->ld & has_interp alongside caller macro-specified checks.

Otherwise it continues by configuring the virtual DSO, calls _dl_init_paths, initializes debug data & updates some properties.

dl_main then iterates over the ELF headers again to extract the PT_GNU_RELRO field to retrieve the l_relro_addr/size properties from. If there’s a l_tls_blocksize it’ll retrieve the thread-locale-storage (TLS) modID (later). It appends 2 items to the audit list & considers outputting help text.

If there’s anything in the audit list it’ll initialize TLS & security, & load all the specified audit modules taking great care to handle errors via _dl_open & _dl_lookup_symbol_x.

If it successfully loaded any audit items, it’ll call registered callbacks.

It stores some previously retrieved debug references, passes each link_map_audit_state to the afct->activity callback, with time-profiling iterates over any configured (via env, commandline args if lib’s ran as executable, or /etc/ preloads calling _dl_map_object (later) with error handling for each, than gathers an array of them & applies _dl_map_object_deps (later).

It iterates over everything in the main_map->l_searchlist.r_list array in reverse marking them as globals, removes dl_rtld_map from the doubly-linked linked list, then populates that l_search_list.r_list array from the dl_rtld_map linked-list. If there were any items it rebuilds the linked-list head.

Then it checks version numbers (later), followed by initializing thread-local-storage (later) & in normal mode optionally outputs debugging & applies relocations.

If any libraries were loaded, it iterates over them to validate the precomputed checksums & names match. Might output debugging info.

Then it retrieves some fields into globals, optionally scope debugging info, calls platform-specific validation code. If prelinked time-profiled resolves conflicts (later topic), ensures malloc is properly linked (tomorrow’s topic, pushing off previous comments to that effect), & iterates over the mainmap marking executable pages as readonly & collecting slots.

If it isn’t prelinked then dl_main iterates over the searchlist in reverse to time-profiled apply relocations & gather slots.

The thread-local-storage “generation” is conditionally incremented, unconditionally allocated, & conditionally added to a list for the local thread with optional debugging output.

If not prelinked but with multiple references it ensures malloc’s properly linked & time-profiled applies relocations.

dl_main finishes by initializing parts of GNU LibC (later topic), starts sysdep cleanup (after tomorrow), compile- & run- time optionally runs loaded audits, finishes initializing debugging & compiletime-optionally unloads the cache.

ELF Format (“elf.h”)

Continuing my discussions about how GNU LibC “elf” plugin to “dlfcn” works, today I’ll describe the ELF binary file format itself, because executables & libraries don’t just contain machine code for the CPU to fetch, decode, & execute. It also contains additional information required for a “linker” to get the CPU ready to do so.

I’ll use the term “word” to refer to 32bit or 64bit depending on the file. C macros help “elf” generate codepaths for both!

An ELF extended header (referred to by the “dl-load”, “sprof”, “readelflib”, “rtld” as described yesterday, “dl-map-segments”, “readlib”, & “dl-support” subcomponents) contains a 16byte identifier, 16bit object file type, 16bit target architecture, 32bit version, 1word (virtual) pointer to _start, 1word offsets for program headers & section headers, 32bit CPU flags, 16bit bytesizes for all(?) headers, entry & byte sizes for program & section header tables, & section index for string table.

That identifier starts with “\177fELF” followed by enums for wordsize & endianness, a version number, an enum & version number for the ABI (e.g. 3 for GNU/Linux) the machine code was compiled against, & a padding byte. These enums, and others, are defined manually via C macros.

Section headers have 32bit name index, 32bit section type & 1word flags, 1word address & file offset & size, 32bit next link & additional info, & 1word alignment & entry size. Those section headers, if it’s `SHF_COMPRESSED bitflag is set, can contain additional 1word fields for compression type (current ZLIB/DEFLATE fully supported), size, and alignment.

Each entry in the symbol table (used throughout the “elf” extension) contains 32bit name index, 1word pointer, 32bit size, 1byte each for type/binding & visibility, & 16bit section index. 64bit ELF moves the pointer & size to the end of the struct. Optionally also 16bit each for binding refs & flags.

Relocation tables (used throughout “elf”, via a parallel array) maps 1word offset tables to 1word symbol index, optionally with 1word “addend”s.

Program segment headers (used primarily in “dl-load” & “rtl” subcomponents) include a 32bit type (from an enum), 1word offset, 1word virtual & physical addresses, 1word file & mem size, 32bit flags (moved to second field for 64bit ELF), & 1word alignment.

Elf??_Dyn is a dynamically-typed 1word value used throughout “elf”.

Version definitions (used in “dl-version” & “rtld” subcomponents) consist of 16bits each for the revision, flags, index, & count; & 32bits each for hash, offset in verdaux array, & offset to the next version definition. That verdaux table lists 32bit names with 32bit offsets to the next such entry.

Each version dependency describes a 16bit version, 16bit “aux” entry count, & 32bits offset references to dependency filename, to the vernaux array, & to the next version dependency. Version dependencies optionally also hold 32bit hash reference, 16bit flag, 16bit padding, 32bit dependency name string offset, & 32bit offset reference to next auxiliary entry.

There’s dynamically-typed 32bit values used predominantly by “dl-sysdep” & “dl-support”.

Note headers (used by “dl-load” & “readelflib”) have 32bits each for namesize, bodysize, & type.

Each move records have 1word each for value, size & index, & symbol offset ref; & 16bits each for repeat count & stride.

In 32bit ELF SHT_MIPS_GPTAB sections hold 32bit current value, 32bit padding, 32bit new value, & 32bit tailsize. SHT_MIPS_REGINFO holds bytes for the kind & size, 16bits for section, & additional 32bits interpreted according to it’s kind.

.options sections contains 64bits of flags.

SHT_MIPS_LIBLIST sections contains 32bits each for name index, timestamp, checksum, version, & flags.

SHT_MIPS_CONFLICT contains 16bit version; chars for ISA level, revision, size for general regs & coprocessor 1 & 2, & ABI, 32bits for ISA extensions, 32bit mask of ASEs used, & 64 bits of flags.

Opening Dynamic Libraries (“dl-open”)

_dl_open is called by that dl_main function described previously to load the auditting plugin modules for the given plugin module, & is referenced by a methodtable I don’t see used anywhere.

With validated args & a lock it checks whether the given nsid is LM_ID_NEWLM. If so iterates over the namespace array looking for a free slot erroring out if none, initializes it if needed, & flags it. Otherwise invalid namespace errors.

With careful error handling & compiletime-optionally cache unloading, it does the main work of:

  1. If the given filename has a $ seperator, the namespace ID is __LM_ID_CALLER, or there’s no slash in the filename it calls _dl_find_dso_for_object to load the “link map”. _dl_find_dso_for_object iterates over every namespace, & _ns_loaded linked lists within to find the highest addressed object inside that memory region.
  2. Check if LibC’s loaded.
  3. Temporaily saves the namespace’s _ns_global_scope_pending_adds to be restored later.
  4. Initialize some debugging data (later topic)
  5. Loads the named object (later, seperate topic)
  6. If there’s a l_searchlist optionally outputs debugging information, validates them & allocates a new global “link map”, set the l_nodelete_active on the dynamicly loaded lib flag, and/or copies over into the namespace’s _ns_main_searchlist array.
  7. set the l_nodelete_pending flag on that lib.
  8. Load the “object”’s dependencies by:
    1. Preparing a “preloads” array copying from a given array, & taking a reference to the last item.
    2. Allocate a scratchbuffer.
    3. Iterate over that array, flagging each dep as being done, allocating some temp mem, optionally iterating over it’s l_ld array applying it’s substitutes & calling _dl_map_object with error handling (later topics) whilst saving results including into linked lists, consolidating dependency array, & skipping completed deps.
    4. Frees temp memory & restores the error number.
    5. Saves the old l_initfini if present, before alloc’ing a new one.
    6. Iterates over the runp linked-list populating non “faked” entries into that array whilst keeping an index to the last one with a link map.
    7. Optionally iterates over that array skipping items without a local scope & erroring out for ones with AUXTAG & FILTERTAG, for each building local scope, flagging DT_SYMBOLIC items within, & allocating an array for them.
    8. Optionally flags everything temporarily in the l_searchlist as l_reserved & iterates over the l_reldeps array to locate & remove duplications.
    9. Copies the l_searchlist array into the l_initfini array, removing the last item with a link map if any.
    10. Sorts that l_initfini array optionally skipping the first item.
    11. Swaps in the new l_inifini & l_reldeps arrays adding a NULL terminator, with an atomic writebarrier or two. (Their structure’s a later topic)
  9. For each dependency in l_search_list with a version attached, validated that version by:
    1. Extracting relevant values exitting if missing.
    2. If there are DT_VERNEED data validate the records are of version 1 & repeatedly locate a given namespace of what’s already been loaded, if not tracemode or faked repeatedly call match_symbol validating result, & iterate to next dependency.
    3. If there’s DT_VERDEF defined locate highest address amongst them.
    4. If it found a highest index calloc an l_versions array & compute l_versions pointer, if there were DT_VERNEEDs iterate over them & their auxiliary data populating the l_versions array, & if there were DT_VERDEFS iterate over them to do likewise.
  10. Compile & runtime-optionally check if the namespace’s head enables auditting & if so iterates over the audit plugins & call their callbacks.
  11. Reinitialize the debugger & it’s state (later topic) to trigger a “probe” (from ).
  12. Runs some platform-dependant validation.
  13. Optionally outputs “scope” debugging information (later topic).
  14. Gathers the relocation bitflags.
  15. Iterates over the l_initfini array to extract the first & last entries.
  16. Iterates in reverse over those skipping ones which were already relocated _dl_relocate_object once or twice & optionally _dl_start_profile on them.
  17. It (re)allocs the scopes (later topic).
  18. Gathers all the struct link_maps from the searchlist with l_init_called unset & a l_tls_blocksize (later topic).
  19. If the RTLD_GLOBAL flag is set gathers the new entries into the _ns_main_searchlist array again, with careful error handling as per before.
  20. Updates flags on each struct link_map of the namespaces.
  21. Updates the scopes & thread-local storage.
  22. Triggers another probe.
  23. Specially handles it’s own dependencies.
  24. Platform-specific code for the “static” sections.
  25. Calls initializers with careful error handling, by:
    1. Loads the dl_initfirst entry & if present calls it by checking whether it’s already been called according to it’s containing struct link_map, validates the entry can be called, optionally outputs a debugging message, calls platform-dependant code, calls it’s DT_INIT entry if present, & iterates over it’s DT_INIT_ARRAY entry if present calling each function reference contained within.
    2. If there’s a DT_PREINIT_ARRAY it optionally outputs debugging information & iterates over it calling all the contained function pointers.
    3. Begrudgingly (according to a comment) iterates over the l_initfini array calling them as per dl_initfirst.
  26. Appends to namespaces array.
  27. Optionally output debugging info.

dl_relocate_object, skipping already relocated objects & with optional debugging output, checks if there’s DT_TEXTREL entries. If so it iterates over the PT_LOAD headers to alloca a struct textrels for each & enable rewriting the mmap‘d exectuable file, retrieve the stringtable to inform platform-specific code (both C functions & macros), optionally calloc a l_reloc_result array, mark the object as relocated, & restore the memory protections whilst clearing cache.

_dl_start_profile meanwhile iterates over the PT_LOAD headers extracting the start & end addresses, pads those addresses, and uses that to compute how much memory to allocate for profiling whilst checking whether it’s a power of two indicating it can take advantage of hardware optimizations. Then it assembles metadata inside the loaded ELF file, and opens, mmaps a file to which it can output profiling data to, & initializes entries for each branch/”arc”, & calls platform-specific code.

When checking each individual symbol (in match_symbol) it optionally outputs debugging information, throws an error upon missing DT_VERDEF entry, before iterating over those verdefs validating they have an ELF version of 1 & exits upon finding the desired vd_hash value. If not found or a weak reference, throws an error.

Loading Dynamic Libraries (“dl-load”)

The process of “opening” an ELF file as described above thoroughly wraps a subprocess for “loading” the executable from disk, which is what I’ll discuss today starting from _dl_map_object.

_dl_map_object, after iterating over all already open namespaces to ensure this file hasn’t already been successfully loaded & optionally outputting some debugging info, compile & runtime optionally iterate over the loaded audit plugins to call their objsearch refining the filename.

If the filepath (as refined by the auditting plugins’ objsearch methods) does not specify a directory & no DT_RUNPATH is configured on the “loader” it iterates over each loader to check the file can successfully be loaded from it’s configured colon-seperated DT_RPATHs, then if present the main link map’s RPATH. Utility functions loads & splits those paths, and another concatenates the filename onto each & with careful error handling attempts to open the file.

If that’s unsuccessful or not run & theres a __rtld_env_path_list to search, do so. Same for the loader’s l_runpath_dirs, followed by platform-specific code.

If compiletime-configured to use LDConfig… Remind me to get back to this tangent today! And finally it fallsback to __rtld_search_dirs global.

If there were folders specified in the filename, it iterates over the filepath to substitute any valid ${...} “DSTs” into a newly malloc‘d string & attempts to open it.

It validates the caller-given mode, & if it failed to find/open the file a “fake” entry is added to the namespace array if configured with DL_DEBUG_PRELINK. Without that config this is an error to throw (later topic).

Otherwise it continues by attempting to call platform-specific code for retrieving file information. Compile & runtime optionally takes a shortpath for to avoid self-dependency. And a little more validation/debugging.

Compile & runtime optionally calls each audit plugin’s activity method, alongside some debugging.

Then getting into the real logic it callocs/inits a new “object”, copies info from the header into the object, computes a maplength to alloca & pread, iterates over Phdr of which to fill in the appropriate object fields with special handling for “load commands” via _dl_map_segments (which mmaps & mprotects the file segments by those headers), & similar for Dyn entries.

If appropriate stack & mode flags are set, and sharedlibs are compiled-time enabled, it calls mprotect on the stack memory segments with some platform-specific checks if available.

Aconfigured offset is added to the l_tls_initimage if present.

Platform-specific code is called to process each PT_NOTE or PT_GNU_PROPERTY Phdr.

The file is closed.

If sucessful it’ll increment the object’s l_entry by it’s l_addr. It creates a hashtable.

It replaces it’s searchlist with information from it’s new “scope”, updates a dl_initfirst global & l_file_id property, optionally appends a value to the lib’s name array, updates the corresponding namespace’s libc_map property to point to this object, computes a thread-local-storage module ID (later topic), calls platform-specific code if available, appends the object to the namespaces array, & compile & runtime optionally calls each auditting plugin’s objopen method.

Circling back to the LDConfig fallback file lookup logic, that involves (after optionally outputting debugging info) if it hasn’t already (global variable) it mmaps /etc/ with careful error handling whilst validating the format with help from platform-specific code. It can then binary search this file, with it’s fixed bytesize entries, looking for the matching entry best (according to some platform-specific code & later some globals) for the available hardware capabilities.

The result is copied is copied into newly alloca‘d memory, optionally checked if it’s within some configured directories to see whether it should filter it out, and attempts to open the looked-up filepath.

Those hardware capabilities globals which the refers to are initialized alongside the rest of this “dl-load” component from heavily-processed caller-provided input.

Closing Dynamic Libraries (“dl-close”)

I’ve discussed how GNU LibC’s “dlfcn” plugin “elf” opens an Executable & Linkable Format file. Today I wish to explore the flipside & discuss how it closes these files, and in the process some of the underlying infrastructure I’ve been skipping over in previous “elf” threads.

After taking a lock, validating that the dynamic library hasn’t already been closed whilst waiting on that lock, & validating the refcount _dl_close decrements that refcount & validates the lib’s state/type.

It looks up the namespace corresponding to the caller-given linkmap, iterates over it’s _ns_loaded dependencies array collecting them into an array, clears used & done bitmasks, & iterates over that array again skipping over ones which (according to those bitmasks) have already been closed or aren’t still being used elsewhere. For each it flags it in both the done & used bitmasks before iterates over it’s l_initfini then l_reldeps arrays to mark those as used too.

It then sorts those gathered linkmaps by their partial order by moving each out of place items into place, in reference to seen counts in a sidetable.

It then iterates over these linkmaps again this time in sorted order. For each of those libs which are unused & had their initializer called it optionally outputs some debugging information & with careful handling calls each function in it’s DT_FINI_ARRAY followed by the singular DT_FINI if present.

To output debugging information it uses it’s own simpler/self-contained printf implementation (can’t rely on stdio-common being loaded already?) which also prepends the process ID & a colon built-in, using _dl_writev to output the computed string. _dl_writev wraps __writev (from “misc”, platform-dependant implementation) possibly wrapped with lock after startup.

For the careful error handling it temporarilies clear the catch_hook global before calling the given callback.

_dl_catch_exception (said careful error handling) may for other callers instead stack-allocate a new value for catch_hook (storing a targets for the exception, errcode, & control flow) to be available whilst running the given callback. In the event of an error _dl_signal_error updates these pointer values & longjumps into a _dl_catch_exception branch. Or wrappers thereof will first consider running a callback function.

_dl_close compile- & run- time optionally calls each audit plugin’s objclose method, flags it’s l_removed property & unload_any local, if it’s a global linkmap increment the number of globals unloaded, saves the min loaded index, ensures libs with a l_initfini have a l_searchlist by reusing that memory, & counts scopes whilst determining which to remove.

If there were any scopes to remove it mallocs a computed number of scopes to copy each remaining scope into, freeing the old ones.

A scope is a sized array of struct link_maps, & link_maps in turn are a doubly-linked list of named addresses, dynamic sections, & so much more. It attempts to free scopes via a dedicated freelist array & platform-specific threadsafety, fallingback to the malloc APIs.

Otherwise if there’s a new list it’ll set the searchlist to NULL. If the lib has a unused l_loader & once again updates the first_loaded index.

After that loop it exits if there’s nothing found to unload.

It compile- & run- time optionally calls the activity method on all audit methods with one present, before initializing the debugger & it’s state.

To initialize the debugger it looks up the object from a global or the specified namespace & initializes it’s properties with configured values. Including a reference to _dl_debug_state which is a noop intended for setting a breakpoint on in GDB. Most callers immediately triggers that breakpoint.

If there were any global libs to unload, count up the number of libs flagged l_removed in _ns_main_searchlist & rearranges that list if necessary to remove newly-opened holes.

If multithreaded with globals or scopes to free, performs some platform-specific synchronization & free()s the that freelist.

With a lock held it iterates over the struct link_maps one final time skipping any used entries.

For each unused item in this final iteration, if it has a l_tls_blocksize it conditionally & recursively removes the “slot” restoring various properties (and, if successful, global) to previous values before (for TLS offsets other than NO_TLS_OFFSET & FORCED_DYNAMIC_TLS_OFFSET) collects a contiguous memory chunk under various compile- & run- time conditions updating dl_tls_static_used field.

If caller-requested it clears each entry in the symbols table if present under a lock.

Calls caller-specified code to unmap all the mmap‘d memory; removes the lib from the doubly-linked list (which may or may not be cyclic depending on build flags); decrements the _ns_nloaded count; frees it’s l_versions, if present l_origin, & l_reldeps; optional debugging output, frees l_name, each item in the l_libname linkedlist, l_initfini, conditionally l_scope, & if present l_phdr_allocated & l_r[un]path_dirs. Then NULLs dl_initfirst field if that’s this lib.

After that final (locked) loop…

If there were any thread-local-storage to free it increments the dl_tls_generation field translating numeric overflows into fatal errors, using _exit(127) within the wrapper around it’s internal printf implementation it uses. If the dl_tls_static_used field equals the computed tls_free_end it resets it to the computed tls_free_start.

It compile- & run- time optionally calls the activity method on all auditting plugins where it’s present.

Finally it conditionally truncates the namespaces array, updates the debugging state triggering the breakepoint & a probe. And if the local state is rerun it jumps right back up just after it converted the linkedlist of libs into an array, otherwise it sets the state to not_pending.

Symbol Lookup (“dl-lookup”)

Once you’ve opened/loaded a dynamic library for that to be of any use you need to lookup symbols within it. For that you use _dl_[v]sym via a wrapper from “dlfcn”.

If the given lib’s RTLD_DEFAULT dl_[v]sym iterates over every linkmap in every namespace to find which symbol the caller is in. Then calls dl_lookup_symbol_x (which I’ll come back to), if multithreaded it does so with a GScope platform-specific flag temporarily set whilst forwarding exceptions.

If the given lib’s RTLD_NEXT then dl_[v]sym locates in which symbol the given caller is in, validates it’s dynamically-loaded code, locates the last struct link_map in it’s l_loader linkedlist, & dl_lookup_symbol_xs within that loader’s scope.

Otherwise directly wraps dl_lookup_symbol_x within the given lib.

If the result’s in thread-local-storage it’ll call the platform-specific THREAD_DTV macro to retrieve a TLS global, with various postprocessing under different conditions.

If the result’s not thread-local-storage it calls a platform-specific macro to compute the pointer from the looked-up symbol, which may involve taking a special codepath (_dl_make_fptr if it’s annotated as being a function.

_dl_make_fptr allocates (via mmap with error handling) a global function pointer table if needed, resolves the offsetpointer, validates the index within that table is in-range, & atomically inserts a new entry into that table which it returns.

Additional postprocessing for non-thread-local-storage involves following indirect function references if relevant via platform-specific code, & if run- & compile-time enabled looks up the caller symbol & calls each audit plugin’s symbind method for it.

Looking up the symbol by given name in the given linkmap involves incrementing the number of relocations, validating there’s a version & no DL_LOOKUP_RETURN_NEWEST flag, if there’s a skipmap counts scope entries matching it, iterates over the different scopes to apply the core logic to each, throws an exception if not found, checks whether it’s protected and if so possibly performs the lookup again, for lt_loaded types with DL_LOOKUP_ADD_DEPENDENCIES flag set it adds undef_map to the dependencies & retries, flags the symbol as used, optionally outputs extensive debugging info, & applies some platform-specific usually-noop logic.

That even more core logic involves iterating over the scope (with a compiler readbarrier on it’s length) skipping entries which match the given skip value, executable copies, removed, or empty values whilst outputting optional debugging info. For each it retrieves it’s symbol table & checks it’s “bitmask” (or rather bloomfilter by the look of things, GNU extension) if present allowing it to perform a hashmap lookup. Otherwise uses array lookup.

undef_map is handled specially.

weakpointers, globals, & (GNU extension) uniques are specially. Especially uniques, which has additional hashmap lookup logic inserting new values & resizing if necessary.

Then it iterates onto the next entry in the r_list.

Commandline Tooling

Finally, GNU LibC’s “dlfcn” sublibrary’s “elf” extension provides some commandline tools for working with Executable & Linkable Format files.

pldd is a command for listing shared libraries linked into a ELF file. To do so, with parsed commandline flags & singular integral argument, consults /proc/*/exe symlink to retrieve the corresponding command, consults /proc//task/ ptrace attaching to each thread, retrieves/outputs the data (get_process_info), & cleans up.

It opens /proc/*/mem, /proc/*/exe validating that it is indeed an ELF file, /proc/*/auxv, & calls the 32 or 64bit variants of find_maps. This iterates over the auxv array extracting the values of the ones of type AT_PHDR, AT_PHNUM, & AT_PHENT. It mallocs & preads AT_PHNUM * AT_PHENT bytes of /proc/*/memfd, allowing it to compute the AT_PHDR runtime pointer offset, find it’s PT_DYNAMIC header & mallocs/preads those sections searching for DT_DEBUG entry, it finds PT_INTERP it mallocs the specified ammount of bytes from the program’s memory, outputs PID & program name, initializes a scratch buffer, reads the runtime struct link_map, reads each import resizing the scratch buffer as necessary & outputs each immediately once read.

ldconfig, with localization initialized & commandline flags, validates that any remaining arguments are absolute paths & retrieves the “LD_HWCAP_MASK” integral environment variable.

The absolute path commandline args, parses & allocs a new dir_list global linked list entry whilst checking duplication & saving chroot params. If successful it traverses it’s subdirs specified by hardware caps.

If a chroot was set calls that syscall & cd to /.

Retrieve fallback values cache_file & config_file. If the -r flag was set open, fstat64, & mmap the specified cachefile validating it is indeed a cachefile of the correct endianness, printing each entry therein & count.

Performs additional filepath normalization if chroot configured.

If configured to apply specified links manually, iterates over those commandline args further normalizing/validating the filepaths, opens & mmaps them validating it is a ELF file, reads it’s EI_CLASS, checks for ET_DYN type, iterates over the ELF headers retrieving data from & validates PT_DYNAMIC, PT_INTERP, PT_NOTE, & PT_GNU_PROPERTY (platform-specific code) ELF headers.

Then iterates over the segments looking for the string table, followed by DT_NEEDED & DT_SONAME entries. Then computes the soname if missing, & creates a symlink from that soname to the lib name whilst outputting debugging info.

If not manually linking it instead continues by optionally clearing the caching linked list.

Optionally opens & line-based parses (recursing for include directives) a config file, followed by adding two system dirs populating the aforementioned linkedlist.

The system dirs are registered via CPU-specific code.

It normalizes the filepath to the “aux cache” file, loading (via mmap into a struct whose entries are loaded into a closed-addressed hashmap) or initializing it as needed or configured via the -C or -i commandline flags.

Then iterates over the dir_entries global linkedlist, gathering all libs in each of these dirs (incorporating hardware capabilities), checking whether they’re in the cache already, & processing them as before.

On each iteration it iterates over the gathered libs again before freeing them to create appropriate symlinks given hardware capabilities & insertion sorts into the cachefile & stringtable.

If -n or -N commandline flags are not set it saves the cache & possibly auxilliary cache by, after preprocessing & output headers, iterating over the gathered cache entries referring to stringtable copying them into the mmap‘d cache. Saving the results in a binary file, 1 of 2 formats.

Text Encodings (“iconvdata”)

Computers are only as smart as you make it, and even then it don't know nothing. It can store. It can store information but can't interpret it. It doesn't know what it means unless you give it meaning. It don't mean nothing unless you tell it to mean something, and even then the computer doesn't know what meaning is. Computers/Commutors by Dr Rev Churchdoktor

There’s historically been numerous text encodings representing different alphabets, most of which are, like UTF8 which we use near-universally today, supersets of ASCII. IBM certainly designed a lot of them, I guess they were dominant in the appropriate era & “International” is their first name!

Most of these text encoders/decoders are written as converters to/from UCS, consisting mostly of Perl-generated lookup tables from UNIX datafiles.

While there may occasionally be a tiny bit more logic, converting between charsets the “iconv” extensions consist of basically just lookup tables converting chars to/from an internal UCS4-like.

It is reasonable to view “iconv” as legacy code taking the stance we should all be using Unicode, e.g. UTF8 specifically, now. But this diversity of text encodings (mostly from IBM or ISO) played a vital role in the past & who knows who’ll still need to read those text files?

There’s occasional 0xhex TSV files (listed within another TSV) to test these charset converters, with a corresponding testrunner shellscript.

There’s typically seperate lookup tables for conversion to UCS (of unicode codepoints) & conversion from UCS (of strings). Typically there’ll be C macros (arranged inconsistantly in different files, sometimes even in a seperate file) specifying identifying, sizing, & lookup logic to encorporate into a template (or an extension thereof) provided by the “iconv” sublibrary these extend.

The lookup tables may be smallintmaps (most frequent), parallel arrays, state machines, etc.

There’s a datafile listing the still unsupported text encodings, and another listing the supported text encodings alongside their aliases which dynamically loaded modules to use for them.

The JOHAB (Korean Hangul Jamo) text encoding is an interesting one: 5bit consanant, 5bit vowel, & 5bit consanant. Requires more sophisticated transcoding & lookup tables!

euc-jisx0213 is mostly macros rather than datafiles. asmo_449 doesn’t list it’s lookuptables in C form.

Then there’s the issue of composition/decomposition in certain charsets/alphabets… And there’s ones with different lookuptables for different ranges…

While most text en/de-coders for “iconv” use the templates, some implement the conversion themselves. Skipping the trivial cases like jis0201 & simple cases like jisx0213 which computes which lookuptable row to consult, UTF32 conversion upon init parses the name & stores (alongside lookahead requirements) it inside given converter data.

To convert UCS to UTF32 an initial iteration is performed to seek out any byteorder marks determining whether to output bigendian or little endian retrieves each 32bit char, validates it, increment readpointers, swap byteorder if needed, & outputs the 32bit char.

To convert from UTF32 to UCS it retrieves the 32bit char, swaps byteorder if needed, validates more simply, outputs the 32bit char, & increment read & write pointers by 4.

UTF16 has similar initialization but if the 32bit char is larger than 16bit it bitmasks it in & adds some high-order bits in appropriate byteorder (byteorder check moved further out to play well with the CPU’s branch predictor). And there’s the inverse operation checking those highorder bits.

I’ll skip describing the UTF7 encoding designed for apps which really breaks on non-ASCII chars.

There’s a converter to/from the internal UCS4-like from/to UCS2. UTF8 also has similar logic, but is also the only encoding directly implemented, statically compiled into it, by the “iconv” sublibrary.

There’s some code shared between euc-kr, iso-2022-jp, iso-2022-kr, & uhc text encoding “iconv” extensions. This includes a function ksc5601_to_ucs4 which reads 2 chars which it validates & computes into a singular index split accross 3 lookuptables.

`ucs4_to_ksc5601 determines the appropriate lookuptable to use & performs a binary search & divides the result by 94 to take the result & remainder as the output chars, or reads the two entries in the lookuptable.

There’s a korean text encoding which can switch back & forth between ASCII, using an escape code like "<esc>$)C". Which charset the converter’s in needs to be tracked. Some Japanese & Chinese text encodings do similar.

There’s a suite of asian text encodings (standardized by ISO) which support switching between them. These are implemented in a single “iconv” plugin with a giant switch statement in place of lookuptables.

The IBM text encodings are diverse in how much they match or mismatch Unicode. Most are converted to/from the internal UCS4-like via straightforward lookup tables. Some like ibm1364 requires tracking (typically 7bit) state between each byte to handle “combined” chars.

There’s variable-length char encodings like ibm939 (Japanese), ibm937 (Chinese), & ibm932, generally using that 7bit state. I see this called “EBCDIC” encoding.

There’s conversion between GBK & GB2312 encodings.

There’s gb2312 common between iso2022-jp, iso-2022-cn-ext, euc-cn, & iso-2022-cn text encodings, implementing it’s from UCS4 lookuptable as a C switch statement. The to UCS4 function looks up, validates, & outputs 2 chars from it’s lookuptable.

euc-tw combines ASCII & cns11643l1’s two “planes”, with invalid codepoints between them & no lookuptables of it’s own. euc-kr adds Unicode decomposition.

euc-jp likewise combines ASCII, jisx0201. & jisx0208.

And euc-cn combines ASCII & gb2312.

cns11643l2 is embedded into iso-2022-cn[-ext], using a 2byte encoding. As does cns11643l1 (uses C switch inplace of from UCS4 lookup table) but also shared with euc-tw.

cns11643 (uses multiple to/from UCS4 lookup tables, and also a C switch from UCS4) is embedded in iso-2022-cn-ext & euc-tw.

Localization (“localedata”)

Different cultures do fairly fundamental things differently. That may be stating the obvious, but it is something core operating system components like GNU LibC needs to deal with. As such it’s “locale” sublibrary has extensive datafiles & a few scripts to configure the GNU LibC’s other sublibraries including even case-insensitive string comparison!

The LC_CTYPE translit_start/end section specifies which char(s) to replace each char with for comparison. Some reusable ones are autogenerated.

include directives within translit_start/end loads these reusable substitution configs (which may even remove chars entirely for comparison).

Within LC_CTYPE you may also configure lists of chars for the different char classes. Not sure what LC_COLLATE lists char substitutions for case-insensitive comparisons (this isn’t necessarily converting everything to lowercase). POSIX locale sticks to ASCII, others include i18n configs.

Locale files have metadata LC_IDENTIFICATION sections.

LC_COLLATE may define script, collating-symbol, copy (loading another file), reorder-after, & symbol-equivalence directives.

LC_MONETARY defines formatting params for denoting prices, with LC_NUMERIC being very similar.

LC_TIME provides localized smallintmaps for weekdays, months, etc & standard format strings.

LC_MESSAGES defines regexps for yes & no responses.

LC_PAPER defines paper size.

LC_MEASUREMENT whether to use Imperial or Metric.

LC_NAME, LC_ADDRESS, & LC_TELEPHONE defines localized printf format strings.

This config format is line-oriented with start & (often preceded with END keyword) stop directives. The initial word defines the key under which the value is stored, or in some sections defines which char is being substituted. A different config file is provided for each locale, and some directives (copy & include`) may import directives from another file. Many of those are autogenerated.

In a simalar format there’s a directory of “charmaps”.

In ;-commented key-space-value files there’s test files for those charmaps. Other files list ;-commented sample chars (typically one-per-line). Not sure how these two are used.

There’s a script which iterates over all character classes & mappings, & all chars within each of both to output the codepoints in the character class & the key-tab-value mapping into correspondingly-named files.

Another translates codepoints into chars.

Unicode Converter Scripts

Amongst the locale datafiles are directories for testing, and one for generating locale configuration from datafiles maintained by The Unicode Consertium who makes it their job to assign a number to every letter, etc from every alphabet so that they can all be represented inside computers.

Unicode’s datafiles are #-commented ;-seperated values, identifying chars by their hexadecimal (base 16) number often specifying ranges using the .. infix operator.

There’s a Python module for reading in these SSV files used by all these internal commandline tools. Or determine classifications from that

The main datafile lists each char’s ID, name, category, combining classes, bidirectional category, decomposition mapping, decimal digit value, digit value, numeric value, mirrored, old name, upper, lower, & title. Another lists the classification of each char-range & another lists the width for each char-range (mostly for kanji on the terminal).

Around these GNU LibC implements Python scripts for:

Each script tends to work, after commandline flags are parsed, by:

  1. Parse the Unicode datafiles via the aforementioned Python module.
  2. Output that parsed data in it’s new format (including preludes & epilogues, may be copied over from an existing file), or parse the specified files & use Python’s set implementation to describe what’s changed between the input data.

The UTF8 charmap generator parses the main Unicode file itself… gen_unicode_ctype requires slightly more processing…

gen_translit_combining involves slightly more processing to convert decomposition into substitutions, in part via a simple recursive function.

ctype_compatibility set-compares the two given files, before running a testsuite over each with a custom testrunner.

DNS client (“resolv”)

On the Internet we need to name (one of the hardest software challenges) hosts. Every computer has an often-temporary numeric IP address assigned to it (e.g., but for some you often want to assign a unique human-readable name (e.g. anyone can lookup via their ISP. These are managed centrally with ICANN at the root of the hierarchy, and consulted via a vast decentralized CDN.

“resolv” (a BIND fork) extends “nss” (via the _nss_dns_getcanonname_r, _nss_dns_gethostbyname[2,3,4]_r, _nss_dns_gethostbyaddr[2]_r, _nss_dns_getnetbyname_r, & _nss_dns_getnetbyaddr_r callbacks) to support looking up rented (I wish there were other options…) names via DNS, whilst providing a public API much of which can be overriden by platform-specific code.

I’ll focus on exploring from _nss_dns_gethostbyname_r.

After validating the domain name, retrieves a refcounted global struct resolv_context alloc/init’ing it whilst alloc/pars’ing (under a lock) a commented configfile if necessary, then hands off to gethostbyname3_context (legacy code tries IPv6 first, but that’s called out as legacy code). The configfile also has a refcounted global, only (re)parsed when the file changes, ensures /etc/hosts is parsed too, can override domain via “LOCALDOMAIN” envvar.

After it’s input validation/reformatting gethostbyname3_context first tries looking up user-level aliases if there’s no dots in the domain name by scanning the 2 column TSV file specified by the “HOSTALIASES” envvar. This may be another domain name.

It then allocates a 1k buffer & calls __res_context_search handling it’s errors appropriately, followed by getanswer_r to convert the data from BIND (most of this code comes from BIND) to “nss” formats.

__res_context_search first counts up dots & trailing dots, if there were any dots at all it tries consulting that $HOSTALIASES file & queries the looked-up value for the response, if there’s enough (or any trailing) dots it runs the query with a length check.

In certain cases (no dots & RES_DEFNAMES, or dots without any trailing & RES_DNSRCH) it looks up the name under all configured search_list or dnsrch domains.

If all those failed & dots it runs the query with a length check.

From there it initializes a struct of DNS headers & “nopt” or (for “A” & “AAAA” lookups) retrying upon lack of allocated buffer space, & reencoding the domain names being looked up weirdly, & sends it via UDP (or cached TCP socket) to the configured server to receive a binary response. Functions are provided for parsing certain fields like “TTL” (Time To Live a.k.a. cache timeouts) of this binary struct, for serializing it to text for debugging, and for textually-comparing domain names (e.g. ns_same_domain for efficiently checking whether a is a subdomain of, or same as, b & ns_makecanon for stripping non-escaped trailing dots).

Multi-name Lookups

The central public API of GNU LibC’s “resolv” sublibrary is getaddrinfo_a which, with validated arguments & a lock held, iterates over the caller-provided array to enqueue those requests whilst counting up successes. If all fails it notifies a provided sigevent.

If the caller-specified to wait on all the caller-specified requests it iterates over them again to configure an atomic condition which it’ll wait on. Otherwise it’ll configure a “waitlist” notifying the caller PID.

Upon enqueueing a DNS request onto that global linked queue (with it’s own freelist), GNU LibC’s “resolv” checks whether to start a new thread (none idle & unreached limit) to service them & triggers an atomic signal to wake those threads if necessary.

It creates a “DETACHED” pthread to repeatedly call getaddrinfo (which has a platform-dependant implementation) for each request, saving the response & under lock notifying (via platform-specific code) the caller. freeing the request into said freelist, awaits next task, if necessary sleeping or starting a new thread.

Timezones, Landmarks, & Time History (“timezone”)

Time, at least how denote it, is a very political construct. For a variety of reasons people in different countries adjust their clocks forward or back. And we’ve got timezones!

GNU LibC’s “time” datafiles lists, in #-commented TSV, all the times different regions, etc (seperate files) adjusted their clocks (to correct for when computing diffs), timezones (possibly in same file), coordinates of timezone-landmark cities, timzone aliases, leap years, & ISO country codes.

Alongside these heavily-commented TSV files are some C header files defining structs which can hold the same data, and expose “time”’s API.

There’s a shellscript which, after parsing the commandline flags using a getopt command & ensuring UTF-8 text encoding is used regardless of locale, an AWK script is prepared to compute the distance of each landmark city from a given coord (maths actually looks correct!), & repeatedly uses AWK scripts & case branching to resolve user input, & sets $TZ.

And finally there’s C scripts for converting all this timezone & historical clock adjustment data between it’s text & binary forms. The “time” sublibrary refers to these binary files.

The text format requires significant (de)serialization (postprocessed once parsed), and some of the C standard library is reimplemented in case it’s running on a less full-featured UNIX than GNU. Because one Arthur David Olson maintained this flatfile database, in the public domain, for all UNIXes.

Platform-Dependant Sublibraries

Sublibraries in this section are almost-entirely (or fully) implemented with platform-dependant code. It is GNU LibC’s job to expose platform-independant APIs to this platform-dependant code, and these libraries consist of stubs which are overriden by the appropriate platform-specific code which will be discussed later.

The sublists names the APIs exposed by these sublibraries with platform-specific implementations.

Trigonometry & Other Mathametical Functions (“math”)

Most of the tasks we set computers involves mathematical computation in some form or other. Heck, much of computers’ I/O is expressed as numbers! GNU LibC has a sublibrary “math” implementing many common mathematical functions to help with this, which I’ll describe today.

The implementation is almost entirely platform-specific, though there are a relative functions which are the exception to that rule.

ctan, after handling non-finite floatingpoint numbers specially, computes both the sine & cosine for both the real & imaginary components of the input to compute tan(x+iy) = (sin(2x) + i*sinh(2y))/(cos(2x) + cosh(2y)) = (sin(x)*cos(x) + i*sinh(y)*cosh(y)/(cos(x)^2 + sinh(y)^2).

ctanh does similar to compute tanh(x+iy) = (sinh(2x) + i*sin(2y))/(cosh(2x) + cos(2y)) = (sinh(x)*cosh(x) + i*sin(y)*cos(y))/(sinh(x)^2 + cos(y)^2).

csqrt, after handling non-finite number specially, checks the classification of it’s input’s real & imaginary components. Computing the answer differently if the imaginary component is 0, or the real component is zero, or checking which quadraint the two components are in; before checking for underflows & assembling the output complex number.

csin, csinh, cexp, ccosh, catan, & catanh checks the classification of the real & imaginary components in handling it’s non-finite cases while using normal trig (or computes hypotenuse) logic in success path.

clog refers to the component’s classifications for error-reporting, computes scale, absx, & absy based on how the absolute components of the input compare then calls off to different functions based on the range of the results. The imaginary output is computed by atan2. As does clog10. casin wraps casinh.

casinh & cacosh both uses classification error handling & wraps __kernel_casinh on happy path.

multc3(a, b, c, d) computes ac - bd + (ad + bc)i, with error handling.

casinh, depending on the input’s relationship to it’s specificity, calls onto clog/M_LN2/COPYSIGN, M_HYPOT/M_LOG/M_ATAN2, M_SQRT/M_LOG/M_ATAN2, M_SQRT/M_LOG1P/M_ATAN2, or csqrt possibly as part of slightly more complex multiply-add computations. Then reassembles complex output with signs.

fromfp_max_exponent(bool negative, int width), fromfp_round(bool negative, uintmax_t x, bool halfbit, bool more_bits, int round), fromfp_overflowed(bool negative, uintmax_t x, int exponent, int max_exponent), fromfp_domain_error(bool negative, unsigned int width), & fromfp_round_and_return performs simple conditional maths on these floating point components.

exp2 wraps scalbn, exp, & LN2 with overflow checks & other error handling. Not clear what divtc3(a, b, c, d) is doing.

Functions which simply wraps other functions (and may be overriden by platform-specific code), etc include cimag which dispatches to some GCC-specific syntax, carg which wraps atan2, cabs which wraps hypot, ldexp which wraps M_SCALBN with error handling, w_log1p wraps log1p, & w_scalbln wraps scalbln.

For a bit more fallback logic when lacking platform-specific code nexttowardf compares the two given floats’ exponents to determine whether to decrement or increment. nextafter does basically the same thing.

Several functions fallback wraps corresponding platform-specific functions with more error handling. These include:

Functions which wraps the platform-specific kernel_standard function includes:

Most of these also calls the corresponding platform-dependant function under certain conditions, and they all add some error handling.

Functions with fully platform-dependatn implementations include:

CPU & Kernel Ports

GNU LibC’s whole job is to expose platform-independant APIs upon platform-specific code. While much of the surrounding code can be platform-independant, GNU LibC manages all their platform “ports” (as I’ll call them) in it’s “sysdeps” directory as explained for x86_64 Linux here. Numerous other platforms are also supported.

UNIX-like Port

The different syscalls are listed in a TSV file (name, caller, syscall name, mneomonic arg types, strong name, & weak names cols) which is converted into C code via a shell script. Referring to a C preprocessor header file to retrieve the syscall number, which defines macros to instantiate CPU-specific assembly-code template. Whilst validating args to be compiled via C calling convention.

The core syscalls GNU LibC supports are:

More Linux- or BSD-specific syscalls may be defined.

The Assembly template used by these involves storing the syscall number in the RAX register & using the syscall opcode, possibly wrapped with error handling to save returned error codes to the errno global and/or convert EWOULDBLOCK_sys to EAGAIN.

For trivial UNIX code:

Slightly more complex, getlogin wraps ttyname_r & (surrounded by setutent & endutent) getutline_r.

ifreq calls opensock if it’s not given an open socket, calls ioctl(SIOCGIFCONF) with exponentially buffer sizes, compiletime-optionally followed by repeated if_nextreq calls.

pts_name repeatedly ptsname_internal with exponentially larger buffer sizes. I’m not seeing the definition for ptsname_internal outside of HURD.

grantpt retrieves input from pts_name, getuid (optionally with a chown to ensure we own the shell), getgrnam_r (buffer allocated via alloca(sysconf(SC_GETGR_R_SIZE_MAX))) once or getgid (also optionally with a chown). With that info it wraps an optional chmod, possibly via a privileged command.

And getlogin_r wraps ttyname_r & locked (via mutex & set/end-utent) __libc_getutline_r.

Implementing POSIX Upon Core UNIX

While I don’t believe it’s needed to get GNU LibC running on Linux, GNU LibC does implement the more refined POSIX APIs upon the older UNIX ones. The way it implements asynchronous I/O is especially interesting…

Studying the simplest cases first:

Slightly more complex, there’s:

More complex ones involve:

Relying more on magic numbers/strings there’s:

Finally there’s a hader file for some error codes & textual descriptions this platform-binding might emit, some code for looking these up in their table, & extensive logic (with a lookup table of sizes for different output structs) for looking up textual host identifiers to IP addresses, etc using the various platform-independant logic (including NSS) I’ve described previously. I won’t bother with details.

Various of these functions can be overriden elsewhere.

If the kernel doesn’t support asynchronous I/O (which Linux does) GNU LibC will reimplement it poorly upon pthreads.

At the core of this “pthread” platform binding is a sorted linked list of timestamps a dedicated thread should trigger interrupts, sleeping during the intermediate times. By blocking on a condition so this sleep can be interrupted when adding new timers.

There’s binary-searched sorted array in which to lookup semaphores, each stored in mmap-shared memory.

The mutex-locked arrays/linked-lists for file operation “requests”, & their “waitlist” atomic conditions, has additional logic for keeping them tidy. The requests array has it’s own freelist.

Upon enqueueing it groups requests by file descriptor, and the thread for enacting these requests calls [p]read[64], [p]write[64], fdatasync, or fsync foreach depending on the given opcode. Notifying the enqueuer of the result via those atomic conds.

Generic Port

GNU LibC has bindings to a “generic” platform. This appears to be for when it’s running upon extremely limited kernels, and needs a rudimentary userspace implementation of core UNIX features. I don’t know why this isn’t implemented the fallback logic provided by the public APIs…

Anyways starting with the most trivial implementations:

For slightly more complex functions, there’s:

More complex functions in GNU LibC’s “generic” bindings:

And finally, there’s headerfiles containing data on:

x86_64 Port

Despite what I’ve described so far, GCC doesn’t only abstract datastructures provided by the kernel. It also abstracts those provided by the CPU (and GCC). So it has includes ports for various different CPU architectures, though I’ll only describe x86_64.

Seperately I’ll describe the FPU (Functional Processing Unit) ports amongst others, which are seperate directories. But today I’ll just focus on the core x86 port.

Starting with the simplest APIs:

Slightly more complex, there’s:

The relative complex internal functions are:

Functions related to the ELF or DWARF linkers:

The x86_64 linker port especially has several internal functions written in C:

And there’s a datafile defining registers for the systemcall calling convention.

x86 Multiarch port

There’s a few different variants of x86__64 CPUs, they may support one or more extended instruction sets & you can ask x86_64 CPUs which extended instruction sets they support!

For core memory-copying (“string” sublibrary) functions GNU LibC often wishes to take advantage of these extra opcodes where available. There’s numerous of these functions (non-multiarch described the other day) backed by complex Assembly code, but today I wish to specifically focus on strcat.

The entrypoint supports AVX2 with fast unaligned load & an optimal VZEROUPPER op, just fast-unaligned-load, or SSE3 with a fallback to SSE2.

Advanced Vector Extensions (AVX, 2011-?) adds 16 special “vector” registers each holding 8 32bit numbers or 4 64bit numbers, such that it can apply the same arithmatic on all in a single instruction. AVX2 doubles these register sizes, with a VZEROUPPER op to clear the upper half of those larger vector registers.

The CPU’s cache hierarchy may operate in fixed wordsizes, to the degree that it is inoptimal to read data spread accross different words. The dispatch code for multiarch bytearray functions like strcat tests whether or not it needs to optimize for this.

Streaming SIMD Extensions (SSE) are another suite of extensions to x86_64 instruction set supporting vectorized floats. SSE1 supported 8 registers each holding 4 32bit floats. SSE2 added int/byte support & doubled the register size.

That SSE2 fallback strcat implementation is the normal non-multiarch code described the other day. At this point we assume any 64bit Intel/AMD CPU supports SSE2.

If there’s no performance penalty to unaligned reads, it needs only run vectorized unrolled loop around SSE2 ops pcmpeqb to determine when to stop copying & pmovmskb to actually copy each 256bits to their new location.

For AVX2 CPUs it uses vpcmpeqb & vpcmpeqb in it’s own vectorized unrolled loop for 512bit copies.

If unaligned reads are inefficient on your SSE3 CPU, GNU LibC will use a strcat implementation which’ll use the movlpd instruction for copying unaligned data 256bits at a time. With numerous exit codeblocks! So movlpd is SSE3-specific?

NPTL PThreads-port

Here I’ll discuss the PThreads port to GNU LibC’s “nptl” sublibrary, and to x86 machine code.

pthread_cancel_init emits a membarrier if PThreads are already initialized. Otherwise _Unwind_Resume, gcc_personality_v0, _Unwind_ForcedUnwind, & _Unwind_GetCFA from LibGCC. And exposes public wrappers for those functions which also lazily calls that initializes & applies atomic read barriers.

GAI_MISC_NOTIFY macro iterates over/clears an array of futexes to futex_wake.

GAI_MISC_WAIT reads the given futex & if non-zero under a __gai_requests_mutex lock repeatedly calls __futex_abstimed_wait[_cancellable]64 until successful.

gai_start_notify_thread wraps pthread_sigmask & sigemptyset.

gai_create_helper_thread wraps pthread_create with various configuration APIs including pthread_sigmask & pthread_attr_setstacksize.

futex_abstimed_wait_common32 wraps the corresponding syscall (cancellable variant or not) with range & timestamp validation.

__futex_abstimed_wait_common64 wraps the corresponding syscall (cancellable variant or not), falling back to the 32bit variant described above, with timeout & clockid validation & error reporting. __futex_abstimed_wait64 & futex_abstimed_wait_cancelable64 are trivial wrappers around that.

fork checks wraps the kernel port’s version with callback dispatch & (if multiple threads running, some under a IO_list lock) other NSS functions, clearing I/O locks, & allocates new thread data.

thread_gscope_wait iterates over all system-allocated pthreads in the dl_stack_used linkedlist skipping THREAD_GSCOPE_FLAG_UNUSED ones, atomically exchanges it’s gscope_flagp from THREAD_GSCOPE_FLAG_WAIT to ...FLAG_USED, & if successful repeatedly futex_waits on them. Then does the same for the user-allocated pthreads in dl_stack_user.

thread_attr_compare tests equality on the various properties of the given struct pthread_attrs.

_IO_lock_init macro stores some initial values in the given var. _IO_lock_fini macro returns 0. _IO_lock_lock conditionally (on other threads) wraps lll_lock & storing THREAD_SELF to the given var before unconditionally incrementing a counter.

_IO_lock_trylock similarly conditionally wraps lll_trylock. _IO_lock_unlock decrements the count & upon reaching 0 calls lll_unlock. There’s some _IO_cleanup_region_* macros wrapping __libc_cleanup_region_* funcs.

_IO_acquire_lock wraps IO_flockfile with a register cleanup function. _IO_release_lock macro closes it’s codeblock.

INLINE_SETXID_SYSCALL (unless there’s multithreading) wraps INLINE_SYSCALL macro with enqueueing until pthreads have been initialized. This enqueueing is done differently whether it’s shared threads-system or not.

Alongside var declaration macros & with lighter bootstrapping variants for use in the platform-indepent PThreads implementation, there’s mutex macros.

__libc_lock_init_recursive macro wraps pthread_mutex_init with pthread_mutexattr_* configuration & an existance check. libc_lock_fini_recursive wraps pthread_mutex_destroy. libc_lock_recursive wraps pthread_mutex_lock, or in bootstrapping resembles the IO_locks. Equivalent for libc_lock_trylock_recursive & libc_lock_unlock_recursive. And there’s a couple macros wrapping pthread_cleanup_push/pop_defer/restore with local state & checks.

There’s a futex API wrapping lll_futex_* API & occasional syscalls with validation & error reporting. And for futex_clocklock64 a atomic-exchange loop.

There’s AIO_MISC_NOTIFY & AIO_MISC_WAIT` akin to the corresponding macros.

For simpler public APIs there’s:

For x86_64-specific components, there’s:

Atomics are uninterruptable CPU ops reading/writing single words required to implement synchronization upon.

Floating Point Math

We love giving computers mathematical equations to solve! Even if your problem isn’t inherantly mathematical computers tend to represent I/O in numbers. To aid the non-trivial formulas you give them, GNU LibC provides a highly optimized library implementing the mathematical functions. Of which there are CPU-independant implementations & CPU-dependant implementations.

There’s some generic error reporting functions branching upon a number indicating which function was called. There’s a CPU-overridable __matherr function which appears to return 0, with a meaningless x->arg1!=x->arg1 check. And a wrapper around the that error reporting.

CPU-independant (“ieee754”) Doubles Implementation

I’ll continue by beginning to describe the CPU-independant implementations for 64bit doubles. The CPU-independant implementations for other numeric types are very similar, just the more bits are used the larger approximating polynomials & the more magic numbers are needed.

Starting with the simplest cases:

Slightly more complex there’s:

For the more complex functions with relatively few magic numbers…

There’s a double to [u]intmax_t conversion template function, with caller-given rounding mode & width. wrapping bitmasks, bitshifts, fromfp_domain_error upon error, fromfp_max_exponent, & fromfp_round_and_return.

remquo wraps 2 EXTRACT_WORDS64 & a single INSERT_WORDS_64 with some error checks, fastpaths one deferring to __ieee754_fmod, & conditional arithmatic rounding the divide & multiplies to take remainder.

lround uses utilizes EXTRACT/INSERT_WORDS64 & bitwise ops, with conditions for errorchecks & upon compiletime bitsize.

fma may wrap __builtin_fma if available or performs the * & + itself with lots & lots of rounding/error checks. Possibly multiplying by different exponents.

mpatan initializes a multiprecision output number, chooses a number of iterations based on how EX (where’d that input come from?) compares to 0 & if that’s non-zero it repeatedly sums, squareroots, & divides. It looks up the resulting precision in a lookuptable & performs that many divides, multiplications, & subtractions.

__remainder splits it’s two inputs in 2 halves bitmasking off the signbit & checks their range. Performing different multiply, adds, subtracts, bitmasks, & recursion.

branred performs range reduction via some multiplies & subtracts, extracts the high & low words of that, bitshifts & bitmasks the highword twice with a tad of arithmatic each time, gathers an array of six doubles from multiplications, a couple of sum loops (bit convoluted to handle precision issues), few more additions/subtractions, repeats essentially same thing, & some multiply-adds.

mpn_extract_double uses a C union & conditional bittwiddling to convert a double to multiprecision.

hypot splits the inputs each into two halves with sign bitmasked off, swaps input to ensure the 2nd is larger, checks range & errors to determine what to set the highwords to, and what to multiply the inputs by, performs sqrt surrounded by arithmatic depending on whether diff’s larger than the larger input, & an error check.

mpsin wraps slowsin with some conditional range reduction arithmatic. If the caller specifies it performs some multiplies subtracts (how many depends on range).

If not requested it instead convert it’s inputs into multiprecision. Folowed by, under 3 conditions, 3 multiprecision squares followed by a couple multiprecision squares, numerous multiplies (a few loops), & half as many subtracts.

The range reduction if ran indicates which quadrant to simplify the computation before deferring to slowsin. Similar is done for mpcos.

Then there’s a range of multiprecision arithmatic functions…

Of these multiprecision functions mcr iterates over the mantissa digits until it finds the first differing one. acr compares the first two exponent digits falling back via mcr to the mantissa.

cpy (which I’ve been eliding from explanations of other functions) copies every digit from one multiprecision to another.

norm branches upon the given precision to perform the relevant multiply-adds. For large enough values it’ll use for-loops. Then muliplies all digits by a RADIX[I] consts.

denorm branches upon the given precision & EX to populate a 5digit array (from constants, input digits, or their sum), applies rounding, and computes the output double from those digits.

mp_dbl decides whether to defer to norm or denorm.

dbl_mp extracts the sign then exponent then digits to populate multiprecision number.

add_magnitudes iterates over the given x number (must be larger) & corresponding y digits to compute sum. Final optional iteration shifts digits down 1.

sub_magnitudes does essentially same thing for subtraction.

add & sub wraps cpy, acr (to get arguments in correct order), & add_magnitudes or sub_magnitudes (depending on sign).

mul after checking fastpaths computes the product of each pair of digits into a “diag” array then computes a “zk” via multiplies, adds, & subtracts over the two inputs & the diagonal in two loops populating output as it goes. Then incorporates carries.

sqr, after some shortcuts, repeatedly multiplies the last p digits before the kth digit with the first p digits taking the sum for kth digit, with a similar but tighter trailing loop & some minor postprocessing/normalization.

inv converts the multiprecision to a double and back to apply an inverse operation. Then uses postprocessing 2 mulitiplies & 1 subtract per digit to regain the multiprecision.

dvd wraps inv & mul if non-zero input to perform divide.

For math on doubles which utilizes some non-trivial magic numbers… (Not that I can read these hexadecimal magic numbers)

sincos splits the input into two halves & bitmasks off the signbit, & checks the range of that absolute high half ultimately deffering to do_sin, do_cos, & maybe reduce_sincos or branred.

scalb[l]n[l] wraps EXTRACT/INSERT_WORDS64 with bitshifts, bitmasks, & constant multiplies by 2^54 (~1.8e16) or 2m^54 (~5.55e-17), tiny (~1e-300), or huge (~1e300).

rint & nearbyint wraps EXTRACT_WORDS64 & maybe INSERT_WORDS64 with bittwiddling & adding/subtract + or - 2^52 (~4.5e15). lrint uses more sophisticated bittwiddling & error reporting, less for llrint.

reduce_sincos wraps mostly subtracts & constant multiplies between it’s args.

For values between + & - 0.126 do_sin (with it’s own sin wrapper) squares it’s input & uses Taylor Series refinement. Which is a polynomial equation. Otherwise do_sin uses a different polynomial & looks up several parameters from a lookuptable to incorporate into a second polynomial (constants look like multiples of 1/6). Copying the original sign back over into the output.

do_cos (with it’s own cos wrapper) works similar but without the Taylor Series fast path.

log10 wraps EXTRACT/INSERT_WORDS64 with adds, subtracts, bitmasks, & multiplies by 2^54 (~1.8e16), 1/ln(10), or upper & lower halves of log10(2).

log1p extracts the highword bitmasking off the signbit checks it’s range to report errors or perform the appropriate arithmatic or bittwiddling. Often refining result via polynomial.

sqrt splits input into two halves bitmasking off exponent & signbit & validates range, & performs lots of arithmatic with error checks. Or recurses.

To compute cuberoots (cbrt) it extracts mantissa & exponent, computes a new mantissa via a polynomial to cube than multiplies by a fraction computed from old mantissa & an exponent-looked up value of cuberoot of 2 (~1.26), it’s square (~1.6), 1 or their inverse.

acosh extracts the two words from it’s input, & checks it’s range to report errors, compute log(x) + ln(2), compute log(2x - 1/(x + sqrt(xx - 1))), or compute log1p(x - 1 + sqrt(2 * (x - 1) + (x - 1)^2)).

expm1 retrieves the highword & seperate off it’s signbit; checks it’s range to report error cases; adds, subtracts; or multiplies the two words of ln(2) depending on the range to compute two words & “k”; performs some constant multiply-adds; under different ranges of the computed “k” compute different multiply, add, & subtract formulas; & under those different conditions reassemble the two words into output.

lgamma_r reduces it’s input range to be <= 8 via constant logs, approximate via polynomial, refine, & handle negative inputs specially amongst other special cases.

jn wraps sincos with word extraction, bittwiddling, various range checks, & arithmatic. Maybe some loops. yn wraps y0 or y1 or similar logic.

I see various other wrappers around sin/cos. lgamma_neg wraps log1p, fabs, log, some of those wrappers, and/or lgamma_product with plenty of arithmatic, conditions, & loops.

rem_pio2 looks up a “jk” value based on given precision from a lookup table, performs various arithmatic & loops, wraps scalbn & floor to compute “z”, some conditional bittwiddling within an “iq” array”, computes a “diagonal” via another arithmatic loop, populates additional tables from that, & branches over precision for how to compress the array.

There’s plenty of functions which do little more than multiply-adds upon a lookup table. Maybe add some conditional trigonometry.

gamma_positive wraps exp, lgamma_r, maybe other such arithmatic functions, & maybe a repeatedly-checked lookup table under different ranges. gamma_r wraps gamma_positive with some error checks, scalbn, trunc, & trigonometry.

Some of those various multiply-add lookuptable functions perform different multiply-adds for different input ranges.

For implementations which use lots of magic numbers…

Looking at the dataheaders dosincos (described yesterday) refers to constants for a 5-degree polynomial, a large value, I think another polynomial, etc. Variable names aren’t that legible.

utan’s includes positive & negative multiplicands for at least 3 polynomials, and some other constants.

In a seperate headerfile utan has a lookup table of “xi”, “Fi”, “Gi”, & “FFi” parameters. I so no documentation on where these numbers came from beyond “IBM Accurate Mathematical Library”.

urem has constants for a large value, 2^128, 2^-128, 0, & -0.

There’s seperate constant values depending on CPU byteorder. Maybe I won’t continue describing these dataheaders, they’re not very self-documenting.

cosh retrieves the high word of it’s input “x” bitmasking off the signbit & checks the range to determine whether to return 1, t = expm1(|x|) in 1 + (t^2)/((1+t)^2), 0.5exp(|x|) + 0.5/exp(|x|), 0.5exp(|x|), 0.5exp(0.5|x|)^2 or an error.

asinh does similar copying sign onto output for formulas: x, 2x, log(|x|) + ln(2), log(2|x| + 1/sqrt(|x^2| + 1) + |x|, or log1p(|x| + |x^2|/(1 + sqrt(1 + |x^2|).

fmod extracts the two words of it’s input extracting the signbit; checks for errors; computes ilogb(x) via bitshfit-subtract or repeatedly decrementing ix for each bit in the extracted “x” until it’s no longer positive; same for “y”; aligns highwords of x & y via bittwiddling; repeating the difference between-those-logs times compares the high words to determine to double x highword, 0, or difference to produce new “hx”; & reformats back into floating point under different conditions.

exp retrieves the 12 highestorder bits with sign bitmasked off, if that’s greater than those for 512 it checks various error conditions to exit on, divides x by ln(2N), performs bitcasting & rounding, adds x to the rounded value times -ln(2N) “r”, performs performs a couple tablelookup, computes a polynomial on “r” adding the first lookup, checks whether it needs to be careful regarding overflows, & computes/applies a “scale” from the second lookup.

If it’s input is less than 0.5*hp0.x docos performs a tablelookup for partial parameters to a large polynomial it then evaluates via macros taking great care around overflows. If it’s input is less than 1.5 times that it’ll perform a little preprocessing to run the sin helper which works similarly. Otherwise it’ll subtract that absolute input from 2*hp0.x before calling the helper.

exp2 checks error cases before applying similar arithmatic & table lookups to exp.

If input’s less than 22 sinh, with some overflow checks, computes sign(x)*0.5*(2expm1(x)/(expm1(x)+1). If it’s less than log(maxdouble) sinh computes sign(x)*0.5*exp(|x|). If it’s less than an overflow threshold sinh computes sign(x)*0.5*exp(0.5|x|)^2. Otherwise sinh reports an overflow error.

atan after splitting it’s input, error checking, temporarily configuring rounding to NEAREST, & taking absolute input check’s it’s input’s range.

It may return an identity of it’s input. It may compute a couple polynomials with fast exits, eventually getting careful regarding overflows before finally resorting to arbitrary precision arithmatic. Another couple polynomials it might run consults a lookuptable & branching to determine it’s parameters with similar regard to overflows, calling signArctan on fastexits. Another more resembles the first but with signArctan calls on fastexits. Or reports overflows.

log bitcasts it’s input to an int whilst also extracting it’s top 16 bits before checking it’s input’s range possibly performing some additional polynomial calculation and/or refining the lookuptable index, performing a tablelookup, & doing a little bit of arithmatic on it. log2 works similarly with slightly more maths. tan works similarly to sinh.

pow bitcasts both inputs to ints whilst extracting the 12 highest-order bits before checking the range & some final arithmatic.

Those range checks ammount to errorchecks, fastpaths, & input simplification. The final arithmatic calls off to log, __builtin_fma (or it performs the two-word multiply-add manually), & exp. So pow(x, y) = exp(log(x)*y). There’s a specialcase where it’s extra careful regarding overflows, & it’s got it’s own inlined implementations of exp & log.

atan2 splits it’s input in half & checks it’s range for errors then fastpaths then normalization then …

… applying an appropriate polynomial being careful later-on regarding overflows eventually resorting to multiprecision arithmatic upon fastexit copying appropriate sign over, maybe looking up parameters in a lookuptable. Regardless there’s a final such lookuptable polynomial it applies being careful regarding numeric overflows.

Normalizing may involve multiprecision divide, multiply, & subtract.

acos & asin involves lots of range-conditional arithmatic. And maybe a lookuptable.

x86_64 Float Port

Here I’ll study how GNU LibC implements various math functions on x86_64 CPUs.

For the simpler cases:

Slightly more complex, there’s:

Most of the more complex implementations of mathematical functions refers to various constants & lookup tables.

For trigonemetry data it has a bitmask for copying everything but the floating point sign bit OR only that bit, range limits, polynomial coefficiants for high accuracy, Pi in 4 parts & a variant for FMA unit, 4 polynomial constants & a FMA variant, 1/Pi, rightshift value, pi/2, 1/2, lookup table mask, 2^(k-1)

All those values are stored in a lookup value with macros defining indices into it.

For exponentiation it has constants for value ranges, infinity, bitmask copying everything but the floating point sign bit, 4 polynomial coefficients, bitmask to retrieve the floating point exponent field, 2^10 (do I have this right? I don’t read hex floats), more ranges, bitmask to divide by 2, 1(?), log(2) high & low parts, threshold, bias (2 parts), 1/ln(2), shift value, 2^(1/2^k)-1, & 2^k-1.

Some of those values are repeated (to play well with the CPU’s dataprefetcher? That’s often optimized for array traversal), and interspersed in here are lookup tables. Not that I see any comments about how they’re computed..

There’s ones for “Log(2) lookup High+Low table for logarithmic part”, “Log(2) lookup table for logarithmic part”, “dInfs = DP infinity, +/- ==”, “dOnes = DP one, +/- ==”, “dZeros = DP zero +/- ==”, & a last one I think called “dbT”.

For logarithms there’s 7 polynomial coefficients, constants for work range checks, a mantissa “break point”, signicand mask, 1.0, ln(2), +/- infinity, +/- 1.0, & +/- 0.0 (yes, floating point supports signed zero).

For expf it has constants for 1/ln(2), right shift value, ln(2) low & high parts, a bias, 6 polynomial coefficients, bitmask copying everything but the floating point sign bit, & a working domain range.

There’s a reencoding of the trig constants in doubles (might be some other differences…). There’s 5 lookup tables for log(2) & 2^x (whether or not there’s “HSW” support, also has recipricals if not).

Between these are a bitmask to retrieve mantissa field, 1.0, Cvt bitmask, min & max normal values, rnd bit & bitmask, 6 polynomial coefficients (first with low & high part), bitmask to copy everything but sign bit, domain range, shift value, index mask, & 4 more polynomial coefficients.

After those lookup tables there’s additional constants for exponentiation computation on doubles: 7 polynomial coefficients, 5 more for a different polynomial. bitmask to retrieve mantissa field, 0x3fe7fe0000000000, 0xffffffff00000000, 0xfff0000000000000, & 0x3fe7fe00 are unlabled, 1.0, 2^20+2^19, a bitmask, -log2(e), 2^45+2^44, -infinity, -0.0, 2^52, 1/2^111, delta, range, bitmask to copy everything but the sign bit, +infinity, domain range, index bitmask, index add, 2^20+2^19, & 1.0.

For logs of doubles there’s two lookup tables for “high+low parts and 9-bit index for -log(mRcp), where mRcp is mantissa of 1/x 9-bit accurate reciprocal” & “9-bit index for -log(mRcp), where mRcp is mantissa of 1/x 9-bit accurate reciprocal”; followed by 4 polynomial coefficients, bitmask to retrieve exponent field, 2^10, min & max normal values, bitmask to divide to divide mantissa in half, 1.0, log(2) high & low parts, threshold, bias, log(2), +/- infinity, +/- 1.0, & +/- 0.0.

For exponentiation of doubles there’s a 2^(j/2^k) lookup table followed by 2^k/ln(2), shift values of 3*2^52, ln(2) high & low parts, 3 polynomial coefficients, index mask, bitmask to copy everything but the sign bit, & a domain range.

exp10l & similar after validating input repeatedlies:multiplies by relevant CPU constant, temporarily sets round-to-nearest whilst it rounds two numbers, consults a tiny lookup table to multiply-subtract by, and does a little more arithmatic.

expm1l uses extremely similar but non-identical code. And this code can be used as a template for other even more similar functions.

log2l after validating input takes the absolute value of the decremented input, fcompl & fnstsw opcodes on that, validates the result, conditionally takes the absolute value, & wraps fyl2xp1. The finite varient does few checks & no loops.

powl after extensive input validation with fastpaths for ints >= 4, or 0.0, calls a helper function.

There’s a C macro, presumably for debugging purposes since the public API has better logic for this, which outputs a floating point number in hexadecimal by wrapping itoaword under different conditions. Hexadecimal so it doesn’t need to convert base 10 which is complex.

sincos for x86_64 Advanced Vector Extensions CPUs surrounds calls to SEE4 implementation with plenty movs & lea ops. If there’s 32 Instruction Level Parallesm the initial movs will be stack ops, otherwise vmovdqu ops.

There’s similar handling for sincos on SSE2 & AVX512 CPUs. Double or floating point.

Though there is a vzeroupper opcode prior to the two calls in all these. And the 32 Instruction-Level Parallelism code may include constants inline. Or one case there’s 4 calls.

There’s macros for for generating simpler variants of such logic. Though the AVX ones look somewhat interesting, though I still don’t get impression I’m seeing the core logic.

x86_64 Multiarch port

There extended instruction sets CPUs may support to aid efficient implementation of various math functions. Today I wish to describe how GNU LibC takes advantage of these! (Which is neatly bifarcated into 2 or 3 sections)

For the simplest cases:

There’s C function macro-templates for choosing which implementation of various functions to be used based on the CPU’s self-reported featureset.

To compute sinf16 on AVX512 CPUs it uses Assembly code to bring the input range between -Pi/2 & Pi/2 by removing the sign bit saving aside old sign, multiply by inverse pi (i.e. divide by pi) to get the “octant” (using vfmadd213ps, vcmpps, & vpbroadcastd ops), subtract a right-shift, & something about signbit & integers (using vfnmadd ops).

It continues by using multiply-adds to evaluate a polynomial, squaring the ranged-reduced input first, then restores the original sign bit & performs some final movs & validation. Possibly back to the CPU-generic code.

I am judging operation largely by comments here… There’s several other implementations of this same function, in large part for different extended instruction sets.

Cosine is computed similarly, possibly as a sideeffect.

GNU LibC implements pow(x, y) as 2^(log2(x)y). The log2 here is implemented by splitting the float into integral (“k1”) & fractional (“X1”) parts, & computes the reciprical of X1 undoing & redoing it 3 times to handle imprecision. These recipricals & their logorithms are looked up in a table, before being summed to with a polynomial of their product (less 1/ln(2)) to get the logorithm.

This gives a 3 part number for the multiplication to operate on.

To calculate 2^x as part of pow(x,y) GNU LibC splits “x” into N + j/2^7 + Z where N & j are integers. Hence 2^x ~= 2^N * 2^(j/2^7) * 2^Z, where 2^(j/2^7) is looked up in a table, & 2^Z is computed via a polynomial. I’m not clear what’s happening with N.

Then there’s validation…

The AVX512 implementation uses opcodes like vshuff32x4, kxnorw, vcvtps2pd, vpsrlq, vpbroadcastd, vpternlogq, vrp28pd, vrndscalepd, vcvtdq2pd, & vgatherqpd. Though there’s other implementations.

GNU LibC implements log(x) as exponent*log(2) + log(mantissa), with adjustments for mantissa > 4/3. With log(mantissa) being approximated as a polynomial of varying degrees. And ofcourse there’s validation!

The AVX 512 implementation uses relatively few weird opcodes, though it does use vpsrad, vcvtdq2ps, & vpbroadcastd. Mostly multiply-adds. Though again there’s other implementations.

For exp & related it takes the input as resembling N*ln(2) + ln(2)*(j/2^k) + r.

As such exp(X) = exp(N*ln(2) + ln(2)*(j/2^k) + r) = 2^N * 2^(j/2^k) * exp(r). 2^N is calculated via bit manipulation. 2^(j/2^k) via a lookup table. exp(r)` approximated via polynomial. And there’s the usual validation.

AVX512 implementation uses vpcmpgtd to compare against a threshold value, & vpbroadcastd in regards to computing 2^N. Otherwise it’s mostly multiply-adds. Again there’s other implementations.

Linux Port

Most of GNU LibC’s porting effort ofcourse goes to Linux, and to HURD. So how does this well optimized port work?

These functions which on Linux wraps system calls are (unless otherwise stated the syscall identifier is the same as the function’s), or are trivial wrappers for other functions include:

Some more functions around syscalls, ones which wraps several syscalls, are:

Here I’ll study the functions which don’t call into Linux, at least directly, & doesn’t use many magic numbers!

There’s a few functions to check & report overflows.

There’s format conversion helper functions for e.g. xstat[64,32] or rusage.

I see more wrappers around the time syscalls.

I see a pthread which infinite loops calling sigwaitinfo & spawning new pthreads for provided callbacks for implementing timers.

I see atomic accessors for a global caching whether ..._time64 syscalls are supported.

There’s a thread_sleep[64] wrapper around clock_nanosleep_time64, thread_sleep adds format conversion.

Unless overriden by CPU ports INTERNAL_VSYSCALL_CALL macro compiles to a function call, whilst INLINE_VSYSCALL wraps it & INTERNAL_SYSCALL_CALL with error checks & GLRO globals. INTERNAL_SYSCALL_ERROR_P compares the given value against -4096. SYSCALL_ERROR_LABEL wraps set_errno.

INLINE_SYSCALL macro wraps INTERNAL_SYSCALL, INTERNAL_SYSCALL_ERRORP, & depending on that result SYCALL_ERROR_LABEL & (which negates it’s arg) INTERNAL_SYSCALL_ERRNO. INLINE_SYSCALL_ERROR_RETURN_VALUE also wraps set_errno. There’s alignment macros which by default assumes registers are paired like on x86. There’s SYSCALL_LL[64][_PRW] macros for reformatting 64bit args if necessary when passing to Linux. & LO_HI_LONG splits a word in 32bit halves.

There’s more statfs[64] wrappers, & format conversion helpers. And a sigtimedwait retry-wrapper sigwait.

sigstack wraps sigaltstack with stack_t datastructures.

There’s a sigset_t smallintmap counter implementation.

set[ipv4]sourcefilter wraps setsockopt(MCAST/IP_MSFILTER) with input (including response from get_sol) gathered into stack or heap-allocated memory.

Dirs are read like normal files on Linux, and there’s functions for decoding their intermediate fileformat. Though reading individual entries of a directory is a seperate “getdents64” syscall, despite what I stated above, even if there’s still need for buffering & output reformatting. Has several wrappers.

ptsname[_r] wraps ioctl(TIOCGPTN) with format conversion.

There’s a format conversion functions, some retrieving clock IDs from struct pthreads.

Several more time-related wrappers.

ifreq opens a socket to wrap ioctl(SIOCGIFCONF) with format conversion, similar for if_nametoindex/if_indextoname & ioctl(SIOCGIFINDEX).

if_nameindex[_netlink] wraps calls into Linux’s Netlink system to get linkedlists to reformat into an array, with a corresponding function to free that data.

getipv4sourcefilter works like setipv4sourcefilter.

fdopendir calls fstat64, fnctl64_nocancel(F_GETFL) & alloc_dir to validate the given filedescriptor.

fpathconf dispatches to different functions processing the response from fstatfs, or fallsback to posix_fpathconf.

setup_vdso_pointers, called the ELF loader, loads GLRO globals from said ELF loader.

DL_SYSDEP_OSCHECK validates OS version requirements from GLRO globals. dl_setup_stack_chk_guard reprocesses some provide random numbers to help validate against stack underflow.

dl_make_stack_executable wraps mprotect with compiletime-optional validation & setting of a bitflag in GL(dl_stack_flags).

cmsg_nxthdr performs pointer arithmatic to traversea struct cmsghdr.

There’s code for reading/allocating linkedlists as mentioned above & validating them from Linux’s PF_NETLINK sockets. And freeing said data. With usecount & “timestamp” atomic counters. Some such functions validate the interfaces use native transport, utilizing failure-retry.

And there’s structs, enums, & macros defining the format of a.out files.

Here I’ll look at the function implementations which use some magic numbers.

The TRANSFORM_UTMP_FILE_NAME macro wraps a few iterations of strcmp & access checks upon the given filename & some global constants typically with “x” appended. I see this exact same macro in two different files: one extending <login/utmp_file.c> & the other <login/updwtmp.c>.

There’s funcs for transform “stat64” parameter formats involving some bittwiddling.

readdir_r whilst locking the providing DIR* repeatedly calls readdir_unlocked until successful.

opendirat wraps openat_nocancel with non-empty filename & fstat64 validation before allocating the DIR*.

There’s ftime[64] format conversion which may involve a divide & zeroing unspecified fields.

dl_osversion_init & it’s wrapping envvar-loading macros parses provided string with some bittwiddling & stores in GLRO(dl_osversion).

getcwd wraps “getcwd” syscall with validation.

sethostid writes the given 32bit ID number to /etc/hostid. gethostid reads that number from that file with a fallback to gethostname with repeated gethostbyname_r calls given an exponentially-growing buffer.

There’s adtime[64] wrapping clock_adjtime64 with format & unit conversion.

dl_discover_osversion may consult GLRO(dl_sysinfo_map) looking for the Linux version number. Or it’ll use uname. Or /proc/sys/kernel/osrelease. Which requires parsing the integers.

getsourcefilter stack-or-heap allocates a struct group_filter, iterates over a table to retrieve the “socket level” from the given “address family” whilst validating address bytelength, & defers to getsockopt(MCAST_MSFILTER) on that socket level.

There’s an internal function for validating responses AF_NETLINK sockets. With a call to getsockname & fcntl(F_GETFL), error code validation, bytesize validation, & error messages.

wait4_time64 wraps the “wait4” or “waitid” syscalls possibly with some bittwiddling format conversion & NULL checks.

timer_delete wraps the "timer_delete" syscall with input conversion & a mutex-locked linked-list. timer_create has more input conversion and starts a helper thread with it’s data under a once flag before recalling the syscall.

tcsendbreak wraps “ioctl(TSCBRK)” if input < 0 or with unit conversion “ioctl(TCSBRKP)” if supported. Otherwise errors.

tcgetattr wraps “ioctl(TCGETS)” if successful with output conversion with bitmasks & memcpys.

cfget(i/o)speed wraps struct termios properties with bitmasks. cfset(o/i)speed does same with some additional validation.

sigaction wraps "rt_sigaction" syscall with, if valid, conversion between given struct sigactions & a struct kernel_sigaction.

getentropy wraps repeated “getrandom” syscalls until the provided buffer is filled.

spawni[x] wraps numerous syscalls to configure the childprocess environment (including mmaping the stack) & call the configured exec syscall.

There’s conversion functions between struct __msqid64_ds & struct kernel_msqid64_ds involving bitshifts. msgctl_syscall wraps “msgctl” or "ipc(IPCOP_msgctl)", which msgctl64 wraps with format conversion under different conditions. There’s conversion functions between struct __shmid64_ds & struct_kernel_shmid64_ds. shmctl_syscall wraps “shcmctl” or "ipc(IPCOP_shmctl)", which shmctl64 wraps with format conversion under different conditions. Same situation with struct __shmid64_ds & struct shmid_ds. And between struct __semid64_ds & struct kernel_semid64_ds, as well as struct __semid64_ds & struct semid_ds. As per struct __msqid64_ds & struct msqid_ds

getdents wraps “getdents64” syscall with repeated response parsing involving memmoves & lseek64s upon error.

create_thread wraps "sched_setaffinity", “tgkill”, and/or “sched_setscheduler” under different conditions with locking, input validation, & input reformatting.

mq_notify conditionally wraps a corresponding syscall, possibly with an atomically initialized helper pthread with signals blocked to manage a AF_NETLINK socket running callbacks in in new pthreads upon each message.

getifaddrs wraps a PF_NETLINK socket with a couple of requests, responses parsing with subsequent multipass multi-conditional reformatting & validation. Includes corresponding routines to free any memory this allocated.

Linux exposes some of it’s capabilities to userspace as virtual filesystems. GNU LibC’s Linux port may consult some of these, and that’s what I’ll be studying today!

check_may_shrink_heap consults /proc/sys/overcommit_memory for the text “2” caching the result.

getloadavg consults /proc/loadavg parsing at most 3 doubles from it’s text.

gen_tempfd opens the tempdir consulting a global for the filepath falling back to /tmp.

get_login_r consults /proc/self/loginuid to get a user ID to pass through (given an exponentially growing buffer) getpwuid_r validating both responses. Falling back to getlogin_r_fd0.

fips_enabled_p consults /proc/sys/crypto/fips_enabled parsing the integer & caching whether the response is true, false, or fail.

fexecve, when the execveat syscall is unavailable, fallsback to execveing the appropriate file in /proc/self/fd/. Erroring with ENOSYS if that dir’s unavailable.

dl_get_origin wraps the /proc/self/exe symlink with output validation & a fallback to GLRO(dl_origin_path).

fchmodat, wherever it’s syscall is unavailable, fallsback to fstatat64ing the given filepath opened to the given filedescriptor number validating response, & chmods the corresponding symlink within /proc/self/fd/.

sysconf(SC_MONONIC_CLOCK/SC_[THREAD_]CPUTIME) returns POSIX_VERSION constant. sysconf(SC_ARG_MAX) wraps getrlimit & MAX. sysconf(SC_NGROUPS_MAX) parses the textual integer from /proc/sys/kernel/ngroups_max. And sysconf(SC_SIGQUEUE_MAX)` parses the textual integer from /proc/sys/kernel/rt-sigmax.

readonly_area parses the -/space seperated table of textual integers at /proc/self/maps.

opensock checks whether /proc/net is available followed a table of /proc/ subpaths for each protocol family, opening a socket for the first protocol family whose corresponding subfilepath is available & caching which that was.

ttyname[_r] validates/fstat64s it’s given filedescriptor reading the corresponding symlink within /proc/self/fd/ to get the actual filepath to validate. Failing that it locates the corresponding file within /dev/pts/ or /dev/.

shm_directory checks /dev/shm/ is available before consulting /proc/mounts to locate an appropriate “tmpfs” or “shm” mountpoint to. The successfully-located directory is cached & returned, with that initialization happening only atomically-once.

get_nprocs_conf counts the /sys/devices/system/cpu/cpu* pseudofiles falling back to parsing /proc/cpuinfo, compiletime-optionally checking/parsing /sys/devices/system/cpu/online & /proc/stat before /proc/cpuinfo.

pathconf statfss the given filepath reformatting it’s f_type differently given PC_LINK_MAX, PC_FIELSIZEBITS, or PC_2_SYMLINKS. Given PC_CHOWN_RESTRICTED it just validates whether that statfs was successful. Otherwise fallsback to posix_pathconf.

Here I’ll discuss GNU LibC’s Linux port’s datafiles. And their generators.

There’s a dataheader of _STAT_VER_... & _MKNOD_VER_... enum-macros. There’s a dataheader of standard filepaths used internally to this port. There’s a dataheader of magic identifiers for different filesystems. There’s a dataheader for CPUCLOCK magic numbers & bittwiddling. There’s a dataheader of buildflag macros. There’s a dataheader for an IPCOP_... macro-enum. There’s a dataheader of version numbers for /dev/{null,full,tty}. A dataheader stores the /proc/self/fd/ directory path. Another defines the SCSI_IOCTL_... macro-enum.

There’s several TSV datafiles in the same format as for UNIX more generally, I’ll see how many I can list in the next toot…

For 64bit wordsizes:

For 32bit wordsizes: prlimit64 & fanotify_mark.

And also in a seperate file (don’t understand what the “generic” subdirectory’s about) there’s:

A seperate datafile lists all the names newline-seperated from these files.

Then there’s scripts to help generate these datafiles from Linux’s build files!

There’s a Sed script to insert additional code into the linker’s shellscript. There’s a localized C script which checks the given file for some magic numbers & execs it with some environment variables. There’s a Python script which parses the Linux header files (via GCC subprocesses & regexps) to define macros specifying the ID number for each syscall. There’s an AWK script which wraps the __NR_... macros with SYS_... macros if you’re on a kernel version which supports it. And there’s another AWK script for stripping off the __NR_...-prefix.

While GNU LibC’s Linux port doesn’t link directly to Linux’s headerfiles, there are a small part of Linux’s, and the internet’s, extensive ABI which is relevant to it. So GNU LibC redeclares these itself as C macros & structs.

There’s a headerfile which declares the structure, including bitflags, of the arguments to pass to the epoll & inotify syscalls. Another defines enums & (largely-corresponding) macros defining ints passed into mount, prctl, personality (which has two, one of which represents bitflags), ptrace, & (bitvalues look interesting to human debuggers) reboot syscalls. There’s macros representing various control characters.

adjtimex takes a struct holding a struct timeval & several long ints, half of which are padding, whilst returning a value from a macro-enum. signalfd’s input struct holds numerous ints signed or unsigned of various bitwidths (usually 32bits).

Config request IOCTLs takes a struct of 3 64bit unsigned ints (O.K., first int’s less constrained) representing a version number. quoteactl syscall takes structs of 32bit, & several 64bit, unsigned integers which have corresponding bitpacking macros.

Additional Linux-specific structs for ELF are defined, namely: struct prstatus/prstatus_t which contains several ints signed or unsigned of different bitwidths, __pid_ts, & struct timevals, & prpsinfo_t mostly contains text.

acct takes enums of text, numerous unsigned 16 or 32bit integers, & maybe a float. With a corresponding bitflag enum.

There’s a headerfile defining SCSI control characters alsongside a struct to hold 12 of them for different hardware purposes. There’s structs for SCSI communication over IOVecs including sized pointers, request headers, & different request types consisting largely of ints of various bitwidths. Associated macros defines what goes into those properties.

Various time-related syscalls take a struct __timex64 or __ntptimeval64 holding various (usually long) ints, plenty of padding, & a struct __timeval64.

GNU LibC defines various routing protocols in the form of C structs. Linux is the one actually doing the routing (even if you don’t personally run Linux, the “router” your ISP gave you probably does), but I suppose the raw packets are exposed to userspace in case they want additional info from it.

Amongst these are: “Rose” for packet radio use, NetROM, arbitrary packets, IUCV for cars, Internetwork Packet eXchange, IEEE 802.5 Token Rings, Fiber Distributed Data Interface, Ethernet + ARP, EcoNet (from Acorn Computers), AX25 for amateur radio, Ash for wireless network sensors, Point to Point Protocol, ARP for local network address lookup, & more Ethernet.

Also there’s a socket level for Apple Talk declared via macro; a text-containing struct for addressing AF_SOCKET sockets, a struct, with bitflag macro-enums, for sending routing tables to the kernel; & a struct shaperconf holding a 16bit command & a 14char name or 32bit speed.

I’ll focus on Ethernet, Point-to-Point Protocol, & ARP. Apparantly IP is too high-level here…

Point-to-Point Protocol declares a struct npioctl holding a protocol (int) & mode (enum). struct ppp_option_data is a sized pointer with an integral transmit field. struct ifppp[c]statsreq combines struct ifreq & struct ppp_[comp_]stats`. With macros for what should go in these fields.

ARP headers contain bytes specifying the hardware & protocol address formats & lengths, & an opcode byte with a macro-enum. Higher-level structs package the addresses into struct sockaddrs.

Ethernet addresses contain 16 bytes. Ethernet headers contain two of those plus a 16bit macro-enum “type”. Plus some pointer-arithmatic macros.

Linux x86_64 Port

Some parts of the kernel ports, including Linux’s are CPU-specific. Especially regarding how systemcalls are triggered in the kernel. I’m specifically focusing on x86_64 CPUs here since that’s what’s common & what I use.

Starting with simpler macros & functions…

internal_syscall1 loads the argument into the “rdi” register before deferring to the syscall opcode. There’s ZERO_EXTEND_... macros which movs from a smaller register to a larger register. ARGIFY performs type coercion if necessary. SYSCALL_ERROR_HANDLER macro saves the error code aside (via another macro SYSCALL_SET_ERRNO) & returns -1. lseek* defaults to noops. In preparation for mmap syscalls MMAP_PREPARE macro will first attempt to use the 32bit syscall if the CPU indicates it prefers that. This can be overriden via an environment variable (which is not inherited by setuid programs) updating the GLRO global. There’s a timer_settime/settime syscalls with input format conversion. There’s optional legacy compatibility for timer_getoverrun/delete, or with atomic postprocessing timer_create. There’s a syscallpath which negates the “RAX_LP” register. There’s syscall assembly function which wraps the syscall opcode with movs of all 5 arguments + syscall number into the appropriate registers & an error path. sigcontext_get_pc retrieves the saved “RIP” registers out of the given out of the given ucontext_t*.

The vfork syscall pops it’s return address into the “RDI” register (that’s what’s available per calling convention) whilst running the syscall, then uses xorl, rdsspq, & testq to determine whether the “shadowstack” is being used & thus what to return. There’s macros for prepending the __NR_-macroprefix to syscall names, & others wrapping other syscalls with various fallback codepaths. There’s different macros wrapping the syscall opcode for different numbers of args. There’s a macro wrapping the restore[2] syscall with a struct (additional macros mutating it’s fields) & trailing magic numbers. The arch_prctl syscall is wrapped with minor pre/post-processor for ARCH_GET_FS/GS codes.

For more complex functions…

clone wraps that syscall with (in Assembly) validation, argument gathering, and a “thread_start” codepath to call the given codepath & exit. getcontext wraps that syscall with lots of movs & a compile/run-time optional codepath for the presence of a “shadow stack” including it’s own syscall. There’s longjump validation code with both userspace & kernelspace checks, possibly aborting with a “longjump causes uninitialized stack frame” message.

setcontext & swapcontext wraps the rt_sigprocmask syscall followed by loading various pieces of CPUstate. Then there’s a loop looking for a “restore token” to load in “shadow stack” state. Program Counter last.

makecontext configures the ucontext_t* CPUstate to call the given function pointer with given args upon swapping to it. makecontext calls down into a __push___start_context Assembly function which wraps the "arch_prctl" syscall & mutates the stack to push __start_context function onto it. So that when makecontext’s callback returns it’ll call exit possibly preceded by setcontext.

There’s a C function for serializing CPU state to human semi-readable text, including hexcodes via itoa_word. add_system_dir macro appends one of “/lib[64,x32]” to the given library path.

Looking at the datafiles…

There’s a buildfile for generating macros to access the ucontext_t fields. Anothers for stack_t fields. There’s a Sed script for rewriter a linker shellscript to correct LD_LIBRARY_VERSION & RTLDLIST environment variables. Another dataheader lists filepaths to ELF linkers & (which all this is the source code to!) libm/libc shared libraries. There’s macros for the bytesizes of saved CPUstate & the dynamic loader cache. Other datafiles list syscalls like arch_prctl, modify_ldt, syscall_clock_gettime, bind, get(peer,sock)name, (g,s)etsockopt, listen, shutdown, & socket[pair]. There’s 2 headerfiles declaring syscall IDs. The ID for the set_thread_area syscall is declared seperately.