GNU LibC, following the POSIX standards, are the most fundamental libraries on most UNIX-like systems today. Allowing for I/O, memory allocation, filesystem manipulation, etc. Put simply, it’s primary job is to abstract away the datastructures exposed by both Linux, central config files, GCC, & the underlying CPU.
Cross-platform modules
GNU LibC provides several libraries which (building upon it’s others) are platform-independant. Mostly these serves to throttle calls into the operating system’s “kernel” e.g. Linux.
<assert.h>
is largely implemented in the C preprocessor compiletime depending on the__cplusplus
&NDEBUG
macros. Optionally upon runtime failure of the caller-specified condition calls a function to output debugging info sourced from that preprocessor taking great care not to trigger further errors. Including by using rawmmap
for memory allocation.- I’ll probably get back to the full implementation later, but this internationalization library (
<catgets.h>
) parses/evaluates some environment variables (includingLANG
&NLSPATH
) to determine which binary file it should mmap for a hashmap of translated strings taking care to get the correct text encoding. A supporting utility script can parse a relevant subset of C to extract calls to_()
& output this binary format. csu
provides code for allocating (viasbrk
) some memory for thread-local storage (taking extra care regarding alignment & initial values) if not allocated already.- More broadly,
csu
implements a __start routine which initializes GNU LibC, the memory layout (partially CPU-specific) & any C++/ELF initializers/finalizers before calling main() (taking care regarding threading) & exit()ing. Some details in ELF lib. ctype
classifies & converts “character types” via lookup tables (provided said input’s in-range), some of which are optionally locale-specific.debug
adds ELF initializers/finalizers to open a file specified byPCPROFILE_OUTPUT
(prepending byteorder & pointer sizes), to which it’ll write a reformatted callstack (sourced from another lib) upon segfault, etc signals. A seperate program reformats that binary data into human readable text, or you could use GDB.dl
(dynamic libraries) appears to dispatch calls elsewhere once arguments are validated & synchronized. Which is a little tricky to do before having implemented dynamic libraries! I’ll defer describing the libraries it dispatches to.gshadow
&nss
libraries serializes & (preprocess-aided) deserializes user account, etc info from coordinated binary-searched in-memory storage, /etc files, & dynamically-loaded libraries.nss
provides some commands to create & query these files.pwd
reads & writes the passwords in these useraccount files.iconv
dispatches to statically & dynamically loaded modules to convert between character encodings (source encoding taken from locale where unspecified), with caching & a commandline program operating on files one chunk at a time. I’ll defer describing these modules.intl
exposes more of the `gettext internationalization library, managing globals, determining which plural form to use via interpreting exprs, binary searching aliases, better determines which locale to use, etc.locale
parses files specified byLC_ALL
,LANG
, etc environment variables (once validated, cached in a hashmap) to extract formatting parameters used accross GNU LibC & elsewhere. Incorporating logic for evaluating locale rankings; OS-specific logic for extracting additional locale-dependant parameters & setting the locale global; & some nontrivial commands for writing these locale files.sunrpc
provides multiple, buffered backends for encoding binary data over various IPC channels & multiple backends (e.g. DES hashing) for authenticating into them. The public API allows for (de)serializing aggregate types & fully abstracts away the underlying IPC channels.
Commandline Flags (argp
)
<argp.h>
aids parsing commandline flags, since they’re not parsed when parsed into the program. And it can output help info for it’s configuration.
To do the parsing, after initialization argp_parse
repeatedly calls “posix” APIs wrapped in state tracking & caller-specified parsing callbacks. & a finalizer.
That’s possibly preceded by reformatting the list of arguments to parse into the string-based syntax and/or adding some of it’s own options for version or help.
There’s a suite of functions for outputting errors/warnings related to commandline flags, including reformatting the list of options & some callbacks into localized & user-friendly help text via an array of “hols”. These “hols” are a flattened array of “entries” & “clusters” referencing the original data.
In converting from “hols” to textual output it uses gettext (ofcourse, described later) & a suite of buffered textstream formatting functions aiding indentations & textwrap.
Cryptography (crypt
)
GNU LibC includes a basic crypto library builtin, which’ll be the last one I discuss today. Crypto is built on a concept of “trapdoor” mathematical functions, so I will describe the different standard algorithms it supports in terms of their underlying trapdoor function.
- SHA512 appears to largely rely on XOR + shift as a trapdoor function, with a couple other bitwise ops also involved. The results of 80 applications of that on each 128 bytes of input are summed into 8 output numbers.
- SHA256 works very similarly except operating on chunks of 64 bytes at a time & using smaller bitsizes.
- MD5 (outdated, do not use) looks a little more convoluted, involving various bitwise ops & summing, with several variations of it’s core loop run in the outer pipeline.
- Crypt starts by initializing a few lookup tables, to obfuscate shifted values which are bitwise OR’d together.
- A somehow related hashing uses lookup tables + SHIFT, XOR’d together.
Memory Allocation (malloc
)
Memory allocation is a central construct in computer programming upon which basically everything else relies. Most languages make it an implicit aspect of their syntactic constructs, making it part of their “runtime system”. C doesn’t have a runtime system however, and Assembly doesn’t implement it. So it ends up being GNU LibC’s job.
First off, GCC provides debugging hooks incorporated into the allocators. These may fprintf()
relevant data, sourced inpart from the dynamic loader, and call a function for GCC to break on. Or they may seek to detect multithreading conflicts.
And there’s atomic profiling functions which can optionally wrap all the memory allocators, outputting this data to a file specified by the MEMUSAGE_OUTPUT environment variable.
There’s an internal allocator for fixed-size arrays which specifically checks for integer overflow resulting from multiplication.
GNU LibC provides “obstacks” which implements a stack of arbitrary-sized values, using a merge between arrays (implemented in the C preprocessor thus gauranteeing inlining) and links. Calling a stored callback to allocate or free the next chunk. Making sure to maintain memory alignment.
There’s also code for (re)allocating dynamically-resized arrays, which tracks it’s size in a header struct.
And finally, if you have an allocation pattern where everything’s freed simultaneously you can go without a lot of the bookkeeping required by GCC’s main allocator. In the common case allocating by simply atomically incrementing an appropriate in-memory pointer, making it very fast! Yes, care is taken to make this thread-safe. Uses mmap() directly for an underlying allocator.
Fully Flexible Allocation
Last night I described several auxialary memory allocators GNU LibC provides, catering to certain usecases it can better optimize for. Usually when projects reimplement their own memory allocator it’s because they fall into one of these cases (or because they’re implementing a programming language), though not always e.g. GLib’s “slice” allocator.
But catering for more general usecases requires more bookkeeping, and balancing between time, space, portability.
The central construct in this generic memory allocator are cyclical doubly linked lists for a range of sizes, formed out of the unused memory. The data in this linked list are the nodes themselves, which can be removed for use by your program until free() reinserts it into the appropriate linked list.
Between every two chunks in memory is a header specifying the size of the chunks on either side, and indicating if the lower-side is free, to aid defragmentation & realloc().
If malloc()
requires more memory to allocate into, or if you request a very large ammount of data from it, GNU LibC will carefully call mmap() with profiling & optional validation. Thus handing the job off to the kernel. The initial bytes of this will be used to store the head of additional linked lists. Yes munmap() & mremap() are also used, as well as brk().
To delay the cost of freeing memory (including malloc() excess) a special linked list is kept with a corresponding flag.
The allocator is thread-protected by a mutex, and compiletime-optionally offers per-thread recently-freed lists. Which malloc() would check first.
The main logic of malloc() then involves computing which bucket to check, checking early whether it needs to mmap() more memory from the kernel, & iterates over the buckets large enough for it’s needs to see who has any memory it can remove from the linked lists for caller-use. Possibly stealing everything else in that list for use by it’s thread.
There’s two types of bins malloc() might check like that “fastbins” and, for smaller sizes, “normal bins”. Otherwise it needs to “consolidates”.
Then it compiletime-optionally the local thread allocator & iterates over the recently freed list, validating and sorting the entries as it looks for an appropriately-sized entry.
Failing all that it iterates over the chunks from the fastbins to find the one with the minimum excess space, or checks remaining bins.
Then cleans up & returns a pointer.
Freeing memory meanwhile involves compiletime-optionally checking if the local thread allocator hasn’t overflowed, then considers whether to insert into a fastbin, the recently freed list, or hand off to the kernel.
Consolidation involves checking every chunk in a fastbin to see if it’s now contiguous with free()’d memory & if so moving it into the recently-freed (or “unsorted”) list. So that the actual consolidation can be done lazily.
realloc() checks if the next contiguous chunk is free before falling back to normal malloc()/free()/memcpy() routines. mtrim() does the reverse, freeing segments of allocated memory.
malloc_usable_size()
checks the between chunk readers to return how much RAM was allocated. Then there’s functions for returning gathered statistics, possibly outputting it to stderr!
Or you could alter the magic numbers used by this allocator.
Remote User Accounts (nis
)
Having skimmed nis
, I think I’ll skip over describing it in detail. NIS (formally known as Yellow Pages after the phonebook) is a legacy authentication protocol. It’s useful in a business or cloud setting for synchronizing UNIX user accounts between devices, but it has since been superceded by LDAP & I don’t know why GNU LibC still supports it.
I would consider this feature, amongst similar ones in the Linux kernel, spyware wherever any distro auto-enables it. And it’s near-certainly irrelevant to the home computing uses I’m interested in.
Hence why, in addition to it being superceded by LDAP, I’m not interested in digging into GNU LibC’s NIS support beyond briefly describing what it is.
PThreads (nptl
)
The UNIX fork() syscall already offers asynchronisity, so pthread’s job is to reintroduce some useful synchronization. Including offering the caller useful datastructures to invoke their own synchronization.
A callback is called by the platform-specific code to run other appropriate caller-registered callbacks “on fork”.
Thread-local variables are stored in an array, or overflow hashmap, within the current thread’s struct pthread
. How it gets that containing thread-local variable representing the current thread is platform specific, so I’m just kicking the can down the road. Ultimately there needs to be CPU support.
pthread_setattr_default_np
sets some validated global variables as requested.
PThread provides functions (with validation) wrapping __sched_getparam
, etc to set thread priority in Linux.
Threads can have a stack of callbacks registered to be called upon errors before jumping to an unwind_stop
caller-specified callback.
“Semaphores” wraps the CPU’s support for “atomic values” & the kernel’s support for “futexes”, which incidentally in turn wraps atomic values (again!) & the process scheduler. Calls into the kernel are slow so semaphores first updates it’s own atomic value, whilst adding error handling & stuffing an extra count into the same atomic value.
A read/write lock is a variant of semaphores which has two atomic counters, enforcing the invariant that no single writer may lock the data whilst there’s others reading it. GNU LibC stuffs a couple bitflags into the read counter.
(Atomics are read/write operations which cannot be interrupted, full details later)
Once variables uses an atomic & futex to ensure a given callback is only ever called once. Atomically flagging whether it has been, or is being, run.
PThreads includes a complex mutex representation counting (non-atomically) the number of users, distinguishing different subtypes, allowing the same thread to “recursively” lock & unlock the mutex whilst it already has it locked via a seperate count, track owners for “adaptive” locks, etc.
WebKit has written they don’t like this overhead, so they wrote their own.
There’s also a lock() variant which specifies how long to wait before timeout.
A barrier allows a configurable number of threads into the critical section, and once they’ve all finished it lets any further threads try to enter the critical section again, using 2 atomic counters (for threads in & threads out) & a futex.
When various OS signals are received those stacked callbacks mentioned earlier are called, protected by an internal lock. Seperate functions (protected by their own internal locks) allows you to send one of those OS signals to a pthread.
A “condition” is amongst the primitive synchronization construct: threads block on a condition until it’s triggered. GNU LibC uses futexes (yet again) to power it’s conditions, swapping between two of them as determined by atomic use-counters ensuring it can always be triggered. Ultimately the kernel’s scheduler has to be the one implementing atomic conditions.
Upon the SIGSETXID syscall pthreads will call that syscall() on themselves again with an internal lock.
Finally there’s some decent initialization logic to run on program startup & a cache of allocated memory for struct pthread
s.
TODO What’s nptl_db
tracking?
Useraccounts (nscd
)
In most computing setups, you want to limit who can use the computer. This is what useraccounts are for, and is implemented by either nss
or nscd
.
Central functions like __nscd_getgrouplist
first consults a pthread-safe profiled hashmap cache for data on a usergroup, etc before requesting said data via a UNIX socket from a local SELinux-secured daemon centrally caching this data from /etc/* in RAM.
The central daemon parses those “passwd”, “group”, “hosts”, “services”, “netgroup” using the same caching, etc TSV files from /etc/, serves those files or parsed data to requesting clients, & aggregates stats for logging. The daemon is started as-needed & secured via SELinux if possible. It takes great care to cleanup any mess in it’s local environment to avoid security vulnerabilities.
Miscallaneous Utils (stdlib
)
Going over the non-trivial cross-platform utility functions in stdlib
:
__xpg_basename
returns the string slice from the last “/” in the given filepath, unless that “/” was the last character.mpn_sub_n
performs long subtraction on arbitrary-precision numbers (mp_ptr
type), akin formpn_add_n
.mpn_submul_1
multiplies each word of an arbitrary-precision number by a fixed precision number, subtracting a given size.STRTOD_NAN
determines the appropriate NaN error for a given string.STRFROM
parses the optional integral precision (following “%.”) from a format string & wraps a lower-level__printf_fp_l
or__printf_fphex
function based on the following character in the format string.mpn_rhsift
shifts the bits between the bytes of a multi-precision number.rpmatch
extracts regular expressions from the configured locale to determine whether you’ve answered “yes” or “no” in your language & evaluates said regexps.rand_r
is a naive psuedorandom number generator where the given seed is multiplied by 1103515245 & added to 12345 thrice, shifted left by 10 the last 2 times. Each time this updated seed is divided by 65536 & remaindered by 2048 or 1024, these values are xor’d together to produce the resulting random number.mpn_mul_1
multiplies a fixed-precision number by each byte of a multiprecision number, with carries.- There’s several wrappers around the external library “gconv”.
_quicksort
implements one of several algorithms for rearranging an array into sorted order. QuickSort takes two slices of the given array seperated by a “pivot” value. All values less then the pivot are moved into the left array & vice versa, before recursing over both those slices. This implementation uses a caller-provided comparison fuction, an explicit stack instead of recursion, & chooses a median value between three numbers in each slice for the pivot.mpn_lshift
is akin tompn_rshift
.llabs
,labs
, etc are literally implemented asreturn i < 0 ? -i : i;
.mpn_mod_1
wraps the arbitrary-precision division functions with special remainder optimizations I don’t comprehend.mpn_mul_n
uses the other arbitrary-precision arithmatic functions I described & the Karatsuba Square to multiply two arbitrary-precision numbers.l64a
uses a lookup table to convert each 7 bits of it’s 32bit input number to string characters.getsubopt
splits an input string by “,” & returns the index of the entry which begins one of the given keywords followed by “=”.__correctly_grouped_prefix
wc/mb validates the syntax of the given string.getenv
iterates over the global__environ
array looking for the entry which starts with the given name & an “=” returning the trailing text.__drand48_iterate
incorporates some extra shifts & a multiply-add into the pseudorandom number generator.- Before yielding to the syscall,
exit
carefully iterates over a number of registered callbacks to call. div
returns a structure containing the results of both division & remainder between two numbers. GCC would merge these!mpn_cmp
compares two arbitrary precision numbers by finding the first differing digit (if any) between two arbitrary-precision numbers &.comparing those.- There’s a bit of C++ runtime here.
mpn_divmod_1
first considers whether converting into a multiplication via dividing 1 by the input would be faster. Not comprehending beyond that. I’m not clear howmpn_divrem
works.a64l
converts a string back into a 32bit by converting each character via a lookup table into each 7bits.abort
reconfigures, triggers, resets, & retrigger theSIGABRT
signal, then attempts to use platform-specific abortion code before resorting to callingexit
.- There’s a few more functions for processing filepaths, including one which converts a relative filepath to an absolute filepath & normlizes it.
Also there’s several functions which may be overwritten by platform-specific code, including:
__srand48_r
for”seeding” the pseudorandom number generator.strfmon_l
for merging the configured locale (including monetary options) into a given format string.- Leniently parsing strings to various numeric types. Involves multiply-add by 10 for decimal to binary conversion. Floating points require more sophistication!
- Various supporting floating point operations including rounding (involves deconstructing the CPU’s datastructure) & error reporting.
- Registering exit handlers.
- More sophisticated random number generator.
- Parsing longs source some parameters from the configured locale.
- Merge Sorting (the algorithm underlying
qsort
) an array, by splitting the input array into two slices, recursing over both slices, & iterating over both results to insert out-of-order entries where they belong. Requires additional temp memory. - Formatting logging messages to be sent to stdout or syslog, with environment variable configuration.
- Formatting & appending a new entry to the environment variables global. This has platform independant wrapper ensuring memory is allocated correctly.
lcong48_r
saves given parameters into the pseudorandom number generator config.
Platform-dependant APIs with stubs dumped in “stdlib” include:
system
swapcontext
setcontext
secure_getenv
makecontext
getrandom
getentropy
getcontext
And there’s a linter script here too…
Highlevel (Textual) I/O (“libio”)
“libio” makes the crucial task of inputting & outputting (primarily textual) information more efficient & convenient.
“libio” includes several backends including various wrapping provided memory, but mostly it wraps a kernel “file descriptor” from “io”. This file descriptor backend will be discussed through the rest of this thread, & includes a buffer & text decoding.
Additional state (including the seeked offset in the buffer) is tracked in-process on open files by the “io” GNU LibC sublibrary beyond the file descriptor & the buffer for reducing how many syscalls are needed against it, some of which is parsed from the caller-specified “mode”.
The buffer is implemented as a ringbuffer, with seperate read & write indices into the buffer chasing each other. Once they catch up, or upon certain seeks, up the file is “flushed”. The text decoding dispatches to a different backend provided by the “iconv” sublibrary, applied in-place within the ringbuffer.
There’s an API for opening a process & returning I/O streams for communicating with it. The public APIs in addition to calling the methods on the backend, or initializing the appropriate one, also updates the in-memory state for the file.
Evaluating formatting strings is handed off to the “stdio-common” sublibrary I’ll cover later.
There’s several stubs it experts entirely to be overwritten by platform-specific code, including:
putc_unlocked
& maybeputc
,putchar
Also overridable, but with platform-indendant defaults:
fputs_unlocked
&fputs
fread
&fread_unlocked
ftell
gets
&puts
setbuffer
&setvbuf
vdprintf
,vsnprintf
, &vswprintf
open_memstream
fopen64
fgets
fflush_unlocked
&fflush
getwc_unlocked
&getwc
getc_unlocked
&getc
getchar
ftello64
&ftello
fseeko64
&fseeko
fileno
ferror_unlocked
feof_unlocked
&feof
fcloseall
In most cases where there’s locked & _unlocked
variants of a function, one or the other needs to be overwritten with platform specific code.
Semi Platform-Dependant
Sublibraries in this section are largely platform-dependant, but includes some code which is platform independant. Though, due in part to incomplete knowledge when I classified GNU LibC’s sublibraries, the difference is from the platform independant sublibraries described above is a little fuzzy.
- “dirent” would be fully platform-dependant if it weren’t for a function which applies filter & sort operations upon the
readdir()
syscall/iterator, using caller-specified callbacks. Platform-dependant:seekdir
,telldir
,rewinddir
,readdir[64][_r]
,opendir
,getdirentries[64]
,fdopendir
,dirfd
, &closedir
. - “gmon” sets a timer to periodically increment the entries for which functions are currently running during that SIGALRM, double checking it’s not called recursively. Another function determines the finest tick frequency
setitimer
can schedule. A public initializes, configures, & serializes this sytem. Platform-dependant:[s]profil
- “grp” provides several (de)serializers wrapping the
setgroups
platform-dependant function, & other sublibraries.
InterNETworking (“inet”)
“inet” allowing you to open communication channels to computers elsewhere on the InterNET. The main platform-independant task of which is to resolve the destination, whether it’s textually identified by IPv4, IPv6, hostname, or DNS addresses.
inet_ntoa
applies the format string “%d.%d.%d.%d” to the given parsed IPv4 address. inet_netof
& inet_lnaof
are hueristics for extracting the mask from a sole IPv4 address.
inet_network
parses IPv4 addresses into 4 bytes, with options for indicating numbers are written in base 8, 10, or 16.
ether_ntohost
& ether_hostton
wraps the “nss” sublibrary to resolve hosts identified by hostnames.
ether_ntoa_r
applies the format string “%x:%x:%x:%x:%x:%x” to the given parsed IPv6 address.
ether_line
& ether_aton_r
parses an IPv6 address, only supporting base 16 numbers.
rexec[_af]
(utilizing ruserpass
) & the superceding rcmd
implements protocols for running shell commands on a remote machine. ruserpass
consolts common (non-specialized) environment variables for authentication parameters.
There’s functions, with a preceding __idna_name_classify_check
, for resolving the IDNA encoding of domain names written in the broader Unicode character set, via a dynamically loaded library.
__check_pf
indicates whether a IPv4 and/or IPv6 address was parsed.
There’s functions converting connection “deadlines” to/from other formats used elsewhere.
There’s functions for reading & writing (as per the IETF spec) the options in the extended IPv6 header provided from the kernel via struct cmsg
. Including routing information for e.g. ping/traceroute.
bindresvport
iterates over privileged ports to bind
a socket on an open one.
To aid network management “nss” & “nscp” sublibraries are wrapped to implement “net groups”.
And finally getnameinfo
, after some validation, recomputes the hostname the given network address was parsed from, branching upon the socket type & flags. As well as the TCP or UDP port number. Ultimately calling the other functions I have or will described. The localhost address is cached, protected by a lock & double-checked first
flag.
__inet6_scopeid_pton
retrieves & serializes network interface indices.
As always I’m skipping over minor functions.
inet_makeaddr
merges a network address & host address into the space of a single structure, determining how far to shift the network address based on which “class” it is in before OR’ing the two together. This code can be overriden by platform specific code.
Fully platform-specific stubs include:
setsourcefilter
&setipv4sourcefilter
if_nametoindex
,if_indextoname
,if_freenameindex
, &if_nameindex
getifaddrs
&freeifaddrs
Input/Output (“io”)
One of the most important (and challenging, but not at this layer I’m describing today) things an operating system’s APIs implements are input & output. Important in the sense of “if a tree falls in the forest and noone’s around to see it, does it make a sound?”
Most languages including C expose I/O to programs via APIs for them to call, which ultimately needs to wrap language bindings. GNU LibC’s “io” sublibrary is almost entirely language bindings.
Starting with the few platform-independant functions in GNU LibC’s “io” sublibrary, ppoll
wraps poll
. Adapting the format for timeouts & temporarily setting a sigprocmask
. statx*
wraps fstatat64
with alternate interfaces.
Internal functions for determines whether a file has changed based on the results of stat64
.
And there’s a fairly sophisticated depth-first iterator over all descendant files of a directory, with binary-searched caching, filepath normalization, symlinks, etc.
GNULibC’s “io” sublibrary has a few functions which may be overriden by platform-specific code but don’t have to be:
stat64(file, buf)
= fstatat64(AT_FDCWD, file, buf, 0)
.
stat(file, buf)
= fstatat(AT_FDCWD, file, buf, 0)
.
lstat(file, buf)
= fstatat(AT_FDCWD, file, buf, AT_SYMLINK_NOFOLLOW)
.
lstatat64(file, buf)
= fstatat64(AT_FDCWD, file, buf, AT_SYMLINK_NOFOLLOW)
And lockf[64]
is adapted to calls to fcntl
with reformatted arguments.
Platform dependant stubs include:
write
utime[nsat]
unlink[at]
umask
ttyname[_r]
symlink[at]
stat[v]fs[64]
sendfile[64]
rmdir
readlink[at]
read
posix_fallocate[64]
posix_fadvise[64]
poll
pipe[2]
open[at][64]
mknod[at]
mkfifo[at]
mkdir[at]
lseek[64]
link[at]
lchown
isatty
getcwd
futimens
fstat[v]fs[64]
fstatat[64]
flock
fcntl[64]
fchown[at]
fchmod[at]
fchdir
faccessat
e[uid]access
dup3
dup[2]
copy_file_range
close
chown
chmod
chdir
- &
access
LOGging IN (“login”)
I’ve described several platform-independant APIs before for authenticating into a computer, which you do so the kernel can authorize & audit your interactions as per it’s configurations. But how does the kernel know who’s currently logged-in? That’s what “login” does.
For the platform-independant components:
logwtmp
straightforwardly wraps updwtmp
whilst passing the current time from clock_gettime64
.
pts_name
wraps ptsname_r
, doubling the passed buffersize until it’s large enough to fit the response.
logout
reads the current state from the useraccounts logfile, and appends an updated state of DEAD_PROCESS
with a fresh timestamp.
openpty
wraps getpt
, grantpt
, unlockpt
, ioctl(TIOCGPTPEER)
, optionally pts_name
& open
, optionally tcsetattr(TCSAFLUSH)
, optionally ioctl(TIOCSWINSZ)
, & optionally pts_name
to open & configure a new pair of tty file descriptors.
try_file_lock
wraps fcntl64_nocancel
with a noop SIGALRM
temporarily set. Restores existing SIGALRM configuration on exit.
If ioctl(TIOCSCTTY) is unsupported,
login_tty instead closes the old stdin, stdout, stderr, & a reopened device file for the shell. In either case it calls
setsid & configures the
stdin,
stdout, &
stderr` to point to the given shell file descriptor, closing the old file descriptor number.
file_unlock
wraps fcntl64_nocancel(F_SETLKW)
.
tty_name
wraps ttyname_r
like pts_name
does ptsname_r
.
maybe_setutent
opens the authenticated logfile & seeks to the end if it’s not open already. read_last_entry
reads a binary record at a saved offset.
login
retrieves the shell name for stdin, stdout, or stderr, strips if /dev/ if present, and appends it to the logfile.
__libc_getutent_r
ensures the logfile’s open, locks it, & reads the last entry using above functions. internal_getut_nolock
retries upon failure.
__libc_getutline_r
opens the logfile if necessary, locks it, & reads the final entry until successful.
__libc_pututline
opens the logfile if necessary, ensures we’ve got permission to write it erroring out if not, checks whether the new entry matches the last one to determine where to write it, seeks to the new location, writes it, & truncates the file!
updwtmp
opens & locks the logfile, seeks to end, truncates off partial writes, & appends to it.
Amongst other related functions.
Functions with platform-independant fallbacks but can be overriden by platform-specific code:
utmpname
with appropriate locks & closed logfile, validates & stores the given filepath in a global.
updwtmp
fotwards (with filename adjustments) to the internal implementation described above. As does (with a lock & validation) getutline_r
, getutid_r
, setutent
, getutent_r
, pututline
, & endutent
.
getutent
, getutid
, & getutline
allocates global buffers for their namesakes.
Platform-dependant stubs include:
unlockpt
setlogin
ptsname[_r]
grantpt
getpt
posix_openpt
- &
getlogin[_r]
Shell commands are provided for outputting the logfile as text (“utmpdump”) & for configuring access rights to pseudoterminals (“pt_chown”).
“misc”
Next on my list of GNU LibC semi-platform-dependant sublibraries is “misc”, so sounds like no good introduction is possible.
Amongst it’s platform-independant functions are:
lfind
which iterates over an array finding the first element equal to the given value according to a given callback. lsearch
wraps this to append the item to the end of the array if absent.
insque
inserts an element into a doubly-linked list (initial properties defined by struct qelem
). remq
removes it.
getusershell
, endusershell
, &setusershell
consults globals set by initshells
calling it if necessary. initshells
frees the old values, opens & fstat64
s a configuration file, malloc
s the new values, & reads each line into the global array stripping comments, spaces, & non-absolute paths.
getpass
tries to open /dev/tty
falling back to stdin & stderr, ensures it’ll get closed, lock the file, tries disable echoing via tcgetattr
, writes the given prompt, reads a line & restores.
fstab_init
allocates a buffer, & reopens or resets the file descriptor. fstab_fetch
reads an entry into that buffer. fstab_convert
converts from internal to external structs. This all wraps the “mntent” APIs, these several higher level wrappers, & there’s deallocators.
The internal __fd_to_filename
function converts the filedescriptor integer to a string (inline code) & appends to FD_TO_FILENAME_PREFIX
.
dirname
locates the last (non-final) “/” & returns that slice of the string.
The internal __FCVT_R
function validates it’s arguments, extracts the sign bit then takes the absolute value, divides by 10 until we get the desired rounding, calls SNPRINTF
, strips off leading zeros, & adds trailing zeros. The internal __ECVT_R
function wraps __FCVT_R
with additional preprocessing the number being converted to text.
daemon
(necessary for init systems prior to systemd) does a dance to detach this process from it’s parent. By forking, calling setsid
, cd
s to “/”, open
s & fstat64
s “/dev/null”, & configures stdin
, stderr
, & stdout
ot point to it.
is_open
wraps fcntl(F_GETFL)
(or on Windows, _get_osfhandle
).
flush_stdout
wraps fflush(stdout
with validating stdout_fd
is_open
.
print_errno_message
wraps strerror[_r]
& fprintf
to output a given errnum
.
error_tail
wraps vf[x]printf(stderr)
, optionally print_errno_message
, putc
or fxprintf
, fflush
, & optionally exit
.
Functions which may be overriden by platform dependant code, but has platform-independant implementations include:
__error_internal
adds pthread shutdown, stdout flush, outputting the program’s name, & file locking to error_tail
. error
adds variadic arguments.
[__]error_at_line[_internal]
adds pthread shutdown, stdout flushing, file locking, and outputting filename & line number fallingback to program name. The filename & number are presumably provided by the C preprocessor.
Turns out error[_at_line]
is overridable by platform-specific code.
tsearch
, tfind
, tdelete
, twalk[_r]
, & tdestroy
implements a “red-black” balanced binary-search-tree, utilizing internal tdestroy_recurse
, trecurse[_r]
, maybe_split_for_insert
, & check_tree[_recurse]
support functions & a given comparison callback. binary-search-trees stores values less than it’s value under left child, & greater than under right child. A red-black tree ensures all leaves are at roughly the same depth by tracking whether each branch is left-heavy (roughly “red”) or right-heavy (roughly “black”).
[v]syslog[_chk]
validates the passed flags, grabs the syslog lock ensuring it will be unlocked, tests the “priority” against a bitmask, opens a “memstream” to output formatting results to with a fallback for out-of-memory errors, locks the memstream, grabs the current time & formats it, retrieves the current seek offset, outputs the “LogTag” or program name, optionally the process ID in square brackets, optionally a “: “, restore the errno, & output the caller-provided format string, That text optionally gets outputted to stderr if configured, if SIGPIPE is available a signal handler will be configured for it, opens the configured socket with a couple retries if necessary, sends the message to that socket, closing & reopenning open error, & cleans up.
openlog
under the lock (re)opens the socket with a couple retries. closelog[_internal]
under a lock resets some globals & closes the socket.
setlogmask
under a lock sets the LogMask global returnning the old value.
There’s a hashmap implementation maintaining a prime size using linear rehashing.
getttynam
iterates over ttyents
to find the one with the desired name.
getttyent
opens & locks the file, reads each line where they’re not too big whilst skiping comments & preceding whitespace, parses the first found data line utilizing an internal “skip” function to handle quotes & otherwise splitting by whitespace, & returns it.
setttyent
opens the file & configures locking. endttyent
closes it.
getauxval
checks various globals depending on the given “type”.
mk[o]stemp[s]
& mktemp
wraps __gen_tempname
.
setmntent
, endmntent
, getmntent[_r]
, addmntent
, & hasmntopt
opens, closes, reads, & writes a TSV file of mountpoints.
sbrk
wraps brk
to retrieve the old boundary & set a new incremented boundary, with added validation.
Functions with platform-dependant implementations in GNU LibC’s “misc” sublibrary include:
acct
brk
chroot
[f]chflags
[f,l]setxattr
[f,l]removexattr
[f,l]listxattr
[f,l]getxattr
[f,l]utimes[at]
ftruncate[64]
fsync
getdtablesize
getdomainname
(consults global if_UTSNAME_DOMAIN_LENGTH
set)getpagesize
getloadavg
gethostname
gethostid
gtty
ioctl
get_nprocs[_conf]
get_[av]phys_pages
madvise
mincore
mlock[all]
mmap[64]
[p]writev[64][v2]
[p]trace
[p]select
[p]readv[64][v2]
munmap
munlock[all]
msync
mprotect
reboot
revoke
remap_file_pages
set[r]e[u,g]id
setdomainname
sethostname
sethostid
sstk
stty
syscall
sync[fs]
swapon
&swapoff
truncate
ustat
usleep
ualarm
vhangup
“posix”
“posix”’s platform-independant code here largely implements simple character-wise interpretors, including:
wordexp
interprets shell-like shortcut syntax to expand one word into multiple, possibly incorporating results of maths or commands.regexec
determines if/where some text matches a simple caller-specified syntax by compiling these “regexes” to NFAs then DFAs. Abstractions finds all matches, including text matching subpatterns specified by the regexp. Regexp’s NFAs (has duplicate paths) & DFAs are labled graphs where matching strings are paths along them.glob
filepath patterns into full filepaths it has found in the filesystem. With optionally sorting in a postprocessing step. A utility function adds a prefix to all results from a recursive call, another handles the iteration over the named directories whilst determining which contained files/dirs match the partial pattern.getopt
parses commandline flags (including the “–long” syntax) according to caller-provided data, using seperate support functions for long vs short flags.getopt
has a seperate support function for parsing long flags which’ll also log ambiguities in-memory, before looking up each character in a “-“-prefixed commandline argument in a given string, checking whether it expects an argument as donoted by “:”. Another support function moves commandline arguments around.fnmatch
applies the same wildcard syntax asglob
to a string array.
Functions which can be overriden by platform-specific code but don’t have to be include:
posix_spawn_file_actions_addclose
&posix_spawn_file_actions_add_chdir_np
alters a global table for how to handle the caller-specified file descriptor on fork once validated.posix_spawnattr_setsigmask
,posix_spawnattr_setschedpolicy
,posix_setattr_setsigdefault
,posix_spawnattr_init
,posix_spawnattr_setflags
,posix_spawn_file_actions_addup2
,posix_spawn_file_actions_realloc/init/destroy/addopen
sets appropriate globals.re_compile*
,re_set_*
,regcomp
, ®error
wraps internals of the regexp implementation, whilst still allowing more optimal platform-specific to override it. There’s a lookup table for nicely reporting errors. Maybe I’ll revisit this regexp engine tomorrow?execv[p]e[x]
adds logic over similar syscalls for resolving which programming to run from the first commandline argument & the “PATH” environment varible.confstr
retrieves the specified string global, or callssysconf
, for the caller-specified “name”, copying results into a caller-provided buffer.exit
sets astatus
global &abort()
s.glob_pattern_p
wrapsglob_pattern_type
group_member
as a fallback of allocating & retrieving groups then iterating over in reverse order to check whether the groupID is in it.pathconf
wrapsfpathconf
with validation.nanosleep
wrapsclock_nanosleep(CLOCK_REALTIME, 0)
uname
wrapsgethostname
to return it’s response alongside some constants.
Function stubs whose implementation are overriden by platform-dependant code include:
alarm
[f]execve
&execvp
fpathconf
fork
getaddrinfo
& (defaults to nothing)freeaddrinfo
getgroups
get[e,p][g,u]id
&getres[u,g]id
get[p]pid
getsid
pause
pread[64]
&pwrite[64]
posix_madvise
sched_rr_get_interval
sched_get_priority_min
&sched_get_priority_max
sched_get_scheduler
&sched_setscheduler
sched_getparam
sched_yield
sched_setparam
sched_setaffinity
set[p][u,g,s]id
setres[u,g]id
posix_spawn[_compat]
(wrapsspawni
by default)sleep
setuid
posix_spanwattr_destroy
(does nothing by default)spawni
sysconf
(some incomplete fallback)fattach
,fdetach
,get[p]msg
,isastream
, &put[p]msg
times
[v]fork
(vfork
fallsback tofork
)wait[[p]id]
(waitpid
fallsback towait4
,wait
fallsback towaitpid(WAIT_ANY)
)wait4
&wait3
Regular Expressions
Regular expressions provide an easy way (when not overcomplicated, as they frequently are) to test whether some text matches some given relatively-simple syntax, extract where they do, or replace those occurances.
I’ll start from regcomp
…
Anyways I quite like this implementation! The syntax is simple enough to not overstay it’s welcome, it uses the most optimal algorithm (unlike most other languages), & it still found ways to incorporate the features devs like elsewhere.
Compilation
regcomp
starts by initializing/allocating the given regex_t
pointer & reformatting given bitflags before handing off to the main code & if successful compiling “fastmaps” from it.
re_compile_internal
further initializes the provided regex_t
pointer, allocs/inits a DFA from the regex’s buffer reallocating said buffer if needed, handles errors, copies the regex source code into the DFA (re)alloc/init’ing a re_string_t
for it whilst reencoding the text & recording char boundaries.
It then runs it’s handwritten pushdown automaton bin_tree_t
of partial re_dfa_t
emitting parsing pulling tokens from it’s lexer, whilst lowering character classes & repititions. Refer to the UNIX documentation for details on this syntax.
If successful it than analyze
s this AST to convert it into Non-deterministic Finite Automoton form. Which involves allocating next
, org_indices
, edests
, & eclosures
arrays for the first re_dfa_t
in the regex’s allocation buffer.
Amongst those arrays is a subexpression smallint map, populated from a postorder traversal over the Abstract Syntax (Binary) Tree setting appropriate bitflags to indicate where it has done so.
The next several postorder traversals are for lowering away parenthesized “subexpressions”, gathering the possible initial characters for each branch, computing the subsequent possible characters from that, constructing a NFA graph from that analysis & the control flow, & calculate epsilon closures.
The difference between an Non-determinstic Finite Automaton (NFA) & a Deterministic Finite Automaton (DFA) is that an NFA may have multiple edges for the same char going off a node, and that it may contain “epsilon nodes” which may be traversed prior to following the edge for a character. Expressing is easier in a NFA, but interpretation’s harder.
Another compiletime-optional pass traverses the graph/automaton to flag where we’re dealing in pure ASCII. Then computes initial state & cleans up.
Execution
To evaluate a regular expression compiled as per above, you call regexec
which validates it’s input takes a lock & hands off to re_search_internal
.
re_search_internal
validates the fastmap if present, computes the number of extra matches to capture (stored seperately in memory), performs additional validation, allocs/inits a new re_string_t
for the text being matched followed by the re_match_context_t
. There’s additional validation for multicaptures & more initialization.
Iterating over each starting position in the string to be matched (in forward or reverse order) it considers whether it can use the fastmap as a cache quickly discarding incorrect starting points, takes a slice of the string being matched making sure to properly handle multibyte characters, than runs the core interpretor (check_matching
) checking for success. & copies out all the text which has been captured.
After validating/selecting it’s initial state check_matching
optionally saves an offset for capturing text and/or dynamically lowers backrefs into the DFA, checks whether to exit, & enters it’s mainloop.
This mainloop first considers whether it needs to realloc buffers (should only be needed in presence of backrefs), looks up & traverses the appropriate edge by char or multibyte “word”, optionally stores the state in a per-char log, detects failures & whether to retry, & maybe exits.
re_match/fork[_2]
are a variant of regexec
which operates on a slice & does more preprocessing to ensure a fastmap is computed & that there’s space for at least 1 pseudoregister. Those registers are returned to the caller. The _2
variant concatenates two strings before hand.
re_exec
is a BSD-compatible interface to regexec
.
During interpretation this regex engine may wish to prune impossible-to-reach nodes. Upon updating the per-char log NFA nodes are merged into 1 DFA node.
(De)Serialization (“stdio-common”)
GNU LibC’s “stdio-common” sublibrary some higher-level I/O abstractions & syscalls, largely relating to serializing & deserializing C’s primitive types to/from text.
psignal(sig, s)
runs fxprintf(NULL, "%s%s%s\n", s or "", ": " or "", _(__sys_siglist[sig]))
or asprintf(&buf, _("%s%sUnknown signal %d\n"), s or "", ": " or "", sig)
before fxprintf
ing the result. For some reason it doesn’t directly fxprintf
the int, worried about triggering additional signals?
psiginfo
extracts si_signo
, si_code
, etc fields from the given siginfo_t
& fprintf
s them to at most 128b (via fmemopen
) before outputting to stderr.
printf_size
is a printf
plugin which, handling NaN & infinity specially, repeatedly divides the decoded input by 1000 (if uppercase specifier) or 1024 (if lowercase) to find the appropriate unit before forwarding to the floating point formatter & appending the unit via put[w]c
.
printf_fphex
is another printf
plugin which, after retrieving params from the format spec & the locale, hands the decoded data to ito[w]a[_word]
it’s constituent parts, for postprocessing to round, etc.
After handling a quirk with stderr
, perror
serializes the given errnum
via strerror_r
& fxprintf
s the result with the optionally-given message.
itowa_word
serializes an integer by repeatedly divides the given value by the given base looking up the remainder in one of two arrays (depending on desired casing) to output that digit at the start of the output string (decrementing from end of buffer). Special codepaths are generated for bases 10, 16, & 8 to encourage GCC to specially optimize them, certainly it can for 16 & 8…
itowa
does similar for long long ints more complexly using multiprecision maths, fastpaths, & chunking.
itoa[_word]
don’t appear to be any different from itowa[_word]
.
tmpfile
wraps __gen_tmpfd
, fallingback to path_search
& gen_tempname
, before opening the new file.
Functions which may be overriden platform-dependant code includes:
getline
which defaults to agetdelim
wrapper.snprintf
which defaults to avsnprintf
wrapper.vf[w]scanf
which defaults to the code I described earlier.vfwprintf
which also defaults to the code I described earlier.
Function stubs with platform-dependant code for their actual implementations include:
path_search
&gen_tempname
rename[at[2]]
remove
funlock
,ftrylock
, &flockfile
(defaults to noops)cuserid
- &
ctermid
Floating Point Serializing
printf_fp
is a final bundled printf
plugin which outputs floats in decimal, which is more complex & uses multiprecision arithmatic from the previously-described “stdlib” sublibrary.
After extracting the locale-specified parameters & the input, outputting NaN or infinity specially, printf_fp
initializes a struct hack_digit_param
(internal type) & malloc
s it’s other fields & computes it’s scalesize
based on the extraced exponent
.
It performs some complex multiprecision math (which honestly I fail to comprehend) to convert the mantissa * 2^exponent
format into mantissa * 10^exponent
format (the new mantissa is fully computed later one digit at a time), and to decide where to place the decimal point by counting zeroes. This math is performed differently whether the base 2 exponent is > 2, negative, or between 0-2 inclusive.
It derives formatting parameters from the formatting char, estimates the number of grouping characters, & validates no overflows happened.
It mallocs a buffer to use, iterates over the computed integer digit positions (pre-computed) lazily finishing the multiprecision arithmatic (divmod & mul operations) to emit each one, adds a decimal point or trailing zeros, continues emitting those digits for the fractional component, rounds those characters informed by the subsequent digit, counts the number of added zeroes & then removes them, removes a leading 0, adds thousands seperators, serializes new (integer) exponent if needed, it adjusts the desired width, outputs specified padding, outputs the sign, insists ‘0’ padding be left-aligned, looksup a factor
from the locale, allocs a new buffer, & copies the previously computed text into it, applies internationalization postprocessing, outputs that text, cleans up, & adds trailing padding.
Interpreting Textual User Input
Allowing programs to receive user input is the easy bit, interpreting it is the hard bit. Which is perhaps why it’s typically part of programming language’s standard libraries rather than their syntax.
vfscanf
& it’s variants, building on a pair of trivial wrappers (char_buffer &
scratch_buffer`) around the core “malloc” sublibrary implementing slices & arrays, after initializing it’s variables & variadic argument iterator iterates over the chars of the given format string.
Until it reaches the end of the format string (to be matched some given “input” “libio” file, yes “libio” & “stdio-common” are mutually dependant) vfscanf
& variants repeatedly:
- Checks if the multibyte (non-ASCII) char are complete, matching them against the input file.
- Retrieves the next char & checks if it isn’t a
%
. If so skips whitespace (indicating input whitespace should be skipped too), retrieves the input character possibly skipping whitespace, & checks for match. - Otherwise prepares the character buffer & checks if the next character is a digit. Parses the number and checks whether it’s trailed by a
$
. If so it’s an argument position, otherwise a width & should skip steps 4 & 5. - Checks for
*
,'
, &I
chars & converts them toSUPRESS
,GROUP
, &I18N
bitflags. - If the next char’s a digit the number’s parsed as the “width”.
- If the next char’s a
h
,l
,q
,L
,as
,aS
,a[
,m
,z
,j
, ort
are converted into bitflags. - Checks for premature format string end & whether the next(? Step 6 “unreads” invalid chars) is a
[
,c
,C
,n
, or otherwise flagged to skip whitespace, all subsequent whitespace input chars are skipped. - Checks whether the next, and final of this format declaration, char is
%
,n
,c
,C
,s
,S
,x
,X
,o
,u
,d
,i
,e
,E
,f
,F
,g
,G
,a
,A
,[
, orp
, handling them appropriately. - Repeat.
At format end trailing space is skipped & errors handled.
For %%
vfscanf
& variants check whether the next input char is also %
& not EOF; %%
is the escape syntax.
For non-suppressed %n
the count of read input chars to the correponding variadic argument (though e.g. %2$...
duplicates the arg iterator to skip that number of places). %*n
is a noop.
%x
, %X
, %o
, %u
, %d
, %i
, & (which also adjusts bitflags) %p
all selects an appropriate to use in the generic number formatting code.
%c
defaults the width to 1, optionally ensures enough space (with a cap) is malloc
‘d, ensures we haven’t reached EOF, reads that number of chars into (if not suppressed) specified memory realloc’ing more memory as needed whilst compiletime-optionally (don’t know why you’d disable) decoding UTF8, & truncates text to actual size if callee-malloc’d.
That falls through to %C
which appears to be doing the same thing?
%s
& %S
does basically the same thing but escapes on space/EOF too.
To parse a number it checks whether the input char is -
or +
for it to add to the charbuffer decrementing format width if set, optionally parses 0x
as base 16 or 0
as base 8 (defaults to 10), base 10 with i18n-enabled has own codepath, validates each input char (thousands seperator handled special) whilst adding it to a the charbuffer, optionally case-insensitively parses “(nil)” for %p
, ungetc
s the last char, adds a nil character, calls strto[u]l[l]
, & assigns to variadic arg.
To parse floating point numbers for upper-or-lower %e
, %f
, %g
, or %a
it checks for a -
or +
input char, scans (signed) “nan” or “inf[inity]”, parses 0x
to indicate base 16, validates each character is a (possibly hexadecimal) digit, singular e
(or for hexadecimal p
), +
or -
following that e
or p
, singular localized decimal point, thousands seperator prior to said decimal point, substitutes out localized chars seperately, calls strto[f,d]
& assigns to appropriate arg.
%[
requires additional parsing to determine whether it’s followed by ^
indicating not_in
flag & all the characters prior to ]
. If the long flag’s set it iterates over the given width of input chars & iterating over the valid set specially interpreting ranges indicated by -
to indicate whether each input char is valid, then optionally copies those input chars over to the appropriate variadic arg alloc’ing if configured with fastpath for fails.
Communicating to Humans
Whilst vfscanf
& it’s variants aids your programs in interpreting textual user input, vfprintf
& it’s variants aids your programs to output their data in human-legible textual form using the same format string syntax.
After validating it’s params (adding buffering to the output if missing) & with the output file locked, vfprintf
& variants uses wcschr
or strchr
to locate first format declaration & sputn
to output preceding text.
Unless that was the entirety of the work (which GCC optimization ensures it isn’t) until the end of the format string vfprintf
& variants repeatedly uses several jump tables to parse the slightly more complex format declarations into several local variables.
Ultimately those jump tables end at a final char, like:
%%
outputs “%” viaput[w]c
.%d
&%i
configures some args, retrieves the number to format from the variadic args, & dispatches to the[longlong_]number
label.%u
(base 10),%o
(base 8), &%X
or%x
(base 16) sets their numeric base & dispatches gotosunsigned_number
. Which ifis_longlong
retrieves the arg as such & (same logic for longlong%d
or%i
) defaults precision to 1 or pad to whitespace, outputs 0 as “0” or usingitoa
with optionallygroup_number
and/ori18n_number_rewrite
postprocessing. Similar for non-longlong. Postprocessing adds octal markers, adds signs, & precision.- Case-insensitive
%e
,%f
, &%g
retrieves the variadic arg with an auxialiary bitflag set indicating it’s a float & hands off toprintf_fp
formatting parameters gathered into a struct for it. Unless we’ve reached the end of the format string thedone
is incremented by that number of chars whilst overflow checking. %a
or%A
does the same forprintf_fphex
.- Case insensitive
%p
formats non-NULL pointers as a base 16 unsigned number, otherwise outputs as the string “(nil)”. %n
optionally callsreadonly_format
to update that flag, & stores thedone
counter in the corresponding variadic argument.%m
callsstrerror_r
to retrieve the string to output corresponding toerrno
prior to thisvfprintf
call.- non-long
%c
adds configured initial & traling padding (viapad[w]n
) & the character itself (viaputc
). %C
or long%c
does the same thing for typewchar_t
.%s
or%S
overwrites NULL strings with “(null)”, handles non-long%s
specially or retrieves optionally-capped length & outputs the string viasputn
surrounded by padding if space available.- non-long
%s
a) if right-aligned with configured width, converts between UTF16 & UTF8 viawcsrtombs
ormbsrtowcs
to count characters determining how many padding spaces to output viapad[w]n
b) converts the string in batches to output viasputn
c) output right padding if width-configured.
The number of specs interpreted are counted, the next one is located, & the intermediate text is written via sputn
at the end of each iteration. Extensive macro-aided error handling is present throughout.
Once a positional arg is seen in the format string, a printf_positional
fallback function is called for the remainder of the format string. This parses that remaining format string into a scratch_buffer
to figure out how to convert the variadic args for interpretator input.
printf_positional
uses the same code (with the aid of C macros) to interpret the format declarations. Upon encountering an unknown format declaration it serializes the format specifier itself.
This slowpath interpretor can also utilize caller-registered callbacks during parsing & input gathering, registered into lazily-allocated smallintmaps via register_printf_specifier/modifier/type
. Parsing’s in a seperate file.
Bytearrays (“string”)
GNU LibC’s “string” sublibrary processes bytearrays, usually those representing (UTF8-encoded) text. As such looping over the bytes in these arrays is extremely common!
There’s no functions which requires a platform-dependant implementations, but some functions allow for them. I’ll start with the functions are never overriden with platform-specific optimizations.
swab
iterates in reverse over every 2 bytes in the array & swaps (via a temp var) their positions.
An old implementation of strtok
first iterates over initial string characters matching the given seperator, then if that’s not the end of the string iterates over the remaining characters it finds the given seperator char (which is replaced with a nil byte) or end of string. Old implementations of strsep
wraps strchr
or reimplements the loop itself. Old implementations of strcspn
iterates over all characters which matches the given seperator(s) & returns the tail, as does strspn
. An old implementation for strpbrk
iterates over all chars which don’t match any of the given seperator(s) or the nil byte.
strfry
swaps 2 psuedorandomly-selected bytes within the given nil-terminated string once for every char in that string, tracking it’s own pseudorandom state.
strcmp
iterates over every equal char in the two given strings in tandem, returning the difference between the final, possibly nil, chars.
memfrob
xors every char in reverse with 42.
envz_entry
iterates over the given multistring (multiple nil-seperated strings with external byte-length) has an outer loop iterating over the strings & an inner loop which determines how much of the string matches the given “name”. If the following char is nil or ‘=’ that string is returned, otherwise returns NULL.
envz_get
wraps envz_entry
to before conditionally skipping past that ‘=’ if present. envz_remove
wraps envz_entry
& conditionally envz_delete
.
envz_add
wraps envz_remove
to remove the appropriate entry, then either realloc
s enough space to (if successful) memcpy
the given key & value with added ‘=’ & 0 bytes OR does so by calling argz_add/append
.
envz_merge
iterates over every entry in the 2nd given multistring to retrieve corresponding in the 1st via envz_entry
& argz_append
ing it after optionally removing the old entry if needed.
envz_strip
iterates over entries to find empty ones & memmove
over them.
Duff’s Device reffers to wrapping a loop in a switch
statement, with the case
s pointing inside the loop body as a clunky micro-optimization reducing loop iterations. Thankfully GNU LibC’s “string” library doesn’t quite do this (to allow merging loop body instructions?), but it does get close. GCC can do all these some optimizations, but I guess GNU LibC can’t rely on the right optimization level being set.
“string” functions applying similar microoptimizations also includes:
- the internal
WORDCOPY_(F,B)WD[_DEST]_ALIGNED
. - For non-trivial yet not too complex “needles”
strstr
which iterates over char in the given “haystack” skipping forward according to a trivial 256b char-to-int hashtable (amongst other shortcuts) & callsmemcmp
. - For needles of 3 or fewer chars it straightforwardly iterates over haystack char to test where the needle matches.
- For needles of more than 256 chars it puts more effort into building that shift table.
strspn
iterates over the string looking for the char matching the char if given only one. Otherwise converts the given “accept” chars into a bitmask & iterates over the initial chars (in chunks of 4) which are in that bitmask.strncmp
iterates over the given number of chars in chunks of 4 to test where they’re equal. Returns difference between first unqual.- Iterates over chars in chunks of 4 using a special bitops to speedup nil-byte detection.
memcpy
calls a kernel-accelerated macro surrounded by CPU-accelerated ones.
GNU LibC’s “string” sublibrary also provides several more trivial wrapper functions:
__xpg_strerror_r
wraps__strerror_r
,strlen
, &mempcpy
.strsignal
wrapssigdescr_np
,glibc_tls_internal
’sstrsignal_buf
property, &asprintf
.strncpy
&stpncpy
wrapsstrlen
,memcpy
, &memset
for the remaining bytes of the destination.sigabbrev_np
looks up the given signum in an array if in range.- &
explicit_bzero
wrapsmemset
adding a compiler membarrier.
Platform-overridable “string” Functions
Finishing my explorations of GNU LibC’s “string” sublibrary, roughly the other half of it’s API may be overriden by platform-specific code. But if not here’s how they work:
utf8_encode
decides whether to directly store the char as an ASCII byte, or compute how many bytes should be used (via a bittwiddling loop) & storing 6 bits of the char at time iterating over those destination bytes in reverse with that highorder bit set. The first byte (last bits) also encodes the length.
find_idx
wraps followed by following a table of how to far skipahead in the string for a given number of iterations. find_position
wraps find_idx
with a lookup in the given rulesets
smallintmap given by the locale.
do_xfrm
for a locale-specified number of iterations looksup that local-specified pos for the current char & (handling 0 specially) iterates over the (UTF-32?) input starting at that current char for each calling find_idx
& following the rulesets with char backlog.
There’s another variant of do_xfrm
with added caching, with strxfrm
choosing which to use after shortcircuiting for no locale-specified rulesets & locating the locating max index of a character to rewrite.
Turns out the input strings are normal UTF8.
strverscmp
uses iterates over equal chars whilst following a “state” lookup table, with a seperate lookup table determining whether the remaining characters by decimal number length or char difference.
strrchr
repeatedly calls strchr
until it finds the last one, skipping ahead after each match.
str[n]casecmp
compares the lowercased chars (via tolower[_l]
) of both inputs in lockstep & returns the difference between the first non-equal chars.
argz_stringify
iterates over every string in the multistring (via strnlen
) & replaces the seperating nil chars with a given seperator char.
strlen
iterates over every string in the multistring storing pointers in a given presized array.
argz_create
iterates over the given string array once to count up the total len (including nil terminators) so it can malloc
a multistring into which it can stpcpy
each string in that array.
argz_count
iterates over each string in the multistring to return a count of them.
argz_create_sep
malloc
s the given ammount of space & loops over the chars to replace the given deliminator with the nil byte. argz_add_sep
does reverse.
argz_replace
iterates over the strings via argz_next
filtered by strstr
. For each it uses strstr
to find each occurance in that string, and either repeated str_append
s (preceding by strndup
or argz_append
/argz_add
to replace each occurance.
Functions which are microoptimized to iterate in large bitsizes within their loop bodies include:
memchr
which uses bittwiddling tricks to iterate over all chars looking the first matching the given needle.
memcmp
iterates every byte in lockstep in the two inputs in chunks of in chunks of 4 long ints, with special fastpath when the alignment between the two inputs are the same.
memmove
uses the same optimization (calling into kernel, wrapped with CPU, optimizations via C macros) tricks as memcpy
, except it first determines whether there’s overlap requiring it to iterate in reverse order.
memrchr
uses bittwiddling tricks to operate 1 long int at a time exiting once it finds the given byte.
strchr[nul]
uses the same bittwiddling tricks to iterate over the given string 1 long int at a time looking for the first occurance of the given char (which is preprocessed for those bittwiddling tricks).
strnlen
uses the same bittwiddling tricks as strlen
iterate over the string 1 long int at a time.
strcasestr
meanwhile also uses higher-level optimizations!
After checking whether the needle fits in the haystack, it checks for repetitions in the needle.
Locating where the suffix first repeats involves iterating over the needle’s (canonicalized) chars twice comparing prefix & suffix chars, each time with three pointers: two for prefix & one for suffix. Extracts the length of the repeating prefix as the “period”.
If the needle’s short enough & there’s repetition(s) it repeatedly counts how many chars match between the needle & haystack at their current offsets in the right half then left to determine to return this position to the caller. If there were no repetitions it’ll canonicalize the needle & repeatedly compare it against each position in the string suffix than prefix canonicalizing each char of the haystack for comparison.
If the needle’s too long for either of those, it first computes a shift table mapping each char to it’s last occurance in the needle, or the needle length if absent. Then it’ll repeatedly check that shift table for how far to skip ahead before doing the comparison at each position. If there’s repetitions (and the needle is too long) that loop also involves tracking an offset into the needle for which repetition to check.
Simpler wrapper functions which may be overriden by platform-specific code includes:
strtok_r
wrapsstrspn
thenstrcspn
.strsep
wrapsstrcspn
.strndup
wrapsstrnlen
,malloc
, &memcpy
.strncat
wrapsstr[n]len
&memcpy
.strerror
wrapsget_errlist
, optionallyasprintf
, &_
with temporarily setuselocale
.strerrordesc_np
trivially wrapsget_errlist
.strerror_r
wrapsget_errlist
& maybesnprintf
.strdup
wrapsstrlen
,malloc
, &memcpy
.strcpy
wrapsstrlen
&memcpy
.stpcpy
works likestrcpy
but returns the string tail.sigdescr_np
wraps thesys_siglist
internationalized string array.rawmemchr
wrapsmemchr
orstrlen
for the nil char.mempcpy
wrapsmemcpy
while returning the array tail.memccpy
wrapsmemchr
&mem[p]cpy
.- The
argz_next
iterator wrapsstrchr
. argz_insert
wrapsargz_add
, orrealloc
&memmove
.argz_delete
wrapsstrlen
&memmove
.argz_append
/argz_add
wrapsrealloc
&memcpy
.
Time
Processing time can be useful or even vital to many applications. You can’t see it, you can’t hear it, you can’t weigh it, you can’t… measure it in a laboratory. All a computer can do is count every time e.g. a certain capacitor charges & discharges through a resistor, forming an electronic pendulum.
Ben Eater has some great videos on the hardware side (YouTube via Invidious):
Here I’m exploring GNU LibC’s “time” sublibrary, which exposes the time in both seconds-since-1970 & the Gregorian calender with internationalized rendering. Could still be more internationalized if C didn’t stick solely to the Gregorian calendar… Then there’s, especially historical, timezones & leap seconds…
tzfile_read
looks up the caller-specified file relative to the directory specified by the “TZDIR”, checks whether it changed, & parses the binary array data under lock.
tzfile_default
tzfile_read
s a constant filename & overrides several fields.
tzfile_compute
calls tzstring
on various fields, including every zone referenced in the types
array. Or wraps tzset_parse_tz
& tz_compute
. Or traverses a transitions graph. In any case looksup the full details for the timezone (including a call to tzstring
) & extracts the leap seconds to add or subtract.
offtime
computes the days, incorporates timzone offset, hour, min, secs, weekday, & year. Computing years is the trickiest step for offtime
, for which until fixpoint it estimates that a year is 365 days before going back & computing the number of leapdays. Then it computes the month & day-of-month from the remaining day-of-year. The results are packaged in a struct.
mktime
is the central function for converting that broken-down Gregorian timestamp (struct tm
) back into seconds since 1970. It first extracts & validates (Windows needs extra for timezones) the arguments.
After extracting & validating the arguments it normalizes away leapseconds, until fixpoint converts back & forth between to determine any corrections which need to be made for the difference between iterations (exiting if e.g. Daylight-Savings Time invalid) before doing the final conclusion. The conversion is done via localtime64.
nl_select/get_era_entry
wraps a table parsed & chronologically-verified from the locale files.
difftime[64]
subtracts two seconds-since-1970 timestamps, taking care around the sign & weird numeric representations.
nl_get_[w]alt_digit
& nl_parse_alt_digit
parses “alternative digits” for use in date formatting from the user’s current locale if they haven’t already, and retrieves the properties that parser sets.
For functions which may be overriden by platform-specific code:
asctime[_r]
wraps snprintf
& the format string “%.3s %.3s%3d %.2d:%.2d:%.2d %d\n”.
get_date[_r]
opens the file specified by the “DATEMSK” environment variable, trims whitespace surrounding the given text, reads each date on each line via getline
& strptime
exiting once successful, thens cleans up & infers missing data.
gettimeofday
wraps clock_gettime
converting the resulting struct timespec
to a struct timeval
.
gmtime[64][_r]
wraps tz_convert
. tz_convert
in turn wraps tzset
, the previously described tzfile_compute
, & maybe offtime
.
settimeofday
wraps clock_settime
& maybe settimezone
.
The aforementioned internal struct tzstring_l
type is a linked list of sized strings. tzstring[_len]
initializes this datastructures & appends to a global tzstring_list
.
The internal update_vars
function resets the daylight
, timezone
, & tzname
globals from the tz_rules
global. compute_offset
converts seconds, minutes, & hours into a total number of seconds.
parse_tzname
extracts all the initial alnum chars at the start of a string into a tzstring
which to store into the name
property of the relevant tz_rule
. parse_offset
checks a sign char & uses sscanf
to parse a timezone offset which to store into the offset
property of the relevant tz_rule
. parse_rule
parses the date formats which precede log entries.
tzset_parse_tz
wraps parse_tzname/offset
once or twice falling back to tzfile_default
, before attempting parse_rule
.
tzset
under a lock with a singleton flag retrieves the “TZ” environment variable checking whether it’s changed & ignoring preceding “:”, reads it via tzfile_read
, saves the result in the first of tz_rules
global & updates other globals to match.
tz_compute
first calculates the seconds-since-1970 datetime for both global tz_rules
on the given year (year into seconds if after 1970, then depends on calendar), then optionally compares them to determine whether it’s Daylight Saving.
strftime
& variants aids outputting a gregorian date in a given format string resembling printf
, etc’s. This involves an iteration over the format string copying each, possibly multibyte, char over optionally, except for various ASCII characters it handles verbatim, converting casing, until it reaches a ‘%’. Simpler if you don’t compiletime-enable internationalization. Then parses the “_-0^#”, digit, & E
or O
flags. Before acting upon the formatting character, fallingback to outputting invalid chars.
”%%” is the escape sequence which outputs “%”. “%n” outputs a newline. “%t” outputs a tab.
Heavy use of macros throughout this code help to avoid buffer overflows.
Other format chars might lookup fields of the given struct tm
in global arrays and/or (usually enabled by the ‘E’ modifier) the configured locale. Various of those will update casing flags if configured by the ‘#’ modifier. Not all modifiers are valid on all formatchars, if they’re encountered it errors out. Interestingly AM & PM are stored in a single string.
Other formatchars recurses to apply another another format string, whether hardcoded and/or supplied by the configured locale. These tend to not accept modifiers.
Yet others output various properties, possibly with minor (pre?)processing, as a possibly-padded (via memset
) integer by repeatedly dividing by 10 inline outputting remainders (+’0’) from right to left. Where padding is present it usually defaults to spaces. The locale may specify alternative characters to be consulted first before outputting most integers.
Or for “%s” (GNU extension) it’ll use mktime
to convert back into seconds-since-1970 UNIX timestamp for the integer to output, handling negatives specially. Whilst “%V”, “%g” & “%G” computes the ISO weekday via a helper function postprocessed to enforce year boundaries, and “%z” unapplies daylight savings if set & retrieves the difference.
And some formatchars looks up “eras” from the locale to incorporate some property thereof in it’s output, usually falling back to computing the century from the year.
strptime
essentially works in the reverse of strftime
. Consisting mostly of a loop over the format string. Whitespace skips any amount of whitespace in the input string. Non-percentiles are matched against the input advancing both iterators, returning NULL on non-match. Modifiers (“-_0^#”) & field width (digits) are skipped in favor of leniency so you can reuse the format strings for strftime
, and then acts upon the formatchar.
”%%” matches “%” in the input.
Some formatting directives like “%a” iterates over multiple string arrays, including those provided by the configured locale, to determine which matches (via strncasecmp[_l]
) & what it should store for e.g. tm_wday
giving the longest matching string precedance.
Other formatting directives trigger recursion on formatting strings which might be provided by the locale.
Others skip spaces then parses integers via multiply-add by 10 whilst it sees a digit. Sometimes requiring more computation.
“%O” triggers scanf
to check the subsequent character. Some numbers (especially for “%O”-prefixed directives) may first consult nl_parse_alt_digit
.
A statep
parameter is passed through to recursion & unpacked into local variables, indicates postprocessing shouldn’t be done yet. Said postprocessing increments the hour by 12 if PM is set, incorporates the century and/or era into the year. If an era was desired defaults to +100.
Days may be specified in a few ways which needs to be made consistant. If it doesn’t have either a month & a monthday but does have a yearday it can calculate the former taking into account leap years, and from there the weekday if lacking. Vice versa it can calculate the yearday from the year, month, & monthday. If it has a week (in some form) & a weekday it can calculate the monthday, month, & year from it.
Days may be specified in a few ways which needs to be made consistant. If it doesn’t have either a month & a monthday but does have a yearday it can calculate the former taking into account leap years, and from there the weekday if lacking. Vice versa it can calculate the yearday from the year, month, & monthday. If it has a week (in some form) & a weekday it can calculate the monthday, month, & year from it.
UTF16 Wide Strings “wcsmbs”
GNU LibC’s “wcsmbs” sublibrary implements UTF16 processing, which loosely resembles the “string” library with builtin “iconv” integration. Though UNIX programs now typically stick to UTF8 so “wcsmbs” doesn’t get as much use or optimization. (Microsoft’s kernel however deals in ASCII or UTF16, so things are different there) Typically when any real text processing is desired we pull in the external LibICU library to be properly internationalized!
I wouldn’t recommend using “wcsmbs” (or “wctype”) since it doesn’t appear to have kept up after the Unicode Consertium found more than 65,536 characters accross the different written languages. I recommend using LibICU instead.
wctob
wraps the innards of “gconv” to convert char into a Unicode codepoint, fastpath for ASCII.
wc[s]width
wraps the configured locale & “wctype” to compute the width of each char in the given widestring, wcswidth
sums the result.
wcstok
wraps wcsspn
then wcspbrk
, returning early NULL on error.
wcsspn
iterates over the given widestring & with an innerloop over the acceptables chars to exit as soon as it finds a char in that set.
wcsrchr
has a C macro-unrolled loop (defaulting to no unrolling) over the given widestring storing the index of the last occurance of the given char, which it should return upon reaching the nil byte.
wcspbrk
repeatedly calls wcschr
for each starting index in the given widestring.
wcsncmp
iterates every equal char in the two given widestrings in lockstep, comparing them, returning that difference once nonequal or end of strings.
wcsncat
wraps wcs[n]len
& wmemcopy
.
wcsmbs_getfct
wraps gconv_find_transform
with validation. wcsmbs_load_conv
wraps wcsmbs_getfct
with added pre- & post- processing. wcsmbs_clone_conv
wraps get_gconv_fcts
with locked refcounts. wcsmbs_named_conv
wraps wcsmbs_getfct
to retrieve converters in both directions, erroring if either’s unavailable. nl_cleanup_ctype
retrieves the locale’s private.ctype
property, cleaning up if unavailable.
mbsrtowcs_l
wraps the “gconv” internals to convert a string in a locale- (caller-provided) specified encoding into a (UTF16) widestring, specially handling when it’s left to do the malloc
. mbrtoc16
works similarly choosing the input text encoding from the global locale.
c16tomb
splits the specified 16bit char into one or two bytes before continuing via wcrtomb
.
For “wcsmbs” functions which may have platform-dependant optimizations applied:
btowc
wraps the “gconv” internals to convert a Unicode codepoint int to a widechar according to the global locale’s configured char encoding, with fastpaths taken for single-pass conversions & especially ASCII.
mbrtowc
also wraps the “gconv” internals to convert a string to a widestring according to the configured locale’s charset, taking special error-reporting care.
mbsnrtowcs
is another wrapper around the “gconv” internals handling sized strings, specially handling cases where it’s responsible for malloc
ing the output.
wcrtomb
is a more straightforward “gconv” internals wrapper converting widestrings to strings, given some state to propagate between consecutive calls.
wcscasecmp
iterates lockstep over the equal (once lowercased) chars in both given widestrings, returns the case insensitive difference of the next char.
wcschrnul
iterates over a given widestring until it reaches the given char or nil, returning that pointer.
wcschr
, with C macro-based loop unrolling, iterates over a given widestring saving the last index where it encountered the given char & returns that once it reaches the nil byte.
wcscmp
iterates lockstep over the equal chars in both given widestrings, returning the difference between the next chars.
wcscpy
wraps wmemcpy
or uses it’s own macro-unrolled loop.
wcslen
iterates over a given widestring in chunks of 4 & returns the char count.
wcsncasecmp
iterates lockstep over the equal-once-lowercased chars two given widestrings returning the lowercased difference between the subsequent chars.
wcsnrtombs
(GNU-specific, very useful internally) again wraps the “gconv” internals to convert from a sized widestring to sized string according to the given locale, specially handling when it’s responsible for the malloc
. wcsrtombs
works similarly.
wcsstr
iterates over the given “haystack” widestring (including some inner fastpath loops) with a seperate iterator over the given “needle” widestring to determine at which index that needle occurs in the haystack without backtracking on said haystack.
wmemset
sets each index in the given sized widestring to the given character in chunks of 4.
wmemcmp
the compares the given number of chars in both given widestrings to determine the first differing character & returns that diff, 4-chunks.
And wmemchr
iterates over the given sized widestring to determine the first differing widechar & returns it’s pointer.
Other more trivial wrapper functions which may be overriden by platform-specific optimizations include:
wmempcpy
which wrapsmempcpy
wmemmove
wrapsmemmove
wmemcpy
wrapsmemcpy
wcsnlen
wrapswmemchr
wcsncpy
wrapswcsnlen
, maybewmemset
, &wmemcpy
wcscat
wrapswcslen
&wcscpy
wcpncpy
wrapswcsnlen
,wmemcpy
, &wmemset
wcpcpy
wrapswcslen
&wmemcpy
mbsrtowcs
wrapsmbsrtowcs_l
for the global localembs_init
initializes some state for some conversion functions I described earlier- &
mbrlen
wrapsmbrtowc
Wide Characters
I want to finish describing the GNU LibC’s platform semi-dependant sublibrary implementations. So having just described “wcsmbs” dealing with “wide strings” I’ll now describe it’s complementary “wctype” for dealing with individual wide characters. Again I’d recommended to use LibICU instead.
Most of “wctype”’s platform-independant function implementations wraps wctype_table_lookup
after looking up lookuptables from the global locale & checking whether it can take an ASCII fastpath.
That also closely describes towlower_l
& to_wupper_l
, except those don’t take ASCII fastpaths & wraps wctrans_table_lookup
instead.
The rest can all be overriden with CPU-specific optimizations, several of which are also implemented in that same way…
wctrans[_l]
looks up the list of charset names from the locale, finds the index of the desired name, & looks up the corresponding wctrans_t
from the locale. wctype[_l]
works similarly. The *_l
funcs allow caller-specified locales.
The aforementioned table lookups (wctype_table_lookup
, wcwidth_table_lookup
used by “wcsmbs”, wctrans_table_lookup
functions) consults three tiers of hashtables without any rehashing.
towctrans[_l]
& iswctype[_l]
just validates their desc
arg is provided & publicly wraps those lookup functions. The caller-specified locale is ignored.
Plugin Modules
Several GNU LibC sublibraries, including that for loading these Dynamically Loaded Modules, may defer to plugin modules to support different formats, text encodings, naming systems, etc. While you may add your own, GNU LibC bundles a few such plugin modules built-in.
Executable & Linkable Format (“elf”)
When you compile a program, without special effort adding additional capabilities requires recompiling that code. As such it’s often useful to engineer away this rigidity! That’s what dynamically-loaded modules do!
On Linux-based systems this is often achieved using the Executable & Linkable Format (ELF), and it’ll take me several days to get through it’s featureset as implemented within GNU LibC’s “elf” extension to “dlfcn”.
- The “dl-minimal” subsystem handles specially linking the “malloc”, “setjmp”, error reporting, & “string” sublibraries for “elf”’s own use, with temporary implementations directly wrapping
mmap
. Because withoutmmap
“elf” can’t do anything! _dl_fini
iterates over namespaces skipping empty or auditting ones, for each under lock, gathers than sorts (soon-ish topic) a list of items under that namespace, & releasing the lock iterates over them again skipping uninitialized ones to call each registered finalizer. Optionally followed by auditting & debugging output both within & outside the loop.- There’s utilities for iterating over every “LD_“-prefixed environment variable, & for removing an environment variable.
_dl_addr
(called & exposed directly by “dlfcn” sublibrary) wraps_dl_lookup_address
(on ia64 & hppa CPUs) & under lock_dl_find_dso_for_object
with output format conversion & hashmap or array symbol lookup.
Entrypoint _dl_start
(“rtld”)
The entrypoint _dl_start
is called via platform-dependant code and, with time-profiling & a cleared bootstrap_mem.l_info
array, retrieves some platform-specific globals into bootstrap_mem
before converting an array in the ELF file into that l_info
smallintmap with a little postprocessing & validation. Then initializes internal malloc-wrapper stubs before calling _dl_start_final
& returning it’s optionally mapped start code address.
With additional profiling & a dl_rtld_map
loaded from some global pointers or compiletime-conditionally what _dl_start
gathered, _dl_start_final
in turn calls the __builtin_frame_address
function which I can’t find it’s definition & forwards to the “sysdep” subsystem (later).
Upon start (_dl_sysdep_start
) iterates over the dl_auxv
to retrieve appropriate fields, calls platform-dependant code to retrieve system memorypage size if currently unknown, if there’s a new sysinfo
& the lib doesn’t have dl_sysinfo_dso
already updates global, initializes “tunables” (won’t cover that topic…), runs some platform-specific initialization code (via C macros), computes dl_platform
length if missing, sbrk
-allocs some memory, validates standard FDs, & runs the dl_main
code.
That function, and any necessary cleanup (none by default), might be overriden with platform-dependant code. And there’s a _dl_show_auxv
function for textually outputting that dl_auxv
array in case debugging is desired!
_dl_sysdep_start
in turn calls dl_main
which is where the bulk of “elf”’s code is! With initialized state
, locks, stacks, etc dl_main
starts by iterating over envvars to set appropriate globals.
Then it conditionally (for suid executables) clears any “unsecure” environment variables in a platform-specific list & checks if /etc/suid-debug is present and/or parses debug configuration opening the appropriate file.
If the ELF library was ran as the commandline program, handling it specially (not digging into details), otherwise allocates a main_map
representing the embedding executable. In either case iterating over the program headers to populate various main_map
, etc fields.
Those extracted fields from the ELF header array are postprocessed, if it’s not the main executable it converts an array in the file into a smallintmap just like it did first thing on _dl_start
, & initializes a new hashmap.
If dl_main
was called for verification, it’ll validate the presence of main_map->ld
& has_interp
alongside caller macro-specified checks.
Otherwise it continues by configuring the virtual DSO, calls _dl_init_paths
, initializes debug data & updates some properties.
dl_main
then iterates over the ELF headers again to extract the PT_GNU_RELRO
field to retrieve the l_relro_addr/size
properties from. If there’s a l_tls_blocksize
it’ll retrieve the thread-locale-storage (TLS) modID (later). It appends 2 items to the audit list & considers outputting help text.
If there’s anything in the audit list it’ll initialize TLS & security, & load all the specified audit modules taking great care to handle errors via _dl_open
& _dl_lookup_symbol_x
.
If it successfully loaded any audit items, it’ll call registered callbacks.
It stores some previously retrieved debug references, passes each link_map_audit_state
to the afct->activity
callback, with time-profiling iterates over any configured (via env, commandline args if lib’s ran as executable, or /etc/ld.so.preload) preloads calling _dl_map_object
(later) with error handling for each, than gathers an array of them & applies _dl_map_object_deps
(later).
It iterates over everything in the main_map->l_searchlist.r_list
array in reverse marking them as globals, removes dl_rtld_map
from the doubly-linked linked list, then populates that l_search_list.r_list
array from the dl_rtld_map
linked-list. If there were any items it rebuilds the linked-list head.
Then it checks version numbers (later), followed by initializing thread-local-storage (later) & in normal mode optionally outputs debugging & applies relocations.
If any libraries were loaded, it iterates over them to validate the precomputed checksums & names match. Might output debugging info.
Then it retrieves some fields into globals, optionally scope debugging info, calls platform-specific validation code. If prelinked time-profiled resolves conflicts (later topic), ensures malloc
is properly linked (tomorrow’s topic, pushing off previous comments to that effect), & iterates over the mainmap marking executable pages as readonly & collecting slots.
If it isn’t prelinked then dl_main
iterates over the searchlist in reverse to time-profiled apply relocations & gather slots.
The thread-local-storage “generation” is conditionally incremented, unconditionally allocated, & conditionally added to a list for the local thread with optional debugging output.
If not prelinked but with multiple references it ensures malloc’s properly linked & time-profiled applies relocations.
dl_main
finishes by initializing parts of GNU LibC (later topic), starts sysdep cleanup (after tomorrow), compile- & run- time optionally runs loaded audits, finishes initializing debugging & compiletime-optionally unloads the cache.
ELF Format (“elf.h”)
Continuing my discussions about how GNU LibC “elf” plugin to “dlfcn” works, today I’ll describe the ELF binary file format itself, because executables & libraries don’t just contain machine code for the CPU to fetch, decode, & execute. It also contains additional information required for a “linker” to get the CPU ready to do so.
I’ll use the term “word” to refer to 32bit or 64bit depending on the file. C macros help “elf” generate codepaths for both!
An ELF extended header (referred to by the “dl-load”, “sprof”, “readelflib”, “rtld” as described yesterday, “dl-map-segments”, “readlib”, & “dl-support” subcomponents) contains a 16byte identifier, 16bit object file type, 16bit target architecture, 32bit version, 1word (virtual) pointer to _start
, 1word offsets for program headers & section headers, 32bit CPU flags, 16bit bytesizes for all(?) headers, entry & byte sizes for program & section header tables, & section index for string table.
That identifier starts with “\177fELF” followed by enums for wordsize & endianness, a version number, an enum & version number for the ABI (e.g. 3 for GNU/Linux) the machine code was compiled against, & a padding byte. These enums, and others, are defined manually via C macros.
Section headers have 32bit name index, 32bit section type & 1word flags, 1word address & file offset & size, 32bit next link & additional info, & 1word alignment & entry size. Those section headers, if it’s `SHF_COMPRESSED bitflag is set, can contain additional 1word fields for compression type (current ZLIB/DEFLATE fully supported), size, and alignment.
Each entry in the symbol table (used throughout the “elf” extension) contains 32bit name index, 1word pointer, 32bit size, 1byte each for type/binding & visibility, & 16bit section index. 64bit ELF moves the pointer & size to the end of the struct. Optionally also 16bit each for binding refs & flags.
Relocation tables (used throughout “elf”, via a parallel array) maps 1word offset tables to 1word symbol index, optionally with 1word “addend”s.
Program segment headers (used primarily in “dl-load” & “rtl” subcomponents) include a 32bit type (from an enum), 1word offset, 1word virtual & physical addresses, 1word file & mem size, 32bit flags (moved to second field for 64bit ELF), & 1word alignment.
Elf??_Dyn
is a dynamically-typed 1word value used throughout “elf”.
Version definitions (used in “dl-version” & “rtld” subcomponents) consist of 16bits each for the revision, flags, index, & count; & 32bits each for hash, offset in verdaux
array, & offset to the next version definition. That verdaux
table lists 32bit names with 32bit offsets to the next such entry.
Each version dependency describes a 16bit version, 16bit “aux” entry count, & 32bits offset references to dependency filename, to the vernaux array, & to the next version dependency. Version dependencies optionally also hold 32bit hash reference, 16bit flag, 16bit padding, 32bit dependency name string offset, & 32bit offset reference to next auxiliary entry.
There’s dynamically-typed 32bit values used predominantly by “dl-sysdep” & “dl-support”.
Note headers (used by “dl-load” & “readelflib”) have 32bits each for namesize, bodysize, & type.
Each move records have 1word each for value, size & index, & symbol offset ref; & 16bits each for repeat count & stride.
In 32bit ELF SHT_MIPS_GPTAB
sections hold 32bit current value, 32bit padding, 32bit new value, & 32bit tailsize. SHT_MIPS_REGINFO
holds bytes for the kind & size, 16bits for section, & additional 32bits interpreted according to it’s kind.
.options
sections contains 64bits of flags.
SHT_MIPS_LIBLIST
sections contains 32bits each for name index, timestamp, checksum, version, & flags.
SHT_MIPS_CONFLICT
contains 16bit version; chars for ISA level, revision, size for general regs & coprocessor 1 & 2, & ABI, 32bits for ISA extensions, 32bit mask of ASEs used, & 64 bits of flags.
Opening Dynamic Libraries (“dl-open”)
_dl_open
is called by that dl_main
function described previously to load the auditting plugin modules for the given plugin module, & is referenced by a methodtable I don’t see used anywhere.
With validated args & a lock it checks whether the given nsid
is LM_ID_NEWLM
. If so iterates over the namespace array looking for a free slot erroring out if none, initializes it if needed, & flags it. Otherwise invalid namespace errors.
With careful error handling & compiletime-optionally cache unloading, it does the main work of:
- If the given filename has a
$
seperator, the namespace ID is__LM_ID_CALLER
, or there’s no slash in the filename it calls_dl_find_dso_for_object
to load the “link map”._dl_find_dso_for_object
iterates over every namespace, &_ns_loaded
linked lists within to find the highest addressed object inside that memory region. - Check if LibC’s loaded.
- Temporaily saves the namespace’s
_ns_global_scope_pending_adds
to be restored later. - Initialize some debugging data (later topic)
- Loads the named object (later, seperate topic)
- If there’s a
l_searchlist
optionally outputs debugging information, validates them & allocates a new global “link map”, set thel_nodelete_active
on the dynamicly loaded lib flag, and/or copies over into the namespace’s_ns_main_searchlist
array. - set the
l_nodelete_pending
flag on that lib. - Load the “object”’s dependencies by:
- Preparing a “preloads” array copying from a given array, & taking a reference to the last item.
- Allocate a scratchbuffer.
- Iterate over that array, flagging each dep as being
done
, allocating some temp mem, optionally iterating over it’sl_ld
array applying it’s substitutes & calling_dl_map_object
with error handling (later topics) whilst saving results including into linked lists, consolidating dependency array, & skipping completed deps. - Frees temp memory & restores the error number.
- Saves the old
l_initfini
if present, before alloc’ing a new one. - Iterates over the
runp
linked-list populating non “faked” entries into that array whilst keeping an index to the last one with a link map. - Optionally iterates over that array skipping items without a local scope & erroring out for ones with AUXTAG & FILTERTAG, for each building local scope, flagging
DT_SYMBOLIC
items within, & allocating an array for them. - Optionally flags everything temporarily in the
l_searchlist
asl_reserved
& iterates over thel_reldeps
array to locate & remove duplications. - Copies the
l_searchlist
array into thel_initfini
array, removing the last item with a link map if any. - Sorts that
l_initfini
array optionally skipping the first item. - Swaps in the new
l_inifini
&l_reldeps
arrays adding a NULL terminator, with an atomic writebarrier or two. (Their structure’s a later topic)
- For each dependency in
l_search_list
with a version attached, validated that version by:- Extracting relevant values exitting if missing.
- If there are
DT_VERNEED
data validate the records are of version 1 & repeatedly locate a given namespace of what’s already been loaded, if not tracemode or faked repeatedly callmatch_symbol
validating result, & iterate to next dependency. - If there’s
DT_VERDEF
defined locate highest address amongst them. - If it found a highest index
calloc
anl_versions
array & computel_versions
pointer, if there wereDT_VERNEED
s iterate over them & their auxiliary data populating thel_versions
array, & if there wereDT_VERDEFS
iterate over them to do likewise.
- Compile & runtime-optionally check if the namespace’s head enables auditting & if so iterates over the audit plugins & call their callbacks.
- Reinitialize the debugger & it’s state (later topic) to trigger a “probe” (from
). - Runs some platform-dependant validation.
- Optionally outputs “scope” debugging information (later topic).
- Gathers the relocation bitflags.
- Iterates over the
l_initfini
array to extract the first & last entries. - Iterates in reverse over those skipping ones which were already relocated
_dl_relocate_object
once or twice & optionally_dl_start_profile
on them. - It (re)allocs the scopes (later topic).
- Gathers all the
struct link_map
s from the searchlist withl_init_called
unset & al_tls_blocksize
(later topic). - If the
RTLD_GLOBAL
flag is set gathers the new entries into the_ns_main_searchlist
array again, with careful error handling as per before. - Updates flags on each
struct link_map
of the namespaces. - Updates the scopes & thread-local storage.
- Triggers another probe.
- Specially handles it’s own dependencies.
- Platform-specific code for the “static” sections.
- Calls initializers with careful error handling, by:
- Loads the
dl_initfirst
entry & if present calls it by checking whether it’s already been called according to it’s containingstruct link_map
, validates the entry can be called, optionally outputs a debugging message, calls platform-dependant code, calls it’sDT_INIT
entry if present, & iterates over it’sDT_INIT_ARRAY
entry if present calling each function reference contained within. - If there’s a
DT_PREINIT_ARRAY
it optionally outputs debugging information & iterates over it calling all the contained function pointers. - Begrudgingly (according to a comment) iterates over the
l_initfini
array calling them as perdl_initfirst
.
- Loads the
- Appends to namespaces array.
- Optionally output debugging info.
dl_relocate_object
, skipping already relocated objects & with optional debugging output, checks if there’s DT_TEXTREL
entries. If so it iterates over the PT_LOAD
headers to alloca
a struct textrels
for each & enable rewriting the mmap
‘d exectuable file, retrieve the stringtable to inform platform-specific code (both C functions & macros), optionally calloc
a l_reloc_result
array, mark the object as relocated, & restore the memory protections whilst clearing cache.
_dl_start_profile
meanwhile iterates over the PT_LOAD
headers extracting the start & end addresses, pads those addresses, and uses that to compute how much memory to allocate for profiling whilst checking whether it’s a power of two indicating it can take advantage of hardware optimizations. Then it assembles metadata inside the loaded ELF file, and opens, mmaps a file to which it can output profiling data to, & initializes entries for each branch/”arc”, & calls platform-specific code.
When checking each individual symbol (in match_symbol
) it optionally outputs debugging information, throws an error upon missing DT_VERDEF
entry, before iterating over those verdefs validating they have an ELF version of 1 & exits upon finding the desired vd_hash
value. If not found or a weak reference, throws an error.
Loading Dynamic Libraries (“dl-load”)
The process of “opening” an ELF file as described above thoroughly wraps a subprocess for “loading” the executable from disk, which is what I’ll discuss today starting from _dl_map_object
.
_dl_map_object
, after iterating over all already open namespaces to ensure this file hasn’t already been successfully loaded & optionally outputting some debugging info, compile & runtime optionally iterate over the loaded audit plugins to call their objsearch
refining the filename.
If the filepath (as refined by the auditting plugins’ objsearch
methods) does not specify a directory & no DT_RUNPATH
is configured on the “loader” it iterates over each loader to check the file can successfully be loaded from it’s configured colon-seperated DT_RPATH
s, then if present the main link map’s RPATH. Utility functions loads & splits those paths, and another concatenates the filename onto each & with careful error handling attempts to open the file.
If that’s unsuccessful or not run & theres a __rtld_env_path_list
to search, do so. Same for the loader’s l_runpath_dirs
, followed by platform-specific code.
If compiletime-configured to use LDConfig… Remind me to get back to this tangent today! And finally it fallsback to __rtld_search_dirs
global.
If there were folders specified in the filename, it iterates over the filepath to substitute any valid ${...}
“DSTs” into a newly malloc
‘d string & attempts to open it.
It validates the caller-given mode, & if it failed to find/open the file a “fake” entry is added to the namespace array if configured with DL_DEBUG_PRELINK
. Without that config this is an error to throw (later topic).
Otherwise it continues by attempting to call platform-specific code for retrieving file information. Compile & runtime optionally takes a shortpath for ld.so to avoid self-dependency. And a little more validation/debugging.
Compile & runtime optionally calls each audit plugin’s activity
method, alongside some debugging.
Then getting into the real logic it calloc
s/inits a new “object”, copies info from the header into the object, computes a maplength
to alloca
& pread
, iterates over Phdr
of which to fill in the appropriate object fields with special handling for “load commands” via _dl_map_segments
(which mmaps
& mprotects
the file segments by those headers), & similar for Dyn
entries.
If appropriate stack & mode flags are set, and sharedlibs are compiled-time enabled, it calls mprotect
on the stack memory segments with some platform-specific checks if available.
Aconfigured offset is added to the l_tls_initimage
if present.
Platform-specific code is called to process each PT_NOTE
or PT_GNU_PROPERTY
Phdr
.
The file is closed.
If sucessful it’ll increment the object’s l_entry
by it’s l_addr
. It creates a hashtable.
It replaces it’s searchlist with information from it’s new “scope”, updates a dl_initfirst
global & l_file_id
property, optionally appends a value to the lib’s name array, updates the corresponding namespace’s libc_map
property to point to this object, computes a thread-local-storage module ID (later topic), calls platform-specific code if available, appends the object to the namespaces array, & compile & runtime optionally calls each auditting plugin’s objopen
method.
Circling back to the LDConfig fallback file lookup logic, that involves (after optionally outputting debugging info) if it hasn’t already (global variable) it mmaps /etc/ldo.so.cache with careful error handling whilst validating the format with help from platform-specific code. It can then binary search this file, with it’s fixed bytesize entries, looking for the matching entry best (according to some platform-specific code & later some globals) for the available hardware capabilities.
The result is copied is copied into newly alloca
‘d memory, optionally checked if it’s within some configured directories to see whether it should filter it out, and attempts to open the looked-up filepath.
Those hardware capabilities globals which the ld.so.cache refers to are initialized alongside the rest of this “dl-load” component from heavily-processed caller-provided input.
Closing Dynamic Libraries (“dl-close”)
I’ve discussed how GNU LibC’s “dlfcn” plugin “elf” opens an Executable & Linkable Format file. Today I wish to explore the flipside & discuss how it closes these files, and in the process some of the underlying infrastructure I’ve been skipping over in previous “elf” threads.
After taking a lock, validating that the dynamic library hasn’t already been closed whilst waiting on that lock, & validating the refcount _dl_close
decrements that refcount & validates the lib’s state/type.
It looks up the namespace corresponding to the caller-given linkmap, iterates over it’s _ns_loaded
dependencies array collecting them into an array, clears used
& done
bitmasks, & iterates over that array again skipping over ones which (according to those bitmasks) have already been closed or aren’t still being used elsewhere. For each it flags it in both the done
& used
bitmasks before iterates over it’s l_initfini
then l_reldeps
arrays to mark those as used too.
It then sorts those gathered linkmaps by their partial order by moving each out of place items into place, in reference to seen
counts in a sidetable.
It then iterates over these linkmaps again this time in sorted order. For each of those libs which are unused & had their initializer called it optionally outputs some debugging information & with careful handling calls each function in it’s DT_FINI_ARRAY
followed by the singular DT_FINI
if present.
To output debugging information it uses it’s own simpler/self-contained printf
implementation (can’t rely on stdio-common
being loaded already?) which also prepends the process ID & a colon built-in, using _dl_writev
to output the computed string. _dl_writev
wraps __writev
(from “misc”, platform-dependant implementation) possibly wrapped with lock after startup.
For the careful error handling it temporarilies clear the catch_hook
global before calling the given callback.
_dl_catch_exception
(said careful error handling) may for other callers instead stack-allocate a new value for catch_hook
(storing a targets for the exception, errcode, & control flow) to be available whilst running the given callback. In the event of an error _dl_signal_error
updates these pointer values & longjumps into a _dl_catch_exception
branch. Or wrappers thereof will first consider running a callback function.
_dl_close
compile- & run- time optionally calls each audit plugin’s objclose
method, flags it’s l_removed
property & unload_any
local, if it’s a global linkmap increment the number of globals unloaded, saves the min loaded index, ensures libs with a l_initfini
have a l_searchlist
by reusing that memory, & counts scopes whilst determining which to remove.
If there were any scopes to remove it mallocs a computed number of scopes to copy each remaining scope into, freeing the old ones.
A scope is a sized array of struct link_map
s, & link_maps in turn are a doubly-linked list of named addresses, dynamic sections, & so much more. It attempts to free scopes via a dedicated freelist array & platform-specific threadsafety, fallingback to the malloc APIs.
Otherwise if there’s a new list it’ll set the searchlist to NULL. If the lib has a unused l_loader
& once again updates the first_loaded
index.
After that loop it exits if there’s nothing found to unload.
It compile- & run- time optionally calls the activity
method on all audit methods with one present, before initializing the debugger & it’s state.
To initialize the debugger it looks up the object from a global or the specified namespace & initializes it’s properties with configured values. Including a reference to _dl_debug_state
which is a noop intended for setting a breakpoint on in GDB. Most callers immediately triggers that breakpoint.
If there were any global libs to unload, count up the number of libs flagged l_removed
in _ns_main_searchlist
& rearranges that list if necessary to remove newly-opened holes.
If multithreaded with globals or scopes to free, performs some platform-specific synchronization & free()
s the that freelist.
With a lock held it iterates over the struct link_maps
one final time skipping any used entries.
For each unused item in this final iteration, if it has a l_tls_blocksize
it conditionally & recursively removes the “slot” restoring various properties (and, if successful, global) to previous values before (for TLS offsets other than NO_TLS_OFFSET
& FORCED_DYNAMIC_TLS_OFFSET
) collects a contiguous memory chunk under various compile- & run- time conditions updating dl_tls_static_used
field.
If caller-requested it clears each entry in the symbols table if present under a lock.
Calls caller-specified code to unmap all the mmap
‘d memory; removes the lib from the doubly-linked list (which may or may not be cyclic depending on build flags); decrements the _ns_nloaded
count; frees it’s l_versions
, if present l_origin
, & l_reldeps
; optional debugging output, frees l_name
, each item in the l_libname
linkedlist, l_initfini
, conditionally l_scope
, & if present l_phdr_allocated
& l_r[un]path_dirs
. Then NULLs dl_initfirst
field if that’s this lib.
After that final (locked) loop…
If there were any thread-local-storage to free it increments the dl_tls_generation
field translating numeric overflows into fatal errors, using _exit(127)
within the wrapper around it’s internal printf implementation it uses. If the dl_tls_static_used
field equals the computed tls_free_end
it resets it to the computed tls_free_start
.
It compile- & run- time optionally calls the activity
method on all auditting plugins where it’s present.
Finally it conditionally truncates the namespaces array, updates the debugging state triggering the breakepoint & a probe. And if the local state is rerun
it jumps right back up just after it converted the linkedlist of libs into an array, otherwise it sets the state to not_pending
.
Symbol Lookup (“dl-lookup”)
Once you’ve opened/loaded a dynamic library for that to be of any use you need to lookup symbols within it. For that you use _dl_[v]sym
via a wrapper from “dlfcn”.
If the given lib’s RTLD_DEFAULT
dl_[v]sym
iterates over every linkmap in every namespace to find which symbol the caller is in. Then calls dl_lookup_symbol_x
(which I’ll come back to), if multithreaded it does so with a GScope platform-specific flag temporarily set whilst forwarding exceptions.
If the given lib’s RTLD_NEXT
then dl_[v]sym
locates in which symbol the given caller is in, validates it’s dynamically-loaded code, locates the last struct link_map
in it’s l_loader
linkedlist, & dl_lookup_symbol_x
s within that loader’s scope.
Otherwise directly wraps dl_lookup_symbol_x
within the given lib.
If the result’s in thread-local-storage it’ll call the platform-specific THREAD_DTV
macro to retrieve a TLS global, with various postprocessing under different conditions.
If the result’s not thread-local-storage it calls a platform-specific macro to compute the pointer from the looked-up symbol, which may involve taking a special codepath (_dl_make_fptr
if it’s annotated as being a function.
_dl_make_fptr
allocates (via mmap
with error handling) a global function pointer table if needed, resolves the offsetpointer, validates the index within that table is in-range, & atomically inserts a new entry into that table which it returns.
Additional postprocessing for non-thread-local-storage involves following indirect function references if relevant via platform-specific code, & if run- & compile-time enabled looks up the caller symbol & calls each audit plugin’s symbind
method for it.
Looking up the symbol by given name in the given linkmap involves incrementing the number of relocations, validating there’s a version & no DL_LOOKUP_RETURN_NEWEST
flag, if there’s a skipmap counts scope entries matching it, iterates over the different scopes to apply the core logic to each, throws an exception if not found, checks whether it’s protected and if so possibly performs the lookup again, for lt_loaded
types with DL_LOOKUP_ADD_DEPENDENCIES
flag set it adds undef_map
to the dependencies & retries, flags the symbol as used, optionally outputs extensive debugging info, & applies some platform-specific usually-noop logic.
That even more core logic involves iterating over the scope (with a compiler readbarrier on it’s length) skipping entries which match the given skip value, executable copies, removed, or empty values whilst outputting optional debugging info. For each it retrieves it’s symbol table & checks it’s “bitmask” (or rather bloomfilter by the look of things, GNU extension) if present allowing it to perform a hashmap lookup. Otherwise uses array lookup.
undef_map
is handled specially.
weakpointers, globals, & (GNU extension) uniques are specially. Especially uniques, which has additional hashmap lookup logic inserting new values & resizing if necessary.
Then it iterates onto the next entry in the r_list
.
Commandline Tooling
Finally, GNU LibC’s “dlfcn” sublibrary’s “elf” extension provides some commandline tools for working with Executable & Linkable Format files.
pldd
is a command for listing shared libraries linked into a ELF file. To do so, with parsed commandline flags & singular integral argument, consults /proc/*/exe
symlink to retrieve the corresponding command, consults /proc//task/ ptrace attaching to each thread, retrieves/outputs the data (get_process_info
), & cleans up.
It opens /proc/*/mem
, /proc/*/exe
validating that it is indeed an ELF file, /proc/*/auxv
, & calls the 32 or 64bit variants of find_maps
. This iterates over the auxv
array extracting the values of the ones of type AT_PHDR
, AT_PHNUM
, & AT_PHENT
. It malloc
s & pread
s AT_PHNUM
* AT_PHENT
bytes of /proc/*/memfd
, allowing it to compute the AT_PHDR
runtime pointer offset, find it’s PT_DYNAMIC
header & malloc
s/pread
s those sections searching for DT_DEBUG
entry, it finds PT_INTERP
it mallocs
the specified ammount of bytes from the program’s memory, outputs PID & program name, initializes a scratch buffer, reads the runtime struct link_map
, reads each import resizing the scratch buffer as necessary & outputs each immediately once read.
ldconfig
, with localization initialized & commandline flags, validates that any remaining arguments are absolute paths & retrieves the “LD_HWCAP_MASK” integral environment variable.
The absolute path commandline args, parses & allocs a new dir_list
global linked list entry whilst checking duplication & saving chroot params. If successful it traverses it’s subdirs specified by hardware caps.
If a chroot was set calls that syscall & cd to /.
Retrieve fallback values cache_file
& config_file
. If the -r flag was set open
, fstat64
, & mmap
the specified cachefile validating it is indeed a cachefile of the correct endianness, printing each entry therein & count.
Performs additional filepath normalization if chroot configured.
If configured to apply specified links manually, iterates over those commandline args further normalizing/validating the filepaths, open
s & mmap
s them validating it is a ELF file, reads it’s EI_CLASS
, checks for ET_DYN
type, iterates over the ELF headers retrieving data from & validates PT_DYNAMIC
, PT_INTERP
, PT_NOTE
, & PT_GNU_PROPERTY
(platform-specific code) ELF headers.
Then iterates over the segments looking for the string table, followed by DT_NEEDED
& DT_SONAME
entries. Then computes the soname
if missing, & creates a symlink from that soname to the lib name whilst outputting debugging info.
If not manually linking it instead continues by optionally clearing the caching linked list.
Optionally opens & line-based parses (recursing for include
directives) a config file, followed by adding two system dirs populating the aforementioned linkedlist.
The system dirs are registered via CPU-specific code.
It normalizes the filepath to the “aux cache” file, loading (via mmap
into a struct whose entries are loaded into a closed-addressed hashmap) or initializing it as needed or configured via the -C or -i commandline flags.
Then iterates over the dir_entries
global linkedlist, gathering all libs in each of these dirs (incorporating hardware capabilities), checking whether they’re in the cache already, & processing them as before.
On each iteration it iterates over the gathered libs again before freeing them to create appropriate symlinks given hardware capabilities & insertion sorts into the cachefile & stringtable.
If -n or -N commandline flags are not set it saves the cache & possibly auxilliary cache by, after preprocessing & output headers, iterating over the gathered cache entries referring to stringtable copying them into the mmap
‘d cache. Saving the results in a binary file, 1 of 2 formats.
Text Encodings (“iconvdata”)
Computers are only as smart as you make it, and even then it don't know nothing. It can store. It can store information but can't interpret it. It doesn't know what it means unless you give it meaning. It don't mean nothing unless you tell it to mean something, and even then the computer doesn't know what meaning is. Computers/Commutors by Dr Rev Churchdoktor
There’s historically been numerous text encodings representing different alphabets, most of which are, like UTF8 which we use near-universally today, supersets of ASCII. IBM certainly designed a lot of them, I guess they were dominant in the appropriate era & “International” is their first name!
Most of these text encoders/decoders are written as converters to/from UCS, consisting mostly of Perl-generated lookup tables from UNIX datafiles.
While there may occasionally be a tiny bit more logic, converting between charsets the “iconv” extensions consist of basically just lookup tables converting chars to/from an internal UCS4-like.
It is reasonable to view “iconv” as legacy code taking the stance we should all be using Unicode, e.g. UTF8 specifically, now. But this diversity of text encodings (mostly from IBM or ISO) played a vital role in the past & who knows who’ll still need to read those text files?
There’s occasional 0xhex TSV files (listed within another TSV) to test these charset converters, with a corresponding testrunner shellscript.
There’s typically seperate lookup tables for conversion to UCS (of unicode codepoints) & conversion from UCS (of strings). Typically there’ll be C macros (arranged inconsistantly in different files, sometimes even in a seperate file) specifying identifying, sizing, & lookup logic to encorporate into a template (or an extension thereof) provided by the “iconv” sublibrary these extend.
The lookup tables may be smallintmaps (most frequent), parallel arrays, state machines, etc.
There’s a datafile listing the still unsupported text encodings, and another listing the supported text encodings alongside their aliases which dynamically loaded modules to use for them.
The JOHAB (Korean Hangul Jamo) text encoding is an interesting one: 5bit consanant, 5bit vowel, & 5bit consanant. Requires more sophisticated transcoding & lookup tables!
euc-jisx0213
is mostly macros rather than datafiles. asmo_449
doesn’t list it’s lookuptables in C form.
Then there’s the issue of composition/decomposition in certain charsets/alphabets… And there’s ones with different lookuptables for different ranges…
While most text en/de-coders for “iconv” use the templates, some implement the conversion themselves. Skipping the trivial cases like jis0201
& simple cases like jisx0213
which computes which lookuptable row to consult, UTF32 conversion upon init parses the name & stores (alongside lookahead requirements) it inside given converter data.
To convert UCS to UTF32 an initial iteration is performed to seek out any byteorder marks determining whether to output bigendian or little endian retrieves each 32bit char, validates it, increment readpointers, swap byteorder if needed, & outputs the 32bit char.
To convert from UTF32 to UCS it retrieves the 32bit char, swaps byteorder if needed, validates more simply, outputs the 32bit char, & increment read & write pointers by 4.
UTF16 has similar initialization but if the 32bit char is larger than 16bit it bitmasks it in & adds some high-order bits in appropriate byteorder (byteorder check moved further out to play well with the CPU’s branch predictor). And there’s the inverse operation checking those highorder bits.
I’ll skip describing the UTF7 encoding designed for apps which really breaks on non-ASCII chars.
There’s a converter to/from the internal UCS4-like from/to UCS2. UTF8 also has similar logic, but is also the only encoding directly implemented, statically compiled into it, by the “iconv” sublibrary.
There’s some code shared between euc-kr
, iso-2022-jp
, iso-2022-kr
, & uhc
text encoding “iconv” extensions. This includes a function ksc5601_to_ucs4
which reads 2 chars which it validates & computes into a singular index split accross 3 lookuptables.
`ucs4_to_ksc5601 determines the appropriate lookuptable to use & performs a binary search & divides the result by 94 to take the result & remainder as the output chars, or reads the two entries in the lookuptable.
There’s a korean text encoding which can switch back & forth between ASCII, using an escape code like "<esc>$)C"
. Which charset the converter’s in needs to be tracked. Some Japanese & Chinese text encodings do similar.
There’s a suite of asian text encodings (standardized by ISO) which support switching between them. These are implemented in a single “iconv” plugin with a giant switch
statement in place of lookuptables.
The IBM text encodings are diverse in how much they match or mismatch Unicode. Most are converted to/from the internal UCS4-like via straightforward lookup tables. Some like ibm1364
requires tracking (typically 7bit) state between each byte to handle “combined” chars.
There’s variable-length char encodings like ibm939
(Japanese), ibm937
(Chinese), & ibm932
, generally using that 7bit state. I see this called “EBCDIC” encoding.
There’s conversion between GBK & GB2312 encodings.
There’s gb2312
common between iso2022-jp
, iso-2022-cn-ext
, euc-cn
, & iso-2022-cn
text encodings, implementing it’s from UCS4 lookuptable as a C switch
statement. The to UCS4 function looks up, validates, & outputs 2 chars from it’s lookuptable.
euc-tw
combines ASCII & cns11643l1
’s two “planes”, with invalid codepoints between them & no lookuptables of it’s own. euc-kr
adds Unicode decomposition.
euc-jp
likewise combines ASCII, jisx0201
. & jisx0208
.
And euc-cn
combines ASCII & gb2312
.
cns11643l2
is embedded into iso-2022-cn[-ext]
, using a 2byte encoding. As does cns11643l1
(uses C switch
inplace of from UCS4 lookup table) but also shared with euc-tw
.
cns11643
(uses multiple to/from UCS4 lookup tables, and also a C switch
from UCS4) is embedded in iso-2022-cn-ext
& euc-tw
.
Localization (“localedata”)
Different cultures do fairly fundamental things differently. That may be stating the obvious, but it is something core operating system components like GNU LibC needs to deal with. As such it’s “locale” sublibrary has extensive datafiles & a few scripts to configure the GNU LibC’s other sublibraries including even case-insensitive string comparison!
The LC_CTYPE
translit_start/end
section specifies which char(s) to replace each char with for comparison. Some reusable ones are autogenerated.
include
directives within translit_start/end
loads these reusable substitution configs (which may even remove chars entirely for comparison).
Within LC_CTYPE
you may also configure lists of chars for the different char classes. Not sure what LC_COLLATE
lists char substitutions for case-insensitive comparisons (this isn’t necessarily converting everything to lowercase). POSIX locale sticks to ASCII, others include i18n
configs.
Locale files have metadata LC_IDENTIFICATION
sections.
LC_COLLATE
may define script
, collating-symbol
, copy
(loading another file), reorder-after
, & symbol-equivalence
directives.
LC_MONETARY
defines formatting params for denoting prices, with LC_NUMERIC
being very similar.
LC_TIME
provides localized smallintmaps for weekdays, months, etc & standard format strings.
LC_MESSAGES
defines regexps for yes & no responses.
LC_PAPER
defines paper size.
LC_MEASUREMENT
whether to use Imperial or Metric.
LC_NAME
, LC_ADDRESS
, & LC_TELEPHONE
defines localized printf
format strings.
This config format is line-oriented with start & (often preceded with END
keyword) stop directives. The initial word defines the key under which the value is stored, or in some sections defines which char is being substituted. A different config file is provided for each locale, and some directives (
copy &
include`) may import directives from another file. Many of those are autogenerated.
In a simalar format there’s a directory of “charmaps”.
In ;-commented key-space-value files there’s test files for those charmaps. Other files list ;-commented sample chars (typically one-per-line). Not sure how these two are used.
There’s a script which iterates over all character classes & mappings, & all chars within each of both to output the codepoints in the character class & the key-tab-value mapping into correspondingly-named files.
Another translates codepoints into chars.
Unicode Converter Scripts
Amongst the locale datafiles are directories for testing, and one for generating locale configuration from datafiles maintained by The Unicode Consertium who makes it their job to assign a number to every letter, etc from every alphabet so that they can all be represented inside computers.
Unicode’s datafiles are #
-commented ;
-seperated values, identifying chars by their hexadecimal (base 16) number often specifying ranges using the ..
infix operator.
There’s a Python module for reading in these SSV files used by all these internal commandline tools. Or determine classifications from that
The main datafile lists each char’s ID, name, category, combining classes, bidirectional category, decomposition mapping, decimal digit value, digit value, numeric value, mirrored, old name, upper, lower, & title. Another lists the classification of each char-range & another lists the width for each char-range (mostly for kanji on the terminal).
Around these GNU LibC implements Python scripts for:
- Generating a UTF8 charmap configfile & another to test it’s backwards compatibility.
- Generating a
LC_CTYPE
localefile config section & another to test it’s backwards compatibility. - Generating
translit_fraction
localefile config sections, as per_font
,_compat
,_combining
,_cjk_compat
,_circle
, - Generating
translit_font
localefile config sections, as per_compat
,_combining
.
Each script tends to work, after commandline flags are parsed, by:
- Parse the Unicode datafiles via the aforementioned Python module.
- Output that parsed data in it’s new format (including preludes & epilogues, may be copied over from an existing file), or parse the specified files & use Python’s
set
implementation to describe what’s changed between the input data.
The UTF8 charmap generator parses the main Unicode file itself…
gen_unicode_ctype
requires slightly more processing…
gen_translit_combining
involves slightly more processing to convert decomposition into substitutions, in part via a simple recursive function.
ctype_compatibility
set-compares the two given files, before running a testsuite over each with a custom testrunner.
DNS client (“resolv”)
On the Internet we need to name (one of the hardest software challenges) hosts. Every computer has an often-temporary numeric IP address assigned to it (e.g. 219.88.233.43
), but for some you often want to assign a unique human-readable name (e.g. adrian.geek.nz
) anyone can lookup via their ISP. These are managed centrally with ICANN at the root of the hierarchy, and consulted via a vast decentralized CDN.
“resolv” (a BIND fork) extends “nss” (via the _nss_dns_getcanonname_r
, _nss_dns_gethostbyname[2,3,4]_r
, _nss_dns_gethostbyaddr[2]_r
, _nss_dns_getnetbyname_r
, & _nss_dns_getnetbyaddr_r
callbacks) to support looking up rented (I wish there were other options…) names via DNS, whilst providing a public API much of which can be overriden by platform-specific code.
I’ll focus on exploring from _nss_dns_gethostbyname_r
.
After validating the domain name, retrieves a refcounted global struct resolv_context
alloc/init’ing it whilst alloc/pars’ing (under a lock) a commented configfile if necessary, then hands off to gethostbyname3_context
(legacy code tries IPv6 first, but that’s called out as legacy code). The configfile also has a refcounted global, only (re)parsed when the file changes, ensures /etc/hosts is parsed too, can override domain via “LOCALDOMAIN” envvar.
After it’s input validation/reformatting gethostbyname3_context
first tries looking up user-level aliases if there’s no dots in the domain name by scanning the 2 column TSV file specified by the “HOSTALIASES” envvar. This may be another domain name.
It then allocates a 1k buffer & calls __res_context_search
handling it’s errors appropriately, followed by getanswer_r
to convert the data from BIND (most of this code comes from BIND) to “nss” formats.
__res_context_search
first counts up dots & trailing dots, if there were any dots at all it tries consulting that $HOSTALIASES file & queries the looked-up value for the response, if there’s enough (or any trailing) dots it runs the query with a length check.
In certain cases (no dots & RES_DEFNAMES
, or dots without any trailing & RES_DNSRCH
) it looks up the name under all configured search_list
or dnsrch
domains.
If all those failed & dots it runs the query with a length check.
From there it initializes a struct
of DNS headers & “nopt” or (for “A” & “AAAA” lookups) retrying upon lack of allocated buffer space, & reencoding the domain names being looked up weirdly, & sends it via UDP (or cached TCP socket) to the configured server to receive a binary response. Functions are provided for parsing certain fields like “TTL” (Time To Live a.k.a. cache timeouts) of this binary struct
, for serializing it to text for debugging, and for textually-comparing domain names (e.g. ns_same_domain
for efficiently checking whether a
is a subdomain of, or same as, b
& ns_makecanon
for stripping non-escaped trailing dots).
Multi-name Lookups
The central public API of GNU LibC’s “resolv” sublibrary is getaddrinfo_a
which, with validated arguments & a lock held, iterates over the caller-provided array to enqueue those requests whilst counting up successes. If all fails it notifies a provided sigevent
.
If the caller-specified to wait on all the caller-specified requests it iterates over them again to configure an atomic condition which it’ll wait on. Otherwise it’ll configure a “waitlist” notifying the caller PID.
Upon enqueueing a DNS request onto that global linked queue (with it’s own freelist), GNU LibC’s “resolv” checks whether to start a new thread (none idle & unreached limit) to service them & triggers an atomic signal to wake those threads if necessary.
It creates a “DETACHED” pthread to repeatedly call getaddrinfo
(which has a platform-dependant implementation) for each request, saving the response & under lock notifying (via platform-specific code) the caller. freeing the request into said freelist, awaits next task, if necessary sleeping or starting a new thread.
Timezones, Landmarks, & Time History (“timezone”)
Time, at least how denote it, is a very political construct. For a variety of reasons people in different countries adjust their clocks forward or back. And we’ve got timezones!
GNU LibC’s “time” datafiles lists, in #
-commented TSV, all the times different regions, etc (seperate files) adjusted their clocks (to correct for when computing diffs), timezones (possibly in same file), coordinates of timezone-landmark cities, timzone aliases, leap years, & ISO country codes.
Alongside these heavily-commented TSV files are some C header files defining structs which can hold the same data, and expose “time”’s API.
There’s a shellscript which, after parsing the commandline flags using a getopt
command & ensuring UTF-8 text encoding is used regardless of locale, an AWK script is prepared to compute the distance of each landmark city from a given coord (maths actually looks correct!), & repeatedly uses AWK scripts & case
branching to resolve user input, & sets $TZ.
And finally there’s C scripts for converting all this timezone & historical clock adjustment data between it’s text & binary forms. The “time” sublibrary refers to these binary files.
The text format requires significant (de)serialization (postprocessed once parsed), and some of the C standard library is reimplemented in case it’s running on a less full-featured UNIX than GNU. Because one Arthur David Olson maintained this flatfile database, in the public domain, for all UNIXes.
Platform-Dependant Sublibraries
Sublibraries in this section are almost-entirely (or fully) implemented with platform-dependant code. It is GNU LibC’s job to expose platform-independant APIs to this platform-dependant code, and these libraries consist of stubs which are overriden by the appropriate platform-specific code which will be discussed later.
- “resource” provides an API for retrieving the kernel’s metrics of your process’s resource usage. Abstracting these
setrlimit64
wrapssetrlimit
(similar forgetrlimit
).vlimit
wrapsgetrlimit
&setrlimit
updatingrlim_cur
property, &vtimes
wrapsgetrusage
for both the current process & it’s summed children converting the datamodel.ulimit
setrlimit
&getrlimit[64]
setpriority
nice
getrusage
- &
getpriority
- “rt” provides functions for event handling & faster IPC
timer_settime
&timer_gettime
timer_getoverrun
,timer_delete
, &timer_create
shm_unlink
&shm_open
mq_unlink
mq_timedsend
&mq_timedreceive
mq_setattr
&mq_getattr
mq_send
&mq_receive
mq_open[_2]
,mq_notify
, &mq_close
lio_listio
aio_write
&aio_read
aio_suspend
,aio_sigqueue
,aio_init
(defaults to noop),aio_fsync
, &aio_cancel
aio_return
(retrieves stored__return_value
property by default) &aio_error
(retrieves stored__error_code
property by default)
- “setjmp” allows saving CPU state to jump to elsewhere in the program.
longjmp
may or may not addlongjmp_unwind
&sigprocmask
calls, which pthread alway uses.__sigjmp_save
abstracts__sigprocmask
.__sigsetjmp
&[sig,_]longjmp
- “signal” allows registering callbacks with the kernel to respond to various events, which may be triggered other processes including itself.
sigvec
wraps an internal array,sigset_set_old_mask
(optional),sigaction
, & optionallysigset_get_old_mask
.sigvec_wrapper_handler
wraps that same array,sigset_set_old_mask
&sigaction
.sysv_signal
sigwait[info]
sigtimedwait
sigsuspend
sigstack
&sigsetmask
sigset
&sigreturn
sigqueue
sigprocmask
sigpending
[xpg___]sigpause
signal
siginterrupt
sigignore
sigfillset
sigblock
sigaltstack
sigaction
raise
kill[pg]
sigemptyset
sigdelset
sigandset
sigaddset
- “socket” allows opening comms to programs on another, or the same, computer utilizing the IETF or POSIX specs.
libc_sa_len
retrieves thesizeof
for the given addressing family.opensock
wrapssocket
caching the successful family.socket[pair]
sockatmark
setsockopt
&getsockopt
sendto
&send[[m]msg]
recv[[m]msg]
&recvfrom
isfdtype
getsockname
&getpeername
- &
listen
,connect
,bind
,accept[4]
, &shutdown
- “sysvipc” exposes useful IPC (allowing different programs/daemons to communicate) constructs first introduced by System V unix.
ftok
wrapsstat64
to compute a bitpacked “key” from the resultingst_ino
,st_dev
, &proj_id
properties.shmget
,shmdt
,shmctl
, &shmat
semtimedop
,semop
,semget
, &semctl
msgsnd
,msgrcv
,msgget
, &msgctl
- “termios” expands the expressivity of what you can output to post-teletype terminal windows.
tcgetsid
wrapsioctl(TIOCGSID)
, falling back totcgetpgrp
& thengetsid
.cf(g,s)et(i,o)speed
&cfmakeraw
updates internal properties on the givenstruct termios
, further abstracted via lookup table bycfsetspeed
.tcsetpgrp
&tcgetpgrp
tcsetattr
&tcgetattr
tcsendbreak
tcflush
tcflow
- &
tcdrain
The sublists names the APIs exposed by these sublibraries with platform-specific implementations.
Trigonometry & Other Mathametical Functions (“math”)
Most of the tasks we set computers involves mathematical computation in some form or other. Heck, much of computers’ I/O is expressed as numbers! GNU LibC has a sublibrary “math” implementing many common mathematical functions to help with this, which I’ll describe today.
The implementation is almost entirely platform-specific, though there are a relative functions which are the exception to that rule.
ctan
, after handling non-finite floatingpoint numbers specially, computes both the sine & cosine for both the real & imaginary components of the input to compute tan(x+iy) = (sin(2x) + i*sinh(2y))/(cos(2x) + cosh(2y)) = (sin(x)*cos(x) + i*sinh(y)*cosh(y)/(cos(x)^2 + sinh(y)^2)
.
ctanh
does similar to compute tanh(x+iy) = (sinh(2x) + i*sin(2y))/(cosh(2x) + cos(2y)) = (sinh(x)*cosh(x) + i*sin(y)*cos(y))/(sinh(x)^2 + cos(y)^2)
.
csqrt
, after handling non-finite number specially, checks the classification of it’s input’s real & imaginary components. Computing the answer differently if the imaginary component is 0, or the real component is zero, or checking which quadraint the two components are in; before checking for underflows & assembling the output complex number.
csin
, csinh
, cexp
, ccosh
, catan
, & catanh
checks the classification of the real & imaginary components in handling it’s non-finite cases while using normal trig (or computes hypotenuse) logic in success path.
clog
refers to the component’s classifications for error-reporting, computes scale
, absx
, & absy
based on how the absolute components of the input compare then calls off to different functions based on the range of the results. The imaginary output is computed by atan2
. As does clog10
. casin
wraps casinh
.
casinh
& cacosh
both uses classification error handling & wraps __kernel_casinh
on happy path.
multc3(a, b, c, d)
computes ac - bd + (ad + bc)i
, with error handling.
casinh
, depending on the input’s relationship to it’s specificity, calls onto clog
/M_LN2
/COPYSIGN
, M_HYPOT
/M_LOG
/M_ATAN2
, M_SQRT
/M_LOG
/M_ATAN2
, M_SQRT
/M_LOG1P
/M_ATAN2
, or csqrt
possibly as part of slightly more complex multiply-add computations. Then reassembles complex output with signs.
fromfp_max_exponent(bool negative, int width)
, fromfp_round(bool negative, uintmax_t x, bool halfbit, bool more_bits, int round)
, fromfp_overflowed(bool negative, uintmax_t x, int exponent, int max_exponent)
, fromfp_domain_error(bool negative, unsigned int width)
, & fromfp_round_and_return
performs simple conditional maths on these floating point components.
exp2
wraps scalbn
, exp
, & LN2
with overflow checks & other error handling. Not clear what divtc3(a, b, c, d)
is doing.
Functions which simply wraps other functions (and may be overriden by platform-specific code), etc include cimag
which dispatches to some GCC-specific syntax, carg
which wraps atan2
, cabs
which wraps hypot
, ldexp
which wraps M_SCALBN
with error handling, w_log1p
wraps log1p
, & w_scalbln
wraps scalbln
.
For a bit more fallback logic when lacking platform-specific code nexttowardf
compares the two given floats’ exponents to determine whether to decrement or increment. nextafter
does basically the same thing.
Several functions fallback wraps corresponding platform-specific functions with more error handling. These include:
tgamma
sqrt
scalb
remainder
pow
log
log10
log2
llogb
lgamma
jn
&yn
j1
&y1
j0
&y0
ilogb
hypot
fmod
exp
exp10
exp2
cosh
- &
atanh
Functions which wraps the platform-specific kernel_standard
function includes:
exp2_compat
j0[f,l]
&y0[f,l]
j1[f,l]
&y1[f,l]
lgamma[f,l]
log[2][f,l]
pow
remainder[f,l]
scalb[f,l]
sinhl
tgamma[l,f]
Most of these also calls the corresponding platform-dependant function under certain conditions, and they all add some error handling.
Functions with fully platform-dependatn implementations include:
ftestexcept
fesetexceptflag
fraiseexcept
fgetexceptflag
feupdateenv
fesetround
fesetmode
fesetexcept
fesetenv
feholdexcept
fegetround
fegetmode
fegetexcept
fegetenv
feenableexcept
fedisableexcept
- &
feclearexcept
CPU & Kernel Ports
GNU LibC’s whole job is to expose platform-independant APIs upon platform-specific code. While much of the surrounding code can be platform-independant, GNU LibC manages all their platform “ports” (as I’ll call them) in it’s “sysdeps” directory as explained for x86_64 Linux here. Numerous other platforms are also supported.
UNIX-like Port
The different syscalls are listed in a TSV file (name, caller, syscall name, mneomonic arg types, strong name, & weak names cols) which is converted into C code via a shell script. Referring to a C preprocessor header file to retrieve the syscall number, which defines macros to instantiate CPU-specific assembly-code template. Whilst validating args to be compiled via C calling convention.
The core syscalls GNU LibC supports are:
accept
access
acct
adjtime
bind
[f]chdir
[f]chmod
[f]chown
chroot
close
connect
dup[2,3]
fcntl
fstatfs
[f]truncate
getdomain
getgid
getgroups
gethostname
getpeername
getpid
getpriority
getrlimit
getsockname
getsockopt
getuid
ioctl
kill
[read,sym,un]link
listen
lseek
madvise
mkdir
mmap
mprotect
munmap
open
profil
ptrace
read[v]
reboot
recv[from,msg]
rename
rmdir
select
send[msg,to]
setdomain
set[e,p,re]gid
set[[r]e]uid
setgroups
sethostid
sethostname
setpriority
setrlimit
setsid
setsockopt
shutdown
sigaction
sigsuspend
socket[pair]
statfs
swapoff
swapon
sync[fs]
umask
uname
utimes
vhangup
write[v]
More Linux- or BSD-specific syscalls may be defined.
The Assembly template used by these involves storing the syscall number in the RAX register & using the syscall
opcode, possibly wrapped with error handling to save returned error codes to the errno
global and/or convert EWOULDBLOCK_sys to EAGAIN.
For trivial UNIX code:
sockatmark
wrapsioctl(SIOCATMARK)
.fcntl_compat
wraps__libc_fnctl64
.getpagesize
returns the appropriate macro-defined global__get_child_max
wraps__getrlimit
- $PATH defaults to “/bin:/usr/bin”
Slightly more complex, getlogin
wraps ttyname_r
& (surrounded by setutent
& endutent
) getutline_r
.
ifreq
calls opensock
if it’s not given an open socket, calls ioctl(SIOCGIFCONF)
with exponentially buffer sizes, compiletime-optionally followed by repeated if_nextreq
calls.
pts_name
repeatedly ptsname_internal
with exponentially larger buffer sizes. I’m not seeing the definition for ptsname_internal
outside of HURD.
grantpt
retrieves input from pts_name
, getuid
(optionally with a chown
to ensure we own the shell), getgrnam_r
(buffer allocated via alloca(sysconf(SC_GETGR_R_SIZE_MAX))
) once or getgid
(also optionally with a chown
). With that info it wraps an optional chmod
, possibly via a privileged command.
And getlogin_r
wraps ttyname_r
& locked (via mutex & set/end-utent
) __libc_getutline_r
.
Implementing POSIX Upon Core UNIX
While I don’t believe it’s needed to get GNU LibC running on Linux, GNU LibC does implement the more refined POSIX APIs upon the older UNIX ones. The way it implements asynchronous I/O is especially interesting…
Studying the simplest cases first:
usleep
wrapsnanosleep
sysv_signal
&sigignore
wrapssigaction
sigblock
wrapssigset_set/get_old_mask
&sigprocmask
shm_unlink
wrapsunlink
shm_directory
returns “/dev/shm/”remove
wrapsunlink
with fallback tormdir
raise
wrapskill
&getpid
pwritev64v2
wrapswritev
orpwritev64
depending on whether an offset’s given- Similarly
pwritev2
wrapswritev
orpwritev
depending on whether an offset’s given - Equivalent for
pread[64]v2
pause
wrapssigprocmask
&sigsuspend
libc_open64
wrapslibc_open
mkfifo[at]
wrapsmknod[at]
killpg
wrapskill
isfdtype
wrapsfstat64
isatty
wrapstcgetattr
getpagesize
wrapssysconf(SC_PAGESIZE)
gethostname
wrapsuname
getdtablesize
wrapsgetrlimit(RLIMIT_NOFILE)
flock
wrapsfcntl
dup
wrapsfcntl(F_DUPFD)
dl_get_file_id
wrapsfstat64
dl_file_id_matches_p
compares thedev
&ino
properties of the 2 givenstruct r_file_id
scuserid
wrapsgetpwuid
&geteuid
ctermid
returns “/dev/tty”clock_getres
wrapssysconf(SC_CLK_TCK)
clock
wrapstimes
- &
alarm
wrapssetitimer
Slightly more complex, there’s:
dup2
wrapsfcntl(F_GETFL)
&fcntl(F_DUPFD)
[f]pathconf
returns the appropriate constant or wrapsfstat[vfs]64
libc_pread[64]
wrapslibc_read
surrounded bylibc_lseek[64]
callsnice
wrapssetpriority
surrounded bygetpriority
callslibc_pwrite[64]
wrapslibc_write[64]
surrounded bylibc_lseek
callspwritev
concatenates the given bytearrays intommap
-allocated memory & wrapspwrite
rename
wrapslink
thenunlink
with error recoveryshm_open
wrapsopen
upon a computed filepath surrounded by calls topthread_setcancelstate
sigpause
(and it’s wrappers) wrapssigprocmask
&sigdelset
orsigset_set_old_mask
in either case followed bysigsuspend
bsd_signal
wrapssigaction
with it’s args prepared bysigemptyset
,sigaddset
, &sigismember
siginterrupt
wrapssigaction
with it’s args prepared bysigaddset
orsigdelset
sigset
wrapssigprocmask
,sigismember
, &sigaction
in one of two orderssigsetmask
wrapssigprocmask
surrounded by calls tosigset_set/get_old_mask
sigsuspend
wrapspause
surrounded by calls tosigprocmask
sleep
wraps repeatednanosleep
calls, to avoid getting awoken prematurelytruncate
wrapsftruncate
surrounded byopen
&close
callssysconf
wraps various globals, some of which are exposed via functions.utimes
wrapsutime
& vice versa
More complex ones involve:
writev
works likepwritev
except withmalloc
‘d memory[p]readv
does the reverse process, with thep
prefix indicating whether to usemmap
ormalloc
allocated memoryulimit
wrapsgetrlimit
,setrlimit
, orsysconf
depending on the given commandsigwait
wrapssigaction
(temporarily-set) &sigsuspend
to wait for the given child processes to exiteuidaccess
wrapsaccess
orstat64
with cached real/effective user/group IDsposix_fallocate[64_l64]
wrapsfcntl(F_GETFL)
,fstat[fs]64
,S_ISFIFO
&S_ISREG
, optionallyftruncate[64]
, & repeatedpread[64]
s &pwrite[64]
s- &
spawn_i[x]
wrapspipe2
,libc_ptf_call
withpthread_setcancelstate
, validates there aren’t too many args, &fork
- In the parent process continues with
close
,read
orwaitpid
, & againlibc_ptf_call
withpthread_set_cancel_state
. - In the child process
close
, optionallysigaction
withsigismember
, compile- & run-time optionallysched_setparam
&/orsched_setscheduler
, account ID setters, interpretes the given/optionalfile_actions
withclose_nocancel
calls & the like, & the givenexec
callback (typicallyexecv[p]e[x]
) with fallback logic handing it to the/bin/sh
command. Tidies up upon error.
- In the parent process continues with
Relying more on magic numbers/strings there’s:
getcwd_generic
which expands an absolute path from the system current working directorylibc_message
&libc_fatal
wrapswritev
with error recovery & it’s own format string logic limited to “%s”, followed byabort
profil
configures a timer (viasetitimer
on a dedicated clock) to periodically trigger an interrupt which looks up the given program counter in a hashmap & increment it’s bucketgetttyname[_r]
opens & iterates over the “/dev” dirstat
ing each file looking for the one with the givenst_rdev
&d_fileno
properties; wrapped byttyname[_r]
gen_tempname
locates (viastrspn
) all the “X”s in the given template with pseudorandom characters, in same file there’s lookup for the pre-existing tempdir from the $TMPDIR environment var “:”-seperated pathsystem
wrapsposix_spawn
(with additional action reconfiguration) to execute “/bin/sh” with the given command- &
sprofil
sums the profiled counts (from the previously-mentioned hashmap) within each function & outputs in sorted order.
Finally there’s a hader file for some error codes & textual descriptions this platform-binding might emit, some code for looking these up in their table, & extensive logic (with a lookup table of sizes for different output structs) for looking up textual host identifiers to IP addresses, etc using the various platform-independant logic (including NSS) I’ve described previously. I won’t bother with details.
Various of these functions can be overriden elsewhere.
If the kernel doesn’t support asynchronous I/O (which Linux does) GNU LibC will reimplement it poorly upon pthreads.
At the core of this “pthread” platform binding is a sorted linked list of timestamps a dedicated thread should trigger interrupts, sleeping during the intermediate times. By blocking on a condition so this sleep can be interrupted when adding new timers.
There’s binary-searched sorted array in which to lookup semaphores, each stored in mmap
-shared memory.
The mutex-locked arrays/linked-lists for file operation “requests”, & their “waitlist” atomic conditions, has additional logic for keeping them tidy. The requests array has it’s own freelist.
Upon enqueueing it groups requests by file descriptor, and the thread for enacting these requests calls [p]read[64]
, [p]write[64]
, fdatasync
, or fsync
foreach depending on the given opcode. Notifying the enqueuer of the result via those atomic conds.
Generic Port
GNU LibC has bindings to a “generic” platform. This appears to be for when it’s running upon extremely limited kernels, and needs a rudimentary userspace implementation of core UNIX features. I don’t know why this isn’t implemented the fallback logic provided by the public APIs…
Anyways starting with the most trivial implementations:
unwind_arch_adjustment(prev, addr)
returnsaddr
.glibc_tls_internal[_free]
wraps a global.startup_get[e](u,g)id
wrapsget[e](u,g)id
.safe_fatal
calls the CPU’s abort instruction via macro if available.profil_counter
wrapsprofil_count
& compile-time optionallysigcontext_get_pc
.mmap_maximum_offset
on 64bit (or greater) platforms returnsUINT64_MAX
.mmap_maximum_offset(shift)
for smaller wordsizes computes(1 << (shift + 8*sizeof(off_t))) - 1
.SET_NAN_PAYLOAD
uses a C union to set the NaN mantissa(s) of the given float.math_opt/force_barrier
wraps assignment & noop asm code.check_may_shrink_heap
returns a global.thread_stack_pointer
returns a global CPU register.spin_[un,try_]lock
&spin_lock_locked
implements non-threadsafe stubs, mutating the given bool pointer. As for other synching primitives.is_internal_signal
returns false.clear_internal_signals
is a noop.libc_signal_block_all
wrapssigfillset
&sigprocmask
.libc_signal_block_app
also wraps thatclear_internal_signals
noop.libc_signal_restore_set
wrapssigprocmask
.ifunc_sel
returns one of the the three given function pointers based on a global.ifunc_one
returns the given function pointer.if_freereq
wrapsfree
.HP_TIMING_DIFF
,HP_TIMING_ACCUM_NT
, &HP_TIMING_PRINT[_SIZE]
macros wraps simple arithmatic.HP_TIMING_PRINT
also wrapsitoa
&memcpy
.hp_timing_now
wrapsclock_gettime[64]
fips_enabled_p
returns false.default_libc_feholdexcept
wrapsfeholdexcept
, as per other floating point exception APIs.exit_thread
is an infinite loop.eloop_threshold
wrapsSYMLOOP_MAX
if that macro’s available orsysconf(SC_SYMLOOP_MAX)
, min-capped toMIN_ELOOP_THRESHOLD
.read_gnu_property
(for loading ELF executables) &dl_get_file_id
are true-returning noops.elf_machine_sym_no_match
&dl_file_id_match_p
are false-returning noops.dl_vdso_vsym
is a NULL-returning noop.elf_ifunc_invoke
calls the given function ELF pointer.libc_use_alloca
compares the given value to a cap.
For slightly more complex functions, there’s:
- A suite of more ELF backend logic (now involving
memcpy
s), incuding a function for computing hashes (using shift, adds, comparisons, & bitwise and & xor) from function names. frame_state_for
dynamically loads LibGCC to wrap it’s implementation.itoa_word
repeatedly divides the input by 10 outputting the remainder in reverse, specially optimized code generated for bases 10, 16, & 8.get_rounding_mode
wrapsFPU_GETCW
macro.- Various macros for the dynamic linker wraps pointer arithmatic.
GETTIME
wrapsclock_gettime64
with some postprocessing.- The macros for indicating a set of signals to
sigaction
wraps bitwise ops. These are further wrapped bysigset_set_old_mask
&sigset_get_old_mask
. utmp_equal
compares the various fields of the givenstruct utmp
.
More complex functions in GNU LibC’s “generic” bindings:
- Decoding metadata from DWARF executables (extensive logic, with C structs)
- Loops in C macros underlying
memcpy
, etc copying one page, word, or byte at a time forwards or backwards. - Yet more DWARF executable decoding (with pthread mutex synchronization), including caching & stack interpretors.
And finally, there’s headerfiles containing data on:
- Paths to core configuration, commands, etc.
- Supported interrupts with long/localized & short names.
- “Insecure” environment variables which shouldn’t be passed to setuid programs.
x86_64 Port
Despite what I’ve described so far, GCC doesn’t only abstract datastructures provided by the kernel. It also abstracts those provided by the CPU (and GCC). So it has includes ports for various different CPU architectures, though I’ll only describe x86_64.
Seperately I’ll describe the FPU (Functional Processing Unit) ports amongst others, which are seperate directories. But today I’ll just focus on the core x86 port.
Starting with the simplest APIs:
__wmemset_chk
wrapswmemset
with a comparison check & coldpath for errors.TLS_LE
&TLS_IE
macros wraps (if I’m reading it right) some special registers.TLS_LD
&TLS_GD
macros wrapstls_get_addr
& those same registers.TLS_GD_PREFIX
macro may (compiletime condition) optionally refer to “.byte 0x66”CALL_MCOUNT
macro incorporated into syscall code wrapscfi_adjust/def_cfa[_offset/register]
Asm directives- Alongside
CALL_MCOUNT
it defines various registers used in the calling convention & aPSEUDO
macro building upon embedder’s macros. stackinfo_get/sub_sp
wraps theesp
orrsp
registers depending on theILP32
register.STACK/POINTER_CHK_GUARD
macros wraps thestack/pointer_guard
properties oftcbhead_t
in Asm macros.POPCNT
macro wraps thepopcntq
opcode.GETSP
macro wraps thersp
opcode.GETTIME
wraps therdtsc
opcode.memset_chk
wrapsmemset
with a comparison errorcheck. As permem[p]cpy_chk
,memmove_chk
,[w]memset
wrapsmov(d,q)
,punpckl(bw,wd
, &pshufd
whilst populating a “multi-architecture” template I’ll study later.pop
wraps_mm_popcnt_u32
from<popcntintrin.h>
, or pulls in<intl/l10nflist.c>
. (Implemented in C)_JMPBUF_UNWINDS
wraps pointer comparison &demangle
. (C)_jmpbuf_sp
performs a registertable lookup & compiletime-optionally wrapsPTR_DEMANGLE
macro. (C)_JMPBUF_UNWINDS_ADJ
wraps_jmpbuf_sp
with pointer comparisons._JMPBUF_CFA_UNWINDS_ADJ
wraps_Unwind_GetCFA
with that.__libc_unwind_longjmp
is a synonym forlibc_longjmp
. (C)htonl
wrapsbswap
opcode.ffsll
wrapsbsfq
opcode.ffs
wrapsbsfl
opcode.__tls_get_addr_slow
, used for ELF dynamic linking, wrapsTHREAD_DTV
macro and eitherupdate_get_addr
ortls_get_addr_tail
.reloc_offset(a, b)
computesb*sizeof(ElfW(Rela))
.reloc_index(a, b)
returnspltn
. (C)elf_ifunc_invoke
calls the given function ELF pointer.elf_irela
wrapself_ifunc_invoke
with additional pointer decoding. (C)addq
&ret
opcodes are added into the .init & .fini sections._init
prepares the callstack &call
s the RAX register._fini
decrements the stack pointer.[_]setjmp
wrapssigsetjmp
on BSD.- &
ABORT_INSTRUCTION
macro wraps thehlt
opcode.
Slightly more complex, there’s:
tls_get_addr
largely wrapsmov[q,l]
with validation comparisons, magic numbers, & a more complex slowpath for when the stack is misaligned._start
uses various opcodes to initialize the callstack before callingmain
&hlt
ing.sigsetjmp
wraps severalmov[q]
opcodes to copy the current CPU state into the given registertable. Possibly with thePTR_DEMANGLE
macro if available. With a couple other ops._mcount
&__fentry__
profiling functions wrapsmcount_internal
with a bunch ofmovq
ops, correspondingcfi_rel_offset
orcfi_restore
Asm directives, a couplecfi_adjust_cfa_offset
Asm directives, & pointer arithmatic.longjump
wraps a bunch ofmovl
opcodes (with pointer demangling) carefully ordered to restore the CPU state from the registertable without breaking it. Also includes a loop around theincsspq
opcode.dl_hwcaps_subdirs_active
reencodes CPU-provided features data into a format appropriate the ELF linker. (C)mpn_rhsift
is manually loop-unrolled with different entry points for different values of the first 2 bits, bitshifting every word in the array._mpn_add_n
& similar are loop unrolled similarly tompn_rshift
with the exit block in the middle optimizing for single-word numbers._mpn_add_n
wrapsadc
in it’s loop, but this could be used as a template for other MPN APIs.
The relative complex internal functions are:
mpn_addmul_1
,mpn_lshift
, & similar which is a manually unrolled loop wrappingadd
(replaced when used as a template for other MPN APIs),mov
,lea
, &adc
opcodes with seperate entry points for odd & even numbers.elf_machine_matches_host
checks if thee_machine
header isEM_X86_64
.elf_machine_dynamic
returns the head of the global offset table, whilstelf_machine_load_address
computes offset from_DYNAMIC
global. (C)- If
DT_JMPREL
is set in the ELF info table & laziness is enabledelf_machine_runtime_setup
decodes theDT_PLTGOT
entry of the infotable copying it over if successful to the linktable’sl_mach
property & stores the linktable beneath that entry, then it might does something with the CPU features depending on whether profiling is enabled or not. IfDL_TLSDESC_GOT
entry is set & laziness is enabled it copies thedl_tlsdesc_resolve_rela
beneath that entry. (C) memcmp
is a manually unrolled loop with fastpaths for different bytesizes, & seperate loops for aligned vs unaligned. Utilizes vector opcodes.[w,raw]mem[r]chr
,(str,wcs)[r]chr
, &wcslen
wrapspcmpeq[d,b]
&pmovskb
opcodes in a manually-unrolled vectorized loop, with a seperate loop for unaligned & cross-cache data. Has multiple exit codeblocks.(str,wcs)[r]chr
&wcslen
has a coldloop for when memory pages are crossed.mpn_mul_1
wrapsmul
&adc
opcodes in a unrolled loop with multiple entry points for the different of the first 2 bits.strcat
&strcpy
wrapsmovq
in a manually unrolled loop.str[c]spn
wrapscmpb
/testb
in a manually unrolled vectorized loop.str[n][case]cmp
&wcscmp
both have a fairly sophisticated implementation with a dispatch table, bittwiddling tricks, & multiple vectorized unrolled loops for different cases.strlen
(after checking for empty strings) wrapspminu(d,b)
&pcmpeq(d,b)
in a unrolled vectorized loop.
Functions related to the ELF or DWARF linkers:
RTLD_START
macro bootstraps the callstack to call to_dl_init
&_dl_fini
from the ELF linker._dl_tlsdesc_return
loads offset 8 of the given pointer._dl_tlsdesc_undefweak
subtracts another given value from it._dl_tlsdesc_dynamic
wrapssalq
withmovq
s & a slowpath._dl_tlsdesc_resolve_rela
wraps_dl_tlsdesc_resolve_rela_fixup
withmovq
ops._dl_tlsdesc_resolve_hold
similarly wraps_dl_tlsdesc_resolve_hold_fixup
.dl_runtime_resolve
wraps_dl_fixup
with plenty of ofmov[q,l]
,xsave[c]
&[f]xrstor
, pointer arithmatic, & thecfi_adjust/rel_[cfa_]offset
Asm directives._dl_runtime_profile
does similar (plus looped vector ops) arounddl_profile_fixup
.
The x86_64 linker port especially has several internal functions written in C:
elf_machine_type_class
bitpacks some equals tests against the given “type”.dl_platform_init
compiletime-optionally calls_dl_x86_init_cpu_features
or runtime-optionally unsetsdl_platform
global if needed.elf_machine_fixup_plt(a, b, c, d, e, f, g)
assigns g to f.elf_machine_plt_value(a, b, c)
returns c.elf_machine_rela[_relative]
under various decoding conditions stores the sum of the givenl_addr
&r_addend
to the givenreloc_addr
. Possibly with error checks._relative
has only two conditions.elf_machine_lazy_rel
performs different pointer computations based on what data’s available._dl_tlsdesc_resolve_rela_fixup
decodes the given pointers whilst wrapping_dl_lookup_symbol_x
and/ordl_make_tlsdesc_dynamic
under different conditions._dl_tlsdesc_resolve_hold_fixup
locks & unlocks the dynamic linker’s mutex._dl_unmap
wraps_dl_unmap_segments
& compile/run-time optionallyhtab_delete
upon the thread-local-storage table.
And there’s a datafile defining registers for the systemcall calling convention.
x86 Multiarch port
There’s a few different variants of x86__64 CPUs, they may support one or more extended instruction sets & you can ask x86_64 CPUs which extended instruction sets they support!
For core memory-copying (“string” sublibrary) functions GNU LibC often wishes to take advantage of these extra opcodes where available. There’s numerous of these functions (non-multiarch described the other day) backed by complex Assembly code, but today I wish to specifically focus on strcat
.
The entrypoint supports AVX2 with fast unaligned load & an optimal VZEROUPPER
op, just fast-unaligned-load, or SSE3 with a fallback to SSE2.
Advanced Vector Extensions (AVX, 2011-?) adds 16 special “vector” registers each holding 8 32bit numbers or 4 64bit numbers, such that it can apply the same arithmatic on all in a single instruction. AVX2 doubles these register sizes, with a VZEROUPPER
op to clear the upper half of those larger vector registers.
The CPU’s cache hierarchy may operate in fixed wordsizes, to the degree that it is inoptimal to read data spread accross different words. The dispatch code for multiarch bytearray functions like strcat
tests whether or not it needs to optimize for this.
Streaming SIMD Extensions (SSE) are another suite of extensions to x86_64 instruction set supporting vectorized floats. SSE1 supported 8 registers each holding 4 32bit floats. SSE2 added int/byte support & doubled the register size.
That SSE2 fallback strcat
implementation is the normal non-multiarch code described the other day. At this point we assume any 64bit Intel/AMD CPU supports SSE2.
If there’s no performance penalty to unaligned reads, it needs only run vectorized unrolled loop around SSE2 ops pcmpeqb
to determine when to stop copying & pmovmskb
to actually copy each 256bits to their new location.
For AVX2 CPUs it uses vpcmpeqb
& vpcmpeqb
in it’s own vectorized unrolled loop for 512bit copies.
If unaligned reads are inefficient on your SSE3 CPU, GNU LibC will use a strcat
implementation which’ll use the movlpd
instruction for copying unaligned data 256bits at a time. With numerous exit codeblocks! So movlpd
is SSE3-specific?
NPTL PThreads-port
Here I’ll discuss the PThreads port to GNU LibC’s “nptl” sublibrary, and to x86 machine code.
pthread_cancel_init
emits a membarrier if PThreads are already initialized. Otherwise _Unwind_Resume
, gcc_personality_v0
, _Unwind_ForcedUnwind
, & _Unwind_GetCFA
from LibGCC. And exposes public wrappers for those functions which also lazily calls that initializes & applies atomic read barriers.
GAI_MISC_NOTIFY
macro iterates over/clears an array of futexes to futex_wake
.
GAI_MISC_WAIT
reads the given futex & if non-zero under a __gai_requests_mutex
lock repeatedly calls __futex_abstimed_wait[_cancellable]64
until successful.
gai_start_notify_thread
wraps pthread_sigmask
& sigemptyset
.
gai_create_helper_thread
wraps pthread_create
with various configuration APIs including pthread_sigmask
& pthread_attr_setstacksize
.
futex_abstimed_wait_common32
wraps the corresponding syscall (cancellable variant or not) with range & timestamp validation.
__futex_abstimed_wait_common64
wraps the corresponding syscall (cancellable variant or not), falling back to the 32bit variant described above, with timeout & clockid validation & error reporting. __futex_abstimed_wait64
& futex_abstimed_wait_cancelable64
are trivial wrappers around that.
fork
checks wraps the kernel port’s version with callback dispatch & (if multiple threads running, some under a IO_list
lock) other NSS functions, clearing I/O locks, & allocates new thread data.
thread_gscope_wait
iterates over all system-allocated pthreads in the dl_stack_used
linkedlist skipping THREAD_GSCOPE_FLAG_UNUSED
ones, atomically exchanges it’s gscope_flagp
from THREAD_GSCOPE_FLAG_WAIT
to ...FLAG_USED
, & if successful repeatedly futex_wait
s on them. Then does the same for the user-allocated pthreads in dl_stack_user
.
thread_attr_compare
tests equality on the various properties of the given struct pthread_attr
s.
_IO_lock_init
macro stores some initial values in the given var. _IO_lock_fini
macro returns 0. _IO_lock_lock
conditionally (on other threads) wraps lll_lock
& storing THREAD_SELF
to the given var before unconditionally incrementing a counter.
_IO_lock_trylock
similarly conditionally wraps lll_trylock
. _IO_lock_unlock
decrements the count & upon reaching 0 calls lll_unlock
. There’s some _IO_cleanup_region_*
macros wrapping __libc_cleanup_region_*
funcs.
_IO_acquire_lock
wraps IO_flockfile
with a register cleanup function. _IO_release_lock
macro closes it’s codeblock.
INLINE_SETXID_SYSCALL
(unless there’s multithreading) wraps INLINE_SYSCALL
macro with enqueueing until pthreads have been initialized. This enqueueing is done differently whether it’s shared threads-system or not.
Alongside var declaration macros & with lighter bootstrapping variants for use in the platform-indepent PThreads implementation, there’s mutex macros.
__libc_lock_init_recursive
macro wraps pthread_mutex_init
with pthread_mutexattr_*
configuration & an existance check. libc_lock_fini_recursive
wraps pthread_mutex_destroy
. libc_lock_recursive
wraps pthread_mutex_lock
, or in bootstrapping resembles the IO_lock
s. Equivalent for libc_lock_trylock_recursive
& libc_lock_unlock_recursive
. And there’s a couple macros wrapping pthread_cleanup_push/pop_defer/restore
with local state & checks.
There’s a futex API wrapping lll_futex_*
API & occasional syscalls with validation & error reporting. And for futex_clocklock64
a atomic-exchange loop.
There’s AIO_MISC_NOTIFY
& AIO_MISC_WAIT` akin to the corresponding macros.
For simpler public APIs there’s:
longjump_unwind
wrappingpthread_cleanup_upto
- More var declaration macros.
- Trivial acros around the lock/unlock macros. Fallingback to nonatomic counters.
- Macros for conditional callback calls.
lll_[cond_]lock
macro wrapslll_lock_wait[_private
with an atomic compexchange loop.lll_unlock
macro wrapslll_futex_wake
with an atomic exchange.- There’s macros for check lll-locks bitflag’s & syscalls.
- There’s C++ code & macros implementing a stack of cleanup functions to call upon interrupt.
- There’s bittwiddling macros for “TD eventsets”.
For x86_64-specific components, there’s:
- Macros ultimately wrapping a manual syscall & mov opcodes to get threadlocal storage.
THREAD_GSCOPE_RESET
flag (alongside macros for how to encode different GScope values) wrapslll_futex_wake
with axchgl
opcode.- There’s an x86 Assembly spinlock implementation which repeatedly claims hardware-locked decrements the number pointed to by the RDI-register, with tight fail loop waiting for it to reach zero before trying again. Unlocking stores 1 to it. Trylock wraps a hardware-locked
cmpxchgl
opcode. - There’s a file listing
struct pthread
,pthread_mutex_t
, &tcb_head_t
properties the Assembly code wishes to access. For the build system to turn into Assembly macros. - There’s Assembly datafiles describing bytesizes (and which register is the stack pointer) & bitoffsets for mutex for mutex & read/writer locks. The core must be a single word to benefit from atomic ops!
Atomics are uninterruptable CPU ops reading/writing single words required to implement synchronization upon.
Floating Point Math
We love giving computers mathematical equations to solve! Even if your problem isn’t inherantly mathematical computers tend to represent I/O in numbers. To aid the non-trivial formulas you give them, GNU LibC provides a highly optimized library implementing the mathematical functions. Of which there are CPU-independant implementations & CPU-dependant implementations.
There’s some generic error reporting functions branching upon a number indicating which function was called. There’s a CPU-overridable __matherr
function which appears to return 0, with a meaningless x->arg1!=x->arg1
check. And a wrapper around the that error reporting.
CPU-independant (“ieee754”) Doubles Implementation
I’ll continue by beginning to describe the CPU-independant implementations for 64bit doubles. The CPU-independant implementations for other numeric types are very similar, just the more bits are used the larger approximating polynomials & the more magic numbers are needed.
Starting with the simplest cases:
x2y2m1f
implements(x-1)*(x+1) + x*y
with float input/output but internally using doubles to overflow overflows.totalordermag
performs some normalizing (including NaN handling) before comparing.signbit
wraps__builtin_signbit
(compiler port?)issignalling
checks if certain highorder bits are set, possibly (compiletime) inverting the NaN bit to catch it too.isnan
&isinf
checks whether some certain highorder bits are set.getpayload
bitmasks all but the highestorder bit, with bitwise checks.fsub
wraps theNARROW_SUB_ROUND_TO_ODD
macro.fpclassify
bitwise checks to determine whether to returnFP_ZERO
,FP_SUBNORMAL
,FP_NAN
, orFP_INFINITY
.fmul
wraps theNARROW_MUL_ROUND_TO_ODD
macro.finite
performs a bitwise check.fdiv
wraps theNARROW_DIV_ROUND_TO_ODD
macro.fadd
wraps theNARROW_ADD_ROUND_TO_ODD
macro.fabs
wraps__builtin_fabs
. (compiler port?)f32xsubf64
wraps theNARROW_SUB_TRIVIAL
macro.f32xmulf64
wrapsNARROW_MUL_TRIVIAL
.f32xdivf64
wrapsNARROW_DIV_TRIVIAL
.f32xaddf64
wrapsNARROW_ADD_TRIVIAL
.copysign
wraps__builtin_copysign
. (compiler port?)mpn_construct_double
bitpack it’s args via C union into a IEEE754 double, depending on value ofBITS_PER_MP_LIMB
macroDIV_RADIX
macro bitmasks & shifts by compiletime constsINTEGER_OF
subtracts the givendouble
by itself casted to along
& backALIGN_DOWN_TO(in, f)
negatesf
to get a bitmask onin
- Suite of uninlined funcs stores errors in the
errno
global asuint64
&asdouble
casts the input using C unionsissignaling_inline
bitchecks highorder bits
Slightly more complex there’s:
EADD
&ESUB
macros which doubles the bitsize of+
or-
output via arithmatic.- An optional
DLA_FMS
macro which wraps__builtin_fma
(compiler port?), & may underlyEMULV
&EMUL12
macros akin to above. Otherwise uses arithmatic. - I don’t understand how
ADD2
,SUB2
,MUL2
, &DIV2
differs. ADD2A
&SUB2A
performs conditional checks for how to compute extended output word.- For values (“x”, taking absolute value whilst restoring sign on output) between -0.5 & 0.5 exclusive
atanh
validates the input is small enough it practically equals it’s sign (once error checked), before using the approximation0.5*log1p(2x + 4x/(1-x))
. - For values less than 1
atanh
uses the approximation0.5*log1p(2x/(1-x))
. - For values greater than 1 it returns signed infinity(?). For 1 it returns signed NaN(?). Computed from x via invalid divides.
- For infinity
exp10
defers toexp
. For values less than some min value it returns a (squared) constant. For values greater than a maximum it returns a (squared) constant. For absolute values less than 0x1p-56 it returns 1. - Otherwise
exp10
does some bittwiddling to process the low & high words which it multiplies by a constant, hands both off toexp
multiplying both call results. - For large values
ilogb
shifts it’s input right by 20 & subtracts 1023, or returns NaN or INT_MAX. - For reasonably sized input
ilogb
splits it’s input into high & low words. Returns a constant if both words are 0. If the high word is 0 decrements -1043 for each bit of the low word until it’s no longer positive. Otherwise decrements -1022 for each bit (less 11) the high word until it’s no longer positive. gammaproductf
repeatedly multiplies the given “x” by the next consecutive “n” numbers.gammaproduct
also callsmul_split
as input to compute the rounding error.lgamma_product
puts more effort into computing the rounding error.mpatan2
uses the multiprecision lib from “stdlib” & local dir to wrapmpatan
with a divide(?), & possibly multiply, adds, & a square root.mpsqrt
wrapsfastiroot
with, for a number of iterations given by a lookup table, square, multiplications, & subtraction.fastiroot
bitcasts the double to int twice to compare the two words to compute what to subtract from the hiword, & computes lowword via polynomial.mptan
wrapsmpranred
&c32
with one divide(?) or other based on the sign.ceil
&floor
uses conditions to determine which other argument to hand toINSERT_WORDS64
macro`, or which error to report.fmaf
either uses the__builtin_fmaf
compiler function or performs the multiply & add itself taking great care regarding errors & precision.frexp
wrapsEXTRACT/INSERT_WORDS
with some bit twiddling to return a double0.5<=|x|<1
& an exponent.llround
wrapsEXTRACT_WORDS64
with bittwiddling (depending on compiletime type sizes) to compute returned int.logb
wrapsEXTRACT_WORDS64
&__builtin_clzll
with a little bittwiddling.modf
wrapsEXTRACT_WORDS64
& under various conditions for bitmasksINSERT_WORDS64
.nextup
wrapsEXTRACT/INSERT_WORDS
with error checks & increment or decrement ops.roundeven
wrapsEXTRACT/INSERT_WORDS64
with error conditions, trivial arithmatic, & bitmasks.round
either (compiletime condition) wraps__builtin_round
, orEXTRACT/INSERT_WORDS64
with different condition conditions & bittwiddling for different bitmasks & error handling.- There’s a
EXTRACT/INSERT_WORDS64
bittwiddling template function, with an error condition on exponent & some conditional bittwiddling for non-zero “ix” words. tanh
wrapsEXTRACT/INSERT_WORDS
with different exponential arithmatic (viaexpm1
) for different ranges of “ix”.totalorder
wrapsEXTRACT_WORDS64
, bitshifts, XORs (some conditional) & a comparison.trunc
wrapsEXTRACT/INSERT_WORDS64
with some bittwiddling, error checks, & a conditional & nonconditional bitmask.add_split
splits addition to have low & high parts to it’s result.compare
compares the absolute values of two inputs.x2y2m1
temporarily sets rounding mode, performs two mulsplits, sorts array of those results, and sums them with more sorting.
For the more complex functions with relatively few magic numbers…
There’s a double
to [u]intmax_t
conversion template function, with caller-given rounding mode & width. wrapping bitmasks, bitshifts, fromfp_domain_error
upon error, fromfp_max_exponent
, & fromfp_round_and_return
.
remquo
wraps 2 EXTRACT_WORDS64
& a single INSERT_WORDS_64
with some error checks, fastpaths one deferring to __ieee754_fmod
, & conditional arithmatic rounding the divide & multiplies to take remainder.
lround
uses utilizes EXTRACT/INSERT_WORDS64
& bitwise ops, with conditions for errorchecks & upon compiletime bitsize.
fma
may wrap __builtin_fma
if available or performs the *
& +
itself with lots & lots of rounding/error checks. Possibly multiplying by different exponents.
mpatan
initializes a multiprecision output number, chooses a number of iterations based on how EX
(where’d that input come from?) compares to 0 & if that’s non-zero it repeatedly sums, squareroots, & divides. It looks up the resulting precision in a lookuptable & performs that many divides, multiplications, & subtractions.
__remainder
splits it’s two inputs in 2 halves bitmasking off the signbit & checks their range. Performing different multiply, adds, subtracts, bitmasks, & recursion.
branred
performs range reduction via some multiplies & subtracts, extracts the high & low words of that, bitshifts & bitmasks the highword twice with a tad of arithmatic each time, gathers an array of six doubles from multiplications, a couple of sum loops (bit convoluted to handle precision issues), few more additions/subtractions, repeats essentially same thing, & some multiply-adds.
mpn_extract_double
uses a C union & conditional bittwiddling to convert a double to multiprecision.
hypot
splits the inputs each into two halves with sign bitmasked off, swaps input to ensure the 2nd is larger, checks range & errors to determine what to set the highwords to, and what to multiply the inputs by, performs sqrt surrounded by arithmatic depending on whether diff’s larger than the larger input, & an error check.
mpsin
wraps slowsin
with some conditional range reduction arithmatic. If the caller specifies it performs some multiplies subtracts (how many depends on range).
If not requested it instead convert it’s inputs into multiprecision. Folowed by, under 3 conditions, 3 multiprecision squares followed by a couple multiprecision squares, numerous multiplies (a few loops), & half as many subtracts.
The range reduction if ran indicates which quadrant to simplify the computation before deferring to slowsin
. Similar is done for mpcos
.
Then there’s a range of multiprecision arithmatic functions…
Of these multiprecision functions mcr
iterates over the mantissa digits until it finds the first differing one. acr
compares the first two exponent digits falling back via mcr
to the mantissa.
cpy
(which I’ve been eliding from explanations of other functions) copies every digit from one multiprecision to another.
norm
branches upon the given precision to perform the relevant multiply-adds. For large enough values it’ll use for-loops. Then muliplies all digits by a RADIX[I]
consts.
denorm
branches upon the given precision & EX
to populate a 5digit array (from constants, input digits, or their sum), applies rounding, and computes the output double
from those digits.
mp_dbl
decides whether to defer to norm
or denorm
.
dbl_mp
extracts the sign then exponent then digits to populate multiprecision number.
add_magnitudes
iterates over the given x number (must be larger) & corresponding y digits to compute sum. Final optional iteration shifts digits down 1.
sub_magnitudes
does essentially same thing for subtraction.
add
& sub
wraps cpy
, acr
(to get arguments in correct order), & add_magnitudes
or sub_magnitudes
(depending on sign).
mul
after checking fastpaths computes the product of each pair of digits into a “diag” array then computes a “zk” via multiplies, adds, & subtracts over the two inputs & the diagonal in two loops populating output as it goes. Then incorporates carries.
sqr
, after some shortcuts, repeatedly multiplies the last p digits before the kth digit with the first p digits taking the sum for kth digit, with a similar but tighter trailing loop & some minor postprocessing/normalization.
inv
converts the multiprecision to a double and back to apply an inverse operation. Then uses postprocessing 2 mulitiplies & 1 subtract per digit to regain the multiprecision.
dvd
wraps inv
& mul
if non-zero input to perform divide.
For math on doubles which utilizes some non-trivial magic numbers… (Not that I can read these hexadecimal magic numbers)
sincos
splits the input into two halves & bitmasks off the signbit, & checks the range of that absolute high half ultimately deffering to do_sin
, do_cos
, & maybe reduce_sincos
or branred
.
scalb[l]n[l]
wraps EXTRACT/INSERT_WORDS64
with bitshifts, bitmasks, & constant multiplies by 2^54 (~1.8e16) or 2m^54 (~5.55e-17), tiny (~1e-300), or huge (~1e300).
rint
& nearbyint
wraps EXTRACT_WORDS64
& maybe INSERT_WORDS64
with bittwiddling & adding/subtract + or - 2^52 (~4.5e15). lrint
uses more sophisticated bittwiddling & error reporting, less for llrint
.
reduce_sincos
wraps mostly subtracts & constant multiplies between it’s args.
For values between + & - 0.126 do_sin
(with it’s own sin
wrapper) squares it’s input & uses Taylor Series refinement. Which is a polynomial equation. Otherwise do_sin
uses a different polynomial & looks up several parameters from a lookuptable to incorporate into a second polynomial (constants look like multiples of 1/6). Copying the original sign back over into the output.
do_cos
(with it’s own cos
wrapper) works similar but without the Taylor Series fast path.
log10
wraps EXTRACT/INSERT_WORDS64
with adds, subtracts, bitmasks, & multiplies by 2^54 (~1.8e16), 1/ln(10), or upper & lower halves of log10(2).
log1p
extracts the highword bitmasking off the signbit checks it’s range to report errors or perform the appropriate arithmatic or bittwiddling. Often refining result via polynomial.
sqrt
splits input into two halves bitmasking off exponent & signbit & validates range, & performs lots of arithmatic with error checks. Or recurses.
To compute cuberoots (cbrt
) it extracts mantissa & exponent, computes a new mantissa via a polynomial to cube than multiplies by a fraction computed from old mantissa & an exponent-looked up value of cuberoot of 2 (~1.26), it’s square (~1.6), 1 or their inverse.
acosh
extracts the two words from it’s input, & checks it’s range to report errors, compute log(x) + ln(2)
, compute log(2x - 1/(x + sqrt(xx - 1)))
, or compute log1p(x - 1 + sqrt(2 * (x - 1) + (x - 1)^2))
.
expm1
retrieves the highword & seperate off it’s signbit; checks it’s range to report error cases; adds, subtracts; or multiplies the two words of ln(2) depending on the range to compute two words & “k”; performs some constant multiply-adds; under different ranges of the computed “k” compute different multiply, add, & subtract formulas; & under those different conditions reassemble the two words into output.
lgamma_r
reduces it’s input range to be <= 8 via constant logs, approximate via polynomial, refine, & handle negative inputs specially amongst other special cases.
jn
wraps sincos
with word extraction, bittwiddling, various range checks, & arithmatic. Maybe some loops. yn
wraps y0
or y1
or similar logic.
I see various other wrappers around sin
/cos
. lgamma_neg
wraps log1p
, fabs
, log
, some of those wrappers, and/or lgamma_product
with plenty of arithmatic, conditions, & loops.
rem_pio2
looks up a “jk” value based on given precision from a lookup table, performs various arithmatic & loops, wraps scalbn
& floor
to compute “z”, some conditional bittwiddling within an “iq” array”, computes a “diagonal” via another arithmatic loop, populates additional tables from that, & branches over precision for how to compress the array.
There’s plenty of functions which do little more than multiply-adds upon a lookup table. Maybe add some conditional trigonometry.
gamma_positive
wraps exp
, lgamma_r
, maybe other such arithmatic functions, & maybe a repeatedly-checked lookup table under different ranges. gamma_r
wraps gamma_positive
with some error checks, scalbn
, trunc
, & trigonometry.
Some of those various multiply-add lookuptable functions perform different multiply-adds for different input ranges.
For implementations which use lots of magic numbers…
Looking at the dataheaders dosincos
(described yesterday) refers to constants for a 5-degree polynomial, a large value, I think another polynomial, etc. Variable names aren’t that legible.
utan
’s includes positive & negative multiplicands for at least 3 polynomials, and some other constants.
In a seperate headerfile utan
has a lookup table of “xi”, “Fi”, “Gi”, & “FFi” parameters. I so no documentation on where these numbers came from beyond “IBM Accurate Mathematical Library”.
urem
has constants for a large value, 2^128, 2^-128, 0, & -0.
There’s seperate constant values depending on CPU byteorder. Maybe I won’t continue describing these dataheaders, they’re not very self-documenting.
cosh
retrieves the high word of it’s input “x” bitmasking off the signbit & checks the range to determine whether to return 1, t = expm1(|x|) in 1 + (t^2)/((1+t)^2)
, 0.5exp(|x|) + 0.5/exp(|x|)
, 0.5exp(|x|)
, 0.5exp(0.5|x|)^2
or an error.
asinh
does similar copying sign onto output for formulas: x, 2x, log(|x|) + ln(2)
, log(2|x| + 1/sqrt(|x^2| + 1) + |x|
, or log1p(|x| + |x^2|/(1 + sqrt(1 + |x^2|)
.
fmod
extracts the two words of it’s input extracting the signbit; checks for errors; computes ilogb(x)
via bitshfit-subtract or repeatedly decrementing ix
for each bit in the extracted “x” until it’s no longer positive; same for “y”; aligns highwords of x & y via bittwiddling; repeating the difference between-those-logs times compares the high words to determine to double x highword, 0, or difference to produce new “hx”; & reformats back into floating point under different conditions.
exp
retrieves the 12 highestorder bits with sign bitmasked off, if that’s greater than those for 512 it checks various error conditions to exit on, divides x by ln(2N), performs bitcasting & rounding, adds x to the rounded value times -ln(2N) “r”, performs performs a couple tablelookup, computes a polynomial on “r” adding the first lookup, checks whether it needs to be careful regarding overflows, & computes/applies a “scale” from the second lookup.
If it’s input is less than 0.5*hp0.x
docos
performs a tablelookup for partial parameters to a large polynomial it then evaluates via macros taking great care around overflows. If it’s input is less than 1.5 times that it’ll perform a little preprocessing to run the sin
helper which works similarly. Otherwise it’ll subtract that absolute input from 2*hp0.x
before calling the helper.
exp2
checks error cases before applying similar arithmatic & table lookups to exp
.
If input’s less than 22 sinh
, with some overflow checks, computes sign(x)*0.5*(2expm1(x)/(expm1(x)+1)
. If it’s less than log(maxdouble)
sinh
computes sign(x)*0.5*exp(|x|)
. If it’s less than an overflow threshold sinh
computes sign(x)*0.5*exp(0.5|x|)^2
. Otherwise sinh
reports an overflow error.
atan
after splitting it’s input, error checking, temporarily configuring rounding to NEAREST, & taking absolute input check’s it’s input’s range.
It may return an identity of it’s input. It may compute a couple polynomials with fast exits, eventually getting careful regarding overflows before finally resorting to arbitrary precision arithmatic. Another couple polynomials it might run consults a lookuptable & branching to determine it’s parameters with similar regard to overflows, calling signArctan
on fastexits. Another more resembles the first but with signArctan
calls on fastexits. Or reports overflows.
log
bitcasts it’s input to an int whilst also extracting it’s top 16 bits before checking it’s input’s range possibly performing some additional polynomial calculation and/or refining the lookuptable index, performing a tablelookup, & doing a little bit of arithmatic on it. log2
works similarly with slightly more maths. tan
works similarly to sinh
.
pow
bitcasts both inputs to ints whilst extracting the 12 highest-order bits before checking the range & some final arithmatic.
Those range checks ammount to errorchecks, fastpaths, & input simplification. The final arithmatic calls off to log
, __builtin_fma
(or it performs the two-word multiply-add manually), & exp
. So pow(x, y) = exp(log(x)*y)
. There’s a specialcase where it’s extra careful regarding overflows, & it’s got it’s own inlined implementations of exp
& log
.
atan2
splits it’s input in half & checks it’s range for errors then fastpaths then normalization then …
… applying an appropriate polynomial being careful later-on regarding overflows eventually resorting to multiprecision arithmatic upon fastexit copying appropriate sign over, maybe looking up parameters in a lookuptable. Regardless there’s a final such lookuptable polynomial it applies being careful regarding numeric overflows.
Normalizing may involve multiprecision divide, multiply, & subtract.
acos
& asin
involves lots of range-conditional arithmatic. And maybe a lookuptable.
x86_64 Float Port
Here I’ll study how GNU LibC implements various math functions on x86_64 CPUs.
For the simpler cases:
signbit[f]
copies the float into a normal register (pmovmskb
) & takes a bitmask (andl
)scalbnl
wrapsfildl
,fldt
,fscale
, &fstp
ops to add a value to the exponent.nearbyintl
wrapsfldt
,fnstenv
,frndint
,fnstsw
, bitwise ops, &fldenv
to round to nearest int.llrintl
wrapsfldt
,fistpll
, &fwait
ops to round to nearest int.llrintf
wrapsctss2si
op.llrint
wrapscvtsd2si
op.finitel
wraps bitwise manipulation to test if it’s storing infinity.fabsl
wrapsfldt
&fabs
.fabs[f]
wraps__builtin_fabs[f]
. (implemented in C, compiler port?)copysign[l,f]
wraps bitwise manipulation to copy the sign bit from one float to another.- There’s internal macros wrapping CPU feature tests. (C)
- There’s a C union & wrapping macros for extracting the different fields in a float. (C)
fetestexcept
&fegetexceptflag
wrapsfnstsw
&stmxcsr
ops with (in C) bitwise manipulation.fesetmode
&fesetexcept
wrapsstmxcr
&ldmxcsr
between which it performs bitwise operations in C.feholdexcept
wrapsfnstenv
,stmxcsr
,fnclex
, & (after minor bitwise ops in C)ldmxcsr
ops.fegetround
wrapsfnstcw
op with highorder bit test in C.fegetmode
wrapsstmxcsr
op.fegetexcept
wrapsfstcw
op.fegetenv
wrapsfnstenv
,fldenv
(to restore previous env afterfnstenv
), &stmxcsr
ops.__ieee754_remainderl/fmodl
wrapsfprem1
&fstsw
in a loop after loading the float args viafldt
.
Slightly more complex, there’s:
__ieee754_exp2l
wrapsfld
(x),frndint
(int(x)),fsubr
(fract(x)),fxch
&f2xm1
(2^fract(x) - 1),fld1
&faddp
(2^fract(x)), &fscale
(e^x) with fast paths.__ieee754_ilogbl
wrapsfxtract
,fstp
,fistpl
, &fwait
with two fastpaths.__ieee754_log10l
(which makes it clear to me x86 floating point ops are a stack machine) wraps decrement;fabs
,fcompl
.fnstsw
, & a conditional codepath withfxam
,fnstsw
, optionalfabs
in a loop. Also has multiple exitpaths usually withfstp
&fyl2x[p1]
ops, loads a CPU constant viafldlg2
.log10l_finite
is an alternate entrypoint/fastpath to above.__ieee754_logl
&__logl_finite
works similarly but with a different CPU constantfldln2
.feclearexcept
&fesetexceptflag
wrapsfnstenv
,fldenv
,stmxcsr
, &ldmxcsr
ops with bitwise ops in C.feenableexcept
&fedisableexcept
wrapsfstcw
,fldcw
,smxcsr
, &ldmxcsr
with bitwise ops in C.feupdateenv
wrapsfnstsw
op,fesetenv
, &feraiseexcept
. (C)__ieee754_sclabl
wrapsfrndint
,fcomip
, &fscale
with error paths/checks for infinity & NaN.feraiseexcepts
given theFE_INVALID
bit wrapsdivss %0 %0
.feraiseexcepts
given theFE_DIVBYZERO
bit wrapsdivss %1 %0
.feraiseexcepts
given theFE_OVERFLOW
,FE_UNDERFLOW
, and/orFE_INEXACT
bits wrapsfnstenv
, sets the relevant exception bit(s) in C,fldenv
, &fwait
.fesetround
wrapsfnstcw
,fldcw
,stmxcsr
, &ldmxsr
opcodes with bitwise manipulation & validation in C. (C)fmin
wraps theminsd
opcode with NaN checks.fesetenv
wrapsfnstenv
,stmxcsr
,fldenv
, &ldmxcsr
opcodes with intermediate bitwise manipulation (under different conditions) in C.fmaxl
&fminl
wrapsfcmov[n]b
&fstp
opcodes with NaN handling.fmaxf
wrapsmaxss
with NaN handling.fmax
wrapsmaxsd
with NaN handling.floorl
&s_truncl
wraps (beyond environment, control, & float loading; + bitwise ops) thefrndint
opcode.fminf
wrapsminss
with NaN handling.log1pl
conditionally wrapsfabs
,fcompp
,fnstsw
, maybe increment, &fyl2x[pl]
. Loads thefldln2
CPU constant.LDBL_CHECK_FORCE_UFLOW_NONNEG[_NAN]
Assembly macro wraps af[u]comip
&fstp
NaN check, withfmul
&fstp
on the failpath.LDBL_CHECK_FORCE_UFLOW_NONNAN
does essentially the same with an addedfabs
opcode.
Most of the more complex implementations of mathematical functions refers to various constants & lookup tables.
For trigonemetry data it has a bitmask for copying everything but the floating point sign bit OR only that bit, range limits, polynomial coefficiants for high accuracy, Pi in 4 parts & a variant for FMA unit, 4 polynomial constants & a FMA variant, 1/Pi, rightshift value, pi/2, 1/2, lookup table mask, 2^(k-1)
All those values are stored in a lookup value with macros defining indices into it.
For exponentiation it has constants for value ranges, infinity, bitmask copying everything but the floating point sign bit, 4 polynomial coefficients, bitmask to retrieve the floating point exponent field, 2^10 (do I have this right? I don’t read hex floats), more ranges, bitmask to divide by 2, 1(?), log(2) high & low parts, threshold, bias (2 parts), 1/ln(2), shift value, 2^(1/2^k)-1, & 2^k-1.
Some of those values are repeated (to play well with the CPU’s dataprefetcher? That’s often optimized for array traversal), and interspersed in here are lookup tables. Not that I see any comments about how they’re computed..
There’s ones for “Log(2) lookup High+Low table for logarithmic part”, “Log(2) lookup table for logarithmic part”, “dInfs = DP infinity, +/- ==”, “dOnes = DP one, +/- ==”, “dZeros = DP zero +/- ==”, & a last one I think called “dbT”.
For logarithms there’s 7 polynomial coefficients, constants for work range checks, a mantissa “break point”, signicand mask, 1.0, ln(2), +/- infinity, +/- 1.0, & +/- 0.0 (yes, floating point supports signed zero).
For expf
it has constants for 1/ln(2), right shift value, ln(2) low & high parts, a bias, 6 polynomial coefficients, bitmask copying everything but the floating point sign bit, & a working domain range.
There’s a reencoding of the trig constants in doubles (might be some other differences…). There’s 5 lookup tables for log(2) & 2^x (whether or not there’s “HSW” support, also has recipricals if not).
Between these are a bitmask to retrieve mantissa field, 1.0, Cvt bitmask, min & max normal values, rnd bit & bitmask, 6 polynomial coefficients (first with low & high part), bitmask to copy everything but sign bit, domain range, shift value, index mask, & 4 more polynomial coefficients.
After those lookup tables there’s additional constants for exponentiation computation on doubles: 7 polynomial coefficients, 5 more for a different polynomial. bitmask to retrieve mantissa field, 0x3fe7fe0000000000, 0xffffffff00000000, 0xfff0000000000000, & 0x3fe7fe00 are unlabled, 1.0, 2^20+2^19, a bitmask, -log2(e), 2^45+2^44, -infinity, -0.0, 2^52, 1/2^111, delta, range, bitmask to copy everything but the sign bit, +infinity, domain range, index bitmask, index add, 2^20+2^19, & 1.0.
For logs of doubles there’s two lookup tables for “high+low parts and 9-bit index for -log(mRcp), where mRcp is mantissa of 1/x 9-bit accurate reciprocal” & “9-bit index for -log(mRcp), where mRcp is mantissa of 1/x 9-bit accurate reciprocal”; followed by 4 polynomial coefficients, bitmask to retrieve exponent field, 2^10, min & max normal values, bitmask to divide to divide mantissa in half, 1.0, log(2) high & low parts, threshold, bias, log(2), +/- infinity, +/- 1.0, & +/- 0.0.
For exponentiation of doubles there’s a 2^(j/2^k) lookup table followed by 2^k/ln(2), shift values of 3*2^52, ln(2) high & low parts, 3 polynomial coefficients, index mask, bitmask to copy everything but the sign bit, & a domain range.
exp10l
& similar after validating input repeatedlies:multiplies by relevant CPU constant, temporarily sets round-to-nearest whilst it rounds two numbers, consults a tiny lookup table to multiply-subtract by, and does a little more arithmatic.
expm1l
uses extremely similar but non-identical code. And this code can be used as a template for other even more similar functions.
log2l
after validating input takes the absolute value of the decremented input, fcompl
& fnstsw
opcodes on that, validates the result, conditionally takes the absolute value, & wraps fyl2xp1
. The finite varient does few checks & no loops.
powl
after extensive input validation with fastpaths for ints >= 4, or 0.0, calls a helper function.
There’s a C macro, presumably for debugging purposes since the public API has better logic for this, which outputs a floating point number in hexadecimal by wrapping itoaword
under different conditions. Hexadecimal so it doesn’t need to convert base 10 which is complex.
sincos
for x86_64 Advanced Vector Extensions CPUs surrounds calls to SEE4 implementation with plenty movs & lea ops. If there’s 32 Instruction Level Parallesm the initial movs will be stack ops, otherwise vmovdqu
ops.
There’s similar handling for sincos on SSE2 & AVX512 CPUs. Double or floating point.
Though there is a vzeroupper
opcode prior to the two calls in all these. And the 32 Instruction-Level Parallelism code may include constants inline. Or one case there’s 4 calls.
There’s macros for for generating simpler variants of such logic. Though the AVX ones look somewhat interesting, though I still don’t get impression I’m seeing the core logic.
x86_64 Multiarch port
There extended instruction sets CPUs may support to aid efficient implementation of various math functions. Today I wish to describe how GNU LibC takes advantage of these! (Which is neatly bifarcated into 2 or 3 sections)
For the simplest cases:
- SSE4.1 includes a
roundsd
opcode fortrunc
,rint
,nearbyint
,floor
, &ceil
to wrap; &roundss
opcode for float equivalents to wrap. - There may be
vfmadd[213]ss
opcodes forfma
to wrap.
There’s C function macro-templates for choosing which implementation of various functions to be used based on the CPU’s self-reported featureset.
To compute sinf16
on AVX512 CPUs it uses Assembly code to bring the input range between -Pi/2 & Pi/2 by removing the sign bit saving aside old sign, multiply by inverse pi (i.e. divide by pi) to get the “octant” (using vfmadd213ps
, vcmpps
, & vpbroadcastd
ops), subtract a right-shift, & something about signbit & integers (using vfnmadd
ops).
It continues by using multiply-adds to evaluate a polynomial, squaring the ranged-reduced input first, then restores the original sign bit & performs some final movs & validation. Possibly back to the CPU-generic code.
I am judging operation largely by comments here… There’s several other implementations of this same function, in large part for different extended instruction sets.
Cosine is computed similarly, possibly as a sideeffect.
GNU LibC implements pow(x, y)
as 2^(log2(x)y)
. The log2
here is implemented by splitting the float into integral (“k1”) & fractional (“X1”) parts, & computes the reciprical of X1 undoing & redoing it 3 times to handle imprecision. These recipricals & their logorithms are looked up in a table, before being summed to with a polynomial of their product (less 1/ln(2)
) to get the logorithm.
This gives a 3 part number for the multiplication to operate on.
To calculate 2^x
as part of pow(x,y)
GNU LibC splits “x” into N + j/2^7 + Z where N & j are integers. Hence 2^x ~= 2^N * 2^(j/2^7) * 2^Z, where 2^(j/2^7) is looked up in a table, & 2^Z is computed via a polynomial. I’m not clear what’s happening with N.
Then there’s validation…
The AVX512 implementation uses opcodes like vshuff32x4
, kxnorw
, vcvtps2pd
, vpsrlq
, vpbroadcastd
, vpternlogq
, vrp28pd
, vrndscalepd
, vcvtdq2pd
, & vgatherqpd
. Though there’s other implementations.
GNU LibC implements log(x)
as exponent*log(2) + log(mantissa), with adjustments for mantissa > 4/3. With log(mantissa) being approximated as a polynomial of varying degrees. And ofcourse there’s validation!
The AVX 512 implementation uses relatively few weird opcodes, though it does use vpsrad
, vcvtdq2ps
, & vpbroadcastd
. Mostly multiply-adds. Though again there’s other implementations.
For exp
& related it takes the input as resembling N*ln(2) + ln(2)*(j/2^k) + r
.
As such exp(X)
= exp(N*ln(2) + ln(2)*(j/2^k) + r)
= 2^N * 2^(j/2^k) * exp(r)
. 2^N
is calculated via bit manipulation. 2^(j/2^k) via a lookup table.
exp(r)` approximated via polynomial. And there’s the usual validation.
AVX512 implementation uses vpcmpgtd
to compare against a threshold value, & vpbroadcastd
in regards to computing 2^N
. Otherwise it’s mostly multiply-adds. Again there’s other implementations.
Linux Port
Most of GNU LibC’s porting effort ofcourse goes to Linux, and to HURD. So how does this well optimized port work?
These functions which on Linux wraps system calls are (unless otherwise stated the syscall identifier is the same as the function’s), or are trivial wrappers for other functions include:
xmknod[at]
(wraps__mknodat
with error reporting)writev
write[_nocancel]
(“write”)waitid
wait3[_time64]
(wraps__wait4_time64
withstruct __rusage64
tostruct rusage
output conversion)vmsplice
versionsort64
(wrapsstrverscmp
)ustat
unlockpt
(wrapsioctl(TIOCSPTLCK)
)umount2
& it’sumount
wrappertruncate64
truncate
(“truncate64” or “truncate” depending on__NR_truncate
build macro)glibc_tls_internal[_free]
(wrapsTHREAD_SELF
’stls_state
property)timer_getoverrun
with timer ID input conversiontelldir
(wraps lockedfilepos
property of givenDIR*
)tee
tcflush
(wrapsioctl(TCFLSH)
)tcflow
(wrapsioctl(TCXONC)
)tcdrain
(“ioctl” givenTCSBRK
& 1 as args)sync_file_range
(“sync_file_range2” or “sync_file_range” depending onNR_sync_file_range2
build macro)stat
(wrapsfstatat(AT_FDCWD)
)splice
socket[pair]
possibly viaSOCKETCALL
macrosigwaitinfo
(wrapssigtimedwait
)sigsuspend
(“rt_sigsuspend”)sigqueue
("rt_sigqueueinfo"
) with inputs & user/process IDs packaged into asiginfo_t
sigprocmask
(wraps"__pthread_sigmask"
) storing aside errorssigpending
(“rt_sigpending”)signalfd
(“signalfd4”)shutdown
possibly viaSOCKETCALL
macroshmget
(“shmget” or “ipc(IPCOP_shmget)”)shmdt
(similar as forshmget
)setuid
(“setuid32” or “setuid” depending on__NR_setuid32
build macro)settimezone
(“settimeofday” if available)setsockopt
possibly viaSOCKETCALL
macrosetrlimit64
(“prlimit64(0)”)setreuid
(“reuid32”)setresuid
(“setresuid32”)setresgid
(“setresgid32”)setregid
(“setregid32”)setgroups
(“setgroups32”)setgid
(“setgid[32]”(1))seteuid
(“setresuid[32]”(3, -1)) with error handling for ~0 GIDssendto
possibly viaSOCKETCALL_CANCEL
macrosend[m]msg
possibly viaSOCKETCALL_CANCEL
macrosend
(“send[to]”) possibly viaSOCKETCALL_CANCEL
macrosemop
(wrapssemtimedop
)semget
(“semget” or “ipc(IPCOP_semget)”)sched_setaffinity
compiletime-optionally with legacy compatibilitysched_getcpu
(“getcpu”) possibly viaINLINE_VSYSCALL
safe_fatal
(“getpid” & “kill”)- `renameat (“renameat[2]”)
rename
(“rename[at[2]]”)recvmsg
possibly viaSOCKETCALL_CANCEL
macrorecvfrom
possibly viaSOCKETCALL_CANCEL
macrorecv
(“recv[from]”) possibly viaSOCKETCALL_CANCEL
macroreboot
with magic numbersreadv
read[_nocancel]
(“read”)readahead
pwritev[64]
(“pwritev”) compiletime-optionally if successful withatomic_pwritev[64]_replacement
callpwrite[64]
__pthread_initialize_pids
(“set_tid_address”) with the givenstruct pthread*
’stid
property__profil_counter
(wrapsprofil_count
&sigcontext_get_pc
) with anasm volatile
hack to prevent theprofil_count
call from being tailcall-optimized__profile_frequency
(wrapsGLRO(dl_clktck)
)process_vm_writev
process_vm_readv
preadv[64]
(“preadv”) compiletime-optionally if sucessful with__atomic_preadv64_replacement
callpread[64[_nocancel]]
(“pread64”)prctl
lowering away variadic argumentspkey_get/set
is legacy erroring nooppkey_mprotect
, for -1 wraps__mprotect
pause
, fallsback to “ppol” if that syscall is unavailableopen[at[64]][_nocancel]
(“openat”) lowering away variadic argumentsopen_by_handle_at
- `writev_nocancel_nostatus (“writev”)
msync
msgsnd
fallsback to “ipc(IPCOP_msgsnd)”msgrecv
fallsback to “ipc(IPCOP_msgrcv)”msgget
fallsback to “ipc(IPCOP_msgget)”mq_send
(wrapsmq_timedsend
)mq_receive
(wrapsmq_timedreceive
)mq_getattr
(wrapsmq_setattr
)mq_close
(“close(1)”)lseek64
(“llseek” or “lseek”)lseek
(“llseek”) with error reporting of overflowslocal_seteuid
macro (“setresuid[32]”)local_setegid
macro (“setresgid[32]”)listen
possibly viaSOCKETCALL
macrowritev_for_fatal
(“writev”) in a looplibc_signal_block_all
("rt_sigprocmask(SIG_BLOCK)"
), same file has input bittwiddle abstractions- Other
"rt_sigprocmask"
syscall wrappers from that file includelibc_signal_block_app
(with input preprocessing,libc_signal_block_sigtimer
,libc_signal_unblock_sigtimer
, &libc_signal_restore_set
. grantpt
(wrapsioctl(TIOCGPTN)
)gettimeofday[_syscall]
withmemset
on optional inputgettimeofday[64]
(wrapsclock_gettime64
) with `memset on optional input & output format conversion possibly with validationgetsockopt
possibly viaSOCKETCALL
macrogetsockname
possibly viaSOCKETCALL
macrogetrlimit
(“[u]getrlimit”)getrandom
posix_openpt
(wrapsopen
) & it’s wrappergetpt
getpriority
with output validationgetpeername
possibly viaSOCKETCALL
macrogetpagesize
(wrapsGLRO(dl_pagesize)
)getlogin
(wrapsgetlogin_r_loginuid
& if successfulgetlogin_fd0
)getcpu
possibly viaINLINE_VSYSCALL
macrogetclktck
(wrapsGLRO_clktck
orSYSTEM_CLK_TCK
global)ftruncate[64]
(“ftruncate64”)fsync
- There’s numerous
fstat[at][64][_time64]
wrappers, which I’ll discuss soon-ish fdatasync
FATAL_PREPARE
macro (wraps__libc_ptf_call(__pthread_setcancelstate)
)fallocate[64]
(“fallocate”)exit_thread
(“exit”) in infinite loopexit
(“exit_group” & “exit”) fallingback to CPU opcode & infinite loopeventfd_write
(wrapswrite
)eventfd_read (wraps
read`)epoll_wait
fallsback to wrappingepoll_pwait
epoll_pwait
_dl_writev
(“writev”)_dl_write
(“write”) with output validationopenat64
(“openat(3)”)dirfd
(wrapsfd
property of givenDIR*
)create[64]
(“creat” or wraps__open64
)copy_file_range
connect
possibly viaSOCKETCALL_CANCEL
macroclose[_nocancel]
(“close”)brk
with output validationbind
possibly viaSOCKETCALL
macroalphasort64
(wrapsstrcoll
)access
fallsback to “faccessat”accept4
accept
(“accept[4]”) possibly viaSOCKETCALL_CANCEL
macrostatfs
(“statfs64”) with overflow error reportingfstatfs
(“fstatfs64”) with overflow error reportingunlink
&rmdir
(“unlinkat”)symlink
(“symlinkat”)readlink
(“readlinkat”)pipe
(“pipe2”)- `mkdir (“mkdirat”)
link
(“linkat”)lchown
(“fchownat”)inofity_init
(“inotify_init1(1, 0)”)epoll_create
(“epoll_create1”) with input validationdup2
(fcntl(F_GETFL)
if two args are equal or “dup3”)chown
(“fchownat”)chmod
(“fchmodat”)
Some more functions around syscalls, ones which wraps several syscalls, are:
xstat64
(“[f]stat[64]”, “newfstatat”, or “statx”)xstat
(“[f]stat[64]”)utime[n]s[at][64]
(“utimensat[64]”) with validation & format conversiontimes
(“times”) with validationtimer_settime64
possibly with validation, format conversion, & “timer_settime” fallback; hastimer_settime
wrappertimer_gettime64
possibly with fallback totimer_gettime
; hastimer_gettime
wrappertimerfd_settime64
possibly with validation, output conversion, & fallback totimerfd_settime
; hastimerfd_settime
wrappertimerfd_gettime64
possibly with fallback totimerfd_gettime
; hastimerfd_gettime
wrappertime
depending on startup flags may alternatively wrapclock_gettime64
with validation & format conversiontcsetattr
(“ioctl”) with input conversionstatx
possibly (compiletime) with external fallbackstatfs64
possibly withstatfs
fallbacksigtimedwait[64]
(“rt_sigtimedwait[_time64]”)shmat
with “ipc(IPCOP_shmat)” fallbacksetrlimit
(“prlimit64(0)”)setitimer[64]
(“setitimer” in 1 or 2 configurations depending on build macros) with or without format conversionsemtimedop[64]
(“semtimedop[_time64]” or “ipc(IPCOP_semtimedop)”)select[64]
(“pselect6[_time64]” or “select” fallback) with format conversionsched_rr_get_interval[64]
(“sched_rr_getinterval_time” with “…64” fallback)sched_getaffinity
withmemset
tidying & optional legacy compatibilityrecvmmsg[64]
(“recvmmsg[_time64]”) possibly usingSOCKETCALL_CANCEL
macro on “recvmmsg” fallback syscallraise
(“getpid”, “gettid”, & “tgkill”) with a temporary signal blockpwrite[v64]v2
(“pwritev2”) falling back to__writev
orpwritev[64]
ptrace
with variadic lowering & saving aside errorspthread_sigqueue
(“getpid” & “rt_tgsigqueueinfo”) with input validation & gathering into asiginfo_t
, & error reformattingpthread_kill
(“getpid” & “tgkill”) with input validation,pthread_t
’stid
property, & error reformattingpselect32
(“pselect6”) with input validation & reformattingpselect[64]
(“pselect6_time64”) with input rearranging &pselect32
fallbackprlimit
(“prlimit64”) with input & output gathering to/fromstruct rlimit64
spread[v64]v2
(“preadv2”) with output validation & fallback to__readv
orpreadv64
ppoll64
(“ppoll[_time64]”) with input reordering & caching whether 64bit is supportedposix_madvise
(“madvise”)posix_fallocate64_l64
(“fallocate`) with external (another GNU LibC file) fallbackposix_fallocate
(“fallocate”) with certain 0’d out, & external fallbackposix_fadvise64_l64/l32
(“fadvise64_64”) handling ABI breakposix_fadvise
(“fadvise64[_64]”) handling ABI breakpoll
(“[p]poll”)personality
with error reformattingsetup_thread
(“set_tid_address”) withconfstr(_CS_GNU_LIBPTHREAD_VERSION)
check &head->nscd_certainly_running
struct database_dyn*
’s propertyaccess_noerrno
(“access” or “faccessat”) with error reformattingkill_noerrno
(“kill”) with error reformattingmq_unlink
with input validation & error reformattingmq_timedsend[_time64]
with “mq_timedsend” fallbackmq_timedreceive_[time64]
withmq_timedreceive
fallbackmq_open
with input validation, variadic lowering, & amq_open_2
wrappermmap[64]
(“mmap[2]”) with CPU-specific input preparation/validation (unless overriden by CPU ports,MMAP_CALL
macro this uses is a direct wrapper aroundINLINE_SYSCALL_CALL
)mlock[2]
(“mlock2”, possibly trying “mlock” first given no flags)mknodat
with input validationlxstat64
(“lstat[64]”, “[new]fstatat[64]”, or “statx”)lxstat(_STAT_VER_KERNEL)
(“fstatat64” or “lstat” depending on build flags)lxstat
otherwise (“lstat64”) with output reformattinggetrusage[64]
(“getrusage”) with output reformattinggetrlimit64
(“prlimit64(0)”) with optional legacy compatibilitygetitimer[64]
(“getimer”) with output reformattinggai_sigqueue
(“rt_sigqueueinfo”) gathering inputs into asiginfo_t
fxstatat[64]
(“[new]fstatat[64]” or “statx”) possibly with light output reformattingfxstat64
(“fstat[64]” or “statx”, runtime & compiletime conditions)fxstat
(“fstat[64]”, runtime & compiletime conditions)futimens64
- same as forutimensat64
fstatfs64
withfstatfs
fallback caching whether that’s needed[f]statat64[_time64]
(“statx”, “newfstatat”, or “fstatat64” fallback)fstatat
(“fstatat64”) with output validation/reformatting & “newfstatat” fallbackfcntl64_nocancel[_adjusted]
(“fcntl64”) with normalized inputfcntl64
with above fallbackfcntl
(“fcntl64”) with variadic arg validated/reformatted based on the specifiedcmd
, with optional legacy compatibilityfaccessat
(“faccessat[2]”) with output-validated/reformattedfstatat64
fallbackcollect_default_sched
(“sched_getscheduler” and/or “sched_getparam”)clock_settime[64]
with validated input & output, & “clock_settime” fallbackclock_nanosleep[_time64]
with input validation/reformatting & “clock_nanosleep” fallbackclock_gettime64
with “clock_gettime” fallback & caches whether that’s neededclock_getres[64]
("clock_getres[_time64]"
) caching “clock_getres” fallback’s neededclock_getcpuclockid
("clock_getres[_time64]"
) with minor reformattingclock_adjtime64
with fallback to “clock_adjustime” & output reformattingarch_fork
(“clone[2]”)aio_sigqueue
("rt_sigqueueinfo"
) gathering input into asiginfo_t
adjtimex[64]
wrapsclock_adjtime64
aio_create_helper_thread
("rt_sigprocmask"
) run in a new pthread surrounded by “rt_sigprocmask” callssendfile
(“sendfile64”) with input validationlseek_overflow
&stat[fs]overflow
validates & saves asideEOVERFLOW
errorslongjmp_chk
(“sigaltstack”) wraps__longjmp
with linker validation
Here I’ll study the functions which don’t call into Linux, at least directly, & doesn’t use many magic numbers!
There’s a few functions to check & report overflows.
There’s format conversion helper functions for e.g. xstat[64,32]
or rusage
.
I see more wrappers around the time syscalls.
I see a pthread which infinite loops calling sigwaitinfo
& spawning new pthreads for provided callbacks for implementing timers.
I see atomic accessors for a global caching whether ..._time64
syscalls are supported.
There’s a thread_sleep[64]
wrapper around clock_nanosleep_time64
, thread_sleep
adds format conversion.
Unless overriden by CPU ports INTERNAL_VSYSCALL_CALL
macro compiles to a function call, whilst INLINE_VSYSCALL
wraps it & INTERNAL_SYSCALL_CALL
with error checks & GLRO
globals. INTERNAL_SYSCALL_ERROR_P
compares the given value against -4096. SYSCALL_ERROR_LABEL
wraps set_errno
.
INLINE_SYSCALL
macro wraps INTERNAL_SYSCALL
, INTERNAL_SYSCALL_ERRORP
, & depending on that result SYCALL_ERROR_LABEL
& (which negates it’s arg) INTERNAL_SYSCALL_ERRNO
. INLINE_SYSCALL_ERROR_RETURN_VALUE
also wraps set_errno
. There’s alignment macros which by default assumes registers are paired like on x86. There’s SYSCALL_LL[64][_PRW]
macros for reformatting 64bit args if necessary when passing to Linux. & LO_HI_LONG
splits a word in 32bit halves.
There’s more statfs[64]
wrappers, & format conversion helpers. And a sigtimedwait
retry-wrapper sigwait
.
sigstack
wraps sigaltstack
with stack_t
datastructures.
There’s a sigset_t
smallintmap counter implementation.
set[ipv4]sourcefilter
wraps setsockopt(MCAST/IP_MSFILTER)
with input (including response from get_sol
) gathered into stack or heap-allocated memory.
Dirs are read like normal files on Linux, and there’s functions for decoding their intermediate fileformat. Though reading individual entries of a directory is a seperate “getdents64” syscall, despite what I stated above, even if there’s still need for buffering & output reformatting. Has several wrappers.
ptsname[_r]
wraps ioctl(TIOCGPTN)
with format conversion.
There’s a format conversion functions, some retrieving clock IDs from struct pthreads
.
Several more time-related wrappers.
ifreq
opens a socket to wrap ioctl(SIOCGIFCONF)
with format conversion, similar for if_nametoindex
/if_indextoname
& ioctl(SIOCGIFINDEX)
.
if_nameindex[_netlink]
wraps calls into Linux’s Netlink system to get linkedlists to reformat into an array, with a corresponding function to free that data.
getipv4sourcefilter
works like setipv4sourcefilter
.
fdopendir
calls fstat64
, fnctl64_nocancel(F_GETFL)
& alloc_dir
to validate the given filedescriptor.
fpathconf
dispatches to different functions processing the response from fstatfs
, or fallsback to posix_fpathconf
.
setup_vdso_pointers
, called the ELF loader, loads GLRO
globals from said ELF loader.
DL_SYSDEP_OSCHECK
validates OS version requirements from GLRO
globals. dl_setup_stack_chk_guard
reprocesses some provide random numbers to help validate against stack underflow.
dl_make_stack_executable
wraps mprotect
with compiletime-optional validation & setting of a bitflag in GL(dl_stack_flags)
.
cmsg_nxthdr
performs pointer arithmatic to traversea struct cmsghdr
.
There’s code for reading/allocating linkedlists as mentioned above & validating them from Linux’s PF_NETLINK
sockets. And freeing said data. With usecount & “timestamp” atomic counters. Some such functions validate the interfaces use native transport, utilizing failure-retry.
And there’s structs, enums, & macros defining the format of a.out files.
Here I’ll look at the function implementations which use some magic numbers.
The TRANSFORM_UTMP_FILE_NAME
macro wraps a few iterations of strcmp
& access
checks upon the given filename & some global constants typically with “x” appended. I see this exact same macro in two different files: one extending <login/utmp_file.c>
& the other <login/updwtmp.c>
.
There’s funcs for transform “stat64” parameter formats involving some bittwiddling.
readdir_r
whilst locking the providing DIR*
repeatedly calls readdir_unlocked
until successful.
opendirat
wraps openat_nocancel
with non-empty filename & fstat64
validation before allocating the DIR*
.
There’s ftime[64]
format conversion which may involve a divide & zeroing unspecified fields.
dl_osversion_init
& it’s wrapping envvar-loading macros parses provided string with some bittwiddling & stores in GLRO(dl_osversion)
.
getcwd
wraps “getcwd” syscall with validation.
sethostid
writes the given 32bit ID number to /etc/hostid. gethostid
reads that number from that file with a fallback to gethostname
with repeated gethostbyname_r
calls given an exponentially-growing buffer.
There’s adtime[64]
wrapping clock_adjtime64
with format & unit conversion.
dl_discover_osversion
may consult GLRO(dl_sysinfo_map)
looking for the Linux version number. Or it’ll use uname
. Or /proc/sys/kernel/osrelease. Which requires parsing the integers.
getsourcefilter
stack-or-heap allocates a struct group_filter
, iterates over a table to retrieve the “socket level” from the given “address family” whilst validating address bytelength, & defers to getsockopt(MCAST_MSFILTER)
on that socket level.
There’s an internal function for validating responses AF_NETLINK
sockets. With a call to getsockname
& fcntl(F_GETFL)
, error code validation, bytesize validation, & error messages.
wait4_time64
wraps the “wait4” or “waitid” syscalls possibly with some bittwiddling format conversion & NULL checks.
timer_delete
wraps the "timer_delete"
syscall with input conversion & a mutex-locked linked-list. timer_create
has more input conversion and starts a helper thread with it’s data under a once flag before recalling the syscall.
tcsendbreak
wraps “ioctl(TSCBRK)” if input < 0 or with unit conversion “ioctl(TCSBRKP)” if supported. Otherwise errors.
tcgetattr
wraps “ioctl(TCGETS)” if successful with output conversion with bitmasks & memcpy
s.
cfget(i/o)speed
wraps struct termios
properties with bitmasks. cfset(o/i)speed
does same with some additional validation.
sigaction
wraps "rt_sigaction"
syscall with, if valid, conversion between given struct sigaction
s & a struct kernel_sigaction
.
getentropy
wraps repeated “getrandom” syscalls until the provided buffer is filled.
spawni[x]
wraps numerous syscalls to configure the childprocess environment (including mmap
ing the stack) & call the configured exec syscall.
There’s conversion functions between struct __msqid64_ds
& struct kernel_msqid64_ds
involving bitshifts. msgctl_syscall
wraps “msgctl” or "ipc(IPCOP_msgctl)"
, which msgctl64
wraps with format conversion under different conditions. There’s conversion functions between struct __shmid64_ds
& struct_kernel_shmid64_ds
. shmctl_syscall
wraps “shcmctl” or "ipc(IPCOP_shmctl)"
, which shmctl64
wraps with format conversion under different conditions. Same situation with struct __shmid64_ds
& struct shmid_ds
. And between struct __semid64_ds
& struct kernel_semid64_ds
, as well as struct __semid64_ds
& struct semid_ds
. As per struct __msqid64_ds
& struct msqid_ds
getdents
wraps “getdents64” syscall with repeated response parsing involving memmove
s & lseek64
s upon error.
create_thread
wraps "sched_setaffinity"
, “tgkill”, and/or “sched_setscheduler” under different conditions with locking, input validation, & input reformatting.
mq_notify
conditionally wraps a corresponding syscall, possibly with an atomically initialized helper pthread with signals blocked to manage a AF_NETLINK
socket running callbacks in in new pthreads upon each message.
getifaddrs
wraps a PF_NETLINK
socket with a couple of requests, responses parsing with subsequent multipass multi-conditional reformatting & validation. Includes corresponding routines to free any memory this allocated.
Linux exposes some of it’s capabilities to userspace as virtual filesystems. GNU LibC’s Linux port may consult some of these, and that’s what I’ll be studying today!
check_may_shrink_heap
consults /proc/sys/overcommit_memory for the text “2” caching the result.
getloadavg
consults /proc/loadavg parsing at most 3 doubles from it’s text.
gen_tempfd
opens the tempdir consulting a global for the filepath falling back to /tmp.
get_login_r
consults /proc/self/loginuid to get a user ID to pass through (given an exponentially growing buffer) getpwuid_r
validating both responses. Falling back to getlogin_r_fd0
.
fips_enabled_p
consults /proc/sys/crypto/fips_enabled parsing the integer & caching whether the response is true, false, or fail.
fexecve
, when the execveat
syscall is unavailable, fallsback to execve
ing the appropriate file in /proc/self/fd/. Erroring with ENOSYS if that dir’s unavailable.
dl_get_origin
wraps the /proc/self/exe symlink with output validation & a fallback to GLRO(dl_origin_path)
.
fchmodat
, wherever it’s syscall is unavailable, fallsback to fstatat64
ing the given filepath opened to the given filedescriptor number validating response, & chmod
s the corresponding symlink within /proc/self/fd/.
sysconf(SC_MONONIC_CLOCK/SC_[THREAD_]CPUTIME)
returns POSIX_VERSION constant.
sysconf(SC_ARG_MAX) wraps
getrlimit &
MAX.
sysconf(SC_NGROUPS_MAX) parses the textual integer from /proc/sys/kernel/ngroups
_max. And
sysconf(SC_SIGQUEUE_MAX)` parses the textual integer from /proc/sys/kernel/rt-sigmax.
readonly_area
parses the -/space seperated table of textual integers at /proc/self/maps.
opensock
checks whether /proc/net is available followed a table of /proc/ subpaths for each protocol family, opening a socket for the first protocol family whose corresponding subfilepath is available & caching which that was.
ttyname[_r]
validates/fstat64
s it’s given filedescriptor reading the corresponding symlink within /proc/self/fd/ to get the actual filepath to validate. Failing that it locates the corresponding file within /dev/pts/ or /dev/.
shm_directory
checks /dev/shm/ is available before consulting /proc/mounts to locate an appropriate “tmpfs” or “shm” mountpoint to. The successfully-located directory is cached & returned, with that initialization happening only atomically-once.
get_nprocs_conf
counts the /sys/devices/system/cpu/cpu* pseudofiles falling back to parsing /proc/cpuinfo, compiletime-optionally checking/parsing /sys/devices/system/cpu/online & /proc/stat before /proc/cpuinfo.
pathconf
statfs
s the given filepath reformatting it’s f_type
differently given PC_LINK_MAX
, PC_FIELSIZEBITS
, or PC_2_SYMLINKS
. Given PC_CHOWN_RESTRICTED
it just validates whether that statfs
was successful. Otherwise fallsback to posix_pathconf
.
Here I’ll discuss GNU LibC’s Linux port’s datafiles. And their generators.
There’s a dataheader of _STAT_VER_...
& _MKNOD_VER_...
enum-macros.
There’s a dataheader of standard filepaths used internally to this port.
There’s a dataheader of magic identifiers for different filesystems.
There’s a dataheader for CPUCLOCK magic numbers & bittwiddling.
There’s a dataheader of buildflag macros.
There’s a dataheader for an IPCOP_...
macro-enum.
There’s a dataheader of version numbers for /dev/{null,full,tty}.
A dataheader stores the /proc/self/fd/ directory path.
Another defines the SCSI_IOCTL_...
macro-enum.
There’s several TSV datafiles in the same format as for UNIX more generally, I’ll see how many I can list in the next toot…
alarm
bdflush
capget
&capset
create/delete/init_module
epoll_create[1]/ctl
eventfd
execve
flock
get_kernel_syms
get[p,e,re](p,g,u)id
,getpgrp
, &getsid
inotify_init[1]
&inotify_add/rm_watch
ioperm
&iopl
klogctl
lchown
mincore
m[un]lock[all]
mount
mremap
nfsservctl
pipe[2]
pivot_root
query_module
quotactl
remap_file_pages
sched_(g,s)et(p,s)
sched_prim(ax,in)
sched_yield
sendfile[64]
set(fs(g,u),pg)id
sigaltstack
sysinfo
swapo(n,ff)
unshare
uselib
chown
fchownat
mkdirat
[sym,un,read]linkat
[l,f](s,g)etxattr
&[l,f](list,remove)xattr
mq_setattr
timerfd_create
fanotify_init
name_to_handle_at
setns
memfd_create
pkey_alloc/free
gettid
tgkill
For 64bit wordsizes:
[f]statfs
sendfile
prlimit
fanotify_mark
personality
For 32bit wordsizes: prlimit64
& fanotify_mark
.
And also in a seperate file (don’t understand what the “generic” subdirectory’s about) there’s:
socket[pair]
bind
listen
get(sock,peer)name
(g,s)etsockopt
shutdown
A seperate datafile lists all the names newline-seperated from these files.
Then there’s scripts to help generate these datafiles from Linux’s build files!
There’s a Sed script to insert additional code into the linker’s shellscript.
There’s a localized C script which checks the given file for some magic numbers & execs it with some environment variables.
There’s a Python script which parses the Linux header files (via GCC subprocesses & regexps) to define macros specifying the ID number for each syscall.
There’s an AWK script which wraps the __NR_...
macros with SYS_...
macros if you’re on a kernel version which supports it.
And there’s another AWK script for stripping off the __NR_...
-prefix.
While GNU LibC’s Linux port doesn’t link directly to Linux’s headerfiles, there are a small part of Linux’s, and the internet’s, extensive ABI which is relevant to it. So GNU LibC redeclares these itself as C macros & structs.
There’s a headerfile which declares the structure, including bitflags, of the arguments to pass to the epoll
& inotify
syscalls. Another defines enums & (largely-corresponding) macros defining ints passed into mount
, prctl
, personality
(which has two, one of which represents bitflags), ptrace
, & (bitvalues look interesting to human debuggers) reboot
syscalls. There’s macros representing various control characters.
adjtimex
takes a struct holding a struct timeval
& several long int
s, half of which are padding, whilst returning a value from a macro-enum. signalfd
’s input struct holds numerous ints signed or unsigned of various bitwidths (usually 32bits).
Config request IOCTLs takes a struct of 3 64bit unsigned ints (O.K., first int’s less constrained) representing a version number. quoteactl
syscall takes structs of 32bit, & several 64bit, unsigned integers which have corresponding bitpacking macros.
Additional Linux-specific structs for ELF are defined, namely: struct prstatus
/prstatus_t
which contains several ints signed or unsigned of different bitwidths, __pid_t
s, & struct timeval
s, & prpsinfo_t
mostly contains text.
acct
takes enums of text, numerous unsigned 16 or 32bit integers, & maybe a float. With a corresponding bitflag enum.
There’s a headerfile defining SCSI control characters alsongside a struct to hold 12 of them for different hardware purposes. There’s structs for SCSI communication over IOVecs including sized pointers, request headers, & different request types consisting largely of ints of various bitwidths. Associated macros defines what goes into those properties.
Various time-related syscalls take a struct __timex64
or __ntptimeval64
holding various (usually long) ints, plenty of padding, & a struct __timeval64
.
GNU LibC defines various routing protocols in the form of C structs. Linux is the one actually doing the routing (even if you don’t personally run Linux, the “router” your ISP gave you probably does), but I suppose the raw packets are exposed to userspace in case they want additional info from it.
Amongst these are: “Rose” for packet radio use, NetROM, arbitrary packets, IUCV for cars, Internetwork Packet eXchange, IEEE 802.5 Token Rings, Fiber Distributed Data Interface, Ethernet + ARP, EcoNet (from Acorn Computers), AX25 for amateur radio, Ash for wireless network sensors, Point to Point Protocol, ARP for local network address lookup, & more Ethernet.
Also there’s a socket level for Apple Talk declared via macro; a text-containing struct for addressing AF_SOCKET
sockets, a struct, with bitflag macro-enums, for sending routing tables to the kernel; & a struct shaperconf
holding a 16bit command & a 14char name or 32bit speed.
I’ll focus on Ethernet, Point-to-Point Protocol, & ARP. Apparantly IP is too high-level here…
Point-to-Point Protocol declares a struct npioctl
holding a protocol (int) & mode (enum). struct
ppp_option_data is a sized pointer with an integral
transmit field.
struct ifppp[c]statsreq combines
struct ifreq &
struct ppp_[comp_]stats`. With macros for what should go in these fields.
ARP headers contain bytes specifying the hardware & protocol address formats & lengths, & an opcode byte with a macro-enum. Higher-level structs package the addresses into struct sockaddr
s.
Ethernet addresses contain 16 bytes. Ethernet headers contain two of those plus a 16bit macro-enum “type”. Plus some pointer-arithmatic macros.
Linux x86_64 Port
Some parts of the kernel ports, including Linux’s are CPU-specific. Especially regarding how systemcalls are triggered in the kernel. I’m specifically focusing on x86_64 CPUs here since that’s what’s common & what I use.
Starting with simpler macros & functions…
internal_syscall1
loads the argument into the “rdi” register before deferring to the syscall
opcode.
There’s ZERO_EXTEND_...
macros which mov
s from a smaller register to a larger register. ARGIFY
performs type coercion if necessary.
SYSCALL_ERROR_HANDLER
macro saves the error code aside (via another macro SYSCALL_SET_ERRNO
) & returns -1. lseek*
defaults to noops.
In preparation for mmap
syscalls MMAP_PREPARE
macro will first attempt to use the 32bit syscall if the CPU indicates it prefers that. This can be overriden via an environment variable (which is not inherited by setuid programs) updating the GLRO
global.
There’s a timer_settime/settime
syscalls with input format conversion.
There’s optional legacy compatibility for timer_getoverrun/delete
, or with atomic postprocessing timer_create
.
There’s a syscallpath which negates the “RAX_LP” register.
There’s syscall
assembly function which wraps the syscall
opcode with mov
s of all 5 arguments + syscall number into the appropriate registers & an error path.
sigcontext_get_pc
retrieves the saved “RIP” registers out of the given out of the given ucontext_t*
.
The vfork syscall pop
s it’s return address into the “RDI” register (that’s what’s available per calling convention) whilst running the syscall, then uses xorl
, rdsspq
, & testq
to determine whether the “shadowstack” is being used & thus what to return.
There’s macros for prepending the __NR_
-macroprefix to syscall names, & others wrapping other syscalls with various fallback codepaths.
There’s different macros wrapping the syscall
opcode for different numbers of args.
There’s a macro wrapping the restore[2]
syscall with a struct (additional macros mutating it’s fields) & trailing magic numbers.
The arch_prctl
syscall is wrapped with minor pre/post-processor for ARCH_GET_FS/GS
codes.
For more complex functions…
clone
wraps that syscall with (in Assembly) validation, argument gathering, and a “thread_start” codepath to call the given codepath & exit
.
getcontext
wraps that syscall with lots of movs & a compile/run-time optional codepath for the presence of a “shadow stack” including it’s own syscall.
There’s longjump validation code with both userspace & kernelspace checks, possibly aborting with a “longjump causes uninitialized stack frame” message.
setcontext
& swapcontext
wraps the rt_sigprocmask
syscall followed by loading various pieces of CPUstate. Then there’s a loop looking for a “restore token” to load in “shadow stack” state. Program Counter last.
makecontext
configures the ucontext_t*
CPUstate to call the given function pointer with given args upon swapping to it.
makecontext
calls down into a __push___start_context
Assembly function which wraps the "arch_prctl"
syscall & mutates the stack to push __start_context
function onto it. So that when makecontext
’s callback returns it’ll call exit
possibly preceded by setcontext
.
There’s a C function for serializing CPU state to human semi-readable text, including hexcodes via itoa_word
.
add_system_dir
macro appends one of “/lib[64,x32]” to the given library path.
Looking at the datafiles…
There’s a buildfile for generating macros to access the ucontext_t
fields. Anothers for stack_t
fields.
There’s a Sed script for rewriter a linker shellscript to correct LD_LIBRARY_VERSION
& RTLDLIST
environment variables.
Another dataheader lists filepaths to ELF linkers & (which all this is the source code to!) libm
/libc
shared libraries.
There’s macros for the bytesizes of saved CPUstate & the dynamic loader cache.
Other datafiles list syscalls like arch_prctl
, modify_ldt
, syscall_clock_gettime
, bind
, get(peer,sock)name
, (g,s)etsockopt
, listen
, shutdown
, & socket[pair]
.
There’s 2 headerfiles declaring syscall IDs.
The ID for the set_thread_area
syscall is declared seperately.