System Databases

A lot of, if not most (save communicating this information to humans), computing is dedicated to maintaining databases. To updating and querying various collections of data. Some of the most core databases to making your computer function at all are discussed here.

Package Management (Apt)

To assemble all the software developed by Linux, GNU, FreeDesktop.Org, GNOME, KDE, etc, etc, etc into a functioning operating system “distros” (as we often insist on calling them) create “package manager” tools for installing this software from their curated “package repositories”. Programming languages often have their own dedicated ones, which I’d consider a great idea if those were decently curated.

For this Debian has the Advanced Packaging Tool “APT”.

apt-pkg

Apt’s underlying library is predominantly apt-pkg as described here.

There’s a pkgRecords C++ class which converts a cached array into a package mapping indexed by ID.

There’s a pretty-printer.

There’s a routine to fetch each queued package update with user interaction. There’s a more general routine which applies various queued changes with user interaction.

There’s a datafile listing packager header keys in a standardized order.

There’s a class which parses a list of local files listing packages into an array.


There’s a dist-upgrade routines which flags all installed packages to be upgraded, checks whether to install all essential packages, reinstalls all previously installed packages to resolve conflicts, possibly flags held packages to be kept, & repairs any issues. Whilst updating a progressbar.

Another routine goes through all packages selecting which ones are interesting to upgrade. And a couple routines for upgrading packages with or without installing new ones. Another chooses one of these codepaths.


There’s a class which parses a list of index files with error reporting, & exposes results.

There’s a pkgPolicy class which fills in various default properties, including pinnings & priorities.

There’s a class which parses & compares version numbers.

I’m not clear what the metaIndex class is doing.

There’s routines for parsing configuration & system files to determine which packages can be installed & run on this computer.

There’s a class for tracking & reporting install progress.


There’s ofcourse some (line-by-line) parsing code in a dedicated file, and 2nd one for RFC-822 “tags” which includes seeking around the file.

There’s an abstraction around “index files”.

There’s a routine iterating over Apt’s directories deleting irrelevant files. Apt implements its own directory iterator.

The prioritized partial orderings are computed by a dedicated pkgOrderList class. There’s a class holding & validating info parsed from the cache.


There’s a pkgPackageManager class abstracting most of this, with methods for downloading archives for given packages, marking missing packages to be kept, flags a package & dependencies for immediate install, partial-order dependencies, a check for whether dependencies are irrelevant, check a list of dependencies for conflicts, check which packages may get broken by an install, run “configuration” over all non-configured packages, a couple methods to perform a wide choice of that “configuration” on individual packages including a couple repeated nested iterations & seeking out reverse-breakages as mentioned earlier, carefully flags packages to be removed with user-reporting, carefully flags/run removal of a package (actual removal implemented elsewhere), unarchive packages extremely carefully in a loop with version itarations before running the actual install/configuration, & another method for ordering dependencies this one recovering from failures.


There’s a class abstracting file copies, typically off CD or USB, with careful validation (involving pasing, hashing, etc) & progress reporting, with auxialiary methods to carefully compute filepaths.

The cache parser has a seperate class build its abstract syntax tree, normalizing & simplifying results. There’s an interpretor for querying the package cache. And a seperate class Aptitude-syntax.

There’s class abstracting a cachefile, and related classes.


There’s a class abstracting the configuration files away further, combining various field checks.

There’s a superclass for retrieving packages that implements user-reporting itself.

Debian supports reading packages off a USB or CD as a datasource, traversing all its directories not blocklisted by “.aptignr” treating an “i18n” directory specially. Results are scored or ignored based on keywords. Can mount & eject verbosely.


There’s a “worker” class applying configuration & running background commands specified there.

There’s a class handling various edgecases with error-recovery and verbose output, since you really want Apt not to fail on you because then your whole OS might fail on you!

There’s a class extracting various subsets from the package cache.

There’s a class providing parsing & serialization utilities for package descriptors.


There’s a class caching per-package dependency information & tracking package installation-state.

There’s a class maintaining a queue of local & (via subprocesses) remote files to open. As well as a class processing each individual item in that queue, including communicating with the subprocesses & logging.


Looking over some somewhat-auxiliary subsystems for apt-pkg I see:

apt-private

Here I see:

Utilities for commandline UI.

Apt Misc.

Looking through the rest of Apt, I see:

Man Pages

The first package Linux From Scratch has you install during the main userspace build is “Man-Pages”. Because documentation is essential for software, can a feature truly be said to exist if noone knows how to use it?

This package mostly provides documentation for POSIX-standardized APIs/commands/etc implemented by GNU & Linux, written in a format where most lines start with a period & formatting controlcode. Organized into 8 numbered sections, each with an “intro” documentation file.


Man-Pages also provides some scripts they use (or have used?) to normalize the formatting of these man-pages, including:

These are for project management, they don’t need to be installed.

IANA

For compactness IP, TCP, & UDP all include a number denoting which protocol (or connection, but that’s outside the scope of this thread) the rest of the packet speaks. To make this more human-legible IANA (who are charged with tracking these number prescribed by the IETF) provides a couple (IP’s datafile is seperate) datafiles in both TSV & XML formats, 4 total, for you to preinstall into /etc. Extensively listing these protocols.

Naming is an interesting topic…


There’s this concept of “Zooko’s Triangle”. Between global-uniqueness, human-legibility, & decentralization of names any system can only achieve 2 (caveat: blockchains, don’t at-me…).

Today we typically embrace the global-uniqueness & human-legibility edge, though I’d certainly say the other edges are worth exploring (which GNU LibC can facilitate!) using federated systems typically centred around IANA.

IANA is by no means perfect, with the privitization of the Internet.

Pkg-Config

When writing software we rely on reusing code from the operating system & others (in freesoftware this line is fuzzy…). To reuse various “libraries” in C code there’s different flags you need to hand GCC. Pkg-Config manages a flatfile database of those flags.

Pkg-Config in turn reuses GNOME’s extended C toolbox to do so, which it vendors. For GNOME this “GLib” toolbox primarily provides a mainloop & OO bindings, for Pkg-Config it primarily provides a “keyfile” parser.


To do so Pkg-Config checks some envvars & builds a searchpath & globals (stored in GLib collections) handling Windows specially, parses commandline flags into (using GLib abstractions), possibly debugging messages of what options were selected & copying over to globals in other files, & possibly exits printing a version number or comparing that version against an expected value.

Then it inits hashtables, inserting itself & all .pc keyfiles in the path. Giving enough info to print them all.


Then it collects remaining commandline args & check the $PKG_CONFIG_LOG envvar before iterating over the reparsed commandline args. For each it consults the hashtables to determine which keyfile to parse (turns out to not use GLib’s keyfile parser) including expanding globals from their hashmap. From there it can populate relevant hashmaps & check whether it satisfies given requirements. And recurses for any transitive dependencies.

Results are collected & possibly logged.


If that logic passes, that might be all the command needs to do. Otherwise iterates of the collected packages outputting sorted vars, checking whether any are uninstalled, outputting their version numbers, outputting pkg + version numbers, outputting each of their dependencies private or not, outputting their vars, and/or outputting their flags.

Finally it may need to output a newline before exitting.

Includes comparators on version strings.

GDBM

GDBM implements a fileformat for faster key-value stores.

Skimming over it this morning I see…

In short: It’s a memory-mapped hashmap with optional LRU caching & some support utils.


GDBM includes a suite of commandline tools for working with its fileformat, including a YACC/FLEX parser for a human-legible syntax & a suite of scanners (for different datasources) to go with it. As well as memory management abstractions, error reporting, parsing commandline args, & some other supporting utils for implementing these.

These commands include:

For the REPLs, a data-conversion command is implemented semi-externally.

There’s also a word-wrapping util, & a variable system.

GPerf

GPerf is a tool which computes the most optimal hash function for distinguishing between a set of keys a program will be receiving as input. Many system utilities (and WebKit) builds on this to be faster! Though I’m not sure it still reflects the needs of modern CPUs…

After parsing commandline flags & other args GPerf parses a keyword list, runs it through the core logic, outputs as code, & cleans up allocated memory.


What else is there in GPerf?

There’s a bool-array class used for detecting collisions, which barely see the implementation for. There’s a trivial hashtable into which it gathers keywords as it computes a more optimal hashfunction. There’s keyword & keyword-list classes holding & sorting parsed inputs.

There’s a few files (mostly in a public lib) handling option parsing. There’s array-wrappers for storing the computed positions of distinguishing chars. There’s a version string.

The public lib (usable by other programs, but of little interest) includes utils for parsing commandline flags (as mentioned, reading a single line, & computing a generic hash.

GPerf Core Logic

An initial iteration counts the number of keys. A second finds the maximum & minimum lengths of keys disallowing empty keywords & otherwise validating length.

Then it locates optimal fetches (if ones weren’t given already) by running through a loop locating instances where keys are off by a single character (the trivial cases). Then iterate through adding fetches which decrease the number of keys which looks like duplicates. A 3rd loop removes fetches which no longer do so.


A 4th loop replaces 2 fetches with one where that improves results. And then it might do some debugging.

Following loops look for increments then factors which reduce the number of hash collisions.

After checking whether a found a hashfunction with no collisions we sort the keyword list by hash value & extract some additional outputs.

GPerf Input

Today I’m studying the rest of GPerf, focusing on its input 7 output formats.

For input it reads a line-based parser incorporating configuration options (“declarations”) & escape-syntax.

The output is C code split into macro-definitions, an enum, & (through more sophisticated logic) the code to compute the chosen hashfunction. Utilizing template classes to output those macros & enums, as well as the output results & their comparisons.

GPerf Output

The codegen routine, after analyzing numeric-ranges & choosing return-types, first outputs a comment indicating that this is auto-generated code and which options were set. Possibly alongside the computed “positions” for each keyword.

An assertion might be generated to ensure that ASCII is being used. Verbatim code (if provided) is copied into place with a #-line directive. It may output a declaration of the return type. It may import .


It may call templates to output the positions as enums or globals.

It outputs comment indicating the range & duplicates. It may outputs its own functions for ASCII case-insensitive comparisons. It may output a C++ class declaration.

Then it outputs the hashfunction itself, with a few conditions (largely to chosen programming language) around what type signature to yield. It outputs a nicely-formatted array of “asso values”. A fastpath returns the length or 0.


Then it iterates over the selected positions with either a fastpath summing them with selected-increments or it generates a switch/case statement over the length with fallthroughs so that it doesn’t fetch out-of-bounds values.

It may generates globals to store the keywords. It may generate a keywords-lookuptable with lengths sidetable. It may generate a wrapper function which checks whether the input is actually valid (we know which keyword it’d be!).

And some optional verbatim code.