Encoding refers to all the numerous arbitrary decisions regarding how data is represented. These decisions can go all the way from how the hardware stores, orders, & transmits bits all the way up to e.g. which property names data is stored under.
Determining Encoding Used
file is a command for sniffing the filetype of a file, beyond the more vague kernel-level GNU CoreUtils can determine.
After initializing internationalization whilst expanding wildcards & extract the program name into a global, it parses extensive commandline flags, sandboxes via SECComp, validates the shared library version matches the commandline interface, for check/compile/list actions initializes a “magic” struct, opens $MAGIC with various fallbacks, further initializes magic tables, attempts to duplicate the retrieved filepath, clears any previous entries in the magic table, parses & maybe compiles the magic file possibly listing a filetypes with the patterns they’re recognized by or outputting any errors as determined by the requested action, or does a slight subset of that process to just load the magicfile, either outputs usage information or computes maximum arg charwidth before processing each of them.
For each arg it temporarily opens that file (unless it’s “-“ which means stdin), runs some core logic I think I’ll cover tomorrow, asks Linux whether the file’s a FIFO, reads a configured number of bytes to hand off to some other core logic I’ll also cover tomorrow, & cleans up copying the buffer out of lookuptables to form most of the output, prints & flushes results, & exits with appropriate errorcode.
To judge based on filepath it
[l]stats it, some Windows-specific code to tweak these values, outputs results/errors with buffering including assigning filesystem components mimetypes.
To guess a filetype based on it’s contents it initializes a buffer struct to wrap the read data from the file, after some validation does various checks using lookuptables for which text encoding is used judging by e.g. which bytes are present, on OS2 calls a output-adding wrapper around
DosQueryAppType, attempts to decompress the file using a range of compression utils outputting which was successful, checks the Tar checksum to identify those files, AST-less parses JSON to identify them, AST-less parses CSV to identify them, binary-parses Windows CDF files (e.g. Microsoft Office) to identify them extracting identifying information, parses an ELF header to determine whether that matches, iterates over the (pre?-)parsed configuration files checking whether the expected bytes (or byterange) for each entry are within the expected region of the buffer with some preprocessing for e.g. byteencoding, fallingback to reporting whether it’s a plaintext file.
Often you have data in one format (say, an on-disk filesystem) and need it in a different format. This section documents such commands.
GNU FindUtils provides a small suite of commands which helps you locate specific files (though Grep’s, which I’ll cover after AWK, developed seperately). I think I’ll categorize these as reencoding the filesystem into something more concise.
find command, which I’ll cover today, lists all descendant filepaths under a directory (defaulting to “.”) so that they can be
greped. GNU FindUtils also provides other arguably more optimal commands for the same thing, but
find’s still useful.
To do so
find extracts the basename in argv into a global or two, saves the current-working-directory into a global, if
$GNU_FINDUTILS_FD_LEAK_CHECK saves data regarding open file descriptors for debugging, initializes loop detecting hashtable, initializes defaults for various options based in part on envvars, initializes internationalization, registers to close stdout on-exit, parses few commandline flags manually (no shorthands or longopts), handles a couple debug options, TDOP-parses remaining commandline arguments then costs/optimizes them, & before final/optional debug output & cleanup it calls it’s more-core logic.
This more-core logic involves either validating options & carefully-opening a file of basepaths, calling the core logic directly for “.”, or iterating over the commandline args; in all but the middle case for each (validated) entry exiting upon expressions, & calling the core logic.
After reencoding some options it reuses the same filetree traversal logic from GNU CoreUtils using the loop-detecting hashmap & interprets the given expressions TDOP-parsed earlier to determine which entries to skip, & where’s it outputting these filepaths?
That interpretor supports numerous operators including executing subcommands. This makes up the bulk of the code.
In programming there’s often a tradeoff between using more memory-space & using more CPU-time. Because
find generates the filepath list from the directory hierarchy everytime you run it. But what if that data was pregenerated (prefix-compressed) by a
updatedb shellscript? This is how
updatedb shellscript, after setting the locale to “C” & parsing commandline flags, determines whether
sort supports “-z”, loads & validates configuration from envvars, validates
frcode is present, deletes the old tempfile & configures it’s replacement to be deleted upon exit signals, runs
find twice as configured (both optional, both have 2 different codepaths) piped into
frcode with a failure handler, then
mvs the temp file into place.
frcode is an internal command (written as per usual in C) used by
updatedb to efficiently cut down on disk usage by the sorted
find results it caches for
locate. To do so, after retrieving program name & registering to close stdout upon exit, it
mallocs two 1kb-large buffers, parses a couple commandline flags validating no other args are given, outputs a file identifier, reads each stdin line computing a prefix shared with previous line to output count (1 of 2 ways) then suffix.
locate opens that index as a setuid-program before dropping all privileges (filenames can be sensitive), extracts the program name & initializes internationalization, initializes defaulting text-escaping options, configures to close stdout & cleanup those quoting options upon exit, parses commandline flags & validates commandline flags defaulting -d to $LOCATE_PATH, determines whether stdout is a (virtual) teletype, & repeatedly until an optional limit performs the core logic.
In the mainloop it considers temporarily-opening a different “database” file, fstats the file to determine if it’s out-of-date, reads & validates it’s magic number (like those
file looks for), enqueues a callback or two to read/decompress each line appropriate for that magic number & commandline flags, enqueues additional ones for any pattern matching args or containing directory, possibly enqueues one to check whether a file exists whilst maybe following symlinks, …
maybe enqueues one to count statistics, possibly enqueues one to output visited filepaths escaped or not, & enqueues one to count results possibly exitting the loop, determines whether to treat these visitors as disjunctive or conjuctive or singular conditions, possibly outputs the filepath & format, maybe tweaks readinput, repeatedly calls those callbacks, possibly outputs various stats, & outputs any errors if they occurred.
xargs command, which is very handy in conjuction with
find probably hence why it’s bundled with them, reads each line from stdin to pass them to a specified command.
To do so it extracts the commandname, maybe save info on open filedescriptors for debugging, retrieves the process ID from the kernel, initializes internationalization, registers to close stdin & waits for all child processes to exit upon exit, initializes buildcommand defaults (mostly zero-ing it out) handling any errors, & parses commandline flags.
Then it warns about -E flag being present with -0 or -d, finishes handling those buildcommand initialization errors, possibly registers
SIGUSR1/2 signal handlers to increment or decrement the process pool size, opens the specified input stream if not stdin/”-“, defaults trailing commandline args to
echo, possibly outputs various limits stored in the buildcommand,
mallocs some buffers, resets the
SIGCHLD process, and either pushes each commandline arg to be pushed onto the subcommand for it to read each line passing those args to that command ensuring it runs it at least once, or counts the length of each commandline arg then for each line pushes the initial commandline arg taking the rest from the read input line in reference to lengths before executing them.
The functions for reading those input lines can be overriden by commandline flags & handles splitting args. Core logic’s in buildcommand abstraction.
buildcommand abstraction serves to gather data into place enforcing configured limits before handing it off to a callback (provided by
xargs’s main implementation file to actually count subprocesses enforcing a limit,
fork, prepare pipelines, &
This buildcommand can also be used to run “quoted” commands from stdin.
Often we want to treat entire directories/folders as individual files we can send over the internet, always have. For that we use “archivers” to reencode the directory as a single file, which at least traditionally were conflicted with compressors since usually you want both features. The most dominant archive formats are Tar & Zip, though Linux now has its own you can mount as a filesystem!
After initializing debug-info/error-handling & internationalization,
tar allocates a name array, ignores SIGCHLD signals, disables ability to unlink directories, parses commandline args including flags via
argp_parse & several callbacks, allocates a buffer tracking names, optionally reads in a volume number, runs the specified subcommand or outputs error message, optionally outputs various summaries whilst possibly validating results, & cleans up.
To create a Tar file containing given files/directories (after standard initialization & before standard cleanup)
tar closes existing directory reads & carefully opens the file possibly via a compression command, allocates a header populated with codewords and/or reencodes it into the output stream, if not in incremental mode it iterates over a linkedlist dynamically extended from line-parsed files testing whether it wants to include each, for each carefully copying the file with metadata to the archive’s in-memory blocks handling sparsefiles specially & directories via co-recursion before extensively cleaning up after it, writes metadata about the blocks, carefully flushes remaining output, profiles how much time was spent, optionally validates output, closes the file possibly waiting for the process to end, cleans up memory, evaluates deferred unlinks, & possibly additional metadata directory metatdata to a seperate file.
In incremental mode instead of a straightforward iteration
tar gathered a sorted namelist split into two two iterations both similar to before with a merge sort after each (the second populates a hashmap), an optional third iteration validates the result, then iterates over that outputting files as per before, blanks the foundcount for each name, & iterates the non-excluded entries again.
For each copying the name to a buffer, retrieving statinfo & previously-read dir contents, if it finds any such previously-read dir-contents it iterates over each entry in the multistring that’s preceded by “Y” optionally repeatedly retrieving its metadata until success before extending the buffer & dumping that file.
Once you’ve created a Tarfile the next thing you (or whoever received it) probably want to do is to “extract” it. So how does this subcmd work?
Its initialized by checking whether we are root to determine whether to increment a couple of “option” globals, & retrieves the umask with or without restoring it.
To clean up it iterates over a linkedlist to determine which files to carefully attempt altering the
fstat info both before & after iterating over a 2D linkedlist of delayed symlinks.
The bulk of Tarfile extraction logic involves some shared code for reading Tarfiles with callback to copy the read data into new files.
This callback involves updating the reader’s scanner, determining based on options whether to skip the file for its name starting with “../”, otherwise determines whether to skip the file possibly involving a prompt if requested by adjusting the scanner, possibly decodes the header to output status, possibly evaluates queued
fstat adjustments immediately, possibly carefully renames any existing files by this name to back them up, checks the header to determine which codepath to use to generate this file falling back to skipping the entry, & possibly attempts to restore the backedup file.
Extracting a file involves carefully opening the file setting correct attributes, running through a shellcommand, or outputting to stdout before adjusting the scanner, copying each block to that file handling with fastpath for contiguous files, updates scanner, possibly configures
fstat info, & closes the file. Sparsefiles use a dispatch table presumably crossplatform portability.
For directories it validates the entry is safe & only contains files within the tarfile, decodes directory permissions, & carefully/repeatedly attempts to create it handling symlinks specially before enqueueing an operation to update the directory’s
For symlinks it might create placeholder file first to avoid complaints.
For hardlinks it might create a placeholder file (two conditions) or it’ll repeatedly attempt to create the symlink before decoding the error if any.
For device files it syscalls
mknodat before decoding the error & adjusting
fstat info. For FIFOs it similarly syscalls
Other filetypes issue warnings & are skipped. Then it validates results possibly removing the existing file, or checking timestamps.
To read a Tarfile
tar initializes a base64 decode table, gathers filenames into a single linkedlist, carefully opens the archive for writing very similarly to how it does for reading, & before closing that tarfile (again, as per before) & and outputting any missing names it repeatedly until names are found reads a new fileheader after destroying previous & decodes then validates it if successful, performs some normalization, runs the callback for the subcommand, or skips the entry.
Listing a Tarfile is straightforward: use the common logic for reading those files & after computing the “block ordinal” from the scanner outputs the file header & possibly directory contents (determined by the verbose & incremental commandline flags) before skipping the file itself.
Adding to a Tarfile
Sometimes you might want to add a new file to a Tarfile without decoding & reencoding that archive. For this
tar offers the Cat, Update, & Append subcommands!
To do so it gathers all the given names, opens the archive, & validates there aren’t any global overrides, before reading each fileheader, resets some flags if EOF was set, iterates over given names to determine which to append at the end, outputs zeroed EOT block, closes the archive, performs deferred unlinks, & frees names.
For each file preexisting in the archive upon successful reads its decoded & normalized & if its amongst the specified names during an Update subcommand carefully removes the entry, before skipping the file itself. Otherwise updates some globals or outputs errors.
For each name to be added after filtering it reuses the same logic as Tarfile creation, unless you’re using the
cat subcommand. Instead
cat manually opens another archive & copies its bytes to the end of the first.
Removing from a Tarfile
If you’re adding files to an Tarfile (without recreating it), you probably also want to be able to delete!
This is done through an initial iteration over the file’s fileheaders counting the number of times each given name occurs in it whilst rearranging some blocks until it successfully read a header. If so it allocs a record saving aside those previous records & possibly flushing the output, it consults the namelist before carefully copying the file over to output.
Then upon success it zeroes out the overriden blocks, attempts to truncate the file, & regardless tidies up open files & allocated memory warning about names not in the file.
It has helper functions for flushing these changes to the output.
Diffing a Tarfile
Like with directories sometimes you want to examine the difference between two archive files, without extracting them.
To initialize it allocs a buffer & maybe linebased-parses a directoryfile. Then for each file in the archive (usual callback iteration logic) it configures the subsequent block, maybe outputs the file header as text (as per the list subcommand) possibly preceded by “Verify “, & handles the file differently depending on its type.
For directories it reports whether the other file is also a dir & whether its permissions have changed.
For files it checks the other is also a file, checks whether the permissions/owner differs, checks timestamp, checks filesize, handling sparse files differently checks whether the file contents are the same,
For hardlinks compares the
stat data to determine if they’re pointing to the same “ino”. For symlinks compares the contained filepath.
For special files checks if both are the same of the same type, if the device number differs for char & block files, & whether the permissions differ.
For “dump dirs” it opens the subfile considering removing it & carefully compares its contents with a little nuance, or simply scans through its blocks ensuring they’re valid.
For a volume it might treat as a normal directory, or checks the other’s filetype, filesize, with the subfile open compares the contents.
And finally there’s a subcommand which iterates over the fileheaders in the Tarfile (semi-custom codepath) & tests whether each extracted volume label is amongst the given names, reporting any labels which weren’t found. In verbose mode it’ll print these volume labels.
Often in computing there’s a tradeoff to be made between conserving CPUtime vs memoryspace. The epitomy of this are “compression/decompression” tools (often used to distribute software for installation, including on Linux From Scratch) which locates & removes repetition from the input such that it can be restored later, saving space at expense of time.
Rhapsode/Haphaestus/etc happens to use GZip for decompressing HTTP responses & (after making repetition more obvious) PNG images. Not to mention all the other projects which uses GZip in their build toolchains if not at runtime!
gzip, after expanding wildcards in commandline args (in shells which don’t do that already) & extracting the program name from argv, does so by appending $GZIP (unless buildtime overriden) to argv, determines whether the program name indicates indicates to decompress & maybe output to stdout, parses & validates commandline flags with some special casing, allocates some buffers, registers some signal handlers to quit the program, finds the input & does the work, & tidies up.
gzip may (de)compress each file given in argv, carefully opens the file following any symlinks possibly trying with suffixes removed, checks whether it’s a directory which it instead recurses over, validates stdout if configured to output there, retrieves & validates filenames, empties buffers, creates missing output files, in decompression mode determines which “method” to use based on magic numbers, in list mode instead outputs compression percentages & other metadata, & repeatedlies runs core logic followed by flushing any output buffers & maybe debug output.
For “-“ or absent commandline args validates it’s input, configures binary mode & filenames,
fstats stdin, saves filesize & timestamp if set, empties buffers, if decompressing determines which “method” to use based on magic numbers, repeatedlies run the core logic, & maybe outputs debugging output. Largely same as for files but simpler.
The core logic is a callback function configured to use zip, lzw (deleted due to now expired patents), unzip, unpack, unlzw, unlzh, or (no compression or any transformation) copy determined by the magic numbers or commandline flags.
For zip compression it outputs the magic number telling it to decompress via the unzip logic, outputs converted bitflags, outputs a timestamp or zero, initializes a CRC with all ones, initializes bitstring output, initializes decompression tables with some initial data, converts compression level into bitflags to output followed by an operating-stystem identifier, maybe outputs the filename (but not directory), runs it’s core logic, validates filesize hasn’t changed in the meantime, outputs CRC & filesize, & flushes output.
The core logic involves either (buildtime choice):
- Initializes a longestmatch table & prefills the buffer whilst computing CRC via lookuptable, determines whether to take fastpath below, & repeatedlies updates that lookuptable, seeks back through the linkedlist to find the best match for data to compress, skipping distant length-3 matches, & before refilling buffer as-needed either:
- outputs previous match by validates match, updates Huffman frequencies whilst buffering for later compression, & inserts each substring into lookuptable.
- It tallies & enqueues a literal table to be Huffman-encoded, validating rsyncing.
- Upon finding nothing to compare to considers flushing a Huffman-compressed block, and lets the next iteration compress this char.
- A fastpath very similar to above with fewer checks & lookuptable inserts.
- Wraps a CPU opcode available on certain (presumably old) IBM machines in a loop with I/O & envvars.
Huffman encoding encodes more frequent “symbols” with fewer bits than more frequent ones.
To Huffman-compress an enqueued block according to the recorded tallys zip compression first min-heapifies (binary tree datastructure where parents always have lower value than both their children) so it can repeatedly pair the least common entries in the min-heap to create a new heap entry. Then computes stats on the resulting binary tree & reencodes it as a decodingtable which is immediately to the bitstring output. Then it similarly builds a bitlength tree.
It determines which Huffman-tree gives optimal compression, outputs a header, consults encoding tables to compress the data or directly outputs it if we failed to determine an effective compression, clears it’s state, & flushes the rest of the output.
A “bitstring” wrapper is used around the output adding bittwiddling so the caller think can think in terms of bits rather than bytes.
To decompress the files any recent version of this tool generated it initializes a CRC to all ones reading the original CRC & length, validates the method either copying bytes direct from to output or runs the core logic, reads the original CRC & length again. validates the computed CRC matches, validates there isn’t another entry in the PKZip file, & cleans up.
The core logic either (compiletime choice) involves an I/O loop around that old IBM CPU opcode, or decompressing each block (1st bit indicates valid block) before undoing lookahead & flushing any output whilst updating CRC.
To decompress a “dynamic” block (next 3 bits are 2) it reads in table lengths & validates them, reads in each length in table zeroing out others, reformats that into a newly-malloc’d lookuptable (using plenty of increments, bittwiddling, & possibly
mallocing subtables) freeing them upon failure, does same for a couple other tables, & decompresses normally.
To decompress a “stored” block (next 3 bits are 0) it reads the block’s length & validates the complement value, & reads each byte into a “slide” to be flushed a bufferful at a time.
And to decompress a “fixed” block (next 3 bits are 1) it hardcodes the Huffman table to be decoded before decompressing normally.
Once it has the Huffman lookuptables for dynamic or fixed blocks it repeatedlies looks up the first few bits in that table either echoing the stored literal or exitting the loop or Lempel-Ziv decompressing by Huffman-decompressing block length & (var-int encoded) distance so it can look back that far in the “slide” copying bytes one-by-one to the tail whilst flushing a bufferful at a time with CRC computation.
To decompress LZH-compressed input zeroing out several fields & filling the readbuffer. Then it repeatedlies copies a given number of bytes to the header of the buffer, reads the blocksize & Huffman lookuptables if unknown, traverses the decoded Huffman-tree to decode each bit, upon finding one outputting direct into the buffer or decoding how far to look back from another Huffman-tree & copyings bytes from the buffer one-by-one, & maybe flushing bufferfuls at a time.
Same basic concepts as modern Zip files, but not as CPUtime or RAM efficient.
To decompress LZW files (apparantly no patents infringed here, as opposed to compression) it decodes a packed-header, clears a prefixof lookuptable, identity-populates a suffixof lookuptable, before flushing remaining output repeatedlies reads a bufferful in at a time preserving some bytes from previous buffer & repeatedlies decodes a position, code (preserving previous one) before interpreting clearbuf opcodes, performing some validation and/or push lookuptable entries to a stack, repeatedlies pops the stack entries copying them to output, & adds new lookuptable entries.
And finally to decompress newer Pack files GZip Unpack decodes a Huffman-Tree which is quickly reformatted into a prefixtable, zeroes some fields, repeatedlies checks the next few bits looking them up in the table to determine whether to discard extra bits or consult any subtables before exitting the loop or outputting the Huffman-decoded byte, & flushes all remaining output validating length.
GZip isn’t the only compressor used amongst core OS projects in, for example
xz apparantly yields smaller filesizes the GZip.
After extracting the program name (via a vendored “tuklib”) & normalizing stdin/stdout/stderr,
- internationalization (via, minor, Tuklib util)
- available memory (via LibLZMA & cross-platform TukLib)
Then parses commandline args, mostly flags, & envvars.
After some validation & configuring files count
xz then maybe (for non-list mode) configure signal handling, maybe sets a sandboxing flag, determines which core code to run (coder or listing), & iterates over the commandline args with validation relating to stdin calling that core logic, parses each filename from the given listfile & runs the core logic for each before closing it, maybe (in lost mode) output totals), & cleans up.
That cleanup involves freeing alloc’d memory, clearing signal handlers, closing stdout/stderr, & exiting.
On Windows this whole command is registered as a critical section for concurrency’s sake.
After some validation & logging (including updating the progress line)
xz --list carefully temporarily-opens &
lstats the file wrapped in an abstraction & logging, before parsing & printing the indices.
After validation that parsing involves repeatedly (logging any errors):
- repeatedly reading the fileheaders until successful with another loop skipping padding.
- Reading a footer consisting of magic number, CRC32 (whose computation may or may not be sped up by a lookuptable), bitflags, & backwards size.
- Determining how much memory it can use & validating its not using any more
- Initializing data for decoding a file
- Piping the compressed/archive file into that decoder
- Reading & decoding fileheader consisting of magic number, CRC32, & bitflags.
- Saving those bitflags, padding, & decoded file.
All with extensive validation & logging.
After that parse loop it tidies up. The saving is done via callbacks.
Printing (upon success) can be done in robot, basic, & advanced modes with or without counting totals. In either case closes the index reader, and the file whether successful or not.
All of these output modes assemble resulting text via
printf & usually via intemediate strings & TukLib to better align the columns.
XZ supports several encoding & decoding formats (raw, LZMA, & XZ with two implementations of the latter for whether the CPU’s multicore).
Initializing a raw encoder involves (before setting some flags in an array) involves validating an array of filters whilst computing its length before reencoding it via a callback & linear scan through a constant array to contain methodtables.
That logic is generic between encoding & decoding, with its function call tweaked via a macro to retrieve internal coder properties.
The methods coder methods come from those filters as far as I can tell. These filters include:
- The LZMA & LZMA2 encoders
- & Encoders optimized for x86, PowerPC, IA64, ARM, ARM THUMB, SPARC, or Delta CPUs
XZ supports LZMA compression, which is intialized (before a couple flags) by registering this very constructor as a method on the next coder, possibly allocating private data if needed & sets its properties, validates encoding options & reencodes them as a byte, validates dictionary size before using bittwiddling to round it to the next power-of-two, outputs that encoding options byte, clears the buffer, constructs a filter & calls its initializer.
To initialize that filter it first dynamically computes a CRC32 lookuptable (cutting iterations down to number of bytes rather than bits) if it hasn’t been already.
Then XZ allocates the coder & sets its properties if previously NULL, runs a callback (which here allocates some additional properties if previously NULL, determines/computes parameters based on whether its in fast or normal mode, sets some other flags based on validation, sets various options, & resets its state), validates encoding options before computing & saving parameters from them whilst freeing any existing buffers, selects which find & skip methods to use (hc3/4 or bt2/3/4), computes & rounds hash size/mask then count, frees old hashmap, computes number of iterations for match finder, & allocates new buffer & hashmap prepopulating an existing dictionary into it if given.
Resetting the LZMA encoder involves validating the options & recomputing various parameters from it, resets range decoder properties back to defaults, clears the coder’s state & each of its reps, resets all the coder’s probability-tracker bit encoders including those in “bittrees” & length encoders (including those length encoder pricings, computed from those probabilities), & resets some count properties.
The LZ filter provides the methods for encoding, closing, & updating as proxied by the coder. Encoding involves repeatedly both “filling a window” (next toot) & deferring to next coder. Closing involves freeing memory. Updates are deferred.
Once everything’s initialized & we’re entirely diversed from I/O along with XZ’s coder & filter abstractions, LZMA-encoding involves:
- Checking whether we need to move the sliding window, via
memmove& pointer updates.
- Checking whether there’s next filter (there isn’t in this case) to determine whether to defer or carefully
memzeroesread data mostly to stop Valgrind from complaining.
- Checks whether we’ve reached EOF.
- Considers calling the “skip” dictionary-compression method with updated buffer indices.
I don’t yet see where the find method gets called… Presumably the decompressor…
These compression dictionary lookups/stores are implemented as a choice of straightforward hashmaps, binary trees, or combination thereof. I don’t know why you’d want this choice, or what the differences are…
tl;dr; Simple dictionary compression.
XZ supports LZMA compression, multicore or singlecore.
To initialize LZMA compression it allocs the coder if needed & configures its properties/methods. Then clears away any existing index tree & initializes a new one erroring out upon allocation error, enqueues a stream header, & writes it via a newly-initialized “block encoder”. The block encoder performs extensive validation before initializing & its underlying “raw encoder” (previously-discussed concept).
The method to update the singlecore LZMA compressor checks the sequence number before possibly freeing old filters & initializing new instances of the configured filters. If the sequence numbers within the blockinit prefix it initializes the block encoder whilst updating flags & copying over configured filters. If its within block encode the update’s deferred to the block encoder.
Ending the LZMA compression ends the index & block encoders & frees memory.
And encoding involves repeatedly branching over the sequence ID:
STREAM_FOOTERinvolves copying the buffer over before validation & maybe ending or incrementing the sequence ID.
BLOCK_INITeither initializes a new index encoder preceding to
INDEX_ENCODEOR assembles various block properties into a byte array including a computed CRC32 proceding to
BLOCK_ENCODEdefers to the block encoder before retrieving the blocksize to insert into the index whilst validating & tweaking some counts, proceding to
INDEX_ENCODEdefers to the index encoder before assembling a footer bytearray as it procedes to
Ending the index encoder is a trivial deallocator.
Index encoding involves a similar state machine with CRC32 computation:
INDICATORwrites a 0 byte & precedes to
COUNTencodes a variable-lengthed integer (uses repeated bittwiddling so smaller numbers use fewer bytes) counting the number of bytes in the index, preceding to
NEXTiterates a single iteration over the index preceding to
PADDINGor upon error immediately to
UNCOMPRESSEDoutputs a variable-lengthed uncompressed size before incrementing the state.
PADDINGadds new 0 bytes (one per statemachine iteration) before finalizing CRC32 computation & preceding immediately to
CRC32encodes each byte of the CRC32 (in an innerloop) computed upon each iteration & upon
Iterating over the index involves validating the given mode, choosing a group based on said mode & saved itermethod, (possibly repeatedly) using different logic based on presence or absence of the stream or group, & updates some properties.
Then semi-conditionally copies various properties over to the iterator returned to the caller.
If there isn’t a stream it loads the next stream & group out of the leftmost branch of the index. Otherwise it iterates to the next record if there is one & there’s a group. Otherwise it iterates to the next group (via tree traversal) then group (via repeated tree traversal).
These indices are sorted binary trees of “groups” & “streams”.
To update a block encoder (erroring if not in
CODE state) before deferring to the next coder if present. Which in this case is the previously-described “raw coder”.
To end a block encoder it defers to or frees its next coder before deallocating itself.
As for the encoding itself, after validation it is a non-self-repeating state machine:
CODEdefers to the next coder before CRC32/SHA256 validation, state updates, & maybe (unless there’s more of the block to write) proceding immediately to
PADDINGrepeatedly outputs 0 bytes before proceding immediately to
CHECKif there’s space.
CHECKcopies the computed error correction code over to as much output as is available, and exits upon completion.
There’s more sniffing logic here to determine which decoder to use, but ultimately the come down to the same options as per before! Just LZMA decompression can’t use multicore optimizations, unlike LZMA compression.
I was going to focus on the “raw” decoder to start, but that’ basically the same as the “raw” encoder with different filters. So lets discuss those filters!
The LZMA filter has minimal initialization resetting its state & configuring some methods with a little validation. It wraps bittrees & rangecoders in a statemachine implementing its own decompression. The rangecoder serves via bittwiddling to read/write bits rather than bytes, with a bit of compression. I’ve already discussed bittrees. All these filters have methods to sum their memory usage & reconfigure their properties.
It has a corresponding encoder.
The LZMA2 filters look very similar, but simpler. Not much to add.
The x86-optimized encoder & decoder are the same (accepting a flag for whether its encoding or decoding) & very trivial, though I struggle anything x86-specific about it. As for IA64.
The PowerPC-optimized encoder/decoder is even more trivial! As for ARM, ARM/Thumb, & SPARC.
There’s maybe a delta encoder & decoder which computes running sums/diffs to expose more compression opportunities!
XZ implements an “alone decoder”.
Its initializer, method for configuring how much memory is used, & destructor are trivial. Its decoding logic is a statemachine:
PROPERTIESvalidates encoding options from first byte & procedes to
DICTIONARY_SIZEdecodes a 4byte (1 byte per iteration) unsigned number before validating this dictionarysize & proceding to
UNCOMPRESSED_SIZEdecodes & validates an 8byte number, before proceding to
CODER_INITconstructs a LZMA filter before proceding to
CODEwhich defers to it.
I touched on the LZMA filter yesterday, but I could go into more detail! Surrounded by some “LZ decoder” bookkeeping it reads some input & initializes some variables before entering a hacky statemachine!
IS_MATCHchecks whether the input's valid & the opcode.
IS_MATCHopcodes followed by
LITERALopcode or state decodes an 8segment literal from the rangecoder.
IS_MATCHopcodes followed by anything else (or
LITERAL_MATCHEDstate) decodes an 8segment number from the rangecoder. Or that loop might be semi-manually unrolled using the C preprocessor! The next state is determined via a lookuptable, before checking whether there's output to write & repeating the process from the top.
LITERAL_WRITEchecks whether there's output to write
IS_REPchecks if it sees the matching opcode & decodes the slot length proceding immediately to
DIST_SLOT. Otherwise it validates dict distance before proceding to
DIST_SLOTdecodes 6 dist slots (with or without unrolling) & after some validation procedes immediately to
DIST_MODELdecodes the given number of bits from the rangecoder with or without manual unrolling to a max iteration count.
DIRECTreads a given number of “direct” inputs before preceding immediately to
ALIGNreads the given number of bits possibly loop-unrolled into seperate states followed by validation (including
IS_REP0sees the corresponding opcode it procedes immediately to
IS_REP0_LONGsees the corresponding opcode it updates its rangecoder state & procedes immediately to
SHORTREP, or updates said state differently.
SHORTREPchecks for dictionary output or restarts from top.
IS_REP1sees matching opcode it updates its angecoder & other state, otherwise procedes immediately to
IS_REP2updatse its rangecoder & other state depending on whether it sees matching upcode, before decoding a length & proceding to
COPYchecks whether there's dictionary output before proceding back to
Normally procedes to
After emptying input or seeing output to emit it updates/resets state & checks whether we've reached EOF.
LZMA decompression has a straightforward initializer, with methods to configure the memory limit, retrieve flags, deallocate the decoder with underlying block decoder & hash index.
As for the decoding method its a straightforward statemachine:
STREAM_HEADERcopies header bytes into an internal buffer (possibly split between multiple calls) & decodes/validates proceding immediately to
BLOCK_HEADERdetermines whether to procede to
INDEXinstead, for 0th byte decodes block header size, reading that much into its buffer & decodes it; initializing a block decoder with those options (revisit soon!) & procedes immediately to
BLOCKdefers to said block decoder, proceding back to
INDEXuses a new statemachine to decode the indexhash, proceding immediately to
STREAM_FOOTERcopies a stream footer to its internal buffer & decodes it validating index size matches; procedes immediately to
STREAM_PADDINGskips over padding bytes validating their in bundles of 4bytes; then resets state thus proceding to
The index decoding statemachine involves (before trailing CRC computation):
BLOCKvalidates lack of nullbyte proceding to
COUNTdecodes a variable-sized integer validating it matches an internal count, determining whether to procede to
UNPADDEDbased on it.
UNCOMPRESSED(why’s this a single codepath?) decodes a variable-sized integer to appropriate field, validates results, for
UNCOMPRESSEDappends to a hashmap, & procedes to
PADDING_INITcomputes a hashindex before proceding immediately to
PADDINGreads null padding bytes, validates those bytes & sizes & hashmap before finalizing a CRC32 to procede immediately to
CRC32validates the computed CRC32 matches the next 4 bytes in an innerloop & exits.
Block decoders wrap a raw encoder as discussed earlier with slightly less trivial init/deinit & a simple statemachine:
CODEdefers to its rawcoder, computing appropriate CRC or SHA256, updates & validates sizes, & procdes immediately to
PADDINGskips over null bytes until it reaches a 4byte boundary, & finalizes the hash to procede immediately to
CHECKreads in the trailing check validating it matches the computed hash, & exits.
I’ve already discussed the underlying rawdecoder, and its underlying “filters”.