Formatting Commands

Some of UNIX’s core commands serve to reformat text, and other data, into something nicer to work with. These tend to involve more computational logic of their own.


When managing & collaborating on text files, it can be useful to see what has changed! For this GNU DiffUtils provides a small suite of commands which I’ll study today.

sdiff, after initializing internationalization, registers a cleanup function to kill it’s child & remove it’s tempfile, ensures there’s openfiles for stdin, stdout, & stderr, reads $EDITOR envvar defaulting to ed, parses commandline flags largely buffer them to be handed to diff, validation additional arg count.

In simple enough cases (where postprocessing isn’t needed) it then buffers a couple extra flags & args buffer execvping diff.

Otherwise it checks whether the two inputs are files or directories, fopens the resolved relative paths, buffers some additional args, configures several signals to be ignored, forks a new process (using popen instead if fork’s broken on this UNIX) to execvp the buffered diff command in, initializes 3 “line filters”, & interprets any output lines.

To interpret diff output lines sdiff (during which it also handles signals) an initial space indicates to output the line, nonspaces parse a couple ints before preceding. ‘i’ reads a parsed number of common lines with or without (depending on commandline flag) outputting them before copying a parsed number of lines from left & skipping lines on right.

Or it prompts for user input to, via large looped switch statement, determines which output to write if any. Possibly opening $EDITOR.

diff3, after initializing i18n & ensuring there’s open stdin/stdout/stderr, parses & normalizes flags, validates 3 args are provided & extracts them, ensures the “common” file isn’t set to stdin swapping them around if necessary, constructs an index mapping & inverse mapping, validates those files exist, resets SIGCHLD signal handler, computes the diff between between the common file & the other two others parsing results, backs up the linkedlists, merges & outputs them, & cleans up.

Merging two diffs into a 3way-diff involves what resembles a mergesorting linkedlists tracking high & low watermarks, with specially handling/notation for where they overlap.

There’s 3 different output syntaxes: ed, “merge”, & default. These iterate over, possibly in reverse, & serialize the parsed & merged diff.

diff, after initialization internationalization & regexes + exclude lists whilst ensuring there’s open files at stdin/stdout/stderr, parses & normalizes extensive commandline flags partially into that regexp & excludelist, decides which files to compare to which other files (3 codepaths) whilst calling the core logic to do so, outputs any buffered messages, checks whether stdout errored, & exits.

That core logic involves handling edgecases introduced by directory traversal, [l]stating the files handling same filename & stdin specially, flags files which aren’t present zeroing out their stat info appropriately, performs various additional validation possibling handing off to the directory diffing logic, otherwise opens the two files to apply the text diffing algorithm to it, & cleans up with appropriate output.

Diffing a directory involves reading & sorting all it’s filenames (some excluded) into an array, skips over entries until a “starting file”, & essentially mergesorts the two files together calling a callback (which in diff command’s case recurses over the filesystem) for all matches before cleaning up.

To handle text or binary files diff opens both files detecting whether they’re binary & reading+hashing all their lines after skipping common prefix. If either’s binary (fastpaths for unequal lengths or stdin) it computes a lowest-common-multiple for buffersize & memcmps a bufferful at a time outputting whether they differ.

For text files diff initializes numerous fields, compares in detail (revisit this bit next toot?) the first unequal line, finds the next pair of equal lines, builds a “script” reverse or not by recording how many lines have changed in this region, repeatedly checks whether all altered lines are configured to be ignored via regexp, maybe emits a brief report or outputs detailed in a wide choice of syntax before cleaning up. These syntaxes will lazily generate line comparisons.

Generating a line comparison (where it’s actually done) involves skipping over common prefixes & suffixes, checking if it’s a trivial case of insertion or deletion, breadth-first bidirectional searches the conceptual editspace graph (hueristically shortcircuit opon expensive traversals) to locate a decent chunk of equal lines to divide the input around in this divide-and-conquere algorithm, recurses into one subproblem & iterates over to the next.

I think this algorithm’s called “Patience” after the solitaire card game.

And finally to compare files byte-by-byte cmp, after initializing internationalization & ensuring there’s open stdin/stdout/stderr, parses commandline flags, validates there exists at most two additional commandline args retrieving the first couple, opens & fstats them if not stdin, validates they’re different files, checks if stdout’s /dev/null & thus we can avoid work, if communicating entirely via exitcode emit failure upon unequal lengthed files, allocates a buffer of lowest-common-multiple size, & before cleaning up & after lseeking or reading to an initial start offset repeatedly reads bufferfuls in to memcmp` them fallingback to an explicit more-precise yet less optimal loop, possibly counts newlines, & emits a choice of output syntax describing the differing byte. If one file ends before the other that is described in the output too.