Some of UNIX’s core commands serve to reformat text, and other data, into something nicer to work with. These tend to involve more computational logic of their own.
When managing & collaborating on text files, it can be useful to see what has changed! For this GNU DiffUtils provides a small suite of commands which I’ll study today.
sdiff, after initializing internationalization, registers a cleanup function to kill it’s child & remove it’s tempfile, ensures there’s openfiles for stdin, stdout, & stderr, reads $EDITOR envvar defaulting to
ed, parses commandline flags largely buffer them to be handed to
diff, validation additional arg count.
In simple enough cases (where postprocessing isn’t needed) it then buffers a couple extra flags & args buffer
Otherwise it checks whether the two inputs are files or directories,
fopens the resolved relative paths, buffers some additional args, configures several signals to be ignored,
forks a new process (using
popen instead if
fork’s broken on this UNIX) to
execvp the buffered
diff command in, initializes 3 “line filters”, & interprets any output lines.
diff output lines
sdiff (during which it also handles signals) an initial space indicates to output the line, nonspaces parse a couple ints before preceding. ‘i’ reads a parsed number of common lines with or without (depending on commandline flag) outputting them before copying a parsed number of lines from left & skipping lines on right.
Or it prompts for user input to, via large looped
switch statement, determines which output to write if any. Possibly opening $EDITOR.
diff3, after initializing i18n & ensuring there’s open stdin/stdout/stderr, parses & normalizes flags, validates 3 args are provided & extracts them, ensures the “common” file isn’t set to stdin swapping them around if necessary, constructs an index mapping & inverse mapping, validates those files exist, resets
SIGCHLD signal handler, computes the
diff between between the common file & the other two others parsing results, backs up the linkedlists, merges & outputs them, & cleans up.
Merging two diffs into a 3way-diff involves what resembles a mergesorting linkedlists tracking high & low watermarks, with specially handling/notation for where they overlap.
There’s 3 different output syntaxes:
ed, “merge”, & default. These iterate over, possibly in reverse, & serialize the parsed & merged diff.
diff, after initialization internationalization & regexes + exclude lists whilst ensuring there’s open files at stdin/stdout/stderr, parses & normalizes extensive commandline flags partially into that regexp & excludelist, decides which files to compare to which other files (3 codepaths) whilst calling the core logic to do so, outputs any buffered messages, checks whether stdout errored, & exits.
That core logic involves handling edgecases introduced by directory traversal,
[l]stating the files handling same filename & stdin specially, flags files which aren’t present zeroing out their
stat info appropriately, performs various additional validation possibling handing off to the directory diffing logic, otherwise opens the two files to apply the text diffing algorithm to it, & cleans up with appropriate output.
Diffing a directory involves reading & sorting all it’s filenames (some excluded) into an array, skips over entries until a “starting file”, & essentially mergesorts the two files together calling a callback (which in
diff command’s case recurses over the filesystem) for all matches before cleaning up.
To handle text or binary files
diff opens both files detecting whether they’re binary & reading+hashing all their lines after skipping common prefix. If either’s binary (fastpaths for unequal lengths or stdin) it computes a lowest-common-multiple for buffersize &
memcmps a bufferful at a time outputting whether they differ.
For text files
diff initializes numerous fields, compares in detail (revisit this bit next toot?) the first unequal line, finds the next pair of equal lines, builds a “script” reverse or not by recording how many lines have changed in this region, repeatedly checks whether all altered lines are configured to be ignored via regexp, maybe emits a brief report or outputs detailed in a wide choice of syntax before cleaning up. These syntaxes will lazily generate line comparisons.
Generating a line comparison (where it’s actually done) involves skipping over common prefixes & suffixes, checking if it’s a trivial case of insertion or deletion, breadth-first bidirectional searches the conceptual editspace graph (hueristically shortcircuit opon expensive traversals) to locate a decent chunk of equal lines to divide the input around in this divide-and-conquere algorithm, recurses into one subproblem & iterates over to the next.
I think this algorithm’s called “Patience” after the solitaire card game.
And finally to compare files byte-by-byte
cmp, after initializing internationalization & ensuring there’s open stdin/stdout/stderr, parses commandline flags, validates there exists at most two additional commandline args retrieving the first couple,
fstats them if not stdin, validates they’re different files, checks if stdout’s /dev/null & thus we can avoid work, if communicating entirely via exitcode emit failure upon unequal lengthed files, allocates a buffer of lowest-common-multiple size, & before cleaning up & after
reading to an initial start offset repeatedly reads bufferfuls in to memcmp` them fallingback to an explicit more-precise yet less optimal loop, possibly counts newlines, & emits a choice of output syntax describing the differing byte. If one file ends before the other that is described in the output too.
Documentation is a vital but often neglected aspect of software development. GNU’s tool for viewing documentation (bundled with a reformatter, not discussed here) is TexInfo!
After initializing internationalization & parsing commandline flags with a bit of postprocessing, TexInfo might extract the next commandline arg as the input file, looks up various info for $TERM, parses $INFOPATH into that array, & decides whether to perform a search or continue on the mainpath.
To perform a search TexInfo retrieves/allocates a root dir node & iterates over its references if any. For each unique reference it loads the filecontents checking a cache possibly decompressing it via an appropriate subcommand, before extracting all the index tags concatenating together references (whilst clearing a textentry in its reimplemented Curses internal library) & postprocesses the computed index.
It may optionally filter a secondary result by a keyword, here to be outputted.
There’s a secondary search codepath (for –all flag) which straightforwardly parses (this parser is also in the previous search path) the AST from all dirs referenced by $INFOPATH, & locates the dirs which contains the desired filename. Before checking the manpages. Resulting array is written to stdout.
Otherwise it reformats the given file. With sophisticated logic to locate it.
In either case TexInfo finishes by cleaning up output & state.
Depending on flags & whether it found anything TexInfo may load that filepath, generate some markup representing a table of contents from the AST, & if successful either opens the output file & writes it or outputs it into its internal Curses reimplementation. These’ll both be later topics.
Or it might open the file, locate the desired section, & output it into that Curses reimplementation.
Now that the documentation is read into the viewer, amongst other initialization, how is this actually displayed? When its not directly copied over to the output possibly with a table of contents prepended?
The display logic involves finishing initializing the internal NCurses reimplementation (terminal info & some “windows” whilst clearing screen) before…
Once everything’s setup TexInfo copies the read/lightly-parsed file into the textual-window, split by newlines amongst other preprocessing. For which it computes a linemap before resetting window state & search matches. Including computing a “modeline” textually-indicating scroll position & what you’re viewing. This noded & scroll position is added to a history array.
At which point its time to handle userinput!
In the mainloop TexInfo carefully reads as much input as it can, …
Upon success it outputs the changed lines in all vertically-tiled text-windows with SIGWNCH blocked. Regardless TexInfo might rerender the window holding the textcursor.
Then it moves the cursor to the appropriate point adjusted for window coords, reads a key (or mouse, for scrollwheel) sequence from the input buffer pausing for more input as-needed outputting any printable text into an “echo window” looking up & running a callback for each key.
Finally TexInfo runs the macro-registered callback for the given command if any clearing the echo-area before repeating the mainloop. This keymap may be a trie to traverse, though it also includes keybindings for handling editting events. I will not explore any of these commands.
Then it cleans up & exits.