Shell Languages

When interacting with computers via text (the “commandline”), it’s useful to have some “domain-specific” languages (DSL) that makes it easier to process certain kinds of data. If you generalize too much everything’s DSL especially in this medium! This page documents are various of these DSLs work!

Text Expansion “M4”

M4 is a C-like preprocessor for text files - that is it interprets “macros” within a text file & expands them into more verbose output. It’s a simple program, and where most of the illegibility of GNU’s build systems come from! Personally I prefer Meson/Ninja.


After initializing internationalization, includes, debug logs, & interrupt exit handling (compile/run-time optionally testing them) it parses commandline flags vi agetopt_long.

After parsing commandline flags (D/U/s/t/–debugfile [linkedlist], E, F, G, H, I, L, P, Q, R, W, d, e/i, g, l, o, –warn-macro-sequence, –version, & –help options; deprecated B/S/T, N/–diversions) it continues initialization for: configured debug logfile, input parsing, output serialization, intermediate symbol table, macrosequence warnings via regex, include environment from $M4PATH parsed to linkedlist, & either builtin or pre-saved (micro-optimized parser) libs.

M4’s standard lib are simple-to-trivial bindings to GNU LibC & in turn systemcalls. Nothing worth commenting on. They are loaded in at compiletime via a static lookuptable to be copied over into the symboltable.

Minor adjustments are made for “interactive mode” ignoring SIGINT & disabling the stdout buffer.

Any define-related commandline flags are deferred to this point to be loaded into the symboltable closed hashmap, with macrosequence warnings where relevant.


After, & possibly during, reparsing those define commandline flags it looks up the given file in the current dir & parsed $M4PATH (alternatively “-“ means stdin) to push onto an obstack stack & expand. Afterwards it repeatedly pops & expands the wrapup_stack. Then cleans up while either serializing the definitions, diversions, & config OR outputs any text from “diversions”.

To expand it lexes then expands each token with the help of two obstacks.

To lex a token it consults a scanner & some obstacks to check for certain characters & a configurable regex. Expanding it may involve directly writing the text to stdout via custom-buffered specially-microoptimized write routines which may insert debugging info.

Or it may involve looking up the named macro in the symbol table for text to output in the same way, or expand. Expanding such a macro involve updating counts, validating state, optionally output “trace” debugging, turning any arguments into additional macros whilst evaluating them (triggers edgecase which is handled above), surrounded with by tracing debugging output & with a string pushed to the scanner “calls” the macro, & cleans up decrementing counters & freeing obstacks.

Calling the macro involves either calling the callback function (for builtins) or preprocesses the macrobody to get rid of argument references. The tokenize-eval loop handles the rest.

Bash

For computers to run the programs we tell them to we need another program to figure out which program we told it to run. On the commandline we use “shells” like Bash for this.

Bash Builtins

Though several of those programs are, for various reasons, not actually seperate programs. Usually because they modify or query shell-internal state, or the state of that process in-kernel. All the commands I’m listing adds minor text parsing, binding from internal functions & syscalls to a textual language.

Bash tracks the internal state of the “processes” you have it run (might dedicate a day to this topic…), and several builtin commands wrap that infrastructure including:


To alter the in-kernel state there’s:


To mess with BASH’s command lookup infrastructure there’s:


Bash bundles an infix expression interpretor from another project, exposed via the builtins:

There’s builtins for running shell commands:

Bash includes control flow syntax, which can be controlled via the builtins:


Bash supports functions in it’s shell syntax, and that comes with an internal calling convention, exposed via the builtins:


Bash has a concept of “shell variables”, manipulable via:

[un]set & shopt defers to the specified internal accessor functions.


There’s builtins for directly operating Bash’s I/O:


Bash maintains a history of commands it has run in-memory & ondisk, which can be manipulated via:


Some builtins manipulates ReadLine’s (later topic) behaviour:


Other builtins include:

Bash Initialization

Upon startup bash initializes interrupts, debug tracing, extended memory allocator, /dev/tty, Cygwin /temp/, freezes if debugging login shell, internationalization, user/group IDs, checks $POSIXLY_CORRECT or $POSIX_PEDANTIC, validates memory allocation, & reconfigures interrupts.

Then Bash parses configuration from the commandline arguments (long opts first) & the current time, whlist ensuring collections have been cleared. Possibly dropping setUID privileges.

Based on the result it’ll run interactive or non-interactive mode.

Login shells close stdin/stdout/stderr on shutdown. POSIXly correct shells bind a POSIXLY_CORRECT envvar & shellvar. Any shopt options from the commandline are applied.

Buffering is enabled on stderr/stdout. BUiltins table is sorted.


Interrupts are reconfigured again. Hostname is set. User info is retrieved. An expansion table for resolving shorthand (~) path syntax. Checks whether we’re in a restricted shell. Shellvars are initialized into a hashmap from envvars handling some specially, then from other sources. Configures the kernel to aid switching different background processes. The parser is initialized. Saves shell flags, & evals the appropriate builtin.

Copies locale envvars to shellvars.


Interactive shells have some additional initialization.

Then it configures a savepoint, possibly certain shellvars OR sets -i flag, checks whether the shell’s restricted again, loads additional commandline args into appropriate list, possibly runs startup shellscripts, & initializes caches for loading commands & shellvar exceptions.

Then evals from appropriate input possibly a debugger, then tidies ALL this up.

Interpretting Bash Control Flow

Bash is primarily an interpretor over a text centric DSL intended for live interaction. It has control flow constructs, but the focus is on running other programs.

The different options for where Bash reads it’s input from include: the commandline args (via parse_and_execute), a given filepath (resolved via $PATH), or readline.

The parser, after configuring exception recovery, configures the buffered parser input, saves aliases if we’re expanding them, gathers/preprocesses input text, clears the shell input line, Repeatly parsing & if successful COMMAND* via a parser implemented in Bison. Directly reading from read[line].


To evaluate the abstract syntax tree Bash parsed with the aid Bison, once it’s been validated it runs any actions attached to interrupts flagged has having occurred, tweaks flags & considers running in a “subshell” (again with new tempstate) or “coprocess” (same in new process) instead or with time measurements, adjusts pipeline filedescriptors, possibly throws an exception, checks trapped interrupts, and configures an exception handler before branching on the AST node type & cleaning up.


Upon seeing a simple command the control flow interpretor handles linecounting, traps/debugging, optionally forking, “expands”, empty commands, & exceptions then runs it as described later!

for validate/preprocesses it’s inputs, catches any exceptions, possibly tweaks flags, uses substitution logic (tomorrow) to get a list of words to iterate over & repeatedly execute it’s body with a newly defined shellvar & debugging handling.

Arithmatic For is a variant that evals infix exprs twice per iteration, like C syntax.


select repeatedly prompts the user to select from a list of options by index. Checking the break/continue flags to determine when to end the loop.

case iterates over each of it’s clauses and for any with which has a pattern (of possibly multiple) which matches it evaluates that command. Followed by any flagged for fallthrough.

while or until executes it’s test command to determine based on it’s errorcode whether to stop before evaluating the body.


if determines which of it’s bodies to run based on the errorcode of it’s test.

group recurses into the main interpretor function with appropriate flags set.

connection handles postfix & (recurses with appropriate flags once or twice), ; (recurses twice, second time in a new process), | (recurses with appropriately adjusted stdin/stdout), &&/|| (if creating a subshell, decides whether to eval the 2nd based on errorcode of first).

Double parens interprets given maths.


Double square brackets evaluates a boolean expression, possibly incorporating maths or regular expressions; after preprocessing.

The function syntax registers a new function in the variables table (later topic) having readonly one doesn’t already exist under that name. Debugging builds include typechecking. Arithmatic, conditionals, & functions share some pre/post-processing.

Throughout this whole interpretation process it frequently checks for exit conditions.

argv Expansion

In Bash “simple commands” (linkedlist of “words”, first being the command to execute - i.e. normal shell commands) there’s various shorthand syntax you can use which gets expanded in a preprocessing pipeline.

Expanding words in a simple command or for loop involves passes for:


Shellvar assignment involves, if the appropriate flag is set, involves:

Initial pass clears specified shellvars, final pass assigns them. Shellvars may hold indexable arrays.


In Bash you can specify a comma-seperated list between curly brackets for it the outer arg to get duplicated once for each listed substitution. Very handy when the commandline don’t change that much!

To implement this for each word in the command not flagged to disable this pass it first locates a balanced non-escaped curly bracket char in the word & it’s corresponding close bracket taking a text slice. Then within that slice locates unescaped ‘,’ & specially handles where that’s absent.

Having parsed the input & handled an edgecase, for each escaped-comma-seperated textslice it recursively expands any braces therein concatenating those results onto it’s own. Then finally it concatenates on the pre- & post- amble via strcpy before recursing again until there’s no more braces to expand.

If there were only a single item in the braces it’s parsed as a pair of numbers to expand into a numeric sequence.

This is word array is spliced into the word linkedlist.


There’s a main expansion iteration over the word linkedlist which, between validation/correction code (e.g. handles “$@” specially) iterates/branches all chars therein:

Otherwise copies chars over with a quoted check.


Bash allows you to use wildcard chars which expand to all matching files visible to it in the filesystem. It wraps LibC’s, or LibGlob’s, implementation with linkedlist iteration & some pre/post-processing. Skipping results which match specified “ignore” globpatterns.

This globbing pass if it runs also serves to remove quoting & escaping, though it work is done regardless of whether globbing is performed.

After actually assigning shellvars the shorthand expansion is now complete!

Executing Commands

The central task we all expect from Bash is to run the external programs we specify with our specified input. Though Bash also supports other types of commands e.g. the previously mentioned “builtins”. Here I’ll describe how it locates & runs commands regardless of their source.

But first it forks a new process with appropriate stdin/stdout/stderr before locating the command or even (as described above) preprocessing the commandline args to lower shortcuts.


The edgecase of an empty arglist, the “null command”, still requires appropriate cleanup to be done.

Then it checks hashtable-registered “functions” (unless $POSIXLY_CORRECT or functions are disabled). Which it runs (after consulting additional lookups) by determining optimization flags it can set, register exception handlers, updates the stackframe, interprets it’s body (as described previous day), & restores the previous shell state.


Next it binary searches the sorted builtins array to locate a callback function to (with extensive exception handling, and some I/O adjustements) hand the commandline args to, possibly rewriting any invocations of the command builtin.

If the first word starts with % it’s starting as a shorthand for bg or fg builtins. Or you could reference these jobs by name.

An optional autocd feature converts dir paths to cd commands.


Finally there’s the central concept of “disk commands”!

Optionally hashmap-caching results (with edgecase handling) it checks for absolute paths before iterating over $PATH preprocessing each entry & appending the commandname before checking if it’s executable.

To run the located command it first gathers all exported shellvars to pass to it as envvars & adding $_, ensure we have forked, adjust signal handling & stdin/stdout/stderr, close any FIFOs, applies any redirections, upon failure to locate the command it’ll fallback to running the command_not_found_handle shellfunction or in serious cases outputs an error message itself.

Then it reformats the args linkedlist to an args array to hand to the program via execve!


As for aliases, that’s handled during parsing & autocompletion.

fork syscall (Linux)

To run a “disk” command, Bash utilizes Linux’s (or other UNIX-like’s) fork & execve syscalls.

fork, clone, & it’s related syscalls verifies parameters configurable via clone, clones the configured struct task_struct * properties (including CPU-specific & CGroups ones) with minor verification & appropriate locking, tweaks some of those fields based on flags possibly including graph traversal for debugging, adds protection against stack leaks, constructs a new “[pidfd]” file descriptor, & disables stepwise debugging flags.


Then Linux’s implementation of fork, etc under lock checks resource usage permissions before copying those fields over, initialize proctree, etc fields to empty values.

Before clearing the scheduler fields whilst computing priority; initializing perf events; checking allocation permissions; copying/altering permissions, semaphore undo stacks, open files, struct fs_struct*, signal handlers, signals, memory management, namespaces, I/O, & subthreads properties.


Finally it allocs a free Process ID under appropriate locks possibly incrementing it, creates the anonymous “[pidfd]” virtual filesystem entry, clears stepwise debugger flags, checks CGroups permissions again, copies a few more fields, verifies that CGroup isn’t dying & no fatal signals are pending, replaces that “[pidfd]” open filedescriptor, initializes more empty lists, adds PTrace fields, saves the PIDs in the appropriate fields, inits subtask cleanup if it interrupts upon exit, or refcounted-adds the “thread group leader”’s lists, twaks a couple lists based on that, emits an event, initializes some final UClamp, profiling, & UProbe fields, enqueues the new thread on it’s selected scheduler, & returns the PID.

execve syscall (Linux)

To actually run the specified commandline programs in the new processes it forks Bash uses the execve syscall. Theoretically this could all be done in userspace (GNU LibC supports dynamic libraries afterall!), but if we want to discard the shell’s memoryspace to never return someone else has to take over the work. So how does Linux do so?

After carefully reading the filepath from userspace & normalizing args, execve & related syscalls validates & checks permissions.


From there it initializes a “binary PRM” struct to represent the executable being mmap’d, takes some counts, configures a stack limit, & copies (with a decent chunk of logic to cross userspace/kernelspace boundaries) the filename, envvars, & commandline args to the stack before deferring to BPRM’s execve implementation.

BPRM then duplicates &, for SUID programs, alters the credentials struct under lock, performs some safety checks, opens the specified file read-only, …


considers switching CPU cores whilst the addressspace is small to better balance tasks between them, checks permissions via callback hook, iterates over a linkedlist of externally-registered “binary handlers” to determine e.g. whether to mmap the executable or mmap another program specified via a “shebang” (#!) line to interpret this file. Which instruction to start execution on?

Outputs to ptrace. Then defers to “rseq” (zeroing some fields) & updates bookkeeping.

Job Control

Bash has a handy feature whereby you can run commands “in the background”, so that you can continue using the shell or another “foreground” command whilst it’s still running. I’ve already discussed some builtin commands for manipulating/viewing these, but today I wish to discuss the core bookkeeping behind it. Tomorrow probably I’ll discuss the kernel-side components.

You can disable compiling this component & it’s integrations.


“Job stats” (JS) are tracked, which is zeroed out in subshells.

It also has a concept of “pipelines” which upon startup (called internally) deallocs the old pipeline, zeroes out the pipeline PGroup if it’s different from the shell’s, & closes the old pipe.

Upon stop (called upon interpreting the trailing command of the control flow, whether that be a simple command, pipeline, coprocess, or function definition) serves to add the full pipeline’s processes to a joblist.

Before resizing the joblist ringbuffer upon closing a pipeline it’ll first see if it can reap dead processes it’s tracking. Normally it’d flag them as dead first see it can tell you they died. Otherwise it’s deallocating the entry & updating counts. Compaction is another iteration or two.

Then it iterates over the pipeline a couple times to populate a newly allocated “job”. Or it calls tcsetpgrp syscalling back to sigprocmask to place it in the foreground.

Doesn’t handle SIGCHLD in meantime.


These pipelines can be “saved” & “restored” to/from a linkedstack, used for expanding backtick-subcommands into literal arguments & trap handling.

That ringbuffer is resizable & used to populate a trivial numeric-closed-hashmap (multiply+remainder hashfunction), with statically-allocated maximum capacity. This hashmap might be invalidated upon resizing the ringbuffer.

bgp_add moves an entry from the ringbuffer to the hashmap.


There’s a linked-list of process substitutions. Not that it appears to actually be used anywhere…

It tracks the range of job IDs.

The control flow interpretor may call append_process to add a new PROCESS* to the jobs array (which appears seperate from the hashmap & ringbuffer). This may be iterated over to SIGTERM or SIGHUP then SIGCONT all those pgroups, find by it’s IDs, & output their statuses possibly filtered by status.


There’s an iteration over the cyclic linkedlist for a given pipeline seeking a given PID, or over the pipeline stack. Pipelines can be printed. There’s abstractions over certain interrupt handling patterns.

For BSD it might need to what for all output to drawin before changing the baudrate. It abstracts the TIOCGETP, TIOCGETC, & TIOCGLTC and/or TCGETA to retrieve the teletype state. Another abstracts corresponding setters.

There’s a function to setting the global (in-memory) foreground job, with validation.


There’s an abstraction to wait on a given PID, possibly background, or (possibly any) job with or without cleaning up all the surrounding bookkeeping I’m describing. The rawest function here wraps a repeated waitchild with bookkeeping to which other children die in the meantime.

Another function resets the foreground job. This seems to be all userspace,

Upon a simple command the controlflow interpreter calls start_job.

After validation start_job unsets the NOTIFIED bitflag, optionally sets & flags it as foreground, outputs the status manually otherwise, followed by any other jobs it needs to notify you about, flags the command as running, tells the kernel to hand terminal control to it (ah! There’s that code!), if needed sends SIGCONT, & cleans up.

There’s an abstraction around kill[pg] which ensures the process is in a good state to recieve the interrupt.


Bash has it’s own interrupt handlers.

Upon SIGCHLD it repeatedly waits for the child to exit updating any of the destructures it might be in. Including the job structure itself, unsetting the NOTIFIED bitflag & calling any cleanup callbacks while we’re at it.

There’s a function for outputting detailed information about a job’s status.

There’s an initializer, destructor, & reset for this whole subsystem which is called lazily. There’s new_line_discipline setter wrapping several IOCTLs.

SIGCONT is forwarded to fg.

SIGSTOP restores old signal handlers & takes terminal control for itself before forwarding to fg.

There’s a function to delete all jobs being tracked, trigger NOHUP on all of them, count them, flag them all as DEAD, unset the NOTIFIED bitflag on all of them with some counting.

There’s functions to access the jobs_list_frozen global. There’s one to retrieve & cache it’s own PID. To update lmaxchild. To configure the SIGCHLD interrupt handler. And some cleanup functions.

Exiting Processes (kernel-side)

When a process exits (by calling that syscall explicitly or implicitly), Linux notifies e.g. Bash’s job subsystem via a SIGCHLD signal or waitpid syscall. Also GNU LibC adds a little of it’s own logic to help userspace tidyup after itself. So today I’ll describe that syscall, nexttime: teletype IOCTLs.

After validation/correction, Linux updates registered profilers (including, seperately, fuzzing support & PTrace), checks credentials, & cleans up once it’s in a consistant state.


Whilst tidyup it tackles in order:


Then it considers which processes it needs to notify and (if it’s a thread leader or PTrace’d) it validates state, triggering an atomic condition attached (waitpid) to the process. & sends SIGCHLD to the parent.

An “atomic condition” integrates into the scheduler, carefully avoiding race conditions, to pause threads until the condition’s triggered.

Once the input’s gathered & validated, sending a signal involves appending an entry to a linkedqueue whilst updating rlimit counts, triggers the corresponding filedescriptor’s atomic condition, sets the appropriate bitflag, & wakes the process up. Presumable as it wakes up the userspace callstack will get updated for each queued signal.


Finally it deconstruct it’s “process connector” driver, it’s mempolicy, validates no locks are stilled held, cleans up it’s block I/O context, pipe info (with it’s ringbuffer), & task frag memory page. Then revalidates credentials, validates stack usage, temporarily disables preemptive multitasking, increments a per-CPU dirty count, RCU tasks (2nd pass for debug output, whatever RCU means…), & schedules something else in that’s actually alive!

Teletype Files (Kernel-side)

Linux provides special devicefiles for (virtual) teletypes, which Bash uses to manipulate which process (or rather CGroup, I’ll discuss those much later alongside systemd). Which Bash uses to manipulate who receives it’s user input.

This, like all files, is implemented as a methodtable to which the appropriate syscalls defer.


A TTY file cannot be seeked. Upon read, after validation, they repeatedly defer to their underlying configured line discipline’s method & maybe update a timestamp.

Upon write after validation & retrieving the locked line discipline, either overwrites the methodtable with noops or repeatedly defers to the wrapped & locked file descriptor with extensive error handling. Similar for polling, but doesn’t require much error handling at all.

It doesn’t override methods for spliced reads & writes.



As for job control IOCTLs, which is what I’m really interested in here…

Or it might defer to methods on the TTY or line discipline to handle unknown IOCTLs. There’s also a legacy IOCTLs method that does similar things.


Upon open they defer to their wrapped file, possibly followed by a tcgetattr/raw calls.

Upon final close they flush & close their wrapped file, waits on outstanding tasks, outputs appropriate warnings, & deconstructs itself.

Upon fasync after validation & under lock they retrieve the process ID & type to set as the file’s owner.

And outputting debugging info is deferred to it’s own corresponding method.

Bash History

Bash optionally logs all the commands you type to make it easier to repeat variations upon them & eventually translate into shell scripts. In the ReadLine prompt navigating this history is tied to the up/down keybindings. Plus there’s builtin commands for accessing them & ctrl+R searches them by substring.


To (re)initialize this history logging (upon startup or flags) it sets a few globals to default values. Looking up shorthand chars from $histchars shellvar.

Upon history logging disable (called from parser & some builtin commands) it unsets some globals, reloading the $HISTCONTROL & $HISTIGNORE shellvars.

Upon load history (called on startup & set) ensures $HISTSIZE & $HISTFILESIZE shellvars are configured & has ReadLine parse the lines out of the file specified by $HISTFILE remembering the linecount. There’s a corresponding function for clearing history (called on history) deferring almost entirely to ReadLine.


There’s functions abstracting ReadLine’s APIs for altering in-memory history, exposed via the history & fc builtin commands. One of which attempts to open a file for ReadLine to write to. A similar function is called by the REPL loop & signal handlers to save the in-memory history to $HISTFILE.

There’s an abstraction around ReadLine’s preprocessor, with error reporting, where it appends history records. And seperate ones for stripping comments, etc.

There’s several functions to log commands to different targets, including in-memory with ReadLine, $HISTFILE, & Syslog. All abstracted into bash_add_history called from above functions.

And there’s a few other miscallaneous functions.

So… basically all the logic’s in ReadLine.

Bash Autocompletion

In Bash when you press the tab key it’ll autocomplete what you’re typing, mixing it’s own logic with ReadLine’s UI.

Upon hitting that tab key ReadLine calls a BASH-provided callback, which after adjusting ReadLine globals before stripping preceding whitespace, considers whether this is a quoted or unquoted command name to autocomplete in which case (after correcting some flags regardless) to run a completion callback specific for it (or reuse the generic filepath completion).


That commandname completion conditionally sources from all the same locations used for lookup, and supports glob matching. Amongst other adjustments.

It compiletime-optionally skips any preceding shellvars and considers whether to use that commandname autocompletion or interpret the command-specific configuration before fallingback to a generic autocompleter. Which chooses between completing command substitutions (recursive), shellvars, usernames, hostnames, ignores, or filepath globs.


To evaluate programmable autocompletions Bash first looks up the command in a hashmap before consulting the specified bitflags found there to determine whether to autocomplete aliases, arrayvars, bindings, builtin commands, disabled commands, enabled commands, envvars, functions, builtin help, hostnames, jobs, keywords, running jobs, set flags, shopt flags, signals, stopped jobs, shellvars, commands (as per before), files, users, groups, & services.

Directories are treated special.


Then it autocompletes globs against the filesystem. If the autocompletion declaration specified a list of words, it’ll autocomplete any matching ones at this point in the process.

It then splits args, consults the shellfunction about to be called, & optionally runs a specified command to retrieve autocompletion options.

Then finally tidies up and performs some final filtering. Possibly rerunning the main suite of autocompleters.

Gawk

In Unix it’s common to need to convert some input format into some output format. And when there’s not a dedicated command to do the conversion you need often AWK can be programmed to do it for you. I’ll be studying GNU’s implementation today.

To interpret these programs Gawk checks whether $TIDYMEM is defined possibly installing memory allocation logging callbacks if so, extracts the executable basename from argv[0], considers outputting usage info if not enough commandline args, retrieves $GAWK_LOCALE_DIR with default, ensures stderr’s in appendmode if those syscalls are available, initializes internationalization, registers signal handlers, allocs some extra stackspace, allocs an empty “null” string, validates stdin & stdout filedescriptors, registers methodtables for string “arrays” (or rather hashmaps) & if compiled with MPFR ints & complex ints, allocs & zeroes-out some symboltables with appropriate methodtables for empty arrays & registers a couple globals, allocs & linkedlist pushes a new context, parses commandline flags with those commandline args (enqueueing any assignments & input source files) backed up for debugging, possibly reinitializes internationalization, prepares a lookuptable for quickly converting bytes to widechars, for non-ASCII/Unicode (EBCDIC) locales initializes a lookuptable for uppercasing chars, outputs a nostalgic error message if requested, checks $POSIXLY_CORRECT converting into a bitflag checking bitflag conflicts, possibly warns about enabling the setuid permission bit on, possibly registers breakpoint debugging callbacks, initializes some MPFR globals if compiled against MPFR, saves a list of groups from the kernel, initializes regexp globals, loads a bunch of globals & envvars into the runtime symboltable, adds the pre-allocated null to the start of the fields array, loads in the variables specified by commandline flags, considers setting stdin to binary mode, in debug builds enable stdout buffering, detects whether stdout is a (virtual) teletype, initializes some stdlib flags, loads & runs some dynamically-loaded modules, maybe outputs the version number alongside all of those extensions’ & a copyleft notice before exitting, fills in missing source code from commandline args or errors out, initializes stackframe & boolean lookup table for the interpretor, populates vars from commandline args, reconfigures numeric locale if possible, parses the AWK program (tomorrow’s topic) exitting immediately given -g flag, sets a global for the current namespace, populates syntactically-builtin functions into the symbols table, possibly lints that no function shadows any globals, possibly lints whether there’s any sourcecode, possibly registers signal handlers to dump execution state with or without exitting, restores numeric locale, initializes the runtime scanner, debugs or interprets the program (later topic), maybe outputs interpretor state, & cleans up.


Like any programming language Awk needs to be parsed. To do something Gawk saves whether this parsing was called from eval in a global, allocs a noop end instruction, checks the context level to determine whether to alloc other global opcodes, finds the first EXTLIB sourcefile to save in a global, clears some other globals, allocs a tokenization buffer if needed, runs a Bison/YACC parser, restructures it into straightline code, clears some globals, validates undefined functions aren’t called, validates parameters don’t shadow functions, & allocs some memory to store corresponding arguments in.

Reformatting the program involves adding the end instruction to the endblock creating it necessary, possibly merges the begin & program blocks + maybe endblock appending a STOP opcode, appends the endfile opcode to the endfile block creating it if necessary, does some file for beginfile, handles certain empty cases, & more append/prepend/merges.

Those appends/prepends/merges involves appending a AFTER_ENDFILE opcode to the endfile block, prepending record op to the program block & appending a JMP to it, appending a AFTER_BEGINFILE opcode to the beginfile block, merging the beginfile block & program block, prepending the newfile op to the merged block, merging the endfile block & end block & if it exists begin block, merges outercomment op if exists, appends interblock comment if exists, & appends the atexit op & a STOP opcode.

The parsing itself thanks to Bison resembles BNF.

Upon EOF the parser iterates the lexer over to the next input file, otherwise it updates/allocs various linkedlists of opcodes using a hashtable & the symboltable to resolve function references. Whilst allowing the source code to append input files to the scanner or load/run dynamic libraries extending the available APIs.

It adds opcodes to assert certain conditions, or “lints” within this parsing. It uses it’s regexp engine. There’s constant propagation.

switch statements require significant parsing to gather/reformat it’s cases. Loops require updating the jump targets. for loops can be significant shorthand syntax which is expanded here in the parser, as do print statements.

The parser may have temp memory to free. Equals require paying special attention to the lvalue.

There’s numerous helpers for manipulating these linked lists.


The Bison-generated parser is configured to use a handwritten lexer which after consulting the previous token to determine whether to output a SUBSCRIPT or EOF token & consulting a ringbuffer before fallingback to reading the current file a bufferfile (or for debugging, line) at a time to peek the next char to lex, it skips OS/2 “extproc” lines, if it accepts regexps it iterates over chars counting open square brackets whilst lexing escapes & validates it’s terminated before the line/file.

During lexing regexps it lexes any trailing regexp modifiers.

Then it skips whitespace and branches over the token. Certain (pseudo)chars are converted straight to returned tokens. Whilst lexing linecomments it may or may not preserve the line it skips in a global for pretty printing before yielding a newline token. “@” becomes a token, but with following regexps handled specially. “" removes the following newline (and maybe whitespace/comments) from input stream.

”?” & “:” pairs are counted with globals updated to indicate the lvalue is a condition. Corresponding parens (“(“ & “)”) are counted. “$” updates the lvalue global to indicate it’s a field. “;”, “,”, & “[” are converted direct to tokens. “]” peeks at the next token to possibly warn about the portability of multidimensional arrays. “*=” flags the lvalue as a Op_assign_times. “*” & “*=” unless sticking to POSIX compliant behaviour becomes “^” & “^=” tokens. “*” otherwise is a token itself.

”/=” becomes SLASH_BEFORE_EQUAL & = tokens, otherwise “/” is a token in it’s own right. “%=” marks the lvalue as Op_assign_mod yielding a ASSIGNOP token, otherwise “%” is a token in it’s own right. Similar for “^” but with optional warnings that this is relatively new. And for “+” but with a dedicated “++” INCREMENT token. “!” has a dedicated “!~” MATCHOP (indicating the “not” via lvalue global) token. “<” & “=” are lexed similarly.

”>” has a “»” IO_OUT token & lexes differently in parens. “~” corresponds to it’s own MATCHOP token. “}” counts braces with special newline lexing. Lexing quoted strings generates/mallocs a literal value stored in the lvalue whilst handling escapes. “-“ is lexed similarly to “+”. “.” is it’s own token when not followed by digits, in which case it determines which base to use & parses via MPFR or LibC. “&” can be it’s own token, or double up into a AND token with new Op_and lvalue.

” is lexed similarly to “&” but optionally with a special “ &” IO_OUT/IO_IN (depending on whether in both a print & parens expression) token whilst setting a Op_symbol lvalue, on it’s own it’s also a IO_OUT/IO_IN token with different redir_type values on the Op_symbol lvalue.

Otherwise it lexes an identifier & if it’s a special keyword before validation, lvalue correction, & cleanup.


To interpret this linkedlist of opcodes Gawk, starting from a “program counter” of instruction 0, it repeatedlies saves the source line number to a global, maybe runs preexecute callbacks for stepwise-debugging or profiling, maybe outputs each opcode being executed to stderr, branches over the opcode, & normally jumps to the next instruction in the linkedlist.

Op_rule opcodes updates the currule global (and, if it’s BEGINFILE sets the record global to NULL copying over any existing data) & the source global. Op_func only updates the source global. Op_atexit has the scanner iterate over to the next input file closes all remaining input files, closes any dynamically-loaded extensions via their destructors, & maybe dies indicating a SIGPIPE exception. Op_stop exits the interpretor loop.

Op_push_i pushes given refcounted value onto the stack after possibly internationalizing it. Op_push[_arg[_untyped]] all retrieves given value from the opcode consulting a different lookuptable with type validation if it’s originally of type Node_param_list, before branching on the value’s type. If it’s Node_var it might warn about uninitialized values before refcounting & pushing to stack. For Node_var_new it replaces the nullstring, possibly warns, & pushes a duplicate to stack.

And for Node_var_array it pushes to stack in Op_push_arg[_untyped] opcodes, complaining for Op_push opcodes. Op_push_param retrieves given value, maybe replaces with one from args array, & either refcounts & pushes the var’s value or pushes the value itself. Op_push_array pushes the given value. Op_push_lhs does some validation/conversion before pushing the given value.

Op_subscript pops the given number of items from the stack followed by the array range-validating it, if the popped array is func_table it might warn once & yield the subscript but normally calls the “array” method to perform a lookup, if the popped array was symbol_table it first flushes any mutations & afterwards performs appropriate warnings & normalization, dereferences the slice, & references & pushes the result to stack.

Op_sub_array pops given number of items from the stack followed by the array range-validating it, on success creating a new array to copy that entry over into. Op_subscript_lhs works essentially same way with more validation & pushing a reference to stack instead of value.

Op_field_spec[_lhs] pops an int to lookup in the fields array to get a value to push. Op_lint emits given error message if configured to do so via commandline flags. Op_lint_plus peeks at the top two entries to warn if they’re not strings. Op_K_break/continue & Op_jmp repeats loop for given target. Op_jmp_false/true coerces popped value to bool & conditionally jumps.

Op_and & Op_or pops a scalar coerced to a boolean to decide whether to shortcircuit, on success pushing the coerced value back to the stack & jumping to given instruction. The _final variant of those opcodes coerces the top-of-stack to a boolean with refcounting.

Op_not does the same but with negation operator. Op_equal, Op_notequal, Op_less, Op_greater, Op_leq, & Op_geq wraps a helper util to compare to the top 2 stackvalues as strings if not numbers.

Op_plus adds the top 2 stack entries replace the 2nd with a value wrapping the same, the _i variant allows referencing the 1st value from the opcode instead. Op_minus, Op_times, Op_exp, Op_quotient, & Op_mod does equivalent for subtraction. Op_preincrement & Op_predecrement replaces the top-of-stack coerced/parsed to a number with that value plus or minus one, updated in place if possible. The post variant works essentially the same with a little more juggling.

Op_unary_minus coerces top-of-stack to a number & negates it. Op_unary_plus coerces top-of-stack to a number.

Op_store_sub retrieves given array, calls its lookup method validating result is not itself an array, validates more whilst following indirection, pops a scalar, & calls array store method. Op_store_var retrieves given lhs value & stores given value there. Op_store_field pops top int, looks it up in fields array, calls given assignment callback, & overwrites returned value.

Op_unary_minus coerces top-of-stack to a number & negates it. Op_unary_plus coerces top-of-stack to a number.

Op_store_sub retrieves given array, calls it’s lookup method validating result is not itself an array, validates more whilst following indirection, pops a scalar, & calls array store method. Op_store_var retrieves given lhs value & stores given value there. Op_store_field pops top int, looks it up in fields array, calls given assignment callback, & overwrites returned value.

Op_assign_concat pops the top-of-stack lvalue coerced to string, pops a string, takes a shortcut if (amonst other things) they’re the same, allocating the summed length & memcpying both strings into place optimizing for sole references. Op_assign pops an address & scalar to store a unique copy of the scalar in that pointer. Op_subscript_assign after various validation calls an array method. Op_assign_plus/minus/times/quotient/mod/exp pops two numbers one as an address to apply that operator wraped in a value which is also pushed to stack. Op_var_update calls a callback on the opcode. Op_var/field_assign performs some validation on top-of-stack before calling corresponding callback on the opcode.

Op_concat pops the given number of strings (fastpath for 1) summing their lengths, allocating a new string, & copying them all into place.

Op_K_case pops & compares 2 scalars to determine whether to take the branch, with a special regexp variant. Op_K_delete pops an array pops the given number of indices off the stack validating them before calling the corresponding “array” method. Op_K_delete_loop pops an array & address to call the corresponding symbolmethod.

Op_in_array pops an array & a given number of indices to check whether those indices are in the array pushing result converted to a value.

Op_arrayfor_init pops array validating whether its non-empty, initializes sorted_in string constant if needed, tests whether that key’s in the array to retrieve how to sort it, coerces & possibly sorts the list as newly alloc’d value then converts it into iterator value to push to stack. Op_arrayfor_incr increments current index throwing errors on overflow & pushes the value at that index. Op_arrayfor_final frees this iterator.

Op_builtin calls given callback & pushes result. Op_ext_builtin after argcount validation calls given callback before popping all args from stack & pushing converted result. There’s several opcodes directly corresponding to builtin functions like sub, print, printf, & print_rec implementing sophisticated string manipulation.

Op_push_re retrieves given value & maybe pops a string to store in its re_exp property before pushing it. Op_match_rec retrieves given field & evaluates regexp before converting to & pushing a bool value.

Op_[no]match does the same but retrieves the text to match from the stack whilst possibly overriding the regexp to evaluate.

Op_indirect_func_call peeks & validates top-of-stack, looks that poperty in the global “arrays” & either calls the callback pushing result or updates callstack jumping to the function’s opcodes. Op_func_call looks up the named function body in those global “arrays” caching result inline, updates stackframe, & jumps to its start instruction.

Op_K_return_from_eval errors. Op_K_return pops a scalar & stackframe to determine where to jump to after restoring that scalar to the top-of-stack as the return value.

Op_K_getline_redir tells the scanner to read a line (later topic). Op_K_getline validates its not called within the BEGINFILE or ENDFILE rules before repeatedly iterating to the nextfile and either reading the line to push the successful result or pushes a stackframe to jump to the begin or endfile rule.

Op_after_endfile is akin to a return statement. Op_after_beginfile also resets various scanner state. Op_newfile iterates scanner to next file jumping to appropriate destination possibly popping a stackframe. Op_get_record validates there is a curfile global (following a jumppointer if not) before having scanner read a record following another jumppointer on failure. Op_K_nextfile iterates scanner to nextfile returning from current stackframe, BEGINFILE handled specially.

Op_K_exit validates there is a currule, pops an errorcode, pops a stackframe if in BEGINFILE or ENDFILE, pops an ignored value from stack, and determines where to jumpto (END rule handled specially) from an opcode property. Op_K_next validates currule before popping stack & following goto.

Op_pop pops & ignores a scalar. Op_line_range consults a flag on opcode for whether to follow the goto. Op_cond_pair pops a bool to update associated Op_line_range & decide which goto to follow. And finally Op_exec_count if profiling is enabled increments a counter property on the opcode. Op_no_op, Op_K_do, Op_K_while, Op_K_for, Op_K_arrayfor, Op_K_switch, Op_K_default, Op_K_if, Op_K_else, Op_cond_exp, Op_comment, & Op_parens are all noops to aid parsing.

Datamodel involves lookup arrays for function args & input matches, an interpretor stack, typed hashmaps with methodtables including ones for the global context, & dynamically-typed regexps, numbers, strings, or hashmaps called “arrays”.

There’s a runtime scanner used in interpreting the broader AWK syntax which I skipped over yesterday preferring all the different opcodes.

When iterating over to the nextfile to concatenate it consults some script-accessible globals to determine which file to open, which is where most the extra complexity is, and has callbacks for setting other script-accessible globals.

To read a “record” it uses configurable callbacks (including a read timeout) to repeatedly enlarge the read buffer until it founds the record-terminator.

There’s special handling for network files. The scanner ensures reads can timeout as configured via the select syscall. There’s an OO wrapper around AWK’s output.


Gawk, for whatever reason, implements regexps themselves rather than reuse GNU LibC’s. Doesn’t seem to be so that it can operate byte-by-byte as it reads the file, since the scanner operates entirely in-memory…

To parse these regexps, after some initialization & bracket-syntax linting, it resolves any escapecodes character-by-character into a new input string, determines whether to ignore casing, possibly allocs some DFA graphnodes, saves bitflags to a global, recursively compiles it, tweaks the DFA to handle newlines & backreferences, & extracts bitflags from the resulting buffers.

The parser uses a handwritten lexer (mostly converting singular chars to an enum) that consults various flags, an in-memory scanner shared with the interpretor, & a handwritten pushdown automaton emitting a generic binary abstract syntax tree with minor postprocessing to terminate the regexp.

Once parsed this generic abstract syntax tree is compiled to a non-deterministic finite automaton (NFA) via some initial allocations, maybe a postorder traversal to inline all subexps it can, postorder traversals to lower any remaining subexps into open/close opcodes, to extract the first match for each branch, to extract the next match, & to connect resulting NFA graphnodes with labled edges representing control flow, recursively computes epsilon closures (starting DFA lowering) for each NFA graphnode, & maybe (if needed) computes inverse epsilon closures.

Then if possible it optimizes to traverses UTF8 bytes rather than chars, & regardless expands initial state into a DFA.

To evaluate the regexp it calls an interpretor callback on the DFA if it’s already been lowered, which normally checks whether edgetable is larger than 1kb in which case it frees any attached state on those edges first (seemingly to avoid memory leaks), checks if it needs to reallocate the transition table, possibly allocates a transition set for multibyte chars, & repeatedlies looks up (fairly sophisticated) a DFA edge to follow constructing a new one from NFA edges where needed.

Failing that, interpreting the regexp, with ranges validated & a lock held, involves gathering some bitflags, constructs entry-lookuptables if needed, allocates a registertable, & before cleaning up performs some validation, allocs an input string & match context & DFA state table, computes initial state, & intreprets each match_kind yielded from the NFA/DFA graph traversal using the fastmaps where possible. Afterwards selecting a single answer to return from all possible matches.

Grep

grep outputs just the lines of it’s input file(s) which match a given regexp.

To do so, after initializing internationalization & registering to close stdout on exit, grep initializes a hashmap of regexps, parses flags handling digit flags specially allowing selecting between 4 regexp syntaxes, outputs specified messages whilst normalizing commandline args, frees the original pattern hashtable, validates stdin is a chr-device & not /dev/null, validates those commandline flags, determines whether to colourize the output & initializes it if so parsing which colours it should use from envvars, chooses a “unichar-mask” for the charset, normalizes the options for selecting a regexp engine handling fgrep specially, compiles the given regexp via the selected callback, evaluates that regexp to determine whether to skip blank lines, determines the output file to write to with or without configuring binary mode, determines the bytesize & alignment for the buffer & allocs it, & iterates over each file from the commandline arguments (with defaults).

For each, handling “-“ specially, it opens the file with configurable bitflags following symlinks, fstats it, validates whether it actually wants to process this file, considers iterating over directories filtering entries with co-recursion & GNU CoreUtils’ filetree traversal utility, reads the file bufferfuls at a time (checking for NUL chars, or removes them) looking for newlines evaluating the regexp via a pre-selected callback with some normalization buffered-printing and/or counting results. The code for printing results is split into several functions handling each line & in turn the filename/line, prefix, match, & suffix.


Usually grep will use some variation or other of it’s own internal regexp engine.

To compile a regexp (done upon startup) it allocs some memory, normalizes bitflags, repeatedly locates each backref in the pattern lowering them away & compiling each resulting regexp (presumably) just like Gawk did via a light wrapper, performs some tweaks on the resulting DFA, & compiles the result again just like Gawk. With a fastpath for locating a fixed number of substrings in the haystack.

To interpret the resulting DFA it extracts the DFA superset & whether a fastpath can be taken & for each char it maybe defers to keywords matcher & convert back to DFA traversal, maybe evaluates the DFA superset followed by a mem[r]chr call, & evaluates the DFA itself all to compute a startpointer. Or it already knows the startpointer. Regardless it finishes each iteration by validating the position, iterates over the regexps evaluating them normalizing results & selecting best result.

The keywords search fastpath can be treated as an alternative regexp engine, which is parsed by splitting input by newlines & evaluated by iterating over each start position deferring this fastpath before possibly memrchr, the full regexp engine, & repeating the fastpath.

But more generally it has a datastructure with an allocator/destructor prominantly storing a trie (with fast & slow paths), such that the core logic is evaluating the trie via Boyer-Moore or Aho-Corasick algorithms.

You can choose to use LibPCRE instead to evaluate regexps.

There’s a scanner shared by the DFA engine & trie fastpath.

And yes, after having just described Gawk’s regexp engine I’m not digging into Grep’s any further.

Make

To tie together a project’s build toolchain for a project (once you’ve downloaded, uncompressed, & unarchived the source code) UNIX offers the make command. Though personally I prefer Meson/Ninja which so happens to be on my LFS schedule.

Most of the UNIX commands I’m describing on this LFS schedule are used in some build toolchain or other.

To do so make does the following in a convoluted repetitive manor navigating potentially unpredictable dependencies:

Then it runs some core logic on success, redefines “MAKEFLAGS” variable, loads any new files into the hashmap, attempts to unlink stdin, fills in a default goal possibly with a file lookup, & diagnoses failures whilst recomputing the buildplan differently.


Upon parsing the given makefiles it defines “MAKEFILE_LIST”, retrieves “MAKEFILES” with undefined variables warnings disabled, then iterates over the whitespace-seperated tokens to parse those files. Then parses the given files with OS-specific fallbacks.

To parse each of those it allocates & zeroes a “goaldep” prepending it to a linkedlist, saves filename with line number & offset for error messages, checks given flags for localized warnings to report, maybe expands tildes in the filepath, carefully opens the file reporting ENOMEM errors, fallsback to searching the configured include_directories for the makefile, opens it & enters into a deduplication hashmap, checks for NULL files, clears errors, disables subprocesses from inheriting this open file, appends to the “MAKEFILE_LIST” var, resets various buffers, “evaluates” the text, & cleans up.

Evaluation, after initializing various fields & mallocing a string, involves for each line (checking for a trailing “;”-escape to immediately repeat the fgets call whilst counting up those newlines to added to the line number) it skips a UTF-8 byte-order-marker on first line, skips empty lines, if line starts with a tab parses as a shellcommand for previous rule, otherwise allocs a new copy, deletes escaped newlines & backslashes escaping backslashes (memmove shortcut), replaces unquoted comment or var chars (via lookuptable) with a nil terminator, skips preceding whitespace, parses var assignments with modifier keywords & a handful of unbracketed infix assignment ops upon success saving to AST ending previous rule & adding to the hashmap after parsing & expanding value, otherwise skips empty lines, if flagged as in an “ignored define” skips further processing after considering whether to exit this block, otherwise, parses conditional lines to consider whether to enter such a block (or more complex variant), looks up “[un]export” vars when not paired with an assignment to flag it appropriately, upon “vpath” declarations saves it’s value whilst expanding vars ending previous rule, upon “include” declarations ends previous rule & co-recurses, upon “load” declarations enqueues parsed files (after expanding vars & closing previous rule) into the linkedlist of “goaldeps”, warns about missed shellcommand lines, otherwise looks for trailing uncommented/quoted semicolon “;”, collapses escaped newlines again, lexes the pattern, expand any vars, repeatedly lexes the pattern with an infix unquoted/commented semicolon, skips whitespace, validates theres something to do in response, attempts to parse a var assignment, otherwise unescapes any equals, expands any vars, locates semicolon, determines which infix operator was used, normalizes filepaths to UNIX format on DOS or AMIGA, with a little bit more preprocessing opens a new rule. After this loop it cleans up validating all conditional statements are closed & committing the last rule.

A handwritten line-oriented parser.

The AST involves a hashmap tracking string vars with various metadata (where/how it was defined, whether to export, conditionality, etc). Also vars can be flagged to be expanded upon lookup not just storage. These hashmaps can be chained in a linked list/stack.

The AST also consists of “pattern var” linkedlists, specifying a suffix, target, & var.

But mostly the AST consists of a linkedlist of “rules” specifying target strings each with their own length, suffixes for each target, dependencies, commands with count, whether to fork a shell, & a use flag. These can be parsed from a lexed “pspec”.

A command is associated with a rule references the source code where it was defined & consists of the shellcommands concatenated or chopped into an array of lines, per-line bitflags, linecount, a “recipe prefix”, & a bitflag for recursion.

A goaldep consists of dependencies, error, & a reference to it’s source code. Dependencies are represented as an inlinable linkedlist consisting of a namesequence, file, stemstring, & bitflags. Nameseqs in turn are a string inlinable linkedlist.


If that initialization succeeded it outputs debugging info, counts number of makefiles to allocate an array, checks if each rule has a dependency on shell-commands with no transative dependencies removing those entries from the linkedlist, otherwise recording the file mtime into the array, redefines “MAKEFLAGS”, with db_level temporarily-set to DB_NONE & a bitflag computes a partial ordering whilst evaluating it, & diagnoses failures.

To (initially) diagnose failures it checks whether any makefiles were remade or deleted. Upon success it deletes configured intermediate files, possibly outputs the vars/rules/etc for debugging, releases/resets the jobserver, possibly repopulates argv, restores original working directory, outputs a message that it’s reexecuting, recomputes envvars handling Amiga & VMS specially, defines $MAKE_RESTARTS, fflushes stdout & stderr, has the jobserver fork a new child, & OS-specifically executes the computed command.

That process only runs when the parser indicates the makefiles (a lot of work to go through to support arguably bad coding!) might rewrite themselves, normally computing & evaluating the partial ordering is essentially the last task performed.


To compute & evaluate a partial ordering of goals it copies the linkedlist of goals, then repeatedly iterates over it until no more goals are left to be evaluated.

Before each iteration loop over the goals it reaps all dead children both during & after forking & executing each linkedlist-queued job with plenty of OS-specific bookkeeping & diagnosing errors. After each iteration it might increment counter. After all iterations it might update some globals.

In the core logic for each goal make iterates over the files it updates/generates before possibly outputting status information and/or removing the goal from the todolist. For each file it copies the don’t-care bitflag for the goal over to the file, checks whether the file’s being renamed by following in-RAM indirection, if rebuilding makefiles updates some globals based on a flag on the file, saves the value of commands_started, considers running the commands to generate the file, checks whether that has renamed the file following any new indirections, flags the goal as having caused a change if new subprocesses were spawned, diagnoses failures, aggregates file updated flags, unsets don’t-care flag, & considers exiting the loop.

Diagnosing failures involves checking flags & previous failures to see if there was one, if the file has an update-status saves it for next iteration & checks some flags for whether to exit after this iteration, otherwise checks the file’s timestamp if things have changed checks some bitflags to determine whether to clear previous failures or end after this iteration.

To update file, after removing duplication within each iteration over the goals, make iterates over the files generated from this file checking whether we care about it & if it has or is running a command to generate it if not flags the file as being updated, checks modified time & renaming, iterates over the also-make linkedlist & their dependencies, then possibly iterates over it’s dependencies which have “intermediates”, unflags the files as being updated, checks whether commands are being updating flagged state appropriately, if not checks whether the command has failed & whether any of the files have been updated, if not checks whether dependencies were running updating flagged state having not been started, iterates over dependencies again, decrements depth, diagnoses syntactic errors, checks whether we actually need to remake this file, checks filename to determine whether set the file’s ignore_vpath flag, if there’s commands attached splits them upon into lines and spawns & starts a new job foreach possibly via the jobserver otherwise synthesizing success/failures status, updates info on the newly-generated file, outputs debugging info, & flags the file as updated.

For each also-makes’ dependencies make checks whether the file’s been renamed & it’s modified-time, skips files which are currently flagged as being updated, if rebuilding makefiles checks don’t-care flags, recursively checks whether dependencies need to be updated, rechecks a couple things, checks if those files are currently running, considers exiting this loop depending on the dependency’s status & make’s own commandline flags, & updates the dependency’s changed flag.

For each dependency with an intermediate, if it iterates over them at all, make performs a cutdown-version of the previous loop.

And later during diagnosis for each dependency make checks the file’s modified-time & renaming, possibly or’s-together dependencies have changed, updates whether this dependency has changed, & reports any failures to stdout.

Patch

Once you’ve had diff compute the changes 2 files/directories it can be useful (e.g. to lightweight-fork a project, seen that a few times) to interpret the resulting difffiles thus reapplying the previous changes, even if we didn’t ask diff to spit out an edscript. For that we have the patch command.

After initializing the program name, time, stderr bufferring, manual buffer, envvar config, backup hashtable, & linkedlists of files to delete or output patch parses commandline flags then captures remaining args, maybe sets its TZ envvar to UTC, maybe configures the backup “version” via a helper library (variants of which are vendored into most GNU projects primarily for portability), initializes & maybe opens output target, configures various signals to be ignored or trigger a crash, decides whether to use unsafe mode, decides whether to take copy fastpath, iterates over each given difffile, outputs any errors, & cleans up.


There’s an iterator implemented over the given difffiles which sniffs which carefully opens each one & sniffs which format each one uses.

For each newly-opened difffile, resetting state between iterations, patch decides whether to flag failures, specially handles an edgecase regarding gitdiffs, considers unlinking a couple tempfiles, bails upon wanting to change filetype, attempts to recognize concatenated gitdiffs, considers carefully reading the file being modified into buffer, validates the output file isn’t readonly, generates a tempfile reporting failures or setting a cleanup flag, hands edscripts off to newly-spawned $ED reporting errors, otherwise validates not a binary file, ensures there is an output, considers splitting input into buffered lines validating renaming & symlinks, process each “hunk” of input, validates & flushes results, blocks signals during cleanup, flushes output possibly tidying up after failures, & moves new file into place.

For each line-based parsed “hunk” patch computes max fuzzy-matching, repeatedly tries to locate the best match for where to start applying this chunk of the diff possibly with debugging messages ignoring surrounding whitespace, considers closing the output file, & interprets the hunk via merge, apply, or abort codepaths with error reporting.


To merge a hunk into output it skips = lines to the first valid start line with the aid of a dynamic longest-common-substring algorithm, tweaks old input line, computes diffs via that common & vendored lib, possibly outputs per-line debugging info, copies data from input to output, & before final output & cleanup repeatedlies checks initial character for each line. On conflict adjusts iterators to locate source, outputs common prefix lines, determines to whether to ignore as having already been applied possibly decrementing iters by it, copies output, and maybe outputs the conflict in syntax you’re familiar with from git.

For ‘-‘ old lines corresponding to ‘+’ newlines skips over subsequent old ‘-‘ lines whilst checking for conflicts, skips over new ‘+’ lines, counts then outputs those lines, & updates counters.

For ‘ ‘ old lines followed by ‘-‘ skips subsequent cases reporting conflicting upon new ‘+’. For ‘ ‘ old lines followed by ‘+’ skips subsequent ‘+’ lines & copies them to output. Otherwise copies runs of ‘ ‘ lines to output.

Otherwise it validates its the end of the merge.


Applying a hunk involves skipping over ‘=’ lines, interpreting each line, considers flushing output possibly followed by conditional preprocessing & maybe subsequent plus chars, considersother conditional processing, & updates counters. For old ‘-‘ lines it copies output until that point, possibly outputs conditional preprocessing with error handling & state adjustment, & increments counters. For lines after end exits. For new ‘+’ lines treats similarly to ‘-‘.

If old & new chars are otherwise unequal complains. If new line’s ‘!’ treats similarly to ‘+’ or ‘-‘ but with additional skiploops over ‘!’ lines both before & after second conditional preprocessing output. Otherwise increments counts & maybe outputs that conditional preprocessing.

Aborting a hunk creates a tempfile & outputs parsed info into it differently for unified vs context syntaxes.

Most of the code seems to be in lexing diff & target files. And parsing those diff files.

Bison

So much of UNIXs tooling are domain-specific languages to parse, and to make those easier to write in C they (eventually) created YACC which GNU rewrote as Bison. Arguably all commandline all commandline programs are domain-specific languages! It is even used by Gettext (where in any other language I wouldve written a Shunting Yard or Pratt parser…), and most languages have equivalents! Even if that equivalent is Bison itself!


The first thing Bison does is retrieve $BISON_PROGRAM_NAME if present into argv[0] & a global, followed by initializing internationalization, quoting as per $LC_CTYPE, on-exit cleanup, glyphicons table (preferably Unicode), deduplicating hashmap for strings, “muscles” table + obstack allocation, line truncation data & tables for how seriously to treat different kinds of errors, & obstack/flag for lexing.

Then parses extensive commandline flags preprocessing to extract colouroptions first.


Bison’s extensive commandline parsing integrates the internal “complaints” infrastructure with special line-numbering handling, populating the tables I described earlier amongst other globals. Afterwords it validates there’s a single extra commandline arg which it deduplicates & stores in a global & the “file_name” interpretor-var. As always –help & –version handled specially.

With extensive & optional time-profiling & some profiling bitflags it parses the input file exitting upon fatal error, garbage collect the AST, compute derivitives & nullability from it, convert into a statemachine, compute lookahead sets, detect conflicts & complain about them, & compile the output code! I’ll dig into the C/etc output more later…


Compilation is done by allocating arrays for precomputed counts populated with the aid of some temp arrays bucketsorting actions, output additional warnings, & if desired:

  1. Computes output filenames with appropriate extensions
  2. Output a file reporting rules marked as useless, states marked as conflicting, all useful grammar rules, terminals & all rules referencing them, same for non-terminals, & the state machine
  3. Iterates over each state generating a nontrivial label & outputting a graphviz node followed by corresponding edges with reduction in red, all surrounded by a graphviz header & footer. Seperate internal module abstracts graphviz
  4. Outputs the grammar & statemachine as XML without an XML serializer
  5. For HTML it runs that XML file through bundled XSLT file via a configuable XSLT processor!
  6. Unconditionally (as long as there weren’t any errors) frees lookahead tokens.
  7. Conditionally again outputs the statemachine to C in multiple passes deferring to M4 scripts I’ll describe later.

From there it frees all the various data it has allocated whilst applying minor fixes keeping a backup file.

Parsing Bison Files

As a language in its own right before it can compile a parser Bison must parse its own syntax!

To do so it starts by temporarily allocating a symbols hashmap (keys are deduplicated in their own hashset which speeds up hashing for this hashmap!) prepopulated with “$accept”, “YYerror”, “error”, “YYUNDEF”, & “$undefined” symbols. Followed by a semantic type hashtable also keyed by uniqstrs.

Then initializes lexer/scanner with its underlying file & obstack, then parser with its own obstack!


Bison’s parser is self-hosted with atypically few tweaks most of which are involved in AST construction or error reporting. It defers to a Flex program for lexing utilizing C macros for line/col counts & string capturing. Each directive gets its own token whilst utilizing Flex’s statemachine.

This Abstract Syntax Tree is quite flat, beyond the symbols hashtables it populates. The lexer handles deduplicating strings. Lexer & underlying file/obstack are closed after parsing.


The parser also populates the “muscle” tables, & afterwords default values are followed in for the “lr.keep-unreachable-state”, “lr.type”, “lr.default_reduction” (based on retrieved value of “lr.type”), & “tool.xsltproc”. Then validates various core muscles are a given a valid keyword value.

If there weren’t any complaints it’ll validate syntax rules were actually defined, generate an EOF token if missing, locates start symbol if unspecified & gathers a list of startrules, iterate over symboltables in sorted order checking everything is defined, validates tokens count, resolves & removes aliases, validates startsymbols are defined & not tokens, run a C Flex lexer to handle variable renaming, merging AST rules to generate rules & ritems arrays with name resolution & additional checks, & frees the raw AST.

Bison Optimization

Once Bison has parsed a grammar it needs to be optimized before it can be lowered & compiled to C.

The 1st step is deduplication utilizing 4 bitsets, involving computing the closure of all rules which match an empty string followed by another closure (with a little postprocessing) indicating which rules are actually used. a.k.a. garbage collection! If either of those detected any changes to make it'll complain about it before removing those rules keeping indirect references functional.


The next step does some lowering with 2 temp arrays in 2 loops & a segmented newly-allocated global array for an output. The first loop locates infix operators populating array of linkedlists. The second loop populates the global derives array flattening those linkedlists.

A closure is computed via 3 temparrays & a temp queue for all rules that can match an empty string in 2 passes. The first detects all direct cases, the 2nd detects indirect cases.

Bison Lowering

Once Bison has parsed & simplified a grammar it needs to be lowered into a statemachine! Though to be honest I failed to comprehend this code well…

The initial pass involves iterate over all rules to generate an initial state before iterating over states as they are added for each computing closures via bitmasks, save reduction rules as an array in the state, locate outgoing edges with or without debug output before generating/retrieving corresponding states from a hashmap enqueued for processing if new, & saves transitions to the state.


The second phase of lowering deals with issues of lookahead!

The 1st subphase of which is handled differently depending on how “lr.type” is configured. In simple cases it iterates over states & their edges, symbols twice, & states & their edges again to compute a consolidated gotomap. Lets skip over more complex cases…

The 2nd subphase uses bitmasks to compute set of lookahead states, filters out “follow” edges amongst those, edges to nodes with greatest number of edges amongst those. After freeing the lookaheadset it processes the follow edges some more, frees yet more bitsets, & maybe a gathers a predecessors set in 4 iterations over the states.

In the 3rd subphase Bison might allocate & populate metadata tables/”annotation lists” from various tables it previously computed followed by some counting, debugging output, & deallocation.

In the 4th subphase it uses bitsets to detect duplication in the graph.

And the final subphase resembles the first!


The 3rd phase of lowering deals with conflicts in the graph consulting configured operator precedance with the aid of some newly-allocated bitmasks & obstacks. For each state (with some handling tweaks) it gathers a lookahead set then iterates over all symbols requiring a lookahead before cleaning up & reporting errors involving a 2nd iteration.

For each such symbol it iterates over all other symbols checking their precedence & tweaking graph actions appropriately with debugging output.

After resolving conflicts it might (this pass can be disabled…) recursively gather a bitset of all reachable so it can iterate over all the states & free unreachable ones. Then iterate over the edges followed by terminals to rename the state IDs to be more contiguous again. Then over the states to update the conflicts table.

If any conflicts still remain appropriate complaints will be written out possibly with clarifying counterexamples.

Bison Compilation

Having described how Bison parses & optimizes its input code, as well as lowers the grammar to a statemachine, today: how it compiles to lookuptables than C! Or, say, Java.

Generating those tables involves allocating all of them, clearing useful flags on all rules, processing all states whilst updating those useful flags utilizing 4 temparrays, process each tokensymbol populating the yydefgoto table with another temparray, frees remaining temparrays, bucketsorts actions, & cleans up.


In that first loop for each state it zeroes out the actrow & conflrow arrays, iterates over all referenced requiring a lookahead to populate those temparrays, records which symbols triggers “shift” actions into actrow, records explicit actions into actrow, checks lr.default_reductionmuscle, & fills in error actions.

Then non-zero actions are counted & if there are any it allocs froms, tos, & maybe conflict_tos arrays for the state & populates from the actrow temparray.

The iteration over tokens similarly populates froms/tos arrays essentially the previously computed matrix.


Once it has computed the parsingtables it lazily de-lowers some of the code it is compiling to complain about symbols whose precedance is flagged as useless.

Then it outputs the specified files (as described previously) which for C involves subphases outputting symbols, rules, states, actions, symbol definitions, & (after some preperation) a skeleton file. Then tidies up in event of error. All with an obstack allocated for formatting.


For symbols Bison generates several muscles (tokens_number, nterms_number, symbols_number, & code_max) including an array (translate), validates tname & symbol_names, maybe generates translatable as flagging every translatable symbol if there are any, & each token’s “content code” into a toknum muscle.

For rules it allocates severals temparrays, iterates over rules to flatten them into those arrays, & turns them into muscles (rhs, prhs, rline, r1, r2, dprec, merger, & immediate).

To output states Bison extracts each’s accessing_symbol property to output into the stos muscle, followed by storing last, final_state_number, & states_number muscles.

For actions Bison converts the tables generated earlier (as described this morning) into “muscles” (defact, defgoto, pact, pact_ninf, pgoto, table, table_ninf, check, conflict_list_head, & conflicting_rules).

For symbol definitions it generates a seperate muscle for each token then several per symbol & their properties.

The final preperatoin involves straightforwardly adding several more muscles! (version_string, version, required_version, header_flag, flr_flag, nondeterministic_flag, synclines_flag, token_table_flag, use_push_for_pull_flag [from corresponding envvar], yacc_flag, maybe prefix, file_name_all_but_ext, dir_prefix, mapped_dir_prefix, parser_file_name, spec_header_file, spec_mapped_header_file, spec_file_prefix, spec_name_prefix, spec_outfile, spec_verbose_file, maybe skeleton, & skeletonsdir).


With all the muscles generated Bison then locates all the appropriate .m4 files, invokes M4, outputs various data into M4’s stdin including most importantly all these “muscles” it read & generated, & parses the result to lightly reformat it into the resulting output file!

Don’t ask me about these M4 files though, I’m finding M4 to be quite an illegible language…

“Flex” Lexer Generator

Several programs (documented on this page here) are essentially interpretors for domain-specific programming languages. If you want to be uselessly vague you could reasonably suggest this describes every program, especially on the commandline!

However most of these programs are written in C or C++, which is inconvenient & unsafe for this. So we use a pair of DSLs!

In parsing nontrivial text it’s useful to preprocess input abstracting it into a sequence of “tokens”.


After possibly initializing internationalization, to aid that abstraction Flex configures an exit path, initializes various globals & parses commandline args into them whilst loading new macros or prepending syntax. Then it runs a self-hosted & starts outputting code before filling in back edges from forward edges & elsewhere, interprets a “skeleton” file converts the NFA to DFA, outputs any errors, generates a state transition table, outputs final errors & code comments.

Parsing Flex files

As a domain-specific language itself Flex requires its own lexer & parser, which are implemented using itself. I'm reading through these today to see if there's anything else worth remarking on.

The lexer manipulates various globals, most notably an “action” stringbuffer of M4-preprocessed C code as it spits out tokens.

Plenty of lowering is performed during parsing, constructing NFA graphs with simplified equivalances & dispatch “character class” smallintmaps.


As it finishes parsing a rule Flex generates some dispatchcode into that “action” stringbuffer of M4-preprocessed C code to precede what the lexer is about to append there.

It also resolves any “symbols” through a hashmap.

Both the lexer & parser put effort into emitting nice error messages.

That NFA is a labelled graph where input text identifies paths (plural!) to traverse, with unlabeled edges part of any traversal. Can be constructed from charclass tables.

Lowering NFAs to DFAs

Once a Flex file has been parsed to prelude M4-preprocessed C code, an NFA (labelled graph), char dispatch tables to convert that into C code Flex must lower it to dispatch tables via a DFA.

To deduplicate edges whilst collapsing always-taken “eta” edges thus converting the NFA into a DFA… It starts by initializing some collections & possibly outputting the NFA for debugging, possibly with some buffered code representing state 0.


With that it enters a double iteration over “start states”, start-of-line tokens or not, merging each of their epsilon-closures into a single DFA state. Via a depth-first-traversal over epsilon edges, check whether we've already generated a DFA node for that NFA state, & failing that initializes a new NFA node into the NFA array. Associating a sorted NFA-states mapping for later deduplication.

All such start DFA nodes might then be similarly merged.


Once it has tackled the start states it traverses the graph & their (partitioned) outgoing edges locating duplicate edges to merge & traversing any subsequent epsilon edges to yield all the NFA states to merge into a single DFA state.

Then it finishes by counting & maybe debug-outputing DFA states for “backing up”, finalizing some state, maybe relatively directly a outputing a compressed datatable into the buffered M4-preprocessed C code, & freeing data.

Compiling DFAs

To compile that DFA into C code: amongst some more of the skeleton (which I think I'll cover tomorrow…) it outputs calculations & if-checks computing memory-sizing at appropriate indentations (assuming tabs are 8 chars wide), outputs state counts, declares a state variable, converts the DFA to runtime datastructures possibly with compression minimizing bitwidths with minor preprocessing, optionally outputs a linecounter, outputs backtracking vars & data, & possibly a few more vars.


From there amongst some more of the skeleton it starts outputting the main lexing loop, reading data from the read or getc syscalls, before (amongst more of the skeleton) starting to output the controlflow consulting the tables with or without backtracking logic or compression. Then it generates code to branch over the final couple states from that loop to determine which token to emit. Then adds trailing exception-handling code with or without debugging, including backtracking.

The Flex Skeleton

To wrap up my studies of the Flex lexer generator, I'll describe the reference sourcecode which gets compiled into the Flex binary & intermingled with its compiled output.

This file starts with a lot of C preprocessor & M4 macros based on certain M4 (or its own preprocessor prior to compiling it into Flex) conditions. Later ones use M4 to prepend a standard prefix to the C macros.

Then it includes the relevant C & possibly C++ libraries & abstracts compiler attributes.


Then this “skeleton” file defines several more macros (M4 then C then M4), & additional C macros are defined if in the implementation some of which abstracts over the input.

After all those it defines a struct around the input state & several globals, function declarations. A couple more bounds-check macros might be defined alongside conditional function signatures. It might include a few more libraries.

Then defines a stack-allocated struct for reentrant lexing, with some optional fields.


After yet more function signatures, globals, & macros… A struct & lookup-table is defined to decode lexer output.

Then the main function with a macro-specified signature. After defining its locals this checks whether it does some initialization if needed, inserts Flex-compiled initializations, & enters a mainloop incorporating compiled per-iteration code. This mostly consists of a switch-statement state machine compiled from the Flex file & ending an end-of-buffer state.


A C++ abstraction over this lexer might be outputted, followed by routine (which may or may not be a method on that C++ abstraction) for proceding to the next buffer from the configured files. As well as routines for querying previous state (mostly Flex-compiled code), lex nul chars (same), & unlex some chars, specify a textual input buffer (with postprocessed line-tracking code compiled from the Flex file), immediately switch to lexer given text or a file pushed onto a stack, initialize a buffer, free a buffer, flush & push & pop buffer state, allocate a larger stack if necessary, parse input of different types, report errors, & variations upon these.


Then it redefines the yyless macro upon that scanner, & defines several accessor functions. A few of which includes validation. Then a few initializers generated with M4 help. Followed by some loop, memory allocation, & file I/O abstractions. And finally there's routines for parsing the bundled YYtable data.