When interacting with computers via text (the “commandline”), it’s useful to have some “domain-specific” languages (DSL) that makes it easier to process certain kinds of data. If you generalize too much everything’s DSL especially in this medium! This page documents are various of these DSLs work!
Text Expansion “M4”
M4 is a C-like preprocessor for text files - that is it interprets “macros” within a text file & expands them into more verbose output. It’s a simple program, and where most of the illegibility of GNU’s build systems come from! Personally I prefer Meson/Ninja.
After initializing internationalization, includes, debug logs, & interrupt exit handling (compile/run-time optionally testing them) it parses commandline flags vi agetopt_long
.
After parsing commandline flags (D/U/s/t/–debugfile [linkedlist], E, F, G, H, I, L, P, Q, R, W, d, e/i, g, l, o, –warn-macro-sequence, –version, & –help options; deprecated B/S/T, N/–diversions) it continues initialization for: configured debug logfile, input parsing, output serialization, intermediate symbol table, macrosequence warnings via regex, include environment from $M4PATH parsed to linkedlist, & either builtin or pre-saved (micro-optimized parser) libs.
M4’s standard lib are simple-to-trivial bindings to GNU LibC & in turn systemcalls. Nothing worth commenting on. They are loaded in at compiletime via a static lookuptable to be copied over into the symboltable.
Minor adjustments are made for “interactive mode” ignoring SIGINT & disabling the stdout buffer.
Any define-related commandline flags are deferred to this point to be loaded into the symboltable closed hashmap, with macrosequence warnings where relevant.
After, & possibly during, reparsing those define commandline flags it looks up the given file in the current dir & parsed $M4PATH (alternatively “-“ means stdin
) to push onto an obstack
stack & expand. Afterwards it repeatedly pops & expands the wrapup_stack
. Then cleans up while either serializing the definitions, diversions, & config OR outputs any text from “diversions”.
To expand it lexes then expands each token with the help of two obstacks.
To lex a token it consults a scanner & some obstack
s to check for certain characters & a configurable regex. Expanding it may involve directly writing the text to stdout via custom-buffered specially-microoptimized write routines which may insert debugging info.
Or it may involve looking up the named macro in the symbol table for text to output in the same way, or expand. Expanding such a macro involve updating counts, validating state, optionally output “trace” debugging, turning any arguments into additional macros whilst evaluating them (triggers edgecase which is handled above), surrounded with by tracing debugging output & with a string pushed to the scanner “calls” the macro, & cleans up decrementing counters & freeing obstacks.
Calling the macro involves either calling the callback function (for builtins) or preprocesses the macrobody to get rid of argument references. The tokenize-eval loop handles the rest.
Bash
For computers to run the programs we tell them to we need another program to figure out which program we told it to run. On the commandline we use “shells” like Bash for this.
Bash Builtins
Though several of those programs are, for various reasons, not actually seperate programs. Usually because they modify or query shell-internal state, or the state of that process in-kernel. All the commands I’m listing adds minor text parsing, binding from internal functions & syscalls to a textual language.
Bash tracks the internal state of the “processes” you have it run (might dedicate a day to this topic…), and several builtin commands wrap that infrastructure including:
wait
- exposes internalwait_for_job
, to block until said process exits.kill
exposeskill_pid
to send specified interrupt.jobs
prints this statedisown
removes entriesfg
&bg
exposes internalstart_job
with corresponding flags, to resume execution whilst, forfg
, waiting on it.
To alter the in-kernel state there’s:
umask
exposes corresponding syscall.ulimit
trap
indirectly exposessigaction
syscalltimes
exposesgetrusage
exit
&logout
possibly validates there aren’t jobs to kill, executes ~/.bash_logout shellscript & ends BASH.exec
indirectly exposesexecve
cd
exposeschdir
,cdxattr
, or $PWD with decent preprocessing (looking up envvars or maybe even interpreting arg via $CDPATH)
To mess with BASH’s command lookup infrastructure there’s:
type
wraps various internal lookup functions (including semi-manually finding all matches in $PATH) to output whether the command(s) are aliases, reserved words, functions, builtins, or diskfiles; and whether the filepath’s cached in a hashtableinlib
registers a (lazily-loaded) dynamically-loaded library for fallback lookups, via internalloader_$inlib
help
outputs lookedup builtins’ help properties, or all of themhash
query or updates the hashtable cache of commandnames to filepathscommand
only consults filesystem lookup when running it’s arguments as a command; possibly describing what it’s runningbuiltin
similarly only consults the internals table when running it’s argsalias
exposesadd_alias
orprint_alias
to register or display command aliasesunalias
exposesremove_alias
to unregister command aliasesenable
&disable
toggles on & off these builtins via LibDL & manipulating the array directly
Bash bundles an infix expression interpretor from another project, exposed via the builtins:
test
&[
let
evaluates multiple expressions
There’s builtins for running shell commands:
source
&.
exposes the internalsource_file
function, with temp-saved state, to evaluate the shell command in the same process.eval
concatenates it’s args &evalstring
s it
Bash includes control flow syntax, which can be controlled via the builtins:
break
&continue
sets internal globals for the interpretor to consult.
Bash supports functions in it’s shell syntax, and that comes with an internal calling convention, exposed via the builtins:
shift
which wraps the internalshift_args
orclear_dollar_vars
, moving everything over in the internal array & linked list (has both!) n timesreturn
throws a corresponding “exception”caller
outputs the top n frames of the callstack- Bash reimplements
getopt
itself to operate on a linkedlist rather than an array, for all these builtins to use.getopts
builtin exposes this to shellscripts, storing it’s output in shell variables.
Bash has a concept of “shell variables”, manipulable via:
export
runs the shellvar set syntax flagged to replicate into the kernel’s, or LibC’s, envvars.readonly
runs the shellvar set syntax flagged to block further writes.- Said syntax, after preprocessing, defers to
assignment
&declare_builtin
. Or outputs all shellvars. declare
,typeset
, &local
(which isdeclare_builtin
mentioned above) wrapsassignment
,make_local_variable
,find_global_variable
,declare_find_variable
,bind_assoc/array_variable
, & with numerous checks. And allows specifying bitflags to apply to the shellvar. Or outputs all shellvars.
[un]set
& shopt
defers to the specified internal accessor functions.
There’s builtins for directly operating Bash’s I/O:
echo
putchar
s orprintf
s each arg to stdoutmapfile
wraps repeated & postprocessedzgetline
to populate an array shellvar, possibly running a given callback command every n linesread
wrapsreadline
orzread[int][n,c]
with extensive temp-saved state & possibly aSIGALRM
timeout. Saving into specified shellvarprintf
reimplements GNU LibC’s to use Bash’s datamodel. Possibly saving into a specified shellvar.
Bash maintains a history of commands it has run in-memory & ondisk, which can be manipulated via:
history
exposescheck_add_history
,history_expand
+fputs
,bash_delete_history_range
,history_set_pos
, or to output each returned entryhistory_list
. Followed bymaybe_append_history
,write_history
,read_history
, orread_history_range
depending on flagsfc
wrapshistory_list
&parse_and_execute
to evaluate a specified history entry, altering it’s own hist record
Some builtins manipulates ReadLine’s (later topic) behaviour:
bind
with temp-saved state wrapsrl_named_function
& related APIs to alter the behaviour of certain keys; most of the logic is in ReadLinecomplete
reads, removes, or adds to the array of autocompletion rules for the command; thus controlling the behaviour of the tab keycompgen
wrapsgen_compspec_completions
&bash_default_completion
postprocessed viarl_completion_matches
compopts
prints autocompletion rules for specified or all commands
Other builtins include:
:
&true
returns 0.false
returns 1. Yes, they’re builtins! No sense spinning up a new process for them…pushd
&popd
maintains an array stack of dirpaths tocd
between.dirs
outputs that array.suspend
triggers it’s ownSIGSTOP
interrupt, registering an action to continue uponSIGCONT
.
Bash Initialization
Upon startup bash initializes interrupts, debug tracing, extended memory allocator, /dev/tty, Cygwin /temp/, freezes if debugging login shell, internationalization, user/group IDs, checks $POSIXLY_CORRECT
or $POSIX_PEDANTIC
, validates memory allocation, & reconfigures interrupts.
Then Bash parses configuration from the commandline arguments (long opts first) & the current time, whlist ensuring collections have been cleared. Possibly dropping setUID privileges.
Based on the result it’ll run interactive or non-interactive mode.
Login shells close stdin/stdout/stderr on shutdown. POSIXly correct shells bind a POSIXLY_CORRECT
envvar & shellvar. Any shopt
options from the commandline are applied.
Buffering is enabled on stderr/stdout. BUiltins table is sorted.
Interrupts are reconfigured again. Hostname is set. User info is retrieved. An expansion table for resolving shorthand (~
) path syntax. Checks whether we’re in a restricted shell. Shellvars are initialized into a hashmap from envvars handling some specially, then from other sources. Configures the kernel to aid switching different background processes. The parser is initialized. Saves shell flags, & evals the appropriate builtin.
Copies locale envvars to shellvars.
Interactive shells have some additional initialization.
Then it configures a savepoint, possibly certain shellvars OR sets -i flag, checks whether the shell’s restricted again, loads additional commandline args into appropriate list, possibly runs startup shellscripts, & initializes caches for loading commands & shellvar exceptions.
Then evals from appropriate input possibly a debugger, then tidies ALL this up.
Interpretting Bash Control Flow
Bash is primarily an interpretor over a text centric DSL intended for live interaction. It has control flow constructs, but the focus is on running other programs.
The different options for where Bash reads it’s input from include: the commandline args (via parse_and_execute
), a given filepath (resolved via $PATH), or readline.
The parser, after configuring exception recovery, configures the buffered parser input, saves aliases if we’re expanding them, gathers/preprocesses input text, clears the shell input line, Repeatly parsing & if successful COMMAND*
via a parser implemented in Bison. Directly reading from read[line]
.
To evaluate the abstract syntax tree Bash parsed with the aid Bison, once it’s been validated it runs any actions attached to interrupts flagged has having occurred, tweaks flags & considers running in a “subshell” (again with new tempstate) or “coprocess” (same in new process) instead or with time measurements, adjusts pipeline filedescriptors, possibly throws an exception, checks trapped interrupts, and configures an exception handler before branching on the AST node type & cleaning up.
Upon seeing a simple command the control flow interpretor handles linecounting, traps/debugging, optionally forking, “expands”, empty commands, & exceptions then runs it as described later!
for
validate/preprocesses it’s inputs, catches any exceptions, possibly tweaks flags, uses substitution logic (tomorrow) to get a list of words to iterate over & repeatedly execute it’s body with a newly defined shellvar & debugging handling.
Arithmatic For is a variant that evals infix exprs twice per iteration, like C syntax.
select
repeatedly prompts the user to select from a list of options by index. Checking the break/continue flags to determine when to end the loop.
case
iterates over each of it’s clauses and for any with which has a pattern (of possibly multiple) which matches it evaluates that command. Followed by any flagged for fallthrough.
while
or until
executes it’s test command to determine based on it’s errorcode whether to stop before evaluating the body.
if
determines which of it’s bodies to run based on the errorcode of it’s test.
group
recurses into the main interpretor function with appropriate flags set.
connection
handles postfix &
(recurses with appropriate flags once or twice), ;
(recurses twice, second time in a new process), |
(recurses with appropriately adjusted stdin
/stdout
), &&
/||
(if creating a subshell, decides whether to eval the 2nd based on errorcode of first).
Double parens interprets given maths.
Double square brackets evaluates a boolean expression, possibly incorporating maths or regular expressions; after preprocessing.
The function syntax registers a new function in the variables table (later topic) having readonly one doesn’t already exist under that name. Debugging builds include typechecking. Arithmatic, conditionals, & functions share some pre/post-processing.
Throughout this whole interpretation process it frequently checks for exit conditions.
argv
Expansion
In Bash “simple commands” (linkedlist of “words”, first being the command to execute - i.e. normal shell commands) there’s various shorthand syntax you can use which gets expanded in a preprocessing pipeline.
Expanding words in a simple command or for
loop involves passes for:
- shellvar assignment
- brace syntax
- combined, whilst tidying up word splitting:
- ”$@”
- Process substitution
- (detects assignment)
- ’:’
- ’~’
- ’$’ shellvars
- ’`’ command substitution
- ’\’ escaping.
- quoting
- splitting
- glob expansion OR dequoting
- & shellvar assignment again
Shellvar assignment involves, if the appropriate flag is set, involves:
- Splitting the wordlist by flagged word.
- Iterate over all names to assign, saving the recursively expanded value into the shellvar hashmap. Possibly via the shellvar’s callback function, and handling different types specially.
- Before & after each validate it’s a valid assignment, erroring out if configured if not.
Initial pass clears specified shellvars, final pass assigns them. Shellvars may hold indexable arrays.
In Bash you can specify a comma-seperated list between curly brackets for it the outer arg to get duplicated once for each listed substitution. Very handy when the commandline don’t change that much!
To implement this for each word in the command not flagged to disable this pass it first locates a balanced non-escaped curly bracket char in the word & it’s corresponding close bracket taking a text slice. Then within that slice locates unescaped ‘,’ & specially handles where that’s absent.
Having parsed the input & handled an edgecase, for each escaped-comma-seperated textslice it recursively expands any braces therein concatenating those results onto it’s own. Then finally it concatenates on the pre- & post- amble via strcpy
before recursing again until there’s no more braces to expand.
If there were only a single item in the braces it’s parsed as a pair of numbers to expand into a numeric sequence.
This is word array is spliced into the word linkedlist.
There’s a main expansion iteration over the word linkedlist which, between validation/correction code (e.g. handles “$@” specially) iterates/branches all chars therein:
- The nilbyte ends iteration.
- A special “control escape” char appends this sequence to the output.
<(
or>(
parses & runs the embedded command in a new process, appending a FIFO or /dev/fd filepath to output.=
&:
alters some flags.~
might append a string with the aid of LibReadLine or LibTilde.$
handles digits & various symbols specially, looking up various globals to append to output. Possibly with reformatting for$@
/$*
(function args), or evaluating parenthesized arithmatic or command expressions (after recursively expanding parenthesized words).- Otherwise
$
parses off the subsequent name & looks it up in the shellvar hashmap. Possibly with a trailing array lookup. Or upon syntax error doesn’t substitute. - In either case skips trailing
$
when appending to output. - ’
' finds the text slice between it & the subsequent '
’, forks a new process to parse & run that command, & reads the result from it’s standard out to append to the resulting word. With decent error handling. - ’' outputs the escaped or unescaped char under different conditions. Full escaping later!
- quotes alters flags & recursively expands the quoted specially.
- Upon whitespace it flags whether to split the word here.
Otherwise copies chars over with a quoted check.
Bash allows you to use wildcard chars which expand to all matching files visible to it in the filesystem. It wraps LibC’s, or LibGlob’s, implementation with linkedlist iteration & some pre/post-processing. Skipping results which match specified “ignore” globpatterns.
This globbing pass if it runs also serves to remove quoting & escaping, though it work is done regardless of whether globbing is performed.
After actually assigning shellvars the shorthand expansion is now complete!
Executing Commands
The central task we all expect from Bash is to run the external programs we specify with our specified input. Though Bash also supports other types of commands e.g. the previously mentioned “builtins”. Here I’ll describe how it locates & runs commands regardless of their source.
But first it forks
a new process with appropriate stdin/stdout/stderr before locating the command or even (as described above) preprocessing the commandline args to lower shortcuts.
The edgecase of an empty arglist, the “null command”, still requires appropriate cleanup to be done.
Then it checks hashtable-registered “functions” (unless $POSIXLY_CORRECT or functions are disabled). Which it runs (after consulting additional lookups) by determining optimization flags it can set, register exception handlers, updates the stackframe, interprets it’s body (as described previous day), & restores the previous shell state.
Next it binary searches the sorted builtins array to locate a callback function to (with extensive exception handling, and some I/O adjustements) hand the commandline args to, possibly rewriting any invocations of the command
builtin.
If the first word starts with %
it’s starting as a shorthand for bg
or fg
builtins. Or you could reference these jobs by name.
An optional autocd
feature converts dir paths to cd
commands.
Finally there’s the central concept of “disk commands”!
Optionally hashmap-caching results (with edgecase handling) it checks for absolute paths before iterating over $PATH
preprocessing each entry & appending the commandname before checking if it’s executable.
To run the located command it first gathers all exported shellvars to pass to it as envvars & adding $_
, ensure we have forked, adjust signal handling & stdin/stdout/stderr, close any FIFOs, applies any redirections, upon failure to locate the command it’ll fallback to running the command_not_found_handle
shellfunction or in serious cases outputs an error message itself.
Then it reformats the args linkedlist to an args array to hand to the program via execve
!
As for aliases, that’s handled during parsing & autocompletion.
fork
syscall (Linux)
To run a “disk” command, Bash utilizes Linux’s (or other UNIX-like’s) fork
& execve
syscalls.
fork
, clone
, & it’s related syscalls verifies parameters configurable via clone
, clones the configured struct task_struct *
properties (including CPU-specific & CGroups ones) with minor verification & appropriate locking, tweaks some of those fields based on flags possibly including graph traversal for debugging, adds protection against stack leaks, constructs a new “[pidfd]” file descriptor, & disables stepwise debugging flags.
Then Linux’s implementation of fork
, etc under lock checks resource usage permissions before copying those fields over, initialize proctree, etc fields to empty values.
Before clearing the scheduler fields whilst computing priority; initializing perf events; checking allocation permissions; copying/altering permissions, semaphore undo stacks, open files, struct fs_struct*
, signal handlers, signals, memory management, namespaces, I/O, & subthreads properties.
Finally it allocs a free Process ID under appropriate locks possibly incrementing it, creates the anonymous “[pidfd]” virtual filesystem entry, clears stepwise debugger flags, checks CGroups permissions again, copies a few more fields, verifies that CGroup isn’t dying & no fatal signals are pending, replaces that “[pidfd]” open filedescriptor, initializes more empty lists, adds PTrace fields, saves the PIDs in the appropriate fields, inits subtask cleanup if it interrupts upon exit, or refcounted-adds the “thread group leader”’s lists, twaks a couple lists based on that, emits an event, initializes some final UClamp, profiling, & UProbe fields, enqueues the new thread on it’s selected scheduler, & returns the PID.
execve
syscall (Linux)
To actually run the specified commandline programs in the new processes it fork
s Bash uses the execve
syscall. Theoretically this could all be done in userspace (GNU LibC supports dynamic libraries afterall!), but if we want to discard the shell’s memoryspace to never return someone else has to take over the work. So how does Linux do so?
After carefully reading the filepath from userspace & normalizing args, execve
& related syscalls validates & checks permissions.
From there it initializes a “binary PRM” struct to represent the executable being mmap’d, takes some counts, configures a stack limit, & copies (with a decent chunk of logic to cross userspace/kernelspace boundaries) the filename, envvars, & commandline args to the stack before deferring to BPRM’s execve
implementation.
BPRM then duplicates &, for SUID programs, alters the credentials struct under lock, performs some safety checks, opens the specified file read-only, …
considers switching CPU cores whilst the addressspace is small to better balance tasks between them, checks permissions via callback hook, iterates over a linkedlist of externally-registered “binary handlers” to determine e.g. whether to mmap the executable or mmap another program specified via a “shebang” (#!
) line to interpret this file. Which instruction to start execution on?
Outputs to ptrace
. Then defers to “rseq” (zeroing some fields) & updates bookkeeping.
Job Control
Bash has a handy feature whereby you can run commands “in the background”, so that you can continue using the shell or another “foreground” command whilst it’s still running. I’ve already discussed some builtin commands for manipulating/viewing these, but today I wish to discuss the core bookkeeping behind it. Tomorrow probably I’ll discuss the kernel-side components.
You can disable compiling this component & it’s integrations.
“Job stats” (JS) are tracked, which is zeroed out in subshells.
It also has a concept of “pipelines” which upon startup (called internally) deallocs the old pipeline, zeroes out the pipeline PGroup if it’s different from the shell’s, & closes the old pipe.
Upon stop (called upon interpreting the trailing command of the control flow, whether that be a simple command, pipeline, coprocess, or function definition) serves to add the full pipeline’s processes to a joblist.
Before resizing the joblist ringbuffer upon closing a pipeline it’ll first see if it can reap dead processes it’s tracking. Normally it’d flag them as dead first see it can tell you they died. Otherwise it’s deallocating the entry & updating counts. Compaction is another iteration or two.
Then it iterates over the pipeline a couple times to populate a newly allocated “job”. Or it calls tcsetpgrp
syscalling back to sigprocmask
to place it in the foreground.
Doesn’t handle SIGCHLD
in meantime.
These pipelines can be “saved” & “restored” to/from a linkedstack, used for expanding backtick-subcommands into literal arguments & trap handling.
That ringbuffer is resizable & used to populate a trivial numeric-closed-hashmap (multiply+remainder hashfunction), with statically-allocated maximum capacity. This hashmap might be invalidated upon resizing the ringbuffer.
bgp_add
moves an entry from the ringbuffer to the hashmap.
There’s a linked-list of process substitutions. Not that it appears to actually be used anywhere…
It tracks the range of job IDs.
The control flow interpretor may call append_process
to add a new PROCESS*
to the jobs array (which appears seperate from the hashmap & ringbuffer). This may be iterated over to SIGTERM
or SIGHUP
then SIGCONT
all those pgroups, find by it’s IDs, & output their statuses possibly filtered by status.
There’s an iteration over the cyclic linkedlist for a given pipeline seeking a given PID, or over the pipeline stack. Pipelines can be printed. There’s abstractions over certain interrupt handling patterns.
For BSD it might need to what for all output to drawin before changing the baudrate. It abstracts the TIOCGETP
, TIOCGETC
, & TIOCGLTC
and/or TCGETA
to retrieve the teletype state. Another abstracts corresponding setters.
There’s a function to setting the global (in-memory) foreground job, with validation.
There’s an abstraction to wait on a given PID, possibly background, or (possibly any) job with or without cleaning up all the surrounding bookkeeping I’m describing. The rawest function here wraps a repeated waitchild
with bookkeeping to which other children die in the meantime.
Another function resets the foreground job. This seems to be all userspace,
Upon a simple command the controlflow interpreter calls start_job
.
After validation start_job
unsets the NOTIFIED
bitflag, optionally sets & flags it as foreground, outputs the status manually otherwise, followed by any other jobs it needs to notify you about, flags the command as running, tells the kernel to hand terminal control to it (ah! There’s that code!), if needed sends SIGCONT
, & cleans up.
There’s an abstraction around kill[pg]
which ensures the process is in a good state to recieve the interrupt.
Bash has it’s own interrupt handlers.
Upon SIGCHLD
it repeatedly waits for the child to exit updating any of the destructures it might be in. Including the job structure itself, unsetting the NOTIFIED
bitflag & calling any cleanup callbacks while we’re at it.
There’s a function for outputting detailed information about a job’s status.
There’s an initializer, destructor, & reset for this whole subsystem which is called lazily.
There’s new_line_discipline
setter wrapping several IOCTLs.
SIGCONT
is forwarded to fg.
SIGSTOP
restores old signal handlers & takes terminal control for itself before forwarding to fg.
There’s a function to delete all jobs being tracked, trigger NOHUP
on all of them, count them, flag them all as DEAD
, unset the NOTIFIED
bitflag on all of them with some counting.
There’s functions to access the jobs_list_frozen
global.
There’s one to retrieve & cache it’s own PID.
To update lmaxchild
.
To configure the SIGCHLD
interrupt handler.
And some cleanup functions.
Exiting Processes (kernel-side)
When a process exit
s (by calling that syscall explicitly or implicitly), Linux notifies e.g. Bash’s job subsystem via a SIGCHLD
signal or waitpid
syscall. Also GNU LibC adds a little of it’s own logic to help userspace tidyup after itself. So today I’ll describe that syscall, nexttime: teletype IOCTLs.
After validation/correction, Linux updates registered profilers (including, seperately, fuzzing support & PTrace), checks credentials, & cleans up once it’s in a consistant state.
Whilst tidyup it tackles in order:
- URing I/O
- CGroups/threadgroups & recieved signals
- Updates memory status (two different functions)
- Decrements livecount of it’s signal
- If so validates it’s not PID 0 (
init
), cancels timers, & further updates memory stats. - More stats updates derived from the group
- Validates teletype’s new state
- Any attached auditting
- Notifies any relevent NETLINK sockets
- It’s virtual memory manager
- More (slower) stats
- Debug output
- Semaphores
- Shared memory
- Open files
- Root/current filepaths
- Disconnect from the controlling teletype
- The task namespace
- Any work it still has linked-queued.
- CPU-specific cleanup
- Triggers a perf event
- Scheduling
- Finish with CGroups
- PTrace breakpoints
- RCU tasks
Then it considers which processes it needs to notify and (if it’s a thread leader or PTrace’d) it validates state, triggering an atomic condition attached (waitpid
) to the process. & sends SIGCHLD
to the parent.
An “atomic condition” integrates into the scheduler, carefully avoiding race conditions, to pause threads until the condition’s triggered.
Once the input’s gathered & validated, sending a signal involves appending an entry to a linkedqueue whilst updating rlimit
counts, triggers the corresponding filedescriptor’s atomic condition, sets the appropriate bitflag, & wakes the process up. Presumable as it wakes up the userspace callstack will get updated for each queued signal.
Finally it deconstruct it’s “process connector” driver, it’s mempolicy, validates no locks are stilled held, cleans up it’s block I/O context, pipe info (with it’s ringbuffer), & task frag memory page. Then revalidates credentials, validates stack usage, temporarily disables preemptive multitasking, increments a per-CPU dirty count, RCU tasks (2nd pass for debug output, whatever RCU means…), & schedules something else in that’s actually alive!
Teletype Files (Kernel-side)
Linux provides special devicefiles for (virtual) teletypes, which Bash uses to manipulate which process (or rather CGroup, I’ll discuss those much later alongside systemd). Which Bash uses to manipulate who receives it’s user input.
This, like all files, is implemented as a methodtable to which the appropriate syscalls defer.
A TTY file cannot be seeked. Upon read, after validation, they repeatedly defer to their underlying configured line discipline’s method & maybe update a timestamp.
Upon write after validation & retrieving the locked line discipline, either overwrites the methodtable with noops or repeatedly defers to the wrapped & locked file descriptor with extensive error handling. Similar for polling, but doesn’t require much error handling at all.
It doesn’t override methods for spliced reads & writes.
- Upon
TIOCSTI
IOCTLs after validation they defer to their line discipline’sreceive_buf
method. - Upon
TIOCGWINSZ
IOCTLs they return thewinsz
property. - Upon
TIOCSWINSZ
IOCTLs they defer to their ownresize
method, by default updating that property &SIGWINCH
-signalling the appropriate process group under lock if it has changed. - Upon
TIOCCONS
IOCTLs they might swap out the globalredirect
file descriptor. - Upon
TIOCEXCL
,TIOCNXCL
, orTIOCGETEXCL
IOCTLs they checks/updates a given bitflag. - Upon
TIOCGETD
IOCTLs returns it’s line discipline’snum
classproperty. - Upon
TIOCSETD
IOCTLs, if valid, they set that property cleaning up old one. - Upon
TIOCVHANGUP
IOCTLs if permitted they tidyup with a bitflag set before optionally deferring to their ownhangup
method. - Upon
TIOCGDEV
IOCTLs they retrieve & bitpack the device number to return. - Upon
TIOCSBRK
orTIOCCBRK
IOCTLs they defer to their ownbreak_ctl
method. - Upon
TCSBRK
IOCTLs they defer to saidbreak_ctl
method, with or without a brief sleep before retrying as determined by theTTY_DRIVER_HARDWARE_BREAK
bitflag. - Upon
TIOCMGET
IOCTLs they defer to their own corresponding method. - Upon
TIOCMSET
,TIOCMBIC
, orTIOCMBIS
IOCTLs they reencode the given bitflags appropriately & defers to their owntiocmset
method. - Upon
TIOCGICOUNT
IOCTLs they defer to their ownget_icount
. - Upon
TCFLUSH
IOCTLs given arg ofTCI[O]FLUSH
they free locked linkedlist buffer. - Upon
TIOCSSERIAL
IOCTLs they defer to their ownset_serial
. - Upon
TIOCGSERIAL
IOCTLs they defer to their ownget_serial
. - Upon
TIOCGPTPEER
IOCTLs they construct a PTM wrapper.
As for job control IOCTLs, which is what I’m really interested in here…
- Upon
TIOCNOTTY
IOCTLs they retrieve & deinitializes the global TTY under appropriate locks, before NULLing out the current process’s TTY. - Upon
TIOCSCTTY
IOCTLs after validation/permission checks they remove the TTY from it’s current process & set it on the given one under it’s own spinlock. - Upon
TIOCGPGRP
IOCTLs after validation they under spinlock retrieves a TTY property. - Upon
TIOCSPGRP
IOCTLs they do the reverse, more validated. - Upon
TIOCGSID
IOCTLs after validation they reformat the current session’s process ID.
Or it might defer to methods on the TTY or line discipline to handle unknown IOCTLs. There’s also a legacy IOCTLs method that does similar things.
Upon open they defer to their wrapped file, possibly followed by a tcgetattr
/raw
calls.
Upon final close they flush & close their wrapped file, waits on outstanding tasks, outputs appropriate warnings, & deconstructs itself.
Upon fasync after validation & under lock they retrieve the process ID & type to set as the file’s owner.
And outputting debugging info is deferred to it’s own corresponding method.
Bash History
Bash optionally logs all the commands you type to make it easier to repeat variations upon them & eventually translate into shell scripts. In the ReadLine prompt navigating this history is tied to the up/down keybindings. Plus there’s builtin commands for accessing them & ctrl+R searches them by substring.
To (re)initialize this history logging (upon startup or flags
) it sets a few globals to default values. Looking up shorthand chars from $histchars shellvar.
Upon history logging disable (called from parser & some builtin commands) it unsets some globals, reloading the $HISTCONTROL & $HISTIGNORE shellvars.
Upon load history (called on startup & set
) ensures $HISTSIZE & $HISTFILESIZE shellvars are configured & has ReadLine parse the lines out of the file specified by $HISTFILE remembering the linecount. There’s a corresponding function for clearing history (called on history
) deferring almost entirely to ReadLine.
There’s functions abstracting ReadLine’s APIs for altering in-memory history, exposed via the history
& fc
builtin commands. One of which attempts to open a file for ReadLine to write to. A similar function is called by the REPL loop & signal handlers to save the in-memory history to $HISTFILE.
There’s an abstraction around ReadLine’s preprocessor, with error reporting, where it appends history records. And seperate ones for stripping comments, etc.
There’s several functions to log commands to different targets, including in-memory with ReadLine, $HISTFILE, & Syslog. All abstracted into bash_add_history
called from above functions.
And there’s a few other miscallaneous functions.
So… basically all the logic’s in ReadLine.
Bash Autocompletion
In Bash when you press the tab key it’ll autocomplete what you’re typing, mixing it’s own logic with ReadLine’s UI.
Upon hitting that tab key ReadLine calls a BASH-provided callback, which after adjusting ReadLine globals before stripping preceding whitespace, considers whether this is a quoted or unquoted command name to autocomplete in which case (after correcting some flags regardless) to run a completion callback specific for it (or reuse the generic filepath completion).
That commandname completion conditionally sources from all the same locations used for lookup, and supports glob matching. Amongst other adjustments.
It compiletime-optionally skips any preceding shellvars and considers whether to use that commandname autocompletion or interpret the command-specific configuration before fallingback to a generic autocompleter. Which chooses between completing command substitutions (recursive), shellvars, usernames, hostnames, ignores, or filepath globs.
To evaluate programmable autocompletions Bash first looks up the command in a hashmap before consulting the specified bitflags found there to determine whether to autocomplete aliases, arrayvars, bindings, builtin commands, disabled commands, enabled commands, envvars, functions, builtin help, hostnames, jobs, keywords, running jobs, set
flags, shopt
flags, signals, stopped jobs, shellvars, commands (as per before), files, users, groups, & services.
Directories are treated special.
Then it autocompletes globs against the filesystem. If the autocompletion declaration specified a list of words, it’ll autocomplete any matching ones at this point in the process.
It then splits args, consults the shellfunction about to be called, & optionally runs a specified command to retrieve autocompletion options.
Then finally tidies up and performs some final filtering. Possibly rerunning the main suite of autocompleters.
Gawk
In Unix it’s common to need to convert some input format into some output format. And when there’s not a dedicated command to do the conversion you need often AWK can be programmed to do it for you. I’ll be studying GNU’s implementation today.
To interpret these programs Gawk checks whether $TIDYMEM
is defined possibly installing memory allocation logging callbacks if so, extracts the executable basename from argv[0]
, considers outputting usage info if not enough commandline args, retrieves $GAWK_LOCALE_DIR
with default, ensures stderr’s in appendmode if those syscalls are available, initializes internationalization, registers signal handlers, allocs some extra stackspace, allocs an empty “null” string, validates stdin & stdout filedescriptors, registers methodtables for string “arrays” (or rather hashmaps) & if compiled with MPFR ints & complex ints, allocs & zeroes-out some symboltables with appropriate methodtables for empty arrays & registers a couple globals, allocs & linkedlist pushes a new context, parses commandline flags with those commandline args (enqueueing any assignments & input source files) backed up for debugging, possibly reinitializes internationalization, prepares a lookuptable for quickly converting bytes to widechars, for non-ASCII/Unicode (EBCDIC) locales initializes a lookuptable for uppercasing chars, outputs a nostalgic error message if requested, checks $POSIXLY_CORRECT
converting into a bitflag checking bitflag conflicts, possibly warns about enabling the setuid permission bit on, possibly registers breakpoint debugging callbacks, initializes some MPFR globals if compiled against MPFR, saves a list of groups from the kernel, initializes regexp globals, loads a bunch of globals & envvars into the runtime symboltable, adds the pre-allocated null to the start of the fields array, loads in the variables specified by commandline flags, considers setting stdin to binary mode, in debug builds enable stdout buffering, detects whether stdout is a (virtual) teletype, initializes some stdlib flags, loads & runs some dynamically-loaded modules, maybe outputs the version number alongside all of those extensions’ & a copyleft notice before exitting, fills in missing source code from commandline args or errors out, initializes stackframe & boolean lookup table for the interpretor, populates vars from commandline args, reconfigures numeric locale if possible, parses the AWK program (tomorrow’s topic) exitting immediately given -g flag, sets a global for the current namespace, populates syntactically-builtin functions into the symbols table, possibly lints that no function shadows any globals, possibly lints whether there’s any sourcecode, possibly registers signal handlers to dump execution state with or without exitting, restores numeric locale, initializes the runtime scanner, debugs or interprets the program (later topic), maybe outputs interpretor state, & cleans up.
Like any programming language Awk needs to be parsed. To do something Gawk saves whether this parsing was called from eval
in a global, allocs a noop end instruction, checks the context level to determine whether to alloc other global opcodes, finds the first EXTLIB sourcefile to save in a global, clears some other globals, allocs a tokenization buffer if needed, runs a Bison/YACC parser, restructures it into straightline code, clears some globals, validates undefined functions aren’t called, validates parameters don’t shadow functions, & allocs some memory to store corresponding arguments in.
Reformatting the program involves adding the end instruction to the endblock creating it necessary, possibly merges the begin & program blocks + maybe endblock appending a STOP opcode, appends the endfile opcode to the endfile block creating it if necessary, does some file for beginfile, handles certain empty cases, & more append/prepend/merges.
Those appends/prepends/merges involves appending a AFTER_ENDFILE
opcode to the endfile block, prepending record op to the program block & appending a JMP to it, appending a AFTER_BEGINFILE
opcode to the beginfile block, merging the beginfile block & program block, prepending the newfile op to the merged block, merging the endfile block & end block & if it exists begin block, merges outercomment op if exists, appends interblock comment if exists, & appends the atexit op & a STOP opcode.
The parsing itself thanks to Bison resembles BNF.
Upon EOF the parser iterates the lexer over to the next input file, otherwise it updates/alloc
s various linkedlists of opcodes using a hashtable & the symboltable to resolve function references. Whilst allowing the source code to append input files to the scanner or load/run dynamic libraries extending the available APIs.
It adds opcodes to assert certain conditions, or “lints” within this parsing. It uses it’s regexp engine. There’s constant propagation.
switch
statements require significant parsing to gather/reformat it’s case
s. Loops require updating the jump targets. for
loops can be significant shorthand syntax which is expanded here in the parser, as do print
statements.
The parser may have temp memory to free. Equals require paying special attention to the lvalue.
There’s numerous helpers for manipulating these linked lists.
The Bison-generated parser is configured to use a handwritten lexer which after consulting the previous token to determine whether to output a SUBSCRIPT or EOF token & consulting a ringbuffer before fallingback to reading the current file a bufferfile (or for debugging, line) at a time to peek the next char to lex, it skips OS/2 “extproc” lines, if it accepts regexps it iterates over chars counting open square brackets whilst lexing escapes & validates it’s terminated before the line/file.
During lexing regexps it lexes any trailing regexp modifiers.
Then it skips whitespace and branches over the token. Certain (pseudo)chars are converted straight to returned tokens. Whilst lexing linecomments it may or may not preserve the line it skips in a global for pretty printing before yielding a newline token. “@” becomes a token, but with following regexps handled specially. “" removes the following newline (and maybe whitespace/comments) from input stream.
”?” & “:” pairs are counted with globals updated to indicate the lvalue is a condition. Corresponding parens (“(“ & “)”) are counted. “$” updates the lvalue global to indicate it’s a field. “;”, “,”, & “[” are converted direct to tokens. “]” peeks at the next token to possibly warn about the portability of multidimensional arrays. “*=” flags the lvalue as a Op_assign_times
. “*” & “*=” unless sticking to POSIX compliant behaviour becomes “^” & “^=” tokens. “*” otherwise is a token itself.
”/=” becomes SLASH_BEFORE_EQUAL
& =
tokens, otherwise “/” is a token in it’s own right. “%=” marks the lvalue as Op_assign_mod
yielding a ASSIGNOP
token, otherwise “%” is a token in it’s own right. Similar for “^” but with optional warnings that this is relatively new. And for “+” but with a dedicated “++” INCREMENT
token. “!” has a dedicated “!~” MATCHOP
(indicating the “not” via lvalue global) token. “<” & “=” are lexed similarly.
”>” has a “»” IO_OUT
token & lexes differently in parens. “~” corresponds to it’s own MATCHOP
token. “}” counts braces with special newline lexing. Lexing quoted strings generates/malloc
s a literal value stored in the lvalue whilst handling escapes. “-“ is lexed similarly to “+”. “.” is it’s own token when not followed by digits, in which case it determines which base to use & parses via MPFR or LibC. “&” can be it’s own token, or double up into a AND
token with new Op_and
lvalue.
” | ” is lexed similarly to “&” but optionally with a special “ | &” IO_OUT /IO_IN (depending on whether in both a print & parens expression) token whilst setting a Op_symbol lvalue, on it’s own it’s also a IO_OUT /IO_IN token with different redir_type values on the Op_symbol lvalue. |
Otherwise it lexes an identifier & if it’s a special keyword before validation, lvalue correction, & cleanup.
To interpret this linkedlist of opcodes Gawk, starting from a “program counter” of instruction 0, it repeatedlies saves the source line number to a global, maybe runs preexecute callbacks for stepwise-debugging or profiling, maybe outputs each opcode being executed to stderr, branches over the opcode, & normally jumps to the next instruction in the linkedlist.
Op_rule
opcodes updates the currule
global (and, if it’s BEGINFILE
sets the record global to NULL copying over any existing data) & the source
global. Op_func
only updates the source
global. Op_atexit
has the scanner iterate over to the next input file closes all remaining input files, closes any dynamically-loaded extensions via their destructors, & maybe dies indicating a SIGPIPE exception. Op_stop
exits the interpretor loop.
Op_push_i
pushes given refcounted value onto the stack after possibly internationalizing it. Op_push[_arg[_untyped]]
all retrieves given value from the opcode consulting a different lookuptable with type validation if it’s originally of type Node_param_list
, before branching on the value’s type. If it’s Node_var
it might warn about uninitialized values before refcounting & pushing to stack. For Node_var_new
it replaces the nullstring, possibly warns, & pushes a duplicate to stack.
And for Node_var_array
it pushes to stack in Op_push_arg[_untyped]
opcodes, complaining for Op_push
opcodes. Op_push_param
retrieves given value, maybe replaces with one from args array, & either refcounts & pushes the var’s value or pushes the value itself. Op_push_array
pushes the given value. Op_push_lhs
does some validation/conversion before pushing the given value.
Op_subscript
pops the given number of items from the stack followed by the array range-validating it, if the popped array is func_table
it might warn once & yield the subscript but normally calls the “array” method to perform a lookup, if the popped array was symbol_table
it first flushes any mutations & afterwards performs appropriate warnings & normalization, dereferences the slice, & references & pushes the result to stack.
Op_sub_array
pops given number of items from the stack followed by the array range-validating it, on success creating a new array to copy that entry over into. Op_subscript_lhs
works essentially same way with more validation & pushing a reference to stack instead of value.
Op_field_spec[_lhs]
pops an int to lookup in the fields array to get a value to push. Op_lint
emits given error message if configured to do so via commandline flags. Op_lint_plus
peeks at the top two entries to warn if they’re not strings. Op_K_break/continue
& Op_jmp
repeats loop for given target. Op_jmp_false/true
coerces popped value to bool & conditionally jumps.
Op_and
& Op_or
pops a scalar coerced to a boolean to decide whether to shortcircuit, on success pushing the coerced value back to the stack & jumping to given instruction. The _final
variant of those opcodes coerces the top-of-stack to a boolean with refcounting.
Op_not
does the same but with negation operator. Op_equal
, Op_notequal
, Op_less
, Op_greater
, Op_leq
, & Op_geq
wraps a helper util to compare to the top 2 stackvalues as strings if not numbers.
Op_plus
adds the top 2 stack entries replace the 2nd with a value wrapping the same, the _i
variant allows referencing the 1st value from the opcode instead. Op_minus
, Op_times
, Op_exp
, Op_quotient
, & Op_mod
does equivalent for subtraction. Op_preincrement
& Op_predecrement
replaces the top-of-stack coerced/parsed to a number with that value plus or minus one, updated in place if possible. The post
variant works essentially the same with a little more juggling.
Op_unary_minus
coerces top-of-stack to a number & negates it. Op_unary_plus
coerces top-of-stack to a number.
Op_store_sub
retrieves given array, calls its lookup method validating result is not itself an array, validates more whilst following indirection, pops a scalar, & calls array store method. Op_store_var
retrieves given lhs value & stores given value there. Op_store_field
pops top int, looks it up in fields array, calls given assignment callback, & overwrites returned value.
Op_unary_minus
coerces top-of-stack to a number & negates it. Op_unary_plus
coerces top-of-stack to a number.
Op_store_sub
retrieves given array, calls it’s lookup method validating result is not itself an array, validates more whilst following indirection, pops a scalar, & calls array store method. Op_store_var
retrieves given lhs value & stores given value there. Op_store_field
pops top int, looks it up in fields array, calls given assignment callback, & overwrites returned value.
Op_assign_concat
pops the top-of-stack lvalue coerced to string, pops a string, takes a shortcut if (amonst other things) they’re the same, allocating the summed length & memcpy
ing both strings into place optimizing for sole references. Op_assign
pops an address & scalar to store a unique copy of the scalar in that pointer. Op_subscript_assign
after various validation calls an array method. Op_assign_plus/minus/times/quotient/mod/exp
pops two numbers one as an address to apply that operator wraped in a value which is also pushed to stack. Op_var_update
calls a callback on the opcode. Op_var/field_assign
performs some validation on top-of-stack before calling corresponding callback on the opcode.
Op_concat
pops the given number of strings (fastpath for 1) summing their lengths, allocating a new string, & copying them all into place.
Op_K_case
pops & compares 2 scalars to determine whether to take the branch, with a special regexp variant. Op_K_delete
pops an array pops the given number of indices off the stack validating them before calling the corresponding “array” method. Op_K_delete_loop
pops an array & address to call the corresponding symbolmethod.
Op_in_array
pops an array & a given number of indices to check whether those indices are in the array pushing result converted to a value.
Op_arrayfor_init
pops array validating whether its non-empty, initializes sorted_in
string constant if needed, tests whether that key’s in the array to retrieve how to sort it, coerces & possibly sorts the list as newly alloc’d value then converts it into iterator value to push to stack. Op_arrayfor_incr
increments current index throwing errors on overflow & pushes the value at that index. Op_arrayfor_final
frees this iterator.
Op_builtin
calls given callback & pushes result. Op_ext_builtin
after argcount validation calls given callback before popping all args from stack & pushing converted result. There’s several opcodes directly corresponding to builtin functions like sub
, print
, printf
, & print_rec
implementing sophisticated string manipulation.
Op_push_re
retrieves given value & maybe pops a string to store in its re_exp
property before pushing it. Op_match_rec
retrieves given field & evaluates regexp before converting to & pushing a bool value.
Op_[no]match
does the same but retrieves the text to match from the stack whilst possibly overriding the regexp to evaluate.
Op_indirect_func_call
peeks & validates top-of-stack, looks that poperty in the global “arrays” & either calls the callback pushing result or updates callstack jumping to the function’s opcodes. Op_func_call
looks up the named function body in those global “arrays” caching result inline, updates stackframe, & jumps to its start instruction.
Op_K_return_from_eval
errors. Op_K_return
pops a scalar & stackframe to determine where to jump to after restoring that scalar to the top-of-stack as the return value.
Op_K_getline_redir
tells the scanner to read a line (later topic). Op_K_getline
validates its not called within the BEGINFILE or ENDFILE rules before repeatedly iterating to the nextfile and either reading the line to push the successful result or pushes a stackframe to jump to the begin or endfile rule.
Op_after_endfile
is akin to a return statement. Op_after_beginfile
also resets various scanner state. Op_newfile
iterates scanner to next file jumping to appropriate destination possibly popping a stackframe. Op_get_record
validates there is a curfile
global (following a jumppointer if not) before having scanner read a record following another jumppointer on failure. Op_K_nextfile
iterates scanner to nextfile returning from current stackframe, BEGINFILE handled specially.
Op_K_exit
validates there is a currule, pops an errorcode, pops a stackframe if in BEGINFILE or ENDFILE, pops an ignored value from stack, and determines where to jumpto (END rule handled specially) from an opcode property. Op_K_next
validates currule before popping stack & following goto.
Op_pop
pops & ignores a scalar. Op_line_range
consults a flag on opcode for whether to follow the goto. Op_cond_pair
pops a bool to update associated Op_line_range
& decide which goto to follow. And finally Op_exec_count
if profiling is enabled increments a counter property on the opcode. Op_no_op
, Op_K_do
, Op_K_while
, Op_K_for
, Op_K_arrayfor
, Op_K_switch
, Op_K_default
, Op_K_if
, Op_K_else
, Op_cond_exp
, Op_comment
, & Op_parens
are all noops to aid parsing.
Datamodel involves lookup arrays for function args & input matches, an interpretor stack, typed hashmaps with methodtables including ones for the global context, & dynamically-typed regexps, numbers, strings, or hashmaps called “arrays”.
There’s a runtime scanner used in interpreting the broader AWK syntax which I skipped over yesterday preferring all the different opcodes.
When iterating over to the nextfile to concatenate it consults some script-accessible globals to determine which file to open, which is where most the extra complexity is, and has callbacks for setting other script-accessible globals.
To read a “record” it uses configurable callbacks (including a read timeout) to repeatedly enlarge the read buffer until it founds the record-terminator.
There’s special handling for network files. The scanner ensures reads can timeout as configured via the select
syscall. There’s an OO wrapper around AWK’s output.
Gawk, for whatever reason, implements regexps themselves rather than reuse GNU LibC’s. Doesn’t seem to be so that it can operate byte-by-byte as it reads the file, since the scanner operates entirely in-memory…
To parse these regexps, after some initialization & bracket-syntax linting, it resolves any escapecodes character-by-character into a new input string, determines whether to ignore casing, possibly allocs some DFA graphnodes, saves bitflags to a global, recursively compiles it, tweaks the DFA to handle newlines & backreferences, & extracts bitflags from the resulting buffers.
The parser uses a handwritten lexer (mostly converting singular chars to an enum) that consults various flags, an in-memory scanner shared with the interpretor, & a handwritten pushdown automaton emitting a generic binary abstract syntax tree with minor postprocessing to terminate the regexp.
Once parsed this generic abstract syntax tree is compiled to a non-deterministic finite automaton (NFA) via some initial allocations, maybe a postorder traversal to inline all subexps it can, postorder traversals to lower any remaining subexps into open/close opcodes, to extract the first match for each branch, to extract the next match, & to connect resulting NFA graphnodes with labled edges representing control flow, recursively computes epsilon closures (starting DFA lowering) for each NFA graphnode, & maybe (if needed) computes inverse epsilon closures.
Then if possible it optimizes to traverses UTF8 bytes rather than chars, & regardless expands initial state into a DFA.
To evaluate the regexp it calls an interpretor callback on the DFA if it’s already been lowered, which normally checks whether edgetable is larger than 1kb in which case it frees any attached state on those edges first (seemingly to avoid memory leaks), checks if it needs to reallocate the transition table, possibly allocates a transition set for multibyte chars, & repeatedlies looks up (fairly sophisticated) a DFA edge to follow constructing a new one from NFA edges where needed.
Failing that, interpreting the regexp, with ranges validated & a lock held, involves gathering some bitflags, constructs entry-lookuptables if needed, allocates a registertable, & before cleaning up performs some validation, allocs an input string & match context & DFA state table, computes initial state, & intreprets each match_kind
yielded from the NFA/DFA graph traversal using the fastmaps where possible. Afterwards selecting a single answer to return from all possible matches.
Grep
grep
outputs just the lines of it’s input file(s) which match a given regexp.
To do so, after initializing internationalization & registering to close stdout on exit, grep initializes a hashmap of regexps, parses flags handling digit flags specially allowing selecting between 4 regexp syntaxes, outputs specified messages whilst normalizing commandline args, frees the original pattern hashtable, validates stdin is a chr-device & not /dev/null, validates those commandline flags, determines whether to colourize the output & initializes it if so parsing which colours it should use from envvars, chooses a “unichar-mask” for the charset, normalizes the options for selecting a regexp engine handling fgrep specially, compiles the given regexp via the selected callback, evaluates that regexp to determine whether to skip blank lines, determines the output file to write to with or without configuring binary mode, determines the bytesize & alignment for the buffer & allocs it, & iterates over each file from the commandline arguments (with defaults).
For each, handling “-“ specially, it opens the file with configurable bitflags following symlinks, fstat
s it, validates whether it actually wants to process this file, considers iterating over directories filtering entries with co-recursion & GNU CoreUtils’ filetree traversal utility, reads the file bufferfuls at a time (checking for NUL chars, or removes them) looking for newlines evaluating the regexp via a pre-selected callback with some normalization buffered-printing and/or counting results. The code for printing results is split into several functions handling each line & in turn the filename/line, prefix, match, & suffix.
Usually grep will use some variation or other of it’s own internal regexp engine.
To compile a regexp (done upon startup) it allocs some memory, normalizes bitflags, repeatedly locates each backref in the pattern lowering them away & compiling each resulting regexp (presumably) just like Gawk did via a light wrapper, performs some tweaks on the resulting DFA, & compiles the result again just like Gawk. With a fastpath for locating a fixed number of substrings in the haystack.
To interpret the resulting DFA it extracts the DFA superset & whether a fastpath can be taken & for each char it maybe defers to keywords matcher & convert back to DFA traversal, maybe evaluates the DFA superset followed by a mem[r]chr
call, & evaluates the DFA itself all to compute a startpointer. Or it already knows the startpointer. Regardless it finishes each iteration by validating the position, iterates over the regexps evaluating them normalizing results & selecting best result.
The keywords search fastpath can be treated as an alternative regexp engine, which is parsed by splitting input by newlines & evaluated by iterating over each start position deferring this fastpath before possibly memrchr
, the full regexp engine, & repeating the fastpath.
But more generally it has a datastructure with an allocator/destructor prominantly storing a trie (with fast & slow paths), such that the core logic is evaluating the trie via Boyer-Moore or Aho-Corasick algorithms.
You can choose to use LibPCRE instead to evaluate regexps.
There’s a scanner shared by the DFA engine & trie fastpath.
And yes, after having just described Gawk’s regexp engine I’m not digging into Grep’s any further.
Make
To tie together a project’s build toolchain for a project (once you’ve downloaded, uncompressed, & unarchived the source code) UNIX offers the make
command. Though personally I prefer Meson/Ninja which so happens to be on my LFS schedule.
Most of the UNIX commands I’m describing on this LFS schedule are used in some build toolchain or other.
To do so make
does the following in a convoluted repetitive manor navigating potentially unpredictable dependencies:
- interfaces with debugging utils
- configures stdout
- special-cases now-legacy (then-current) operating systems
- initializes OS-specific globals
- initializes internationalization
- configures signal handlers
- extracts the program’s name from argv & (for VMS) envvars
- validates jobserver access credentials
- initializes various hashmaps
- retrieves the current working directory
- populates numerous variables into a couple of those hashmaps, including from commandline, (via semi-OS-specific code) envvars, hardcoded OS-specific constants, & more
- registers a GNU Guile function
- parses commandline flags & (specifying goals & vars) other arguments both from the shell &
[GNU]MAKEFLAGS
envvar - parses default suffixes & registers OS-specific default suffix rules
- retrieves which makefile to interpret & which of it’s “goals” to work towards
- parses the given makefile (described below)
- outputs requested information
- locates default shell
- configure syncing
- initializes jobserver
- populates a files deduplicating hashmap
- reformats suffix format rules into pattern format rules
- flushes buffered output
Then it runs some core logic on success, redefines “MAKEFLAGS” variable, loads any new files into the hashmap, attempts to unlink stdin, fills in a default goal possibly with a file lookup, & diagnoses failures whilst recomputing the buildplan differently.
Upon parsing the given makefiles it defines “MAKEFILE_LIST”, retrieves “MAKEFILES” with undefined variables warnings disabled, then iterates over the whitespace-seperated tokens to parse those files. Then parses the given files with OS-specific fallbacks.
To parse each of those it allocates & zeroes a “goaldep” prepending it to a linkedlist, saves filename with line number & offset for error messages, checks given flags for localized warnings to report, maybe expands tildes in the filepath, carefully opens the file reporting ENOMEM errors, fallsback to searching the configured include_directories
for the makefile, opens it & enters into a deduplication hashmap, checks for NULL files, clears errors, disables subprocesses from inheriting this open file, appends to the “MAKEFILE_LIST” var, resets various buffers, “evaluates” the text, & cleans up.
Evaluation, after initializing various fields & mallocing a string, involves for each line (checking for a trailing “;”-escape to immediately repeat the fgets
call whilst counting up those newlines to added to the line number) it skips a UTF-8 byte-order-marker on first line, skips empty lines, if line starts with a tab parses as a shellcommand for previous rule, otherwise allocs a new copy, deletes escaped newlines & backslashes escaping backslashes (memmove
shortcut), replaces unquoted comment or var chars (via lookuptable) with a nil terminator, skips preceding whitespace, parses var assignments with modifier keywords & a handful of unbracketed infix assignment ops upon success saving to AST ending previous rule & adding to the hashmap after parsing & expanding value, otherwise skips empty lines, if flagged as in an “ignored define” skips further processing after considering whether to exit this block, otherwise, parses conditional lines to consider whether to enter such a block (or more complex variant), looks up “[un]export” vars when not paired with an assignment to flag it appropriately, upon “vpath” declarations saves it’s value whilst expanding vars ending previous rule, upon “include” declarations ends previous rule & co-recurses, upon “load” declarations enqueues parsed files (after expanding vars & closing previous rule) into the linkedlist of “goaldeps”, warns about missed shellcommand lines, otherwise looks for trailing uncommented/quoted semicolon “;”, collapses escaped newlines again, lexes the pattern, expand any vars, repeatedly lexes the pattern with an infix unquoted/commented semicolon, skips whitespace, validates theres something to do in response, attempts to parse a var assignment, otherwise unescapes any equals, expands any vars, locates semicolon, determines which infix operator was used, normalizes filepaths to UNIX format on DOS or AMIGA, with a little bit more preprocessing opens a new rule. After this loop it cleans up validating all conditional statements are closed & committing the last rule.
A handwritten line-oriented parser.
The AST involves a hashmap tracking string vars with various metadata (where/how it was defined, whether to export, conditionality, etc). Also vars can be flagged to be expanded upon lookup not just storage. These hashmaps can be chained in a linked list/stack.
The AST also consists of “pattern var” linkedlists, specifying a suffix, target, & var.
But mostly the AST consists of a linkedlist of “rules” specifying target strings each with their own length, suffixes for each target, dependencies, commands with count, whether to fork a shell, & a use flag. These can be parsed from a lexed “pspec”.
A command is associated with a rule references the source code where it was defined & consists of the shellcommands concatenated or chopped into an array of lines, per-line bitflags, linecount, a “recipe prefix”, & a bitflag for recursion.
A goaldep consists of dependencies, error, & a reference to it’s source code. Dependencies are represented as an inlinable linkedlist consisting of a namesequence, file, stemstring, & bitflags. Nameseqs in turn are a string inlinable linkedlist.
If that initialization succeeded it outputs debugging info, counts number of makefiles to allocate an array, checks if each rule has a dependency on shell-commands with no transative dependencies removing those entries from the linkedlist, otherwise recording the file mtime into the array, redefines “MAKEFLAGS”, with db_level
temporarily-set to DB_NONE
& a bitflag computes a partial ordering whilst evaluating it, & diagnoses failures.
To (initially) diagnose failures it checks whether any makefiles were remade or deleted. Upon success it deletes configured intermediate files, possibly outputs the vars/rules/etc for debugging, releases/resets the jobserver, possibly repopulates argv, restores original working directory, outputs a message that it’s reexecuting, recomputes envvars handling Amiga & VMS specially, defines $MAKE_RESTARTS
, fflush
es stdout & stderr, has the jobserver fork a new child, & OS-specifically executes the computed command.
That process only runs when the parser indicates the makefiles (a lot of work to go through to support arguably bad coding!) might rewrite themselves, normally computing & evaluating the partial ordering is essentially the last task performed.
To compute & evaluate a partial ordering of goals it copies the linkedlist of goals, then repeatedly iterates over it until no more goals are left to be evaluated.
Before each iteration loop over the goals it reaps all dead children both during & after forking & executing each linkedlist-queued job with plenty of OS-specific bookkeeping & diagnosing errors. After each iteration it might increment counter. After all iterations it might update some globals.
In the core logic for each goal make
iterates over the files it updates/generates before possibly outputting status information and/or removing the goal from the todolist. For each file it copies the don’t-care bitflag for the goal over to the file, checks whether the file’s being renamed by following in-RAM indirection, if rebuilding makefiles updates some globals based on a flag on the file, saves the value of commands_started
, considers running the commands to generate the file, checks whether that has renamed the file following any new indirections, flags the goal as having caused a change if new subprocesses were spawned, diagnoses failures, aggregates file updated flags, unsets don’t-care flag, & considers exiting the loop.
Diagnosing failures involves checking flags & previous failures to see if there was one, if the file has an update-status saves it for next iteration & checks some flags for whether to exit after this iteration, otherwise checks the file’s timestamp if things have changed checks some bitflags to determine whether to clear previous failures or end after this iteration.
To update file, after removing duplication within each iteration over the goals, make
iterates over the files generated from this file checking whether we care about it & if it has or is running a command to generate it if not flags the file as being updated, checks modified time & renaming, iterates over the also-make linkedlist & their dependencies, then possibly iterates over it’s dependencies which have “intermediates”, unflags the files as being updated, checks whether commands are being updating flagged state appropriately, if not checks whether the command has failed & whether any of the files have been updated, if not checks whether dependencies were running updating flagged state having not been started, iterates over dependencies again, decrements depth, diagnoses syntactic errors, checks whether we actually need to remake this file, checks filename to determine whether set the file’s ignore_vpath
flag, if there’s commands attached splits them upon into lines and spawns & starts a new job foreach possibly via the jobserver otherwise synthesizing success/failures status, updates info on the newly-generated file, outputs debugging info, & flags the file as updated.
For each also-makes’ dependencies make
checks whether the file’s been renamed & it’s modified-time, skips files which are currently flagged as being updated, if rebuilding makefiles checks don’t-care flags, recursively checks whether dependencies need to be updated, rechecks a couple things, checks if those files are currently running, considers exiting this loop depending on the dependency’s status & make
’s own commandline flags, & updates the dependency’s changed
flag.
For each dependency with an intermediate, if it iterates over them at all, make
performs a cutdown-version of the previous loop.
And later during diagnosis for each dependency make
checks the file’s modified-time & renaming, possibly or’s-together dependencies have changed, updates whether this dependency has changed, & reports any failures to stdout.
Patch
Once you’ve had diff
compute the changes 2 files/directories it can be useful (e.g. to lightweight-fork a project, seen that a few times) to interpret the resulting difffiles thus reapplying the previous changes, even if we didn’t ask diff
to spit out an edscript. For that we have the patch
command.
After initializing the program name, time, stderr bufferring, manual buffer, envvar config, backup hashtable, & linkedlists of files to delete or output patch
parses commandline flags then captures remaining args, maybe sets its TZ envvar to UTC, maybe configures the backup “version” via a helper library (variants of which are vendored into most GNU projects primarily for portability), initializes & maybe opens output target, configures various signals to be ignored or trigger a crash, decides whether to use unsafe mode, decides whether to take copy fastpath, iterates over each given difffile, outputs any errors, & cleans up.
There’s an iterator implemented over the given difffiles which sniffs which carefully opens each one & sniffs which format each one uses.
For each newly-opened difffile, resetting state between iterations, patch
decides whether to flag failures, specially handles an edgecase regarding gitdiffs, considers unlinking a couple tempfiles, bails upon wanting to change filetype, attempts to recognize concatenated gitdiffs, considers carefully reading the file being modified into buffer, validates the output file isn’t readonly, generates a tempfile reporting failures or setting a cleanup flag, hands edscripts off to newly-spawned $ED reporting errors, otherwise validates not a binary file, ensures there is an output, considers splitting input into buffered lines validating renaming & symlinks, process each “hunk” of input, validates & flushes results, blocks signals during cleanup, flushes output possibly tidying up after failures, & moves new file into place.
For each line-based parsed “hunk” patch
computes max fuzzy-matching, repeatedly tries to locate the best match for where to start applying this chunk of the diff possibly with debugging messages ignoring surrounding whitespace, considers closing the output file, & interprets the hunk via merge, apply, or abort codepaths with error reporting.
To merge a hunk into output it skips =
lines to the first valid start line with the aid of a dynamic longest-common-substring algorithm, tweaks old input line, computes diffs via that common & vendored lib, possibly outputs per-line debugging info, copies data from input to output, & before final output & cleanup repeatedlies checks initial character for each line. On conflict adjusts iterators to locate source, outputs common prefix lines, determines to whether to ignore as having already been applied possibly decrementing iters by it, copies output, and maybe outputs the conflict in syntax you’re familiar with from git.
For ‘-‘ old lines corresponding to ‘+’ newlines skips over subsequent old ‘-‘ lines whilst checking for conflicts, skips over new ‘+’ lines, counts then outputs those lines, & updates counters.
For ‘ ‘ old lines followed by ‘-‘ skips subsequent cases reporting conflicting upon new ‘+’. For ‘ ‘ old lines followed by ‘+’ skips subsequent ‘+’ lines & copies them to output. Otherwise copies runs of ‘ ‘ lines to output.
Otherwise it validates its the end of the merge.
Applying a hunk involves skipping over ‘=’ lines, interpreting each line, considers flushing output possibly followed by conditional preprocessing & maybe subsequent plus chars, considersother conditional processing, & updates counters. For old ‘-‘ lines it copies output until that point, possibly outputs conditional preprocessing with error handling & state adjustment, & increments counters. For lines after end exits. For new ‘+’ lines treats similarly to ‘-‘.
If old & new chars are otherwise unequal complains. If new line’s ‘!’ treats similarly to ‘+’ or ‘-‘ but with additional skiploops over ‘!’ lines both before & after second conditional preprocessing output. Otherwise increments counts & maybe outputs that conditional preprocessing.
Aborting a hunk creates a tempfile & outputs parsed info into it differently for unified vs context syntaxes.
Most of the code seems to be in lexing diff & target files. And parsing those diff files.
Bison
So much of UNIXs tooling are domain-specific languages to parse, and to make those easier to write in C they (eventually) created YACC which GNU rewrote as Bison. Arguably all commandline all commandline programs are domain-specific languages! It is even used by Gettext (where in any other language I wouldve written a Shunting Yard or Pratt parser…), and most languages have equivalents! Even if that equivalent is Bison itself!
The first thing Bison does is retrieve $BISON_PROGRAM_NAME
if present into argv[0] & a global, followed by initializing internationalization, quoting as per $LC_CTYPE
, on-exit cleanup, glyphicons table (preferably Unicode), deduplicating hashmap for strings, “muscles” table + obstack allocation, line truncation data & tables for how seriously to treat different kinds of errors, & obstack/flag for lexing.
Then parses extensive commandline flags preprocessing to extract colouroptions first.
Bison’s extensive commandline parsing integrates the internal “complaints” infrastructure with special line-numbering handling, populating the tables I described earlier amongst other globals. Afterwords it validates there’s a single extra commandline arg which it deduplicates & stores in a global & the “file_name” interpretor-var. As always –help & –version handled specially.
With extensive & optional time-profiling & some profiling bitflags it parses the input file exitting upon fatal error, garbage collect the AST, compute derivitives & nullability from it, convert into a statemachine, compute lookahead sets, detect conflicts & complain about them, & compile the output code! I’ll dig into the C/etc output more later…
Compilation is done by allocating arrays for precomputed counts populated with the aid of some temp arrays bucketsorting actions, output additional warnings, & if desired:
- Computes output filenames with appropriate extensions
- Output a file reporting rules marked as useless, states marked as conflicting, all useful grammar rules, terminals & all rules referencing them, same for non-terminals, & the state machine
- Iterates over each state generating a nontrivial label & outputting a graphviz node followed by corresponding edges with reduction in red, all surrounded by a graphviz header & footer. Seperate internal module abstracts graphviz
- Outputs the grammar & statemachine as XML without an XML serializer
- For HTML it runs that XML file through bundled XSLT file via a configuable XSLT processor!
- Unconditionally (as long as there weren’t any errors) frees lookahead tokens.
- Conditionally again outputs the statemachine to C in multiple passes deferring to M4 scripts I’ll describe later.
From there it frees all the various data it has allocated whilst applying minor fixes keeping a backup file.
Parsing Bison Files
As a language in its own right before it can compile a parser Bison must parse its own syntax!
To do so it starts by temporarily allocating a symbols hashmap (keys are deduplicated in their own hashset which speeds up hashing for this hashmap!) prepopulated with “$accept”, “YYerror”, “error”, “YYUNDEF”, & “$undefined” symbols. Followed by a semantic type hashtable also keyed by uniqstrs.
Then initializes lexer/scanner with its underlying file & obstack, then parser with its own obstack!
Bison’s parser is self-hosted with atypically few tweaks most of which are involved in AST construction or error reporting. It defers to a Flex program for lexing utilizing C macros for line/col counts & string capturing. Each directive gets its own token whilst utilizing Flex’s statemachine.
This Abstract Syntax Tree is quite flat, beyond the symbols hashtables it populates. The lexer handles deduplicating strings. Lexer & underlying file/obstack are closed after parsing.
The parser also populates the “muscle” tables, & afterwords default values are followed in for the “lr.keep-unreachable-state”, “lr.type”, “lr.default_reduction” (based on retrieved value of “lr.type”), & “tool.xsltproc”. Then validates various core muscles are a given a valid keyword value.
If there weren’t any complaints it’ll validate syntax rules were actually defined, generate an EOF token if missing, locates start symbol if unspecified & gathers a list of startrules, iterate over symboltables in sorted order checking everything is defined, validates tokens count, resolves & removes aliases, validates startsymbols are defined & not tokens, run a C Flex lexer to handle variable renaming, merging AST rules to generate rules
& ritems
arrays with name resolution & additional checks, & frees the raw AST.
Bison Optimization
Once Bison has parsed a grammar it needs to be optimized before it can be lowered & compiled to C.
The 1st step is deduplication utilizing 4 bitsets, involving computing the closure of all rules which match an empty string followed by another closure (with a little postprocessing) indicating which rules are actually used. a.k.a. garbage collection! If either of those detected any changes to make it'll complain about it before removing those rules keeping indirect references functional.
The next step does some lowering with 2 temp arrays in 2 loops & a segmented newly-allocated global array for an output. The first loop locates infix operators populating array of linkedlists. The second loop populates the global derives
array flattening those linkedlists.
A closure is computed via 3 temparrays & a temp queue for all rules that can match an empty string in 2 passes. The first detects all direct cases, the 2nd detects indirect cases.
Bison Lowering
Once Bison has parsed & simplified a grammar it needs to be lowered into a statemachine! Though to be honest I failed to comprehend this code well…
The initial pass involves iterate over all rules to generate an initial state before iterating over states as they are added for each computing closures via bitmasks, save reduction rules as an array in the state, locate outgoing edges with or without debug output before generating/retrieving corresponding states from a hashmap enqueued for processing if new, & saves transitions to the state.
The second phase of lowering deals with issues of lookahead!
The 1st subphase of which is handled differently depending on how “lr.type” is configured. In simple cases it iterates over states & their edges, symbols twice, & states & their edges again to compute a consolidated gotomap. Lets skip over more complex cases…
The 2nd subphase uses bitmasks to compute set of lookahead states, filters out “follow” edges amongst those, edges to nodes with greatest number of edges amongst those. After freeing the lookaheadset it processes the follow edges some more, frees yet more bitsets, & maybe a gathers a predecessors set in 4 iterations over the states.
In the 3rd subphase Bison might allocate & populate metadata tables/”annotation lists” from various tables it previously computed followed by some counting, debugging output, & deallocation.
In the 4th subphase it uses bitsets to detect duplication in the graph.
And the final subphase resembles the first!
The 3rd phase of lowering deals with conflicts in the graph consulting configured operator precedance with the aid of some newly-allocated bitmasks & obstacks. For each state (with some handling tweaks) it gathers a lookahead set then iterates over all symbols requiring a lookahead before cleaning up & reporting errors involving a 2nd iteration.
For each such symbol it iterates over all other symbols checking their precedence & tweaking graph actions appropriately with debugging output.
After resolving conflicts it might (this pass can be disabled…) recursively gather a bitset of all reachable so it can iterate over all the states & free unreachable ones. Then iterate over the edges followed by terminals to rename the state IDs to be more contiguous again. Then over the states to update the conflicts table.
If any conflicts still remain appropriate complaints will be written out possibly with clarifying counterexamples.
Bison Compilation
Having described how Bison parses & optimizes its input code, as well as lowers the grammar to a statemachine, today: how it compiles to lookuptables than C! Or, say, Java.
Generating those tables involves allocating all of them, clearing useful
flags on all rules, processing all states whilst updating those useful
flags utilizing 4 temparrays, process each tokensymbol populating the yydefgoto
table with another temparray, frees remaining temparrays, bucketsorts actions, & cleans up.
In that first loop for each state it zeroes out the actrow
& conflrow
arrays, iterates over all referenced requiring a lookahead to populate those temparrays, records which symbols triggers “shift” actions into actrow
, records explicit actions into actrow
, checks lr.default_reduction
muscle, & fills in error actions.
Then non-zero actions are counted & if there are any it allocs froms
, tos
, & maybe conflict_tos
arrays for the state & populates from the actrow
temparray.
The iteration over tokens similarly populates froms
/tos
arrays essentially the previously computed matrix.
Once it has computed the parsingtables it lazily de-lowers some of the code it is compiling to complain about symbols whose precedance is flagged as useless.
Then it outputs the specified files (as described previously) which for C involves subphases outputting symbols, rules, states, actions, symbol definitions, & (after some preperation) a skeleton file. Then tidies up in event of error. All with an obstack allocated for formatting.
For symbols Bison generates several muscles (tokens_number
, nterms_number
, symbols_number
, & code_max
) including an array (translate
), validates tname
& symbol_names
, maybe generates translatable
as flagging every translatable symbol if there are any, & each token’s “content code” into a toknum
muscle.
For rules it allocates severals temparrays, iterates over rules to flatten them into those arrays, & turns them into muscles (rhs
, prhs
, rline
, r1
, r2
, dprec
, merger
, & immediate
).
To output states Bison extracts each’s accessing_symbol
property to output into the stos
muscle, followed by storing last
, final_state_number
, & states_number
muscles.
For actions Bison converts the tables generated earlier (as described this morning) into “muscles” (defact
, defgoto
, pact
, pact_ninf
, pgoto
, table
, table_ninf
, check
, conflict_list_head
, & conflicting_rules
).
For symbol definitions it generates a seperate muscle for each token then several per symbol & their properties.
The final preperatoin involves straightforwardly adding several more muscles! (version_string
, version
, required_version
, header_flag
, flr_flag
, nondeterministic_flag
, synclines_flag
, token_table_flag
, use_push_for_pull_flag
[from corresponding envvar], yacc_flag
, maybe prefix
, file_name_all_but_ext
, dir_prefix
, mapped_dir_prefix
, parser_file_name
, spec_header_file
, spec_mapped_header_file
, spec_file_prefix
, spec_name_prefix
, spec_outfile
, spec_verbose_file
, maybe skeleton
, & skeletonsdir
).
With all the muscles generated Bison then locates all the appropriate .m4
files, invokes M4, outputs various data into M4’s stdin including most importantly all these “muscles” it read & generated, & parses the result to lightly reformat it into the resulting output file!
Don’t ask me about these M4 files though, I’m finding M4 to be quite an illegible language…
“Flex” Lexer Generator
Several programs (documented on this page here) are essentially interpretors for domain-specific programming languages. If you want to be uselessly vague you could reasonably suggest this describes every program, especially on the commandline!
However most of these programs are written in C or C++, which is inconvenient & unsafe for this. So we use a pair of DSLs!
In parsing nontrivial text it’s useful to preprocess input abstracting it into a sequence of “tokens”.
After possibly initializing internationalization, to aid that abstraction Flex configures an exit path, initializes various globals & parses commandline args into them whilst loading new macros or prepending syntax. Then it runs a self-hosted & starts outputting code before filling in back edges from forward edges & elsewhere, interprets a “skeleton” file converts the NFA to DFA, outputs any errors, generates a state transition table, outputs final errors & code comments.
Parsing Flex files
As a domain-specific language itself Flex requires its own lexer & parser, which are implemented using itself. I'm reading through these today to see if there's anything else worth remarking on.
The lexer manipulates various globals, most notably an “action” stringbuffer of M4-preprocessed C code as it spits out tokens.
Plenty of lowering is performed during parsing, constructing NFA graphs with simplified equivalances & dispatch “character class” smallintmaps.
As it finishes parsing a rule Flex generates some dispatchcode into that “action” stringbuffer of M4-preprocessed C code to precede what the lexer is about to append there.
It also resolves any “symbols” through a hashmap.
Both the lexer & parser put effort into emitting nice error messages.
That NFA is a labelled graph where input text identifies paths (plural!) to traverse, with unlabeled edges part of any traversal. Can be constructed from charclass tables.
Lowering NFAs to DFAs
Once a Flex file has been parsed to prelude M4-preprocessed C code, an NFA (labelled graph), char dispatch tables to convert that into C code Flex must lower it to dispatch tables via a DFA.
To deduplicate edges whilst collapsing always-taken “eta” edges thus converting the NFA into a DFA… It starts by initializing some collections & possibly outputting the NFA for debugging, possibly with some buffered code representing state 0.
With that it enters a double iteration over “start states”, start-of-line tokens or not, merging each of their epsilon-closures into a single DFA state. Via a depth-first-traversal over epsilon edges, check whether we've already generated a DFA node for that NFA state, & failing that initializes a new NFA node into the NFA array. Associating a sorted NFA-states mapping for later deduplication.
All such start DFA nodes might then be similarly merged.
Once it has tackled the start states it traverses the graph & their (partitioned) outgoing edges locating duplicate edges to merge & traversing any subsequent epsilon edges to yield all the NFA states to merge into a single DFA state.
Then it finishes by counting & maybe debug-outputing DFA states for “backing up”, finalizing some state, maybe relatively directly a outputing a compressed datatable into the buffered M4-preprocessed C code, & freeing data.
Compiling DFAs
To compile that DFA into C code: amongst some more of the skeleton (which I think I'll cover tomorrow…) it outputs calculations & if-checks computing memory-sizing at appropriate indentations (assuming tabs are 8 chars wide), outputs state counts, declares a state variable, converts the DFA to runtime datastructures possibly with compression minimizing bitwidths with minor preprocessing, optionally outputs a linecounter, outputs backtracking vars & data, & possibly a few more vars.
From there amongst some more of the skeleton it starts outputting the main lexing loop, reading data from the read
or getc
syscalls, before (amongst more of the skeleton) starting to output the controlflow consulting the tables with or without backtracking logic or compression. Then it generates code to branch over the final couple states from that loop to determine which token to emit. Then adds trailing exception-handling code with or without debugging, including backtracking.
The Flex Skeleton
To wrap up my studies of the Flex lexer generator, I'll describe the reference sourcecode which gets compiled into the Flex binary & intermingled with its compiled output.
This file starts with a lot of C preprocessor & M4 macros based on certain M4 (or its own preprocessor prior to compiling it into Flex) conditions. Later ones use M4 to prepend a standard prefix to the C macros.
Then it includes the relevant C & possibly C++ libraries & abstracts compiler attributes.
Then this “skeleton” file defines several more macros (M4 then C then M4), & additional C macros are defined if in the implementation some of which abstracts over the input.
After all those it defines a struct around the input state & several globals, function declarations. A couple more bounds-check macros might be defined alongside conditional function signatures. It might include a few more libraries.
Then defines a stack-allocated struct for reentrant lexing, with some optional fields.
After yet more function signatures, globals, & macros… A struct & lookup-table is defined to decode lexer output.
Then the main function with a macro-specified signature. After defining its locals this checks whether it does some initialization if needed, inserts Flex-compiled initializations, & enters a mainloop incorporating compiled per-iteration code. This mostly consists of a switch-statement state machine compiled from the Flex file & ending an end-of-buffer state.
A C++ abstraction over this lexer might be outputted, followed by routine (which may or may not be a method on that C++ abstraction) for proceding to the next buffer from the configured files. As well as routines for querying previous state (mostly Flex-compiled code), lex nul chars (same), & unlex some chars, specify a textual input buffer (with postprocessed line-tracking code compiled from the Flex file), immediately switch to lexer given text or a file pushed onto a stack, initialize a buffer, free a buffer, flush & push & pop buffer state, allocate a larger stack if necessary, parse input of different types, report errors, & variations upon these.
Then it redefines the yyless
macro upon that scanner, & defines several accessor functions. A few of which includes validation. Then a few initializers generated with M4 help. Followed by some loop, memory allocation, & file I/O abstractions. And finally there's routines for parsing the bundled YYtable data.