C as a programming language has a lot of gotchas for programmers to carefully avoid. So programmers have made things easier on themselves by developing new languages without those gotchas, even if they are a bit slower. Some of these which I find particularly or are heavily used in implementing GNU are documented here.
Perl is a general purpose scripting language whose design is derived from UNIX’s domain-specific languages. It is infamous for looking like a cat walked accross your keyboard, & is heavily used in building GNU’s core suite for e.g. metaprogramming.
main function is autogenerated by the build system. Which starts by calling initialization functions…
Digging through all the C macros used by Perl’s build system to target various operating systems, that initialization involves allocating memory, disabling floating point exceptions, possibly initialize an I/O mutex in case of threading, & possibly initialize an allocation mutex.
Then it possibly registers callbacks with PThread to lock & unlock these mutexes.
It further configures floatingpoint handling based on the OS.
If undumping is disabled Perl allocs & (if successful) initializes an object representing the interpretor. Exactly what’s involved (and the allocation) can vary significantly based on buildflags.
This may involve populating various fields defined via C macros in a dataheader, track a
destruct_level counter to handle reentrancy, zero out instruction execution counts, initialize/validate various globals, register signal handler callback globals, GC tweaks, allocate stacks, etc, etc.
That initializes also includes several system APIs exposed to Perl scripts.
An update flag is set indicating what needs to be cleaned up. Commandline args are validated, saved to a global, & manually-parsed whilst handling exceptions. Followed by the referenced file. Which does involve further initializing APIs exposed to Perl scripts.
With the Perlscript parsed, & the APIs it might use initialized, it is run!
To clean up Perl might iterate over signals to unregister its signal handlers. Then after waiting on any child processes it’ll report any errors (or if configured bad state) whilst deinitializing all the previously initialized state in reverse order.
That memory is freed & the terminal I/O state is reset in a OS-specific way.
Continuing on my explorations of how Perl works I’d like to study how Perl parses the text your cat typed! Which is surrounded by standard library initialization, so include that too…
The preceding standardlib initialization is, once again, heavily conditional on buildflags. But may involve checking various environment variables, & loading various C macros defined by the buildsystem into the Perl runtime.
It then opens the specified file or string using Perl’s own I/O APIs.
With the file opened Perl checks whether it is running as a set-uid script, ensures SIGCHLD signals aren’t ignored, if -x is given complains that set-uid bit whilst skipping certain prefixes & changing the directory, & initializes a couple more globals in Perl’s datamodel.
Then it initializes the Perl I/O, UNIVERSAL, builtin, & mro standard libraries.
xsinit global callback is configured it is run. Before initializing OS extras & SOCKS libraries.
Continuing with my Perl thread from this morning…
Perl then initialize “pre &, if undumping’s disabled, post dump symbols”. It might enable UTF8 decoding on stdio. It might set or unset the
PERL_SIGNALS_UNSAFE bitflag. And then it initializes the lexer with its various fields! Dereferencing it’s own pointer to the raw text.
It might register a callback to tweak the lexing results when specified as a string as opposed to file/stream. It might set the HINTS_DEFAULT bitflag.
Perl’s parser is implemented using Bison with no tweaks. This uses some utilities to manage blocks of code & operators, but mostly it constructs the AST directly! Perl has a fairly complex grammar…
The lexer is also quite involved & codes directly against Bison. After UTF8 decoding & with optional debugging output it checks a buffer of tokens to yield before traversing a statemachine before updating various globals.
The lexer’s state machine consists of NORMAL/INTERPNORMAL noop states, INTERPCASEMOD which a trailing \E amongst other escapecodes to update the parser’s “mode” whilst yielding the appropriate token, INTERPPUSH which saves various properties whilst reporting any failures, INTERPSTART which I’ll discuss shortly, INTERPENDMAYBE which checks whether there’s more text to parse otherwise fallingthrough to INTERPEND which’ll also be discussed shortly, INTERPCONCAT, & FORMLINE which does most the work.
INTERPSTART checks whether we’re at the end of the file, possibly outputs debugging info for parens, switches to INTERPNORMAL state, desugars the “@” join operator & parenthesized implicitly-do blocks, etc before recursing.
INTERPEND closes the desugaring INTERPSTART begins, & fallsthrough to INTERPCONCAT. Which also in turn similar constructs.
FORMLINE after errorchecks but before inferring close curlies & recursing iterates over the text looking for dots, newlines, hashes, ~~, @, or ^.
After that initial iteration it skips over whitespace & checks for open-curlies. After this state machine it invokes a more traditional yet still very involved lexer manually-implemented accross numerous functions. Also there’s several APIs for error reporting!
Then the parsing function cleans up involving setting several globals largely via C macros, & reference uncounting. If the
do_undump flag is given which apparantly is an integration into an EMacs thing… & maybe does validation.
Having just described how Perl scripts are parsed, I’ll describe how those parsed Perl scripts are run!
Within exception handling & with optional debugging output throughout Perl does so by first checking whether we’re restarting an op. If not it might output whether the syntax parsed successfully and/or run the core interpretor in the
PHASE_INIT mode. Otherwise in the
PHASE_RUN mode it loads the op to interpret & enters the mainloop. For the mainfunction it does so differently.
Following that, integrated with the exception handling, it pops the interpretor stack until reaching an exception handler, stash interpretor state, & runs the core interpretor for that stackframe.
With interpretor state saved to stack & exceptions handled within mainloop iteration it evaluates a function call.
After consulting some bitflags to determine which state to save, lookup a method name, etc Perl interprets each function call by checking whether it can successfully stash various state aside setting a flag if so, looks up the method if either of those flags are set, & with or without exception handling (based on a bitflag) runs a callback on a stashed global representing the op. Same as for the initial algorithm.
Skimming over the rest of the Perl interpretor’s code (minus standard libs, etc) in the hopes of describing the structures which Perlers datamodel upon.
I see a straightforward yet macro-heavy array implementation.
There’s a bitcounts lookuptable.
builtin:: namespace is defined directly within the interpretor including
import function has to be built-in…
There’s utils for generating debug output.
I’ve been an I/O API Perl uses internally & exposes as an API to perlscripts, including a noop alternative. And there’s transcoding APIs.
There’s an implementation of vectorops.
There’s a string interpolation interpretor.
There’s an OS abstraction layer.
There’s an “inversion list” represented as a slice of an array.
There’s parallel array implementation incorrectly called “hashes”.
Perl hashes include bitflags describing their encoding, including text encoding.
There’s an internal library refining C’s datatypes.
struct gp which can refer to a “scalar value”, file handle, subroutine, “hash”, array with metadata for their refcount, general validity, format Perl-function, effective value, & where it was declared. Has util functions to manipulate this.
Various C libraries get their own wrappers, including conversions between strings & numbers. Strings in particular have plenty of utils, e.g. to handle Perl’s UTF-8 variant!
There’s some code for traversing “hashes” & arrays in depth-first-search order to determine their “method resolution order”. Possibly reporting errors to the user.
There’s plenty of dataheaders.
There’s some “magic values” used to track interpretor initialization, deinitionalization, & validity. Errors may be reported to users.
Perl extensively wraps memory allocation throttling OS calls, adding validation, & allowing alternate callbacks to be used.
There’s a dedicated slab allocator/optimizor for compiled Perl opcodes with mini & full variants.
There’s a complex pretty-printer.
There’s a concept of “pads” for locally-scoped variables.
There’s a regex engine.
There’s taint tracking, threading, timing, & that basically covers it.
Python is a very popular scripting language (and the language I started out on) famous for the extensive suite of “modules” bundled with it & for its love of british humur. Namely Monty Python & Hitchhiker’s Guide to the Galaxy, can’t blame Guido.
To initialize the interpretor Python starts with its “runtime”, ensuring it wasn’t already, by (whilst using the default allocator): clearing all fields except some callbacks it sets aside, possibly initializes the Global Interpretor Lock, sets various config properties to default values, & allocates locks (platform-specifically initializing threading if needed) for the interpretor & ID registry & Unicode IDs, & retrieves the ID of the main thread. Returns an errorcode.
If that’s successful it retrieves the runtime object & checks its errorcode.
Then with the preinitializing flag set it initializes some more configuration properties & copies them over.
If that’s successful it copies yet more properties over (unclear what this is about…).
And finally if that’s all successful it updates its initialization flags & returns a success errorcode.
Python Commandline interface
If that initialization’s all successful (success errorcode, & no exceptions flagged) CPython then goes on to run its own “main” function with trailing cleanup/exception-handling (with temporarily-set default allocator) code!
By retrieving the thread’s interpretor & in turn its configuration, tweaks the import path based on configuration, possibly outputs version info, possibly validates it can import readline, & decides which subcommand to run.
The run subcommand converts the commandline args to Python objects & interprets the file within a new “__main__” module’s scope with error handling.
The run-module subcommand, or if there’s a importpath given, imports the runpy module for its
_run_module_as_main function, converts commandline args to Python objects, & calls that function.
The run-file subcommand opens the specified file possibly skipping firstline (save trailing newline), checks its not a directory, checks for exception handlers to call, & defers to another function depending on whether the file’s interactive. If it is interactive after generating a prompt it repeatedly retreives various object IDs, opens a new allocation-arena, parses the string, runs it as a module, & handles errors.
The run-file subcommand opens the specified file possibly skipping firstline (save trailing newline), checks its not a directory, checks for exception handlers to call, & defers to another function depending on whether the file’s interactive. If it is interactive after generating a prompt it repeatedly retreives various object IDs, opens a new allocation-arena, parses the string, runs it as a module, & handles errors.
Otherwise populates vars describing the file, tries loading the .pyc file, runs the .pyc or .py file, & flushes any output.
If reading from interactive stdin it loads some envvars, runs a specified startup file, & calls the Python function
sys.__interactive_hook__. Regardless whether stdin is interactive it checks for more functions to call & treats similarly to arbitrary files.
And regardless of the subcommand it considers calling
Evaluating some string (whether source or filename) involves 2 substeps: parse & interpret, with an arena allocator set & error handling.
With some debug output this involves initializing a lexer whilst normalizing newlines & normalizing to UTF8, if that succeeded configuring its filename, extract parsing bitflags, initializing/allocating a parser object, & running these before cleaning up.
The lexer has an underlying scanner that can read (upon underflowing its buffer, during which it handles Unicode byte-order-marks) from a string, interactive prompt, or Python file object; capable of uneating specified chars or looking ahead several chars.
The lexer has a flag indicating whether its at the beginning of a line outside of parens at which point it counts indentation to emit indent/dedent tokens checking whether the line is an empty/comment line. Regardless it skips whitespace, checks for comments, EOF, & identifiers.
Then it checks for newlines, periods standalone or starting a newline or ellipsis, numbers in a selection of bases floating point or not, quoted strings, escaped newlines, & 2 character tokens. Before branching over the character to check parenthese balancing, & turn any remaining char into its own token.
There’s a suite of helper function for lexing these tokens, & the lexer may raise Python exceptions.
The parser is wrapped in a layer that adds error diagnosis & exception throwing. The core parsing logic may call back into this.
Said core logic, including the Abstract Syntax Tree, it yields is autogenerated by a self-hosted parser generator from what very closely resembles the BNF file you might find in Python’s documentation. I don’t think I want dig into this parser generator…
Ofcourse this .asdl file is compiled to C, interpretors can’t be as self-hosted as compilers…
Running a parsed module in Python is split into 2 parts (beyond retrieving the thread & debugging output): compiling to bytecode & interpreting those bytecodes.
The bytecode compiler starts by ensuring interned “__doc__” & “__annotations__” strings, zeroes out various fields whilst initializing a Python dictionary & list, saves the filename, arena, parses & lightly postprocesses future imports, & extract optimization level.
Optimizing the parsed AST involves retrieving the recursion limit & thread state before traversing the abstract syntax tree. Within this traversal it may perform constant propagation (including for small collections), extract/concatenate docstrings, & in certain cases replacing lists with tuples. Though most of the code is there for traversing all the different branches of the AST. Seamingly to avoid the overhead of interpreting them.
The next pass resolves variable scope.
To resolve variable scope it allocates a new symbol table, retrieves recursion limit, validates input, saves filename & future imports, retrieves the threadstate, configures recursion depth & limit, allocates/initializes & registers a new block, if successful traverses the AST, tidies up, & validates (utilizing sets & to a lesser extent dictionaries) results whilst tweaking AST to inform generated opcode. These scopes/blocks are Python objects.
The traversal extracts variable/class/function definitions & declarations for what’s global or nonlocal whilst splitting the code up into different scopes. Named expressions require careful handling, but otherwise most of the logic is involved in traversing all the branches.
The final (or 2nd to last?) traversal, the bytecode compiler, also tracks scopes (after ensuring the module is named). This extracts annotation assignments & docstrings, but mostly converts each AST node into opcodes.
Conditional expressions have a special compilepath taking into account their jumptargets, & basically every AST node gets its own function that compiles straightforwardly into the corresponding opcodes. Imports, comprehensions, & patternmatching get special implementation effort.
Finally an assemble pass is called both within & after that compilation pass, which which performs validations, corrections, controlflow optimization, line numbering, linking, compression, & consttable simplification.
The results are wrapped in a Python object.
After ensuring the
__globals__ global is set & retrieving the threadstate & builtins gathering a stackframe. Python interprets bytecode-compiled software.
It does so by constructing & populating Python objects for that stackframe whilst populating any keyword & positional arguments including validation throwing appropriate exceptions, runs a callback for the function (for the sake of C bindings?) & carefully derefs the stackframe tracking recursion depth for deallocators.
The normal callback, after validating/extracting its input & initiating optional debugging utils, considers whether to allocate an opcache, extracts
__ltrace__ property if present, throws an error if flagged to do so, tidies up upon exitting it, considers “trace” debugging callbacks, pops a stackframe, decrements recursion count, & validates error are correctly indicated in the return value.
At the start of each loop iteration Python performs some validation, atomically-synchronized checks some locks for certain instructions, considers calling debugging callbacks, considers outputting which opcode it is running for debugging, considers going to top-of-iteration (normally loop iterates into the middle of its body), & loads the opcode to dispatch it.
Upon errors Python considers outputting debugging info, gathers a Python traceback object wrapping the frame, considers calling “trace” debugging callbacks, & iterates over the stackframe looking to determine which opcode to set the programcounter to reference. Failing that it returns the error to its caller.
Python’s stackmachine opcodes (implemented with the aid of C macros) includes ones for:
- stack manipulation
- Rearranging the top-of-stack
- data access
- retrieving/storing locales or constants by number (constants are stored in a tuple)
- Update the locales or globals dictionaries (including something apparantly class-specific)
- Retrieve a function’s closure
- Calling C methods (& in turn via
- Numeric operators, calling C-methods whilst validating presence of Python magicmethods, may have fastpaths for strings
- Inplace variants of these
- Boolean operators, with fastpaths before attempting various C methods
- Get, store, & set entries identified by an on-stack object within a collection object
- Call the C-methods for iterating over an object asynchronously or not
- Retrieve a coroutine’s iter to await on it
- Retrieves object attributes, with fastpath akin to V8 “hidden classes” or dict lookups
- Compare, is, contains operators
- Retrieve an object’s length
- Set up a with block asynchronously or not extracting
- Calling builtins (which are all implemented inside the interpretor segment of the codebase)
sys.displayhookfor print statements (thought Python didn’t have those anymore?)
- Calls the local
- Call Python functions implementing imports
- Inlined call to
- Calling builtin types (implemented in a seperate segment of the codebase)
- Add item to a list, or to a set
- Push all items from a tuple or other iterable onto the stack
- Construct literal objects (strings, tuples, lists, sets, & maps/dicts) & convert between lists & tuples
- Inlined calls to certain list, set, or dict methods
- Configure function annotations
- Copy a dictionary then delete given keys
- Load & call methods or functions, in a few variations regarding arguments
- Construct function or slice objects
- Control flow
- Trigger the exit (for
returnstatements) or, after nontrivially constructing the exception object, error-handling codepaths
- Yield statements return the new values whilst preserving their stackframe, with or without repeating the opcode per-entry in the object
- Pop queued exceptionhandler blocks
- Error-handle an existing exception object
- fastpath for assertion errors
- Branch upon exception types
- Conditional jumps
- Unconditional jumps
- pattern-match a class, mapping, sequence, or keys
- Trigger the exit (for
- Add an extra bits to the next opcode’s argument
Every Python object participates in a doubly-linkedlist of live Python objects & consists of a refcount, & pointer to its type. The type in turn also includes:
- a textual identifier
- basic & item sizes for allocation
- a deallocator callback
- attribute accessor callbacks
- an offset for vectorcalls
- asynchronous await/iter/next/send methods
- callbacks converting the object to text for
- numeric operators methods (bundled together behind a pointer)
- length/concat/repeat/item/item-assignment/contains & inplace variations callbacks for collection objects (again behind single pointer)
- length & subscript accessor methods for mapping objects
- hashing callback
- callback for function-call syntax
- alternate attr-accessor methods
- callbacks for accessing underlying I/O buffer
- feature bitflags
- traverse callback
- clear callback
- “rich” comparison callback
- weaklist offset
- callbacks to construct an iterator or iterate to next item
- arrays of C methods, static properties, & dynamic properties with metadata allowing them to be called from Python
- the supertype
- its attribute dictionary
- accessor callbacks for the object’s description
- offset in some dictionaries
- initializer/allocator/deallocator/etc callbacks
- array of base objects
- order in which objects should be checked for properties/methods
- object array for some sort of caching
- array of subclassing objects
- array of objects holding a weakpointer
- 2 destructor callbacks
- version number
- callback for vector syntax
There’s a baseclass (
type &, deferring to it,
object) which actually knows how to lookup Python attributes in reference to these C attributes. Since the interpretor mostly only knows how to call these callbacks.
In terms of memory management, Python implements a trivial arena allocator. Whereby it allocates objects by incrementing a pointer at the cost of not being able to free them individually.
Python utilizes reference counting (confusingly organized into importable C modules) so that most objects are freed as soon as they go out-of-scope. But the Python interpretor & standard object types integrates a tracing GC moving objects between doubly-linkedlists & implementing weakreferences.
For its dictionaries & in-turn attribute lookups Python implements SipHash24, hashes doubles by extracting its exponent & mantissa, FNV, & other hashes.
The core hashmap, or (apparantly the fast copies are useful even outside FP) Hashed-Array-Mapped-Tries, logic for dicts is implemented within the interpretor codebase. As-is the core of file objects & string formatting. The sys & builtin modules are implemented within the interpretor section of the codebase.
Although modules are interpreted as calling Python functions that infrastructure is handled in C, involving a Python list of preprocessing callbacks but mostly just defers to the infrastructure I’ve been describing or loading a C dynically-loaded library. The latter are implemented with the aid of some utilities. Or it includes a concept of “frozen imports”, seemingly for obfuscation.
The interpretor exposes timestamp logic, & infrastructure for displaying but not duplicating warnings.
Types built-in to Python without the need to import any modules (distinguishing from objects specific to the interpretor I described previously, & the modules implemented within the interpretor) includes:
- Trivial structures
- Ellipsese (don’t do much)
- Unicode character types
- slice parameters
- ranges & their iterators
- enumerate iterators
- “pickle” (Python object serialization) buffers
- “structs” (tuples with named fields)
- tuples (nonresizable arrays)
- memory (slicing bytearrays) & iters
- byte & unicode strings
- Types for typechecking
- union types
- generic aliases
- Module system
- modules (converted from namespace of the imported module)
- Data wrappers
- weakreferences (integrating into tracing GC)
- “capsules” (wraps
void*pointers in C bindings)
- cells (accessor upon a single object)
- Implementing OO paradigms
- descriptors (additional metadata)
- objects (methods defers to its type)
- property accessors & dictionary proxies
- types (where __magic__ methods & attribute lookup are implemented)
- Functions & methods
- methods (with logic for capturing
self& to hold metadata)
- iterators & generators (wrapping interpreter over coroutines)
- static, abstract, & class methods
- functions (not associated to an object)
- exceptions & various subclasses
- methods (with logic for capturing
- Exposing the interpretor
- interpretor IDs
- sourcecode for compiled bytecode, with compressed linemappings & iterators upon that
- OS files (core logic in interpretor codebase segment)
- lists (resizable & “timsort”-able arrays) with iterators (timsort combines merge & insertion to take advantage of existing sorted runs in input)
- ordered dictionaries (as already stated core logic is elsewhere, this serves to track a doubly-linkedlist order)
- sets (does not share an implementation with dicts)
- unordered dictionaries, exposing more complex processing methods
- Numeric types
- floating point numbers
- variable bitsize integers (resembles long arithmatic like you learnt in school)
- complex numbers
Alongside is a wrapper around the system, arena, or garbage-collected memory allocators, with debugging utils. As are various utilities for calling a Python callable from C.
Unicode strings are implemented with massive ammounts of code!
All the other common Python types are trivial.
There’s logic for intelligently suggesting typo corrections.
There’s a cross-platform threading bundled in the interpretor, amongst other OS abstractions. Including lots of string formatting utils.
Julia, popular amongst scientists, is a highlevel yet fast programming language which takes function overloading to extremes. To the extent its JIT optimizer barely has to care about the sheer range of number types (you can tell this is geared towards academics…) in its standard prelude!
Incidentally if I were writing a Ben Eater-style tutorial on GPUs work, I’d start by implementing the algorithms in Julia. Julia seems like a perfect fit for that…
So how does this JIT work?
After navigating differences in OS whilst initializing its LibUV-based I/O & parsing/validating extensive commandline options, Julia specialcases when it is running under an “RR session”, optionally runs an alternate codepath adhearing to a more Lispy syntax, & before running any exit callbacks & debugging output:
- Saves commandline args to an Julia accessible to the script
- Looks up the
- Attempts to run
_startwith a temporarily-set world-age & exception handling
If there’s a commandline arg not starting with dash it’ll interpret that referenced script. Otherwise it prints some warnings before repeatedly prompting for some text to interpret & output the result of until Stdin closes.
Interpreting a file involves (with exception handling) converting the filename into a Julia string & back, reading the full file into an in-memory string, validating the input, run the parser & increments the refcount on its result, & with exception handling for each “arg” in that root AST branch either:
- for “linenode” ops updates some globals, or
- “expands” & “evaluates” the expression
Parsing & Desugaring Julia
How does Julia parse & desugar its scripts? There’s a couple interesting things about it…
To do so it looks up & calls the
_parse function with typical language-binding code including a GC. For the sake of bootstrapping it converts the args from the Julia datamodel whilst saving aside GC handles, calls
jl-parse-one, & converts back to Julia values.
These inner functions are implemented in a Lisp, seemingly wanting to use anything but C for parsing…
If I’m reading this Lisp code correctly, it implements something resembling a Pratt parser (those are cool!). Though mostly this still plenty resembles a pushdown automaton, one littered with code adding line-numbering annotations.
Julia implements this Lisp itself & exposes I/O functions to it. I will not describe this implementation.
Most operators, etc directly parses to function calls, maybe even implying calls to
* (like mathematicians do, useful for imaginary numbers).
Furthermore collection types (including string templates also usually parses to function calls. Various literal values parses to the runtime representation for that literal, which is shared by this Lisp.
There is a lexer & scanners aiding the implementation of this parser.
After bootstrapping the
_parse Julia function is called which if I’m interpreting this code correctly defers to the Lisp code I just described.
Above I described how Julia parses basically everything (in a custom Lisp) into function calls. So what does it do with that?
Surrounded with line-number restoring line-number globals regardless of exceptions & checks that
eval isn’t used in a “generated function” Julia:
- Special cases linenodes, symbol, & other non-expression nodes using e.g. a tree traversal.
- Lone property accesses becomes a heavily-overloaded Julia function call.
- Once again it checks for invalid callers.
- It checks if there’s any callees needing lowering, & “expands” that AST if so. Macros first, a tree traversal calling named Julia functions passing the parsed AST to be rewritten, before calling back into Lisp code which is its own traversal with a scope resolution, closure lowering, linearize, line-numbering, & postprocess line-numbering phases.
- For module exprs Allocates a module implying standard imports, incorporated into the modules-list.
- For using exprs loads specified module then loads each of the specified names possibly as a module.
- For import exprs does similar, but slightly simpler.
- For export exprs validates syntax then looks up & flags appropriate bindings.
- For global exprs flags the named bindings as belonging to the module.
- For constant exprs does similar but also flags as constant.
- For toplevel exprs recurses into each subexpression
- For error or incomplete exprs raises an exception with human-legible text.
- For symbols looks up that global.
- Any other initial exprs returns the full expr as-is.
- Performs some validation & performs some analysis on the called function to extract bitflags.
- Decides whether to defer to codegen (following topic after expansion) or that tree-traversal interpretor.
All with plenty of datatype conversion.
Continuing my exploration of Julia I’ll discuss how the Abstract Syntax Tree (AST) is “expanded” into a simpler form before it is interpreted as (partially) discussed yesterday.
The first phase traverses the AST thrice to resolve scopes left behind by “hygenic macros”. A 4th traversal does some sort of validation. (I’m not making much sense of this, seems I’m not great reading Lisp)
The main suite of traversals starts with one performing a dataflow analysis lambda functions capture.
The 1st main traversal resolves variable names whilst stripping out
local declarations, & more.
The 2nd main traversal tracks dataflow to determine which variables closures should capture, & whether to treat those as being mutable.
The 3rd main traversal uses that information to make that closure explicit. (plenty of code I’m struggling to follow here…).
The 4th main traversal flattens the AST into opcodes, utilizing preallocated “slots”, numerous helpers, & validation.
One of those helper functions implements the actual traversal logic branching accross the verious expression types, & flattening them with the aid of other helper functions both for alternate traversal & for emitting opcodes. Flattening control flow (becoming GOTOs, conditional or not) & function arguments are where things get most complex at this stage. Especially exceptions.
The final main traversal attaches & attaches line-numbering handling lambdas very speically.
How does Julia JITs each function that is called for the types its called with? The pair of which Julia refers to as a “method”. After it has allocated memory for it, determined which identifiers refer to globals, & (except in early boot) performs self-hosted type inferencing.
If it doesn’t have a cached version (checked atomically) for each call Julia…
After a couple of caching, datagathering, & exception handling wrappers (or maybe it simply uses the interpretor?) it starts by allocating a trie & iterating over instructions locating closures to determine which vars need to be placed behind pointers, iterating over instruction twice to apply this info it gathered with some use/dataflow analysis, iterates over them again to mark volatile variables, infers a function typesignature if needed, allocates & sets up a LLVM wrapper, configures a stackframe, allocates space on that stackframe for local vars (computing space requirements is involved) which arguments are subsequently moved into, iterates over the arguments, iterates over arguments converting types to LLVM’s with possible refinement, initializes “locations”, iterates over statements to examine aliasing, & finally moves onto control flow.
There’s an iteration over opcodes which gathers all branch targets, & an iteration over results to allocate each a “basicblock”. It might insert a function call to log allocations or compute coverage.
Then it iterates breadth-first over the basicblocks & their statements, tree-traversing the contained expressions lowering Julia types to LLVM & flattening all the expressions also to LLVM.
Unreachable blocks are deallocated.
It then iterates over collected “phi” statements (sideeffect of dataflow analysis on flattened control flow) & their incoming edges (without undefined values) consulting a map of what’s already been rewritten. Possibly creating new basic blocks rewriting control flow where needed & flagging constants & unions, or otherwise performing necessary data conversion.
After the incoming edges iteration it ensures the phi opcode is well-formed.
Unused phis are deleted.
There’s an iteration over generated code to finalize function calls with debugging information. Followed by optional cleanup when variadic arguments are not used & optional registration of garbage-collection “roots” (inherantly live references).
It iterates over called modules to tweak their linkage, followed by opaque closure modules.
To codegens functioncalls it sees if the appropriate method is already compiled, to emit a call to amongst type conversions. “Primitives” are hand-compiled.
Looking at callers of that core JIT’ing logic, Julia may compile closure expressions using it. But the main entrypoint to it serves to cache the results as a “method” (multilayered cache), manage a workqueue of methods to compile, apply the linker & debug results, & actually call the generated machine code.
Skimming through the rest of Julia, I see…
- Logic for reformatting the callstack into a stacktrace to attach to exceptions, possibly pulling in some Julia libraries for reformatting. To be combined with longjumps & exception objects.
- Code to serialize & deserialize Julia data (which again includes its AST).
- Type comparison logic & set operations, implemented in C. Complicated by type params.
- Arrays, implemented in C.
syslibrary for OS bindings, implemented in C.
- “Task” userspace scheduling infrastructure, with web bindings.
- Conversion from UNIX, Windows, or Mach signals to exceptions.
- Synchronization utilities
- Time profiling
- There’s a seperate symbol table for static types, including logic to compute these types.
- Since Julia implemented its own Lisp for bootstrapping’s sake there’s lots of utils that needs to be implemented for it.
- It can
mmap“safepoints” with special flags. Safepoints are specially integrated into the GC & interrupts.
- “Intrinsic” functions implemented both in C (integrating into the runtime type system) for the interpretor & seperately LLVM assembly for the JIT.
- There’s functions for allocating & calling closures.
- There’s a minheap for the sake of scheduling tasks & triggering upon timeout. Integrates LibUV.
- There’s some CPU-specific code for Arm & x86 with fallbacks, LLVM handles most of that.
- Lots of LLVM utils including GC integration.
- There’s a Julia-specific version LLVM’s LICM & demotion passes!
- The Julia types used by the intrinsic functions & the parser are manually constructed in C code for bootstrapping’s sake.
- There’s a compacted form of the “Intermediate Representation” to convert between.
- There’s a stop-the-world non-copying GC that can run finalizers, & utilizes dynamic dispatch for sweeping. With Julia-accessible APIs, including debugging.
- Has special page allocator for large objects.
The Lua interpretor starts by allocating & initializing its state covering various of its subsystems. The it calls its real
main function via the Lua stack whilst converting between the C & Lua datamodels & outputting error info upon returning false.
main function converts back to C types to perform the actual conversion of argc/argv to a a Lua hash, & parsing off its own args possibly leading to print usage or version info. Then imports all the standard libs & runs GC.
From there it decides whether to run a string, file (which may or may not have a main function to call subsequently after outerscope), or REPL (via libreadline). Running the parser appropriately, then calling the parsed function (after saving globals to the
_ENV “upvar” to be accessible or sandboxable) & outputting any returned errors.
The Lua global state involves a randomseed for hashing, memory “debt” & “estimate” to trigger GC, callstack for debugging, longjump targets for exceptions, count of activecalls to C functions, a “registry” table for globals, allocator callbacks, string deduplication hashmap, nil, GC state & config, GC collection lists, finalizers to call, panic & warning callbacks, threads, the mainthread, preallocated message for allocation errors, metatables for basic types, & string cache for APIs.
Lua’s parser, which operates under lock with some state & tweaks (upon success) to expose globals via the scope as well as a buffer & various arrays, begins by checking whether the file is binary in which case it reads in the parsed bytecode wrapping in a function object.
Otherwise Lua initializes the closure to return & a “table”/”hash” for the scanner state. The underlying scanner defers to a callback reading a file or string, & implements save/restore + error reporting.
The scanner, as per usual, also tracks line numbers for better error messages.
The lexer is handwritten branching over each subsequent character with the aid of a handful of support functions for literal values, especially strings. Has a keywords list to check. It’s wrapped in a scanner split accross the parser & lexer sourcefiles, adding more error-reporting utils.
The parser is a handwritten pushdown automaton split accross dozens of functions closely resembling the Lua’s EBNF grammar. Infix operators are parsed using something inbetween a Shunting Yard & Pratt parser, whereby it uses the C callstack to store open operators.
The parser tracks a table of local vars which gets parsed directly into a registermachine. Immutability is enforced here whenever desired. Otherwise it’ll search the contextstack for variable names. There’s numerous helper functions for compiling vars, seamingly in part to avoid memory leaks!
Tuples are parsed away into stack operations even when returned from a function.
There’s controlflow flattening in the parser, & simplification collapsing chained gotos.
If the label for a goto is missing that becomes a runtime error.
The parser calls directly into an assembler to yield each of the opcodes it encounters in the expressions, providing it stack-allocated structures representing the operands to aid (e.g. type-directed) peephole optimization. Which in turn does little more than choosing an opcode to append to an array, though it does support e.g. updating controlflow opcodes. The assembler also handles tracking active registers & initializing literal values, & choosing opcodes for assignment statements.
Now that I’ve describe how Lua parses scripts into functions to call, the interpretor code for running those functions first (after moving data over to the Lua stack) first checks which configuration in which to call its attached C function (e.g. is this a coroutine?) or after normalizing arguments run the LuaVM.
The LuaVM, after checking for
call hooks to call, repeatedly checks for
trace hooks to call, fetches, next opcode, & branches upon it.
Lua semi-manually writes a dispatch table in case the C compiler isn’t smart enough to apply that optimization on its own.
The opcodes include ones for:
- Moving data around whilst updating GC, possibly from constant table or instruction offset or closure
- Loading builtin constants, possibly skipping next opcode (can load multiple nils)
- Accessing properties from “tables” with fastpath opcodes for closure, the
selfargument, & specific key types.
:method()syntax has a special opcode which leaves self on the stack.
- Allocating a new table.
- Numeric operators, whether from registers, stack, or immediate values.
- Calling methods on the “metatable”.
- Retrieving the length of a string or table.
- Concatenating strings or defers to metatable.
- Freeing closure-functions & enqueueing cleanup functions on that.
- Conditional jumps upon comparison, & unconditional jumps, with fastpath opcodes for specific types.
- Call & tailcall functions (shares preprocessing but tailcall-recurses).
- Return a given number of values
- Iterate over values for a
forloop, with values possibly flagged to-be-closed. Iterating over numbers vs iterators are seperate opcodes.
- Store a value in a table at an offset from its length.
- Allocate a closure
- Retrieve & pass variadic arguments.
This is implemented with a handful of support functions, mostly for type conversion, for loops, & numeric operators.
The data these opcodes operate upon are typically stored in a stack or accumulator register, though as many registers as the function requests are made available.
Lua’s datamodel is a tagged enum which may or may not point to an entry in the garbage-collected linkedlist. Each stackentry also has a to-be-closed list.
Lua typesystem includes nil, numbers (whether internally integers or floats under-the-hood), strings, & bools as the types basically any program ends up dealing with. And for convenience functions, including their “prototypes” as parsed from the sourcecode & the “upvars” forming their closure. Also there’s some concept of “threads” for the sake of functions, which seems to be there to stop the garbage collector from being too eager.
Lua supports C pointers & functions for the sake of language bindings including its internal ones.
Finally collections of some sort are usually necessary, so Lua adds a single type which cleverly covers all usecases.
A couple of bits are stolen from the typetag to denote how the value’s stored in memory, where the Nil value came from (literal, empty slot, or missing value from table presumably for error reporting), or whether the Bool’s true or false. This is not exposed to scripts (except for bools ofcourse!).
Nils & Bools are stored entirely in the typefield. Strings may be deduplicated via an internal table if hort enough (including keywords), as indicated by a bitflag. Long strings have a layer of indirection with lazily-computed hash.
Tables, closures whether C or Lua, “Up Values” they reference (one of those internal types), function “prototypes” (where the code & typesignature’s actually store, the other internal type), & some userdata (which may also have a metatable) are garbage-collected.
If a closure (consisting of “upvalues”, possibly to-be-closed) isn’t attached to a C function GC & a layer of indirection can be dropped. Similar situation for C pointers, where the layer of indirection allows for metatables.
As stated, Lua closures references a “prototype” parsed from the sourcecode. These track the number of:
- parameters & whether its variadic
- sourcelines (2 fields)
- localvars (for debugging)
As well as first & last linenumbers, source code, & arrays for:
- Expected closure (name, whether instack, index, & type)
- Lines (2 different compressions)
- Vars with lifetimes
- GC objects
Tables are Lua’s only collection type, cleverly covering all usecases! It can be an array, dictionary, (with dot-notation) object, or the result from importing/running a module! Heck I’ve seen it used an elegant way to represent XML elements!
Tables have “metatables” (which, yes, are stored as actual Lua tables) providing methods or tables for operator-overloading (special opcodes) & notably computing missing values upon lookup. Hence Lua has prototypal inheritance!
Internally Lua tables are open hashmaps conjoined with (for integral keys) an array, which cache which metatable methods are absent in a byte & references the metatable as an actual table. Thus optimizing for lack of metatable methods.
The hashmap component stores its size in log2 & tracks the lastfree slot as a hueristic for when to reshuffle existing entries. Singly-linking entries like a closed hashmap.
Reads/writes are optimized separately for each Lua type. Why strings store hashes!
Correctly & deallocating memory is a pain, which can easily lead to security vulnerabilities, most languages (except C, Assembly, etc) want to save you from! The popular way of doing this is to have a “garbage collector” (GC), whereby the language’s runtime traverses & flags all “live” references in the heap to free the rest.
Lua tracks a type-tagged singly-linkedlist of all objects in its heap, with bytes available for this flagging & tracking GC “generations” for stop-the-world GC.
Lua’s optionally-generational stop-the-world tri-colour GC is conditionally triggered upon allocating a closure, table, or concatenated string. At which point it hueristically/configurably considers whether to do a young or full collection.
Collection involves marking a couple root objects gathering an initial “graylist”. Aided by a linkedlist field on tables, Lua & C closures, threads, prototypes, & some C pointers. Then it propagates these marks specially for each of those types.
Metatables may instruct whether to propagate marks to a table’s keys and/or values. If a value for a slot’s nil the corresponding key’s replaced with a special tombstone value.
It propagates marks again after specially handling closures, & propagates marks for “write barriers” seperating the generations. Then handles weakkeys propaging marks to weakvalues, & considers currently currently scheduled finalizers.
Then iterates over the GC linkedlist multiple times tweaking generations, freeing all objects still left marked “white”. A full collection first marks everything white before marking, propagating, & sweeping.
Additionally linkedlists are used to locate weak keys & weak values to postprocess.
The GC is split into “iterations” I’m not fully comprehending, but when you reference a formerly young-generation object the old generation they’re involved.
LuaJIT is a 2nd slightly less straightforward but more optimized implementation of Lua, using the technique of JIT’ing! JIT’ing yields significant performance gains, even if it requires a bit more care to avoid security vulnerabilities.
LuaJIT works much the same way as the reference implementation (though codegen software largely in the parser file), so I’ll start my LuaJIT explorations by describing what happens once you call a parsed function. Again functions include parsed files.
With minimal preprocessing a LuaJIT functioncall hands off to some ARM, MIPS, PPC, PPC-SPE, or x86 assembly code (assembled via a custom Lua script, C script, & a C compiler) bytecode interpretor handling similar opcodes to the reference implementation.
The C script gathers a dispatchtable for the different labels, which is postprocessed into a dispatchtable for the different opcodes & builtin APIs to inline. This consulted after evaluating each opcode to evaluate the next one.
Occasionally these opcodes call C helper functions (especially for “metamethods” & tables more generally), but mostly once data’s in the correct registers & have had their types checked we can rely on straightforward assembly code.
LuaJIT may optionally compile functions further if they’re called often enough or one of its loops repeats often enough (according to profiling counters), but I’ll discuss that tomorrow.
As described yesterday LuaJIT starts with bytecode interpretor, just this time written in a custom Assembly syntax with its own compilation pipeline atop GCC reimplemented for each supported architecture. Otherwise its very similar to the Lua reference implementation.
But LuaJIT may optionally when a function, or loop within, is called often enough (its counter underflows) call a hook or other that generates a more optimized version. Or it rewrites builtin calls to use prewritten opcode.
Upon detecting a hotcodepath (correction: this “hook” as I called it isn’t configurable) with exceptionhandling Lua evaluates a statemachine.
In the START state LuaJIT checks/updates the function’s flags to see whether it should actually be JIT’ing this function, allocates an ID, initializes various fields, calls an “eventhandling” registered hook, & initializes various global fields for optimizers.
And if runtimeflags change LuaJIT adjusts the opcode dispatchtable.
Upon temporary-exit (RECORD state) this statemachine applies pending opcode-write if any, calls a configurable hook & with any necessary preprocessing saves relevant information from the nextopcode. Handling each of the different opcodes differently.
Upon END state as decided by that opcode-analysis applies pending opcode-write if any, considers applying Dead Code Elimination possibly tryingagain later, applies “split” & “sink” optimizations, maybe flags for no loops & proces to ASM state.
LuaJIT’s DeadCode Elimination (DCE) optimization pass marks all pre-gathered “snaps” & their entries as live, then iterates over the opcodes backwards to propagate those marks or replace the opcode with NOPs.
The split optimization is relevant to 32bit CPUs or those without an FPU. When relevant it iterates over the constant table then opcodes then PHIs looking for 64bit values to split.
The allocation-sinking optimization works similarly to the DCE pass with phi & opcode postprocessing.
The END state procedes to the ASM state, or it might decide to jump immediately here upon error. Regardless upon the ASM state LuaJIT repeatedly tries assembly the Machine Code in reverse order with a bit of boilerplate whilst discarding any remaining deadcode. “snaps” & “phis” are taken into account during & afterwords, & GC may be added alongside the “prologue”.
Then it ensures this code can be validly run (possibly including Valgrind validation), & adjusts opcode into this JIT’d code. Followed by possibly updating the dispatchtable.
Invalid states are marked as errors, or errors may be flagged when an exception is raised. In either case upon this state the pending opcode-write is applied, it checks state in order to decide how to recover & whether it should events to the script, typically updating state to IDLE (thereby exitting this statemachine) & updating the dispatch table.
Tool Command Language (TCL)
Tool Command Language (TCL) is a scripting language which resembles the commandline. Presumably future projects I’m going to study incorporates it into their buildsystems?
TCL includes an OS decoupling layer (supporting Mac, Windows, & UNIX) & launcher. This launcher initializes the appropriate OS libraries before calling
Tcl_Main. This allocates/initializes an interpretor, & in-turn an “execution context” & various collections, object preloading various standard libraries.
Next this interpretor is handed off to TCL_MainEx at which point OS-specific aspects of the standard libraries can be initialized, before looking up the thread-local startupscript & it’s encoding. If null (which it is when executed from commandline) there’s retrieved from the commandline arguments. Then script-accessible vars are allocated for this information., saving that state.
A given callback (from the launcher, runs testsuites & configures script-accessible filename) is run to perform initialization, or outputs an error.
Checking the configured startupscript again it opens that file using Tcl’s own standard library, & evaluates it (later topic) reporting any errors possibly fatally. Then it does likewise for the RC file if any & running interactively, & enters the mainloop.
This repeatedly checks whether there’s a mainloop procedure, considers outputting a prompt, checks limits, retrieves input & validates which it converts to a TCL object, interprets it & saves to a history array, & outputs result.
If a mainloop procedure is registered for this iteration it is run & cleared instead of much of that logic.
After the mainloop all that’s left is to clean up & return whether that fatal error occurred.
TCL String Evaluation
The other day I described what TCL does upon startup, here I’ll describe how it evaluates some piece of sourcecode. Which took me a bit to locate.
Behind several safety-wrappers I’m failing to follow TCL’s “evaluator” allocates stackframes, normalizes inputs copying them to the stackframe, enters a core-loop, & carefully cleans up.
This loop parses each command in the sourcecode, advances scanner, if successful interprets it via new arrays & a couple word iterations, & cleans up.
The parser is essentially a hand-written lexer populating an array, with seperate functions for each token. Lexing comments & whitespace first so it can ignore them. In fastpath utilizes aid of lookup tables. Afterwhich a scanner’s advanced.
If that successfully parses some tokens sidetables are allocated for
lineSpace. For each it word it advances that scanner, checks for constant-propagation opportunities, & substitutes various tokens flattening results.
That loop may incur a second reverse-iteration populating those sidetables.
Afterwhich another stackframe is pushed with a history entry recorded, & it recurses through all that code I failed to follow. To call the initial token passing it the trailing ones. Before preceding to parse the next command.
Skimming through the rest of Tool Command Language’s code, I see…
- TCL API bindings to a vendored LibTommath variable-precision math library.
- Lookup tables for decoding unicode dates.
- Thread forking, joining, storage, etc.
- Testsuit litted amongst the core code…
- Initialize stub callbacks.
- Debugging APIs.
- Unicode decoding/encoding.
- Utilities upon the TCL datamodel.
- Hashmap for variables, with support for computing closures.
- ZLib language bindings.
- A string runtime datatype.
- Packaging system with version checks.
- Number parsing.
- Exit with error message.
- Name resolution.
- Lexing API exposed to TCL scripts.
- Object orientation.
- Stringify POSIX codes.
- GC handles & saving data to be restored later.
- Regexp engine.
- Pipe runtime datatype.
- Filepath API.
- Procedure runtime datatype.
- Operators upon interpreter errorcodes.
- Abstraction around dynamically-loaded libraries.
- Network sockets abstraction.
- A wrapper stream calling TCL code.
- A stream deferring to TCL methods.
- A stream piping data from TCL procedure to another.
- Filesystem API.
- Linker between TCL & plain C code.
- List objects.
- Literal tables with hashing.
- Module loader.
- Namespacing & callframes, with imports & exports.
- Event dispatch.
- Thread-safe dynamically-typed objects for the script to operate upon.
- More literal parses, including for (involving a YACC parser) dates.
- Yet more file APIs.
- Introspection API.
- Hashmap objects.
- More I/O APIs & streams.
- Filename APIs.
- Somesort of fastpath interpreter with extensive-compilation, assembly, &, for debugging, disassembly.
- Envvar APIs.
- Yet more commandline APIs.
- Text transcoding.
- Dictionary APIs.
- Asynchronous procedures.
- Memory management.
- Clock API.
- Binary data.