Over the next week I wish to describe how I’d design a Just-In-Time interpretor (JIT) for a functional programming language, so as to:
Special hardware (like the Reduceron) will not be in this hypothetical, but I will heavily optimize this JIT to show it’s still simple!
Today I’ll just set the scope for this hypothetical, and mention some interesting language constructs that could be applied.
I will not discuss the language’s syntax or how variable names are resolved. Whatever happens there (maybe you can hook in your own parser?) I’ll assume it is lowered to an STG-machine consisting of:
- LET - Stores a (lazily-computed) value
- CALL - Executes a function with the given arguments
- MATCH - Branches over a tagged-enum
- DATA - returns a taged-enum
For I/O I will discuss how I’d implement Reactive Functional Programming, but not any compositing/data modelling/etc. I’m tired of all that!
Reactive Functional Programming is basically operating upon lazily-computed & timestamped linked-lists, where laziness refers to not computing data until it’s needed for output. Or in this case until the data is provided by the user. You can be sure I’ll get deep into laziness soon!
Another interesting language construct that’s worth mentioning is “Total Functional Programming” which can express any algorithm that can complete in reasonable time, but can’t express bruteforce or infinite loops by constraining what parameters can be passed to recursive calls.
Then to prevent the remaining crashes MATCH needs more typechecking than Haskell does by default.
I’d definitely want these typechecks applied on untrusted code to help ensure it doesn’t eat up my battery!
To start describing how I’d design a highly-optimized functional programming JIT, I’ll describe how I’d quickly compile first-pass relatively-unoptimized machine code.
The most important step of which is to trace variable names from their defining LET nodes to their CALL nodes in the Intermediate Language (IL), because that would inform those LET nodes of the variables they need capture for function closures. This would be hard to optimize away later!
But this Data Flow Analysis (DFA) would also aid:
- Moving loop invariants out of inner functions (which’ll also be hard to optimize away later because that’d be done on a per-function basis)
- Locating duplicate code
- Register/stack allocation (which I’ll describe tomorrow)
And would make it trivial to identify:
- Dead stores
- Function closures (as mentioned)
- Some inlining opportunities
- Optimal variable placements
After that DFA pass it’s a straight-forward operation to generate the corresponding machine code:
- LET allocs/inits the specified closures. If there are no arguments it still does this, thereby pushing the computation off until it’s actually needed.
- CALL gotos the specified function with it’s arguments stored in the appropriate registers
- MATCH pushes it’s branches onto the callstack with any necessary data
- DATA pops the callstack with it’s fields in the appropriate registers.
Some things worth noting about that mapping to machine code:
1) Data is not computed until they’re needed, “laziness”. This has weaknesses that needs addressing, but to make sure it can only improve how little CPU-time is used the DATA compilation needs to be adjusted so it can cache the computed results.
2) Ofcourse you could interpret these instructions instead of compiling them. I don’t know if that’s better.
3) Register/stack/heap allocation plays a huge part.
Perhaps the main task of a compiler is to figure out where to store data, so programmers don’t need to concern themselves with those details. Which is generally split into “registers” which CPUs can operate on directly, a “callstack” that can be pushed & popped, & “heap” memory for less predictable data lifetimes.
For register allocation much of the solution would be given by the calling convention. That is we know where LET ops need to write function arguments, and where the function parameters need to be read from.
Most registers would be allocated by propagating this information & resolving conflicts (store as identity assignments?).
Interestingly the formulation of the MATCH op makes register allocation fairly easy. Because the register allocator doesn’t need to worry about branches, data is passed into those via the stack.
Which means we can get an optimal allocation by allocating registers from the most constrained to the least.
But if LET ops can allocate multiple “thunks” simultaneously, vars not used outside later in the branch can instead be addressed by offset rather than register, making this even easier!
And I’ll add that I’d dedicate certain registers fixed tasks, like specifying where (if anywhere) it should cache computed results, or holding the function’s closure.
For stack allocation, I’d use two passes. The first pass would figure out which variables need to be written to stack (and, if required by the target machine code, when they need to be loaded into registers). The second pass would assign stack slots to those whilst avoiding copying them around.
The biggest issue in making functional programming fast is that it allocates constantly. Combining multiple allocations into one would definitely help, but we need allocation to be cheap.
For that I’d use a simple bump pointer. That is I’d have a chunk of memory to allocate out of, and a pointer to the start of free memory within that chunk (dedicate a register to this!) to be incremented as needed.
But what about freeing memory to be reused? We need a garbage collector!
The immutability of functional programming has useful characteristics for garbage collectors.
This garbage collector would start as the “young generation” starts to fill up, and pause the program on allocation to collect the active roots, and copy everything they reference to an old generation(s). Once those old generations fill up, everything is copied to new old-generations.
Once collection finishes we can free the young generation.
The main bottleneck would be the copying phase (a post-order traversal), which can be addressed by running concurrently with the program. Possibly over multiple threads (each which with it’s own old generation).
Without (much) mutation to deal with, concurrent garbage collection is relatively easy. Because once data in an old generation changes it must be recollected at some point, thereby slowing down the collection. Here data may mutate representation, but not meaning, at most once.
To continue describing how I’d design a functional JIT, I’d like to describe how I’d apply additional optimizations to the most frequently used code. This characteristic is what makes this hypothatical a JIT, rather than a compiler or interpretor.
I’d trigger these passes by counting how many times each function is called, which includes loops here because functional programming uses recursion instead.
And because each optimize pass reveals more, these passes would repeat periodically.
But first I’ll describe a potential microoptimization on the MATCH op, which makes better use of modern hardware. So I need to describe the relevant two aspects first.
- CPUs look ahead several instructions for the sake of performance, and to do so it has multiple “branch prediction” circuits. With the CALL op I’d be relying exclusively on the target prediction circuit, but I should offload some of it’s work onto the boolean and switch prediction circuits.
- RAM is addressed on byte boundaries, but pointers (on a 64-bit system) takes 8 bytes. Which means the lowest 3(?) bits are likely wasted, especially with all data in the heap being one or more pointers (“memory alignment”).
I can instead use those otherwise wasted bits (“pointer tagging”) to encode which branch an already-computed thunk should cause calling code to take. The garbage collector, if not anything else, can propagate those tag bits.
Perhaps the most important pass for revealing optimization opportunities is inlining. I’ve already mentioned that some of this could be done in the initial compilation pass (by inlining any single-use variables), and this inlining may reveal potentially infinite optimization opportunities to be examined next time.
First (like GHC) I’d attempt to inlining any variables that are only used once in any given branch, failing that inlining the simplest (by some measure) functions it calls.
Furthermore an interesting optimization a JIT can apply that a compiler can’t is to inline a “lambda” function’s closure. That is when a particular instance of a lambda is called enough times (perhaps it has it’s own call count?), all the values it captured would be inlined into a new copy of the function. This would, for example, turn any interpretor implemented within this JIT into it’s own JIT.
The other important pass to do periodically is to determine which arguments are always, never, or sometimes used. Because if some argument is always used, computing it lazily can lead to excessive memory use. Though it shouldn’t be able to do too much damage before the JIT decides it’s worth optimizing that code.
The data flow analysis pass would be useful in computing this, and from it we could compute a to-be-inlined wrapper function with the same type signature.
Recursion needs extra passes.
Higher order functions
One challenge in performing use-analysis is in dealing with the presence of “higher-order” functions passed in as parameters.
Because the use-analysis would need to propagate from any functions it calls, but no additional information is (statically) known about these higher-order functions. GHC struggles here, but a JIT can easily profile the use-analysis of those functions.
For the web-app usecase I’d rather it do this than rely on hints from a potential attacker…
After the inlining (which, btw, includes loop unrolling) & use-analysis passes I’d have it pattern-match inoptimal syntactic patterns & rewrite them into a faster alternative. And then I’d apply the same steps as the initial pass to generate the machine code.
For example a MATCH op over a constant should be replaced with the appropriate branch, and MATCH ops should not be conditional on another MATCH op, the outer should instead be inlined to the inner’s branches.
Maths could (like anything else!) be implemented in-language with what I’ve already described (“Turing Completeness”), in this case by representing numbers as the length of a linked-list. And building upon the resulting increment, repeat-n-times, & decrement+branch ops.
But that makes terrible use of memory space (due to behind-the-scenes use of pointers) & poor use of CPU time. So how else?
To start I’d gather profiling data for the numeric range of the inputs, using the normal baseline compiler. The only optimization applied would be to have maths functions (starting with +, -, *, /, %, etc) it calls already optimized.
Once a number/comparison-returning branch is called enough times I’d apply some maths-specific peek-hole optimizations, extensive inlining, & propagating the possible ranges for the other numbers.
From which I can compute how many bits are required for them.
Once we know how many bits are required for each number we can “vectorize” the maths ops by pairing them up.
And given some functions are paired to machine opcodes it’ll then be easy to compile it to machine code.
If any inputs are received outside of the maximum range it’s been optimized to handle, it’ll get recompiled. Computing this maximum range would be a seperate pass.
Until it actually does the computation I’d store 61bit numbers with pointer-flag (a concept I described yesterday) of “111”, or “000” to indicate it has not yet been computed. This would tell the garbage collector whether to collect them.
Normally the 6 other pointer-flag values would indicate which variant it is whilst still allowing for a pointer to it’s fields to be stored. But here we could use it to store larger numbers without much additional complexity.
In dealing with I/O the JIT I’ve described has incredible flexibility in how it applies optimizations! (Input) data is stored as pointers to read functions. And output can decide whichever order it wants to most efficiently compute and output the data, even allowing for easy concurrency!
Furthermore the JIT can even optimize the output representation seperately from the public API by marking that representation private and/or wrapping the running program with some JIT-specific code. In either case the JIT I’ve described already will inline & optimize away that JIT-specific code whilst still outputting the more optimal data representation.
Also it shouldn’t be too hard to compile the output logic to use special hardware like the GPU, especially when limited to maths.
Functional Reactive Programming
For interactive I/O, I’d represent the input & output as timestamped linked lists (“Functional Reactive Programming”), and the output will wait until the specified time to compute & output the attached data. With some sort of handling for missed frames.
For input data might be read before it’s available, so we need special “future” values which are never cached by the JIT (less work for it!) & indicates to the output logic to recompute it at the specified time.
It’s easy to introduce concurrency in computing the output due to a lack of mutation, but how about introducing concurrency within computing a single datum? I don’t know the answer to this question, but I do have a few thoughts.
The question isn’t in knowing where can introduce concurrency because it can be introduced anywhere. The question is where we should.
And given GHC has struggled with this, profiling would probably be involved in the hueristic. As would making concurrency cheap with (nonblocking) userspace “green” threads.
Spectre & other CPU-level vulnerabilities
As for addressing CPU-level vulnerabilities like Spectre, there’s a number things in the JIT I’ve described that should help:
- Static typing
- Excellent dead code elimination (which might throw away Spectre’s branch predictor training)
- Unpredictable control flow, which can be exacerbated by output if needed.
- No support for arrays (unless it’s input?)
But most importantly you cannot measure how long some code takes to run (only time input signals), which I think fully address Spectre!