Seeing a need for new CPU architectures, I got curious about ones based on the functional paradigm. And I’ve theorised before about this, basing my design upon the “G-Machine” Haskell compiles to.
But now I’ve finally gotten around to studying the “Reduceron” which is based more on the intuitive definition of functional programming. And FPGA prototypes have managed to get comparable performance despite running on a slower clock.
I’ll describe how it works now.
At it’s simplest the Reduceron is a stack machine.
If the top of the stack is a FUNC instruction, it looks up the referenced “template” from the dedicated “code” memory and copies it over to a dedicated “heap” memory. In doing so it rewrites any pointers (from APPLY instructions) and overrides the ARG instructions with the corresponding parameter.
APPLY instructions (which FUNC resolves to after the copy) simply copies a segment of memory onto the stack.
But how do we evaluate conditions in the Reduceron, or construct data structures?
In a “lazy functional language”, we instead store the code which will compute the data when it’s needed and cache the computation then. The Reduceron has a seperate stack to manage that.
And what we think of as data constructors instead are functions which pops the branches (also functions) off the stack and calls it with it’s field values. This really isn’t any different for the GMachine.
The final instruction(s) supported by the Reduceron are basic integer arithmatic (while this can be implemented using the other instructions, you’d waste lots of time and, worse, memory), but this requires defeating some of the Reduceron’s laziness.
So (originally) these operations were written in post-order form and the INT instruction would swap the top two items on the stack. So when we reach the “primitive function” it knows it’s args are both resolved, and if not it’s easily corrected.
Where the Reduceron really gets interesting is with it’s specialized memory needs, specifically:
- It needs to access the code, heap, and memory space in parallel, and as such those should be seperate circuits.
- For efficiency it needs to fetch multiple instructions simultaneously whilst copying/modifying individual values.
And ofcourse the heap memory needs to be garbage collected, which currently is done very naively in hardware.
NOTE: I’ve skipped over minor optimizations.