Browser Hypothetical GPU

How’d I extend our hypothetical hardware for implementing an auditory web browser (and its underlying OS) with a custom-designed GPU for visual HTML rendering once its been styled? Like most of you are used to, like webdesigners are all too often used to?

Computer graphics are complex & unfortunately it’ll be infeasible for me to trace the entire codepath like I did for audio. Especially since the default codepath is the complex one! So I’ll focus on different topics for each subsequent page. Here I’m discussing the extra hardware I’d need to add to produce visual output, and what languages/compilers I’d want to use to program them.

Output Monitors

What’s the output device?

Modern monitors from an electrical perspective are essentially analog write-only RAM. Using chemicals which produce light proportionate to each capacitor’s voltage (OLED), or rotates the “polarity” to determine how much of a LED or plasma backlight passes through a pair of sandwiching “polarisers”.

To get colours out of this we typically add a colour mask behind the screen dying the light red, green, or blue to be blurred together in our vision. Or newer monitors (QLEDs) may use fancier technology to shift the light’s wavelengths as opposed to filtering it!

Old displays shot electrons towards each point (at varying brightness) of a phosphorescent surface, with our electrical protocols still adhearing to the timing those electron beams required, with gaps between lines & frames.

Compositor

Once a webpage has been laid out, how’d I render it? Avoiding early shortcuts which only works for certain (privileged) languages, a Display List Rasterizer would be very well suited to our needs. These were then-advanced graphics chips from the late 1980s which could composite an arbitrary number of sprites onscreen, as long as they’re in sorted order by top-left corner.

It’d iterate over each sprite & when counters hit its coordinate starts outputting that sprite. These sprites would be enqueued into a 2nd internal array so that we can merge it into subsequent rows until we reach the sprite’s height.

To handle rendering textual characters (“glyphs”) or images there’d be some internal “texture memory” in which to lookup pixels to composite onscreen. As for backgrounds/borders decorations… A dozen or so multiply-sums should nicely cover colour-mixing, texture-subsampling, gradients (linear or radial), rounded corners, border styles, etc!

As for tesselation, I’d preserve registers so we can use conditional-subtraction. And there’d be an option to preserve registers in the queue of “active” sprites so we can tessellate vertically. I’d let sprites adjust their x/height coordinates between rows to aid rendering trapezoidal borders. Pythagoras can be used for rounded ones, with conditionality.

Combine with pre-rendering to add support for blurs/shadows, other convolution filters, & geometric transforms.


That would be a register machine, 32bit or even 64bit vectorized (each register storing 1 32bit number, 2 16bit numbers, or 4 8bit numbers) would be very suitable! With special coordinate registers.

I might add circuitry to save the rendered output, that might help:

Also we can minimize the impact of not having enough time to fully compute certain pixels by using those occurrences to heuristically determine where we might have trouble in the next row. And should use the gap between rows to precompute as much as we can.


To help handle cases where we have more images to display onscreen than fit in this Compositor’s texture units, I’d add a queue of prefetched images to be loaded into the texture units as-needed. It may even merge upcoming where that’s straightforward to do in hardware.

I’ll later describe some more versatile software to address this same issue, but there’s only so much that’ll be able to do without some hardware optimizations.

Layout Unit

Now that I’ve established the hardware for displaying & compositing imagery, how’d I design a circuit to determine where to composite text, images, & backgrounds/borders decor?

This process involves repeatedly traversing the styletree (which would be computed the same way as I described for the auditory browser) to negotiate sizing back & forth. So I’ll design a “tree-SIMT”! It’ll process binary trees, though the input tree won’t actually be binary.

I’ll base this coprocessor upon modern GPUs, with tweaks to allow traversing binary trees. Other trees would be encoded in these binary trees. That is I’d have numerous “Compute Units” following instructions from a shared Control Unit. The opcodes would include conditions determining whether or not to apply them in each Compute Unit. These Compute Units would be 32bit (or 64bit?) vectorized register machines, beyond that I’ll dedicate a thread justifying what ALU to give them.

I’d have a few different techniques for traversal over the tree. For the topmost layers I’d create hardwired connections between the Compute Units forming the binary tree. For deeper layers I’d have shiftregisters bringing each next layer to its Compute Unit, traversing up & down the tree. If I can’t fit the entire styletree in this chip… I’d overflow to separate memory pages. Since it’s likely unbalanced (our chip wants balanced trees) I’d combine overflowing subtrees with “pseudoroots”. Also during post-order traversals I’d allow a parent to traverse its children, maybe with parallel computation. This could be vital at times!

To process trees with arbitrary numbers of children per node, I’d include a pass in the Output Unit which encodes those trees into binary trees. With first-child on right branch & next-sibling on left branch. Except… We want balanced trees! So add a bitflag indicating the right branch is also a sibling. Where both branches are siblings it’d be useful to sort into infix-order to aid some of that parallel computation of childnodes!

To program this thing… It’d be useful to have pure-functional programming language as a lower-level option, compiling recursion specially. Or there’d be a higher-level option designed specifically CSS I’ll continue discussing…

Guess what processor’s perfectly designed to optimize its own code?

Collections

A key task we’d need to perform on our “Tree-SIMT” Layout Coprocessor is to process collections. The primary collection would be the child nodes, but we’d also have:

Text literals & pseudoelements would primarily serve to be spliced into the children collection. Under the hood if I have other collections, I’d store them in a metacollection under the right branch.

Upon these collections I’d have map (denoted with curly brackets), filter/query (denoted with square brackets), & concatenate (comma) operations. Maybe I’d even want round parentheses to apply the collection to a pure-function written in a seperate language, for when this domain-specific-language is poorly suited to the task?

In the map operation the “~” prefix would refer to the previous-child. In a query it’d refer to the last candidate for which the expression returned true.

Those operators should do a good job at making most of the fancier tasks this Tree-SIMT needs to perform easy to express! There’d be an optimization pass rearranging commutative operations to get the most parallelism out of them. Absent collections would be treated as empty.

I may include a dedicated bus in the bus for splicing global collections into the tree, though I’d have an optimization pass which prefers running it on the Parsing Unit.

Layout ALU

What operations would we primarily want our Tree-SIMT Layout Coprocessor to perform? What ALU would we want to give it?

We’d have lots of running-sums, additions, subtractions, halvings, comparisons, & less-frequently multiplies & divides. So it’d be appropriate to make each of our Compute Units loosely resemble the Arithmetic Core I described previously. The main difference is that I’d have more & larger registers, without the ADDR, RAM, or PC pseudoregs. There’d be no random-access memory in this coprocessor, no pointer dereferencing, to simplify things. I’d have 32bit registers (or even 64bit) which can be “vectorized” into 2 or 4 ints to aid colour mixing. For multiplies, running sums, etc I’d repurpose the subtree (interconnected compute units) under the active ones to assemble a summer. If they’re busy or absent I’d incur a microcode loop instead.

I’ve already mentioned tree traversal operations: preorder & postorder loops, left/right/up nav, splicing in subtrees. Arrays & floating point numbers would come up often enough that it’d be useful for the Control Unit to decode those machine codes, though not enough to add these concepts directly to the Compute Units. Also have a special STACK register could be handy to ease compilation! I’ll discuss arrays later… ROM would be used to decode the machine code into microcode, unless we decide to add yet another compilation pass instead! The Compute Units as described would themselves be simple integral register machines.

Control Flow

Extending the Tree-SIMT Layout Coprocessor I’ve been describing to support to support control flow would be interesting. On the one hand, having all the Compute Units share a Control Unit makes this complicated. On the otherhand webpages have interesting requirements. I’ve described how GPUs might do this.

For the simple cases I’d embed a condition into the microinstruction to determine whether it actually takes effect inside each Compute Unit.

As for loops… I’d include a counter in the control unit so the machine code or microcode can repeat a handful of opcodes a given number of times. With a slight tweak this circuit could also be used to iterate over arrays (gradient colourstops, background layers, text chars, etc) decoding them into consecutive registers. Supporting more-freeform loops could be done by setting a maximum number of iterations & adding condition to every instruction in its loop body.

But primarily we’d branch over enums… Previously I stated I might include a secondary, larger bus for splicing collections (which the Parsing Unit hasn’t already queried for us during preprocessing) into the tree the Compute Units can process. I could repurpose this to add support for lookup tables, efficiently covering the more trivial cases where each branch corresponds to a literal number. Maybe they could dereference registers too, I’m not sure?

But for the all-too-common complex cases I’d build a few extra Control Units! To handle, say, different values of display differently I’d have a couple opcodes which checks whether any Compute Units under the Control Unit has the given “discriminant” value (special opcode to set it). If so it’d attempt to hand those Compute Unit off to a free Control Unit for it to walk them through the referenced branch, or failing that it’d follow the jump itself. Another opcode would wait until all spawned concurrent tasks complete. These optimizations would be worth it here!

Topological Sort

When implementing CSS layout, the main struggle I faced was ensuring I’ve routed all the necessary data to the places which needed it. Haskell helped some, but a domain-specific language for implementing CSS would want to go further!

I’d have a property tree with numeric leaves, which I’d include a compilation pass to compute a topological sort for. By generating the code to compute whichever properties we can from the data we have so far until no more remain. Easiest to do in software, as a compilation pass!

When there’s no obvious next step we’d have a cyclic dependency to break (yes, these do really occur when implementing CSS layout! Especially due to percentage sizing), which could be broken using special curly-braced syntax denoting “when computing this property, use the given transient values for those dependencies). Failing to find one of these I’d locate a cyclic dependency, to emit an error message with a suggested fix. $ & its subproperties would correspond to machine code to output to the Display List compositor.

That curly-braced syntax could also be appended to boolean expressions to express if-branches, depending on the non-conditional assignments. Processing parent (^ prefix) & children (@ collection) properties would be deferred until last hoping to merge loops. Parent would incur a preorder computation, children would incur a postorder computation. Combining the 2 in an expression (typically incurred by CSS %units) would be treated cyclic dependency! This is what’d make this DSL magical!


This pass would generate a sizeable prelude to all the loops, and we can simplify the performance demands on our tree-SIMT by compiling this prelude specially. Using the Parsing Unit to query collections, Arithmetic Core to perform simpler maths, & the tree-SIMT for complex maths. Taking advantage of Scientific Notation to simplify metric & percentage unit conversion.

Whilst ensuring multiplies, scientific-to-IEEE-float conversion, trigonometry, etc is fast using the entire Tree-SIMT.

CSS Token Syntax

To help interpret the input CSS provides to its layout formulas, I’d add a layer of syntactic sugar for matching CSS tokens. With properties denoted by the lack of symbolic-prefix.

# (for floats) or ## (for ints) would separate a property (or type) from a numeric unit. This would compile to a branch multiplying the property by the given value in a conditional branch testing if the given unit (or lack thereof) is invoked.

Subproperties (. separator) would denote keywords. Those keywords could also be treated as boolean expressions, compiling to a test whether a subproperty holds the dynamically-generated numeric ID representing this keyword.

Appending parentheses containing comma-separated arguments (which may be annotated with types or ranges using : operator) denotes any parameters should be exposed as subproperties. Or a value could be assigned using = operator, or curly-brackets for subproperties. Such functions could be defined multiple times to perform pattern matching.

Then there’d be other types like url() which takes a MIME-glob & returns the parsed response from other DSLs. Or strings/identifiers, strings/images, freeform identifiers, etc. Where a property (unusually) consists of multiple numbers, I’d denote those as subproperties between units, etc & the CSS property. Where its unambiguous these values can be parsed from CSS in any order. To allow for code reuse I’d have globbed properties & reusable (denoted with :, duplicated in codegen) types.

Not only would I compile this syntactic sugar into code to run on the tree-SIMT but also I’d compile it to a tentative parser for these CSS properties! A separate DSL (resembling ones already explored) would refine those parsers to properly match the W3C Recommendations.

These units, keywords, functions, etc could also be invoked in expressions for concision, especially for output.

CSS Lists

Whether for gradients’ colourstops, multiple background layers, or the characters (integral “codepoints”) in the text being laid out, etc CSS includes on occasion comma-separated lists.

To play well with the numeric property-tree datamodel (easing compilation & typechecking) I’d introduce a ... property which gets postprocessed to form maps, folds, filters, zips, etc. Whilst tweaking the tentative parsers being generated. I’d probably to declare properties as being of array-type. Having those type declarations gives us a chance to specify how many items to store inline…

After the topological-sorting pass I’d have some post-processing to convert computation upon, or assignment to, arrays to:

Summary

To conclude my recent threads extending my hardware-browser hypothetical to support visual output, I’ll describe the compilation pipeline designed to aid efficient implementation of CSS layout.

  1. Extract nametables for reference.
  2. Lower CSS tokens, globs, & types inferring several if-conditions, whilst generating a parser via another DSL understanding CSS tokens & special combinators.
  3. Compute a Topological Sort, compiling loops’ prelude specially to make full use of the hardware.
  4. Infer loops & lower branching, determining expected output from 3’s loops’ prelude.
  5. Lower collections into binary-tree processing, including the style-node’s children.
  6. Peekhole optimizations, including constant propagation & strength reduction.
  7. Register allocation.
  8. Hardwired circuits lowering arrays, floating point arithmetic, fixed-iteration-count loops, & maybe datastacks; yielding integral register machine microcode with concurrent tree traversals.

Or for more general-purpose language, we could use Pure Functional Programming compiling recursion specially.

This Tree-SIMT should be well-suited to optimizing its own code! It’d have a direct connection to a Display List Rasterizer similar to what you might have found in the late 1980s. It’d mainly programmed using a graphics library in the CSS domain-specific-language (DSL), defining units, identifiers, functions, etc lowering to multiply-sum & texture-fetch operations to be run as the composited image is sent to the display.


Pseudocode

To make this hypothetical a bit more tangible, I’ll start writing interesting pieces of the code it would be running. I haven’t done this for the auditory browser, since the code that runs would be little than the standardized BNF grammars! Please note that the illustrative code here is very rough with loose ends, but it is using hypothetical syntax closely matching the capabilities of the hardware I’ve described.

Functional Language

To help bootstrap up to the full CSS DSL, I’d start with a basic functional language. The design goals are to quickly build a convenient syntax in which to implement tree traversals.

A concatenative syntax would certainly suit.

Concatenative Assembler main: SPACE* ( @'-'? @DEC+ { # Output Unit's handling of this branch !/Math.out # Defines Arithmetic Core's assembly language alu << SHIFT 3,1 r0: r0r0 r0: LIT,r0 parse <- alu parse << text # Input from Parsing Unit tCONST = ... # tree-SIMT's Literal opcode saving to STACK pseudo-register. out << tCONST parse # Since there's no output, defaults to reading r0. } "'" @. { out << tCONST text } | @word { # preceding is for internal idents. identifier << text # identifier's inherited to resolve each identifier to its inlined code. } | '{' declarations '}' # Will explore later | '[' { body = out! }(main){ tSKIP = ... out << tSKIP |body| stack << |out| out << body } ']' # Control flow primitive | '(' .* ')' # Comments | SPACE* EOF ) -main; # `-` means let other paths take priority. word: '`'? ALPHAX ALNUMX* ('`' .)? # Trailing `x could be used to feed argument counts to resolver, if supported.

Especially if its wrapped in a configurable infix-syntax! Here I’ll treat the ! operator specially to populate the operator-precedance table, whilst handling imports & comments itself.

Infix Parser !/Unicode.syn main: { lparen = 1 1; rparen = 1 2 !/Math.out; !Map.out # Input would be a ID number & precedence looked-up externally. It would be better if we had > 254 operator IDs. alu << r0: LIT FETCHf rx IFz 0 ADDR:~ 0 alu << CONST lparen CMP r0LIT IFe; push = |alu|; alu << 0 J popOp = |alu| alu << SHIFT -8,-8 CMP RAMr0 UNLESSg; exit = |alu|; alu << 0 J # Loop for greater precedences alu << POPf UNLESSc 0 OUT RAM UNLESSc popOp J # Pop & output them exit << |alu| alu << LIT rparen CMP r0LIT IFe; popParen = |alu|; alu << 0 J push << |alu| alu << RAM: r0 PUSH HALT popParen << |alu| alu << POP parseOp <- alu alu << OUT+1 LIT; count <- alu # And add a parseDec routine supporting negatives # Add list that outputs n ":"s. ops <- map; iOps <- map ops << '(' 0 lparen; iOps << lparen 0 '(' ops << ')' 0 rparen; iOps << rparen 0 ')' ops << '!' 0 20 0; iOps << 20 0 0 '!' nextID = 1; left = ""; newOp = "" } expr* SPACE* { parseOp << 0 }; expr: SPACE* (@WORD { out << text "`" 0 " "; left = text } # Should count arguments... | "`" @op "`" { out << text "`" 0 " "; left = text } | '!' { newOp = left } | @op { text >> ops >> parseOp >> iOps >> out; out << "`" 0 " " } | "`" @word "`" { text >> ops >> parseOp >> iOps >> out; out << "`" 2 " " } | @'-'? @DEC+ { `.*` << newOp { prec << parseDec << text; ops << text 0 prec nextID; iOps << prec nextID 0 text } # FIXME: Push op to restore previous precedence? out << text " " } | '"' {charCount = 0} (. { out << "'" text; charCount << count << charCount } )* '"' {list << charCount; out << " "} # Mostly for docstrings. | '{' .* '}' # Comment | '[' @.* ']' { !infix.syn << FILES << text } ) op: OP* | '(' | ')' / '-'

For this language to work, we’d need build an identifier table, at which point I’ll introduce pattern matching. Adding special meaning to the =, ,, ;, ::, @ (to define primitives, accessors, & argument captures), ->, & #. Types & type declarations are handled here, generator both constructor & accessor functions.

Identifier gathering !abstract-interp.syn # Not going into stack-tracking code. main: { !/Map.out; decls <- map; types <- map; datatypes <- map; doc <- map; identifier <- identifier; body! } decl { decls >> !patternmatch.syn >> identifier; body >> !recursion.syn >> identifier body >> !concat.syn } decl: -pat@expr body@expr "=" ("`" 2)? { decls << (exproot << pat) 0 pat " = " body " ; " } # FIXME: Handling "|" properly? | generics@var* ident@proper body@typedecl "=" ("`" 2)? { typedecls << ident 0 generics body} | ident@expr type@expr "::" ("`" 2)? { types << (exproot << ident) 0 type } | ident@expr opcode@num "@" ("`2")? { identifier << (exproot << ident) 0 opcode STACK STACK STACK } # Base-case name resolution for "primitives" | expr expr `!` # Leave alone, operator parser dealt with this! | decl decl "," ("`" 2)? | decl decl ";" ("`" 2)? # Sequence ops | @expr { body << expr " " } # Allow stack to pile up to form a tuple. expr|: @expr @expr (";" | "," | "=" | "::" | "!" | "@") ("`" 2)? { out << "{" text "}" } # Instruct concatenator to run this preprocessor over them. | @expr expr "#" ("`" 2) # TODO: Capture docstrings into doc collections? What outputs most useful? typedecl: ... # Comma or semicolon seperated type decls with auto-generated function bodies. # Treat `@` amongst fields specially to autogenerate accessor functions by that name, or declare recursion compiler should generate arrays. # `->` is treated as a type constructor, interpreted specially as defining functions.

That defers to postprocessors introduce the magic of pattern matching (using @, |, ->, & ...), or of the target co-processor’s concurrent tree traversals (overrides recursive call to refer to special opcodes)!

Pattern matching !abstract-interp.syn main: { !/Map.syn; branches <- map } decl (";" decl)* ";"? { !buildcase.syn << !/map/values.syn << branches } # buildcase.syn would be co-recursive, adding CPU opcodes to enact this pattern-matching decl: -xs@expr* x@expr name@ident "=" body@expr { branches << (exproot << x) 0 xs (expargs << x) name " = " body " ; " } | xs@expr* var@ident x@expr "@" ("`" 2)? name@ident "=" body@expr { branches << (exproot << x) 0 xs (expargs << x) name " = " body " ; " ... # Emit capturing operator & add relevant identifiers to table. } | xs@expr* var@ident name@ident "=" body@expr {...} # Emit capturing operator & add relevant identifiers to table | xs@expr* expr@expr pat@expr "->" ("`" 2)? name@ident "=" body@expr { expr >> !concat.syn # FIXME Common-Subexpression Elimination? Is it needed for valid output? !buildcase.syn << xs pat name " = " body } | xs@expr* start@num end@num "..." ("`" 2)? name@ident "=" body@expr { ... # Emit comparison opcodes !buildcase.syn << xs name " = " body } | decl guard "|" ("`" 2)? "=" body@expr guard: x@expr pat@expr "<-" ("`" 2) { expr >> !concat.syn; !buildcase.syn << x } | @expr { expr >> !concat.syn; out << IF } # Whatever "IF" is in the output...
Recursion I'm not writing concrete pseudocode for this as it heavily deals in the arbitrary specifics of the target machine code. But it'd take the recursive function & wrap it with a concurrent "TRAVERSE" opcode triggering recursive calls to run on lower layers of the tree. This could generate a recusive datastructure (referencing the type declarations) stored there. Bi-recursive optimal, but (if a datum spreads accross multiple binary-tree nodes in the hardware) not required. Mono-recursion may chunk the data into arrays according to `@` operators (with numeric operand) in the type declaration.

The types collection gathered there could be useful for a typechecking pass (or rather 2, running both forwards & reverse, though a self-hosted one would be faster), & combining it with the doc collection could yield some nice reference documentation. Also Constant Propagation (find constant codepaths & evaluate ahead of time), common-subexpression elimination (traverse opcodes consulting/updating a mapping to determine when to replace them with a reference to previous opcodes), & Strength Reduction (locate expressions which can be replaced with simpler variants) optimization passes would be easy to implement whether self-hosted or not! (Maybe we’d also want register allocation to correct for when we overflow available registers, that’d be more complex) Keep both variants for bootstrapping.

Higher order functions are unfortunately unimplemented, though they’d fit nicely syntactically (have operator parser feed through argument counts to identifier resolution, or use an operator) & they’d mostly if not entirely optimize away in a constant-propagation pass. Its not worth extended the target instruction set to support it, instead falling back to static dispatch if thats ever necessary.

This will be a pretty decent functional language to write later pseudocode in… Though it doesn’t have typeclasses, I didn’t think I’d need them. I’ll pretend I’ve implemented higher-order functions.

CSS DSL

Once parsed/desugared with identifiers resolved (running a non-trivial grammar in the Parsing Unit), the first pass for compiling the CSS DSL would be a topological sorting from whence much of the power would come.

Topological Sorting

        [/lang/op/prelude.op]; {L is a binary tree, L' is an array. Different optimizations for the hypothetical hardware with different tradeoffs}
        [expr.op];
        Decl = Decl :: key@String -> deps@L' String -> val@Expr -> fallbackFor@Maybe String -> condition@L' Expr -> Decl;
        topsort :: L Decl -> L Decl;
        topsort Nil = Nil;
        topsort (split doable -> next := rest) = sortBy declPrio next ++ topsort $ map (cleardeps next) rest;
        topsort = error . locatecyles; {Main processor is more suited to converting this into nice error message, crashes this process}
        doable _ = False;
        cleardeps decl (split isFallback -> fallbacks := done) = clearfallback fallbacks $ clearnormal done decl,
            clearnormal (done:dones) decl = clearNormal dones $ depsM (filter' $ `!='` $ key done) decl,
            clearnormal [] = id,
            clearfallback (fallback:fallback) decl | Just for <- fallbackFor fallback, for =' key decl =
                clearfallback fallbacks $ depsM (filter' $ `!='` $ key fallback) decl,
            clearfallback (_:fallbacks) = clearfallback fallbacks,
            clearfallback Nil = Nil;
        isFallback (fallbackFor -> Just _) = True;
        isFallback _ = False;
        declprio = length' . condition;

        { Analysis to locate cycles tripping us up }
        locatecycles graph | head graph -> Just top = dfs (l1 top) visited graph
            | otherwise = []
        locatecycles (x:xs) visited graph = map (intersect `='` visited . deps) conflict ++ locatecycles (next ++ xs) (addL x visited) graph,
            split (overlaps `='` visited . deps) $ filter (in (key x) . deps) graph -> (conflict, next)
            { NOTE: This may not traverse over the entire tree, instead limiting output to most obvious fixes }

After generating code for loops from @ collection, ^-prefixed identifiers, & ... properties I’d need to lower “collections” like @ or string literals.

Collection Lowering

        [/lang/op/prelude.op];
        [expr.op];
        lowerCollections :: Decl -> Decl;
        lowerCollections (Filter self cond)
            { NOTE: These opcode might lower further in later stages... }
            | noX isPrev cond || isTransitive cond = Postorder self $ If cond Yield Propagate
            | otherwise = Iterate self $ If cond Yield Propagate;
        lowerCollections (Map self expr)
            | noX isPrev expr = Postorder self expr
            | isCommutative expr = Postorder self expr `Preorder` expr { Does this need tweeking in one or both passes? }
            | otherwise = Iterate self expr;
        lowerCollections (Get self "...")
            | isCollection self = PassCaller $ Iterate self
            | otherwise = PassCaller $ Array self;
        lowerCollections exp@(Set self "...")
            | isCollection self = error (exp, "`...` property on collections are readonly, use `,` concatenation syntax instead")
            | otherwise = PassCaller $ Array self . Set LoopVar;
        lowerCollection (Get Collections name) = Postorder fCollections $ If (Equals fieldName $ Lit Name) Yield Propagate;
        lowerCollection expr@(Lit _) = Concat fCollections $ NewCollection $ Set fContent expr; { Desugaring, making strings more powerful }
        lowerCollections (Seq exps@(headOr Nop -> exp))
            | anyX isChildContainer exp = Seq $ inferPostorder exps
            | anyX isParent exp || otherwise = Seq $ inferPreorder exps
        lowerCollections = recurseX lowerCollections;

        { Or is this a seperate pass from `lowerCollections`? }
        inferPreorder Nil = Nil
        inferPreorder (break (anyX isChildContainer) -> (preorderLoops, rest))
            | preorderLoops -> Nil = error (headOr Nop rest, "Expression can't mix parent & child references!")
            | otherwise = Preorder $ Seq preorderLoops `cons` inferPostorder rest
        inferPostorder Nil = Nil
        inferPostorder (break (anyX isParent) -> (postorderLoops, rest))
            | postorderLoops -> Nil = error (headerOr Nop rest, "Expression can't mix parent & child references!")
            | otherwise = Postorder $ Seq postorderLoops `cons` inferPreorder rest
Loop Prelude CodeGen (Notes)

The initial preorder loop would compile specially:

This would be a fair bit of logic performing codegen on disperate CPU architectures.

After possibly translating the syntax to a lower-level intermediate language, we’d want to run a few optimization passes which might be shared with the functional language.

Common Subexpression Elimination

        [/lang/op/prelude.op];
        [/lang/op/map.op]; { Implements a binary-search-tree }
        [operation.op];
        cse (Operation op x y) seen = cse x seen -> (x', seen'), cse y seen -> (y', seen''), dedup (Operation op x' y') seen''
        cse NoOperation = NoOperation
        dedup op seen = map'lookup seen (op'hash op) <| op, map'insert op'hash seen op
Constant Propagation & Strength Reduction (Notes)

This recursive function would have numerous branches pattern-matching on each subtree to replace degenerate structures with more optimal ones. I don't want to write out all of these cases.

A vital branch here would move as much duplicate code as we can out of branches-being-compiled, & once that's done choose the most optimal opcode for each to evaluate each branch its compiling.

At which point we might as well also check whether we have constant expressions that can be interpreted compiletime rather than evaluated at runtime.

Register Allocation

Optimally allocating registers to expressions is one of the problems struggle to compute an exact answer for, though if we constrain how loops & branches interact with our register allocation there is an optimal answer for straightline code. Or we could just use a stack machine, like was involved a bootstrapping a functional language onto this processor.

First examine the code to compute the live-regions during which a variable lives. Then assign available registers to each region in chronological order.

Throwing in hardware-level stack-allocation would help reduce the amount of register allocation which needs doing, especially regarding control flow which would otherwise yield inoptimal results in this algorithm. They'd interact by either assigning to previously-allocated registers by having registers allocated for their return value. If inputs are only sometimes-required I'd spill them to be stored alongside the child collections, requiring a tree traversal to get at them which might not always succeed.

The main processor would need to know which registers were allocated for each input value.

With these vital optimizations applied, we’d generate machine code!