A thought experiment I find interesting is: How would I design special-purpose hardware specifically to run web browsers like the ones I’m implementing? In doing so can I address the memory bottlenecks in modern CPUs more simply?
Auditory Web Browser
Imagine a device which reads webpages to you aloud when you verbally state a topic for it to tell you about. Perhaps with some buttons or a joystick to control media playback. This is what my “Rhapsode” project is about, but how would I build it differently when designing custom hardware on up?
Ultimately Rhapsode’s job is to convert network responses to audio:
- It parses HTML/CSS from network responses, which also needs to be parsed & cached
- It looks up URLs from textual user input & data from the HTML page
- It expands those URLs into network requests
- It converts HTML, with a little postprocessing, to SSML via it’s referenced CSS
- eSpeak NG parses that SSML into internal opcodes.
- eSpeak NG uses internal DSLs to convert text into “phonemes” & on into an audio generation pipeline.
Both inputs & outputs are straightforward bytestreams, I just need to convert one to the other.
The primary task of a special-purpose processor designed for Rhapsode-like auditory web browsers is parsing data to be rearranged in the output. Parsing involves branching upon each consecutive character/byte/etc to determine the next state & more structured output to future steps. A fully-capable parser should be able to push & pop this state in order to parse nontrivial languages like XML, CSS, or any general purpose language. So how would implement this in hardware?
I’d split the memory it directly accesses into multiple memory segments which can be accessed concurrently. If more data than what fits inside the CPU is required it should be trivial, with one exception, for these to overflow to RAM or solid state storage.
- Input nybbles (4bits) would arrive on a FIFO queue. Perhaps a ringbuffer to optimize for large enqueues?
- Those nybbles would be added to the address register of another memory segments to traverse it’s graph, determining the next value for that address register. Unpredictable memory accesses patterns are concentrated here, and as such optimal encoding design is vital.
- The instructions retrieved from the graph would push & pop the current state on a graph. Use a shift register?
To optimize the graph traversal segment I’d allow the lookup tables for each node to overlap in a sparse graph (like Haskell’s Happy/Alex), with a fallback instruction in a sidetable stored in a seperate memory segment. Another sidetable would list the external memory pages that might get called (whilst pushing the current address to the parsing stack) so they can be prefixed & referenced concisely. And perhaps I’d have a caching layer between the two for within-page callable states.
Once we have this data parsed we can’t necessarily send it straight to the speaker, etc as output. It may need to be reparsed according to a MIMEtype or port number, or dynamically compile a new parser. We may want to populate a datastack or other collection to consult. Maybe another parser needs to know e.g. what connections are open? We need to choose whether we’re outputting to e.g. speakers or network.
- The stack would drive prefetching of instructions onto an output ops queue (ringbuffer to allow loading instructions out-of-order?)
- I’d use a “capabilities stack” to determine which open hardware, parsers, or memory channels are available.
- The output opcodes would direct their bytes (constant or variable) to one of those channels.
Persistant storage would be required to hold the software (a.k.a syntactic grammer), HTTP cache, cookies, bookmarks, history, configuration, etc. Other network caches would also be needed, but not I think require persistance.
While (nonpersistant) storage would already possible by altering code to parse filepaths into filecontents, it’d be tempting to add minimal circuitry for decoding BTrees into normal parsing micro-ops. This BTree could also be useful for implementing the timers required by TCP.
SSDs would be exposed to this processor within the same address-space as RAM. The only difference is I can’t let the SSD point to data in RAM, a “write barrier”.
When a TLS packet comes in, we want to continue the parser registered to handle that network connection. When an HTTP response comes in we want to resume the code which sent it. If we’re computing audio output too fast we need to wait until the sound card’s ready for me.
For this reason I’d add special instructions for saving the current process state into a BTree or parser, so that when parsing reaches that instruction it’ll replace the data in the current stacks.
Maths would rarely be used in this usecase, but it’s definitely present. HTML attributes & CSS properties need to be sorted. Network packets need to be error corrected. TLS packets need decrypting. CSS units needs decoding. And most importantly sound waves need to be generated, though I’m not sure where the split would be between this circuit & the sound card.
So I’d add a tiny sideprocessor that the output unit can program to perform these tasks & produce it’s own output. The design of this sideprocessor isn’t too important, but the simplest design I can think of would have:
- 2 read buses, 1 write bus
- ALU capable of adding, incrementing, subtracting, comparing, & bitwise ops.
- 16 registers, including ones for:
MEMORYfor access to some ammount of internal RAM.
ADDRESSto determine where in RAM
MEMORYreads/writes, also stores literals from the instruction stream
FLAGSto store comparison results
- 16bit instructions (ALU control, input registers, & output register)
- Optional 16bit instruction header (condition, & literal or bitwise-shift ammounts)
I’m not all that clear on what voice recognition entails, but I do know it involves analysing audio to extract distinctive characteristics to find into some form of a weighted graph. Maybe those weights adjust so the machine can learn?
Utilising the natural concurrency of hardware this is easily solved using a matrix multiplication circuit not disimilar to what’s in your GPU. Or by configuring the electrical network to resemble those graphs & see how long it takes signals to traverse it, allowing them to save energy by flipping fewer bits. This has been an active area of actual research, though it’s applicable to much more than just voice recognition!
Visual Web Browser
How about a visual web browser, more like the ones you’re used to? Everything the auditory web browser hardware is doing (except voice recognition) is relevant so I’d reuse it’s processor. Even the task of converting text to “phonemes” & rendering those to audio resembles the task of converting text to positioned “glyphs” & rendering each of those glyphs, the only difference is heavier use of that maths core to evaluate the I/O-limited Turing-complete programs placed in font files.
Once a page has been styled a browser needs to compute where it shows up onscreen. This requires traversing over the tree several times both preorder & postorder. The pushdown automaton can do this, but that would involve the overhead of repeatedly encoding/decoding the tree only to access an explicitly underpowered ALU. It’s worth designing it a seperate SIMD (Single Instruction Multiple Data) unit capable of traversing trees!
To create a SIMD unit you build numerous compute units following instructions from relatively few control units. In our case the operations involved laying out a webpage are mostly vectorized addition, subtraction, & comparison with some multiplication. Making the interesting question: how do we add multiplication in without slowing down the other more common operations?
So I’d combine the adders dedicated to different tree nodes into single circuit to sum all the intermediate values produced from a multiplier’s AND gates. We can reuse subcompoments of this circuit to get shifting, bitwise-and, & sum circuits. Furthermore I’d “pipeline” multiple numbers through this multiplier simultaneously for all the different nodes it covers, a common technique in GPUs.
The biggest difference between programming on a SIMD unit as opposed to a SISD/MIMD unit is control flow: the control unit(s) can’t take a branch until all children have agreed it should be taken! Otherwise it has to send each instruction to all the compute cores for them to decide whether to store the result of the operation.
So conditions get embedded inside each microcode, and it becomes more valuable to explicitly store boolean bits. Especially since vector operations produce multiple of them!
It may be useful for compute cores to choose which control unit they follow based in the
display CSS property of their current node.
The programming model would allow developers to traverse the tree from leaves to root or root to leaves. Each compute core would be able to read/write the fields of it’s own node. Depending on which direction the traversal occurs they’d also be able to read their parent’s fields or (which is vital for line splitting) iterate over it’s children. No memory outside of those would be accessible.
To implement this I’d use shift registers: once the current node has finished being computed it’s parent or children will be shifted where that compute can quickly access it. The main challenge would be to allocate tree nodes to compute cores…
I’d accomplish this using an extra dimension on those shift registers, which could also be useful for optimizing child traversal, & an incrementing/decrementing address register. To add a child node it increments the address register & shift in all it’s children beneath it where they can be shifted where the compute cores can shift to them when ready. If there’s not enough space it’ll overflow to the next layer of the tree. Once all children have been added it’ll decrement the address register & restore the position of the shift registers.
To ensure there’s always space for a child to be added under the next child I’d alternate the direction each level of the tree is shifted in. To make this work the overflow logic I’d move overflow children down two levels, not one.
In a SIMD unit it can be beneficial to move complexity out of the numerous compute cores to the relatively few control units. So what high-level operations would I find useful to add?
- Floating point arithmatic - Lowers to component integral arithmatic
- Micro-optimized “sum field of children” instruction - We have all the hardware for this from the multiplier, but doesn’t necessarily fit the programming model otherwise
- Loops - Stays in loop until all compute cores issues a BREAK back
- Index array by loop counter - so we can support, most importantly, string processing whilst letting the compute cores to pretend to be pure register machines
- Data stack - To ease compilation; stores in trailing padding bytes, selectively activates shift upon overflow
Now that we’d know where to place everything onscreen, what’s the minimal hardware we’d need to “composite” them there? Let’s start by listing the operations required:
- Composite numerous “sprites” onscreen.
- Fill axis-aligned bounding boxes with optionally tesselating images
- Solid colours could be modelled as tesselated single-pixel images
- Apply bitmasks - aids other processes to fill in gaps, vital to render glyphs “right”
- Scaling images
- Vector graphics for glyphs & SVG
- Transparant Colours
- Border corners can could be rounded or curved
- Linear & radial gradients
- Drop shadows
- Matrix transforms
The Hardware Pushdown Automaton described previously could handle the small-scale vector graphics of glyphs quite well. Larger scale vector graphics is all question in & of itself! And the described SIMD unit, bypassing it’s support for trees, would be perfect for computing rounded corner bitmasks & matrix transforms.
An easy way to handle the sheer quantities of sprites is to load their bounding boxes a range tree. That is a BTree representing the range of pixels each node covers. Maybe it’d be good to have seperate ones for X & Y axese, updating X for each new row? The great thing about range trees is there’s an upper limit to their memory use!
Upon knowing the correct sprite for the current pixel to be outputted, tesselation could be performed using a division/remainder circuit. determining x,y position to lookup in the sprites image. Another division/remainder circuit can be used to lookup the correct bit in the bitmask to check to determine whether to treat this pixel as transparant.
That division can be adjusted to support downscaling images, feeding it’s results to determine the proportions (power-of-twos/bitshifts) to sum into the final value. This upsampling circuit could also be used to apply blurs for dropshadows or compute linear gradients. Larger downscalings or upsamplings could be offloaded to the SIMD unit.
The above process could be repeated for each parent sprite until the result is opaque.