A thought experiment I find interesting is: How would I design special-purpose hardware specifically to run web browsers like the ones I’m implementing? In doing so can I address the memory bottlenecks in modern CPUs more simply?
Auditory Web Browser
Imagine a device which reads webpages to you aloud when you verbally state a topic for it to tell you about. Perhaps with some buttons or a joystick to control media playback. This is what my “Rhapsode” project is about, but how would I build it differently when designing custom hardware on up?
Ultimately Rhapsode’s job is to convert network responses to audio:
- It parses HTML/CSS from network responses, which also needs to be parsed & cached
- It looks up URLs from textual user input & data from the HTML page
- It expands those URLs into network requests
- It converts HTML, with a little postprocessing, to SSML via it’s referenced CSS
- eSpeak NG parses that SSML into internal opcodes.
- eSpeak NG uses internal DSLs to convert text into “phonemes” & on into an audio generation pipeline.
Both inputs & outputs are straightforward bytestreams, I just need to convert one to the other.
The primary task of a special-purpose processor designed for Rhapsode-like auditory web browsers is parsing data to be rearranged in the output. Parsing involves branching upon each consecutive character/byte/etc to determine the next state & more structured output to future steps. A fully-capable parser should be able to push & pop this state in order to parse nontrivial languages like XML, CSS, or any general purpose language. So how would implement this in hardware?
I’d split the memory it directly accesses into multiple memory segments which can be accessed concurrently. If more data than what fits inside the CPU is required it should be trivial, with one exception, for these to overflow to RAM or solid state storage.
- Input bytes would arrive on a FIFO queue. Perhaps a ringbuffer to optimize for large enqueues?
- There’d be some input preprocessor registers for freeing readpages, redirecting input to come from another page, optionally countingdown bytes, splicing in “pseudobytes”, switching to bitmode, etc.
- Those bytes would be added to the address register of another memory segments to traverse it’s graph, determining the next value for that address register. Unpredictable memory accesses patterns are concentrated here, and as such optimal encoding design is vital.
- The instructions retrieved from the graph would push & pop the current state on a graph. Use a shift register?
To optimize the graph traversal segment I’d allow the lookup tables for each node to overlap in a sparse graph (like Haskell’s Happy/Alex), with a fallback instruction in a sidetable stored in a seperate memory segment. Another sidetable would list the external memory pages that might get called (whilst pushing the current address to the parsing stack) so they can be prefixed & referenced concisely. And perhaps I’d have a caching layer between the two for within-page callable states.
Once we have this data parsed we can’t necessarily send it straight to the speaker, etc as output. It may need to be reparsed according to a MIMEtype or port number, or dynamically compile a new parser. We may want to populate a datastack or other collection to consult. Maybe another parser needs to know e.g. what connections are open? We need to choose whether we’re outputting to e.g. speakers or network.
- The stack would drive prefetching of instructions onto an output ops queue (ringbuffer to allow loading instructions out-of-order?)
- I’d use a renametable (stored in the stack) to route data to hardware, parsers, or memory channels.
- The output opcodes would direct their bytes (constant or variable) to one of those channels.
Persistant storage would be required to hold the software (a.k.a syntactic grammer), HTTP cache, cookies, bookmarks, history, configuration, etc. Other network caches would also be needed, but not I think require persistance.
While (nonpersistant) storage would already possible by altering code to parse filepaths into filecontents, it’d be tempting to add minimal circuitry for decoding BTrees into normal parsing micro-ops. This BTree could also be useful for implementing the timers required by TCP.
SSDs would be exposed to this processor within the same address-space as RAM. The only difference is I can’t let the SSD point to data in RAM, a “write barrier”.
When a TCP packet comes in, we want to continue the parser registered to handle that network connection. When an HTTP response comes in we want to resume the code which sent it. If we’re computing audio output too fast we need to wait until the sound card’s ready for me.
For this reason I’d add special instructions for saving the current stack into a BTree or parser, so that when parsing reaches that instruction it’ll replace the data in the current stacks.
Also when sending TCP we’d want to be able to pause the data generator. So once the input preprocessor reaches a low-watermark I’d pause the stack with its input & flush out remaining output. The input unit is already well equipped to only read up to its high watermark, though when outputting to speakers the mathscore may need to specially to deal with this case using interrupts.
Maths would rarely be used in this usecase, but it’s definitely present. HTML attributes & CSS properties need to be sorted. Network packets need to be error corrected. TLS packets need decrypting. CSS units needs decoding. And most importantly sound waves need to be generated, though I’m not sure where the split would be between this circuit & the sound card.
So I’d add a tiny sideprocessor that the output unit can program to perform these tasks & produce it’s own output. The design of this sideprocessor isn’t too important, but the simplest design I can think of would have:
- 2 read buses, 1 write bus
- ALU capable of adding, incrementing, subtracting, comparing, & bitwise ops.
- 16 registers, including ones for:
MEMORYfor access to some ammount of internal RAM.
ADDRESSto determine where in RAM
MEMORYreads/writes, also stores literals from the instruction stream
FLAGSto store comparison results
- 16bit instructions (ALU control, input registers, & output register)
- If I’m given a 17th bit I’d use it for carrying the overflow flag, so I’m not tempted to add a pseudoreg for that.
- Optional 16bit instruction header (condition, & literal or bitwise-shift ammounts)
A Lempel-Ziv decompressor would be placed between this & the output unit loading instructions into it for loop unrolling. The the output unit could find other uses for a Lempel-Ziv decompressor…
The mathscore would be wrapped in circuitry allowing it to recieve clocktick, UI, & other interrupts. This would be handle in a special persistant register-bank/RAM-page.
I’m not all that clear on what voice recognition entails, but I do know it involves analysing audio to extract distinctive characteristics to find into some form of a weighted graph. Maybe those weights adjust so the machine can learn?
Utilising the natural concurrency of hardware this is easily solved using a matrix multiplication circuit not disimilar to what’s in your GPU. Or by configuring the electrical network to resemble those graphs & see how long it takes signals to traverse it, allowing them to save energy by flipping fewer bits. Or use an analog circuit that resembles SRAM, though the fact that wears down with retraining means I probably wouldn’t go that route. This has been an active area of actual research, though it’s applicable to much more than just voice recognition!
Visual Web Browser
How about a visual web browser, more like the ones you’re used to? Everything the auditory web browser hardware is doing (except voice recognition) is relevant so I’d reuse it’s processor. Even the task of converting text to “phonemes” & rendering those to audio resembles the task of converting text to positioned “glyphs” & rendering each of those glyphs, the only difference is heavier use of that maths core to evaluate the I/O-limited Turing-complete programs placed in font files.
Once a page has been styled a browser needs to compute where it shows up onscreen. This requires traversing over the tree several times both preorder & postorder. The pushdown automaton can do this, but that would involve the overhead of repeatedly encoding/decoding the tree only to access an explicitly underpowered ALU. It’s worth designing it a seperate SIMT (Single Instruction Multiple Data) unit capable of traversing trees!
To create a SIMT unit you build numerous compute units following instructions from relatively few control units. In our case the operations involved laying out a webpage are mostly vectorized addition, subtraction, comparison, & running-sums with some multiplication. Making the interesting question: how do we add multiplication in without slowing down the other more common operations?
So as the number of nodes on each layer halves I’d assemble them into a multiplier/summer capable of computing running-sums over all relevant children concurrently!
The biggest difference between programming on a SIMD unit as opposed to a SISD/MIMD unit is control flow: the control unit(s) can’t take a branch until all children have agreed it should be taken! Otherwise it has to send each instruction to all the compute cores for them to decide whether to store the result of the operation.
So conditions get embedded inside each microcode, and it becomes more valuable to explicitly store boolean bits. Especially since vector operations produce multiple of them!
It may be useful for compute cores to choose which control unit they follow based in the
display CSS property of their current node.
The programming model would allow developers to traverse the tree from leaves to root or root to leaves. Each compute core would be able to read/write the fields of it’s own node. Depending on which direction the traversal occurs they’d also be able to read their parent’s fields or (which is vital for line splitting) iterate over it’s children. No memory outside of those would be accessible.
To implement this I’d use shift registers: once the current node has finished being computed it’s parent or children will be shifted where that compute can quickly access it. The main challenge would be to allocate tree nodes to compute cores…
I’d accomplish this by constraining the tree as understood by the hardware to being a binary tree. More free-form tree structures can represented within a binary tree by storing first child on the right branch and siblings balanced (with a couple bitflags) on both branches if they’re free. That would help make running-sums faster!
There’d be a need to support child traversal.
In a SIMD unit it can be beneficial to move complexity out of the numerous compute cores to the relatively few control units. So what high-level operations would I find useful to add?
- Floating point arithmatic - Lowers to component integral arithmatic
- Loops - Stays in loop until all compute cores issues a BREAK back
- Index array by loop counter - so we can support, most importantly, string processing whilst letting the compute cores to pretend to be pure register machines
- Data stack - To ease compilation; stores in trailing padding bytes, selectively activates shift upon overflow
Also there’d need to be support for lookuptables so this tree-SIMT can interpret shapingtables within the process of laying out a page.
Now that we’d know where to place everything onscreen, what’s the minimal hardware we’d need to “composite” them there? Let’s start by listing the operations required:
- Composite numerous “sprites” onscreen.
- Fill axis-aligned bounding boxes & (for borders) trapagons with optionally tesselating images
- Solid colours could be modelled as tesselated single-pixel images
- Apply bitmasks - aids other processes to fill in gaps, vital to render glyphs “right”
- Scaling images
- Vector graphics for glyphs & SVG
- Transparant Colours
- Border corners can could be rounded or curved
- Linear & radial gradients
- Drop shadows
- Matrix transforms
An easy way to handle the sheer quantities of sprites is to load their bounding boxes a interval tree. That is a Trinary representing the range of pixels each node covers, with the middle prong middle prong holding all sprites overlapping the chosen center point pre-blended. Maybe it’d be good to have seperate ones for X & Y axese, updating X for each new row? The great thing about range trees is there’s an upper limit to their memory use!
Upon knowing the correct sprite for the current pixel to be outputted, tesselation could be performed using a division/remainder circuit. determining x,y position to lookup in the sprites image. Or better yet, keep a register so I only need a single adder. A similar process can compute linear gradients, and handle the slope of trapagon edges.
Add a multiplier/summer to support downscaling images, feeding it’s results to determine the proportions (power-of-twos/bitshifts) to sum into the final value. This upsampling circuit could also be used to apply blurs for dropshadows. Larger downscalings or upsamplings could be offloaded to the SIMD unit. Using this multiplier to compute squares allows (as per Pythagyrus) cropping rounded corners & computing radial gradients.
The above process could be repeated for each parent sprite until the result is opaque.
We could combine this trapagon rasterizer with the Bentley-Ottman algorithm running on the layout unit. Such that we have multiple glyphs/SVGs being rendered concurrently, under each’s left child we can store a minheap of upcoming line starts/ends/intersections, & under each’s right child we store a scanline iterator. We can even potentially spit out trapagons ahead of the compositor unit rendering them! This relies on the ability for the layout unit to traverse children!
It’d be a shame if devices such as these couldn’t stream audio & video. So here I’ll describe how it’d handle some of Xiph’s royalty-free openstandards!
FLAC is an open standard royalty-free audioformat that generally uses very little data (and to a slightly lesser extent CPU time) without ever sacrificing any of the quality from the original recording (analog or digital, makes no difference).
The way it works is by storing parameters to a encoder’s-choice of formula approximating the upon wave then uses “rice codes” to compactly encode variations upon it.
One possible choice is uncompressed, in case you’re sending it whitenoise.
For stereo audio it’ll compute “mid” (average) & “side” (difference) channels to avoid duplicating data between “left” & “right” channels. It choose two channels between these 4 to compress & encode.
Rice codes avoid the need to store a compression dictionary by detecting how much variance is common. Below that threshold it stores the variance as a binary value. Above that threshold it stores it as a sequence of 1s followed by a 0 bit, a “unary” number.
The formulas are essentially a weighted sum (multiplicands are typically hardcoded, which aids compiler optimizations) of some number of previous samples with the variance incorporated.
Surrounding this is “blocking” to simplify the approximating formulas, CRC checks upon decompressed audio, syncbytes for seeking, & a metadata header which even includes an imagefile + mimetype for coverart.
The parsing unit would be kept quite busy if I gave it a bit-parsing (as opposed to byte-parsing) mode. Which would be useful anyways for handling other filecompression schemes, IETF-standard bitflags, & in the eventdispatch mainloop.
Though I suspect any realworld hardware FLAC decoders would find MSB/signbit branching & shifting-in input perfectly satisfactory.
The verbatim & constant curves might not even need the Arithmatic Core, being entirely handled by the output unit!
The fixed approximating curve has small hardcoded coefficients (3 & 6 are as easy as I ensured 10 was; 2 & 4 are even simpler) so I wouldn’t need to add a multiplier; otherwise its just sums & vector adds! Computing left/right from mid/center is an add/sub & maybe a shift.
Might be possible to keep everything in registers within my existing design, but a ringbuffer’s a good fallback (if I designed this hardware specifically for FLAC I’d ensure it has enough registers so it wouldn’t need RAM). The Lempel-Ziv decompression/loop-unrolling incorporated into loading the arithmatic’s core’s instructions would be very handy here!
But the Linear-Predictive-Coding (useful for speech) with file-specified coefficients requires a weighted multiply-sum, which that arithmatic core wasn’t designed for. However I could repurpose the AI (Rhapsode computer) or Layout (Hapheastus computer) coprocessors!
That AI coprocessor is specifically designed for (floatingpoint) weighted multiply-sums! Wouldn’t be doing much at that time anyways, just running a smaller neuralnet listening for “Hey Rhapsode”.
Surprisingly good fit. Just don’t ask about FLAC encoding!
Theora is a lossy-compressed royalty-free open-standard videoformat (though Xiph has faced controversy at W3C over what defines an “open standard”, permissionless implementation or cross-industry collaboration?) designed for efficient decompression. You’ll never see a Theora video as a standalone file though, it relies on a container format like OGG or MPEG to e.g. indicate where the different start & end. And to audio like FLAC.
Can’t say I’m fully comprehending it, but here’s my understanding.
FLAC files start with metadata & a Huffman-coding dictionary used to indicate which frequencies are in each 8x8px “block”. These blocks are coded following a space filling curve to (somehow, don't see how they take advantage of this) aid compression of spatially-contiguous areas. They're stored in YCbCr colourspace with typically downsampled Cb & Cr to focus on brightness over hue.
Inter (or “intra”? Confusing terminology) frames code movement of blocks since the last frame & last intra frame.
Corrections to the “predicted” interframes are coded as per usual with Huffman-coding of discrete-cosine transforms.
Really seems to be an evolution of the techniques used for MPEG video. If you have any links explaining Theora better than the official standard or reference implementation, please send them my way!
Decoding Theora would require a little cleverness & should keep my hypothetical processors relatively busy… The surrounding Ogg container is perfectly suited to the parsing unit & output unit, with the arithmatic unit computing a CRC.
There’s plenty of effort which would need to go into parsing Theora videos bit-by-bit. Or ideally variable number of bits at a time.
The parsing unit could also parse the initial huffman codebook & easily use metaprogramming to decode later bits according to it.
The arithmatic unit, output unit, & compositer unit could all be involved in converting that parsed video into an image onscreen. The arithmatic unit can decompress the 8x8px blocks (caveat) & compute their position onscreen.
With the results of that parsing, I’d have the arithmatic unit decompress the 8x8px “block” & compute their position onscreen according to the space-filling curves & motion vectors.
The output unit would upsample the C colour channels & (to reduce memory reads per-pixel) interleave all 3.
The compositor unit would perform colour conversion & rearrange the blocks as specified by the arithmatic core.
However the arithmatic unit I designed isn’t quite capable of doing that work alone…
I specifically didn’t design the arithmatic unit to not include a multiplier. Because outside of layout (which would require its own tree-SIMT processor) I didn't need one.
So to get the arithmatic unit to perform multiplications I could either decode the coefficients into shifted-add instructions (if I'm not doing too much multiplying) or I can offload onto the layout unit. Or both.
When decoding the Huffman-DCT codebook at the cost of extra memory I could precompute images to sum.
In conclusion the parsing, output, arithmatic, layout, & compositing units I designed for rendering HTML/CSS to images isn’t the best fit for decompressing videos, but still perfectly capable.
On the otherhand, that made the thought experiment more interesting to me!
I’d probably not want to play video on this architecture whilst rendering a page, but I wouldn't want to do that anyways due to input limitations…