The core job of a web browser is to parse a network response into something interactive & perceptable for the reader (you). So how about we design a processor which primarily reformats an input stream into an output stream via a stack?
The hardware I have in mind largely consists of hooking up memory circuits. So how do we design these memory circuits?
To store each bit we can use:
- Capacitors, great bulk option (“DRAM”)
- Electrical feedback between 2 NAND or NOR gates, fast
- Transistor-level shortcuts
- Voltage leaks to/from isolated circuits, persistent (“SSD” or “NVRAM”)
- Hardwired data or fuses
- Non-electrical (usually magnetic) techniques, used to be higher capacity before Moore’s Law caught up (won’t use)
Multiple of these “latches” can be tied to the same write-control line to form a multi-bit “register”. Incorporate a periodic “clock” signal into that write-enable line to coordinate registers to work together. This clock signal is generated by combining a resistive-capacitor (or quartz crystals) with a latch that simultaneously the saw wave into a square-wave whilst periodically draining the capacitor to produce said saw-wave.
It can be useful to tie multiple registers up to the same bus, and determine which register to enable by an externally-specified address. By ANDing together the appropriate inverted & non-inverted bits to produce the enable lines for each of those registers. If we store the address in a feedback-based register we wouldn’t need separate inverters. Significant effort is spent trying to compact those decoders so more bits can be stored in a memory chip. There’s a main bus on which data is stored, but I’d also incorporate independantly-addressed “RAM blocks” in my design.
Adders (later topic) are useful for computing these addresses. And with the hardware I’ll describe consisting of RAM blocks, registers, & occasional adders they’ll be perfect for fabrication on an FPGA, chips into which we can write data for how these should be connected.
Finally there’s always been a dynamic where higher-capacity memory is slower, even if the details differed over the decades. For instance in a browser the whole world’s information at our fingertips, but it takes tenths of a second to access. I will accept this dynamic in the hardware designs I’ll discuss below.
Perhaps the main task a browser does is parsing network responses, so how’d I design hardware centered on that? Also with metaprogramming this circuit could be used for CSS styling, cache lookups, filesystem reads, or other collections!
At the core I’d use a bitmask-addressed program counter. Each cycle it’d read an input byte (or even bit!) looking up that offset in the appropriate registers forming edgetables & ORing together the new set of states current states. Those registers may also include opcodes altering input-preprocessing registers, echoing output opcodes, or operating a callstack. An empty bitmask would pop the callstack. That callstack (which as a stack is trivial to prefetch, keeping atop of RAM latency) could cache microcode.
While this bitmask addressing would work well in the microcode, it simply won’t scale to the amount of syntax we’d have to parse. So I’d need a separate machine code (described shortly) which is easy for other code to generate. Then there’s prefetching… Parsers are practically impossible to prefetch, so I’d have the machinecode explicitly drive prefetching using imports. That is, say, 4 “rules” with active states could trigger pre-decoding of 4 additional rules each. Rules could import grammars, & grammars languages.
The preprocesser would dequeue bytes from a stack (include multiple to aid out-of-band communication from Output Unit, priority-scheduling, & differing fetch latencies between DRAM & SSD) of queues possibly deallocating the memory pages as it does so. It may be fed through a shift register to help parse bitflags & compressed data. Then for some grammars it could be useful to splice in “pseudobytes” (identified as such) into the input stream, possibly after a given number of bytes to help parse length-prefixed arrays. These & the callstack could be operated via “mainloop” firmware by trusted hardware external to this hardware. Including facilities for task-switching (don’t bother prefetching these, there’d usually be too many options!), & dispatch into an in-RAM parser dispatching hardware events.
- Dequeue bytes off a stack of queues (trivially prefetched).
- Preprocess that input via a handful of registers (counters, shift, etc)
- Combine that byte (or bit) with enable lines stored in a bitmask-addressed program counter to produce it’s new value + opcodes to operators 2’s registers & more.
- Explicit imports pre-decode new “rules” imported by rules with active states. Explicit imports prefetch this data.
- Microcode can push & pop a callstack (trivial prefetch).
Machine Code for Parsing Unit
What machinecode would be easy for code generators & metaprogrammers to generate? How’d we design hardware to decode it?
The languages, grammars, rules, & states the Parsing Unit’s code is split into would all be fixed-size to aid prefetching (driven by explicit imports). The microcode, as just described, would be bitmask-addressed edgetables. From the set of active states we can derive the set of active rules. If there’s too many we could have a register-renaming circuit to compact it into a DFA. This’ll determine whose imports we pre-decode.
A machinecode state could be a sorted & diff-compressed list of labeled edges to other states in the rule, with an optional opcode (triggered when hitting the state) in place of the initial edges. All the states in a rule could be decoded concurrently, one edge at a time. For each machinecode edge we iterate over, an accumulator (adder+register) could determine which register to write to in the corresponding microcode. Having an “eta” edge can simplify compilation & shouldn’t be difficult to add, or we could leave that to the Arithmetic Unit.
In other cases machinecode could reference ASCII char classes. Decoding an ASCII char class would write to multiple edge registers simultaneously! As for Unicode, that has to be deferred to software, but these ASCII-charclasses could aid text transcoding. Also the Parsing Unit would often want to match literal text (e.g. for tries), so I’d include a state opcode that lazily decodes to an entire rule. Each state transitioning to the next upon matching it’s byte.
By the way, guess what circuit could easily be assemble it’s own Assembly language?
Given the Parsing Unit I’ve been describing, what would I do with its output?
On callstack push I’d start prefetching additional Output Unit opcodes. On pop I’d enqueue them for the Output Unit to evaluate, or buffer this relatively compressed data to be evaluated in a pull-pipeline. Then prefetch the next callframe’s. This would keep these opcodes out of the Parsing Unit’s prefetch bottleneck.
The Output Unit would route data to destinations permitted by the callstack. Possible destinations would include:
- Size-capped registers (fatpointer into ringbuffer?)
- (NV)RAM segments
- Parsers spawned from (NV)RAM segments
- The Arithmetic Core (later topic) & programs running there
The opcodes would write bytes (often echoed from Parsing Unit) to a selected destinations amongst those in the callstack, as well as the contents of the registers, datastack slots, & possibly (NV)RAM segments.
I’d rely on the Parsing Unit’s firmware to allow the Output Unit to feed outputs back into the input, & to pause/resume tasks. Furthermore it’d often become important to instruct that firmware to move the task into collections, yielding an opcode into that parser which triggers an immediate taskswitch.
It’d probably be useful to include Lempel-Ziv decompression into this Output Unit, possibly blocking opcodes to same target.
To output to a (NV)RAM segment I’d buffer the data on the stack to size how much memory to allocate for it. I’d use a hardwired buddy allocator (pyramid of latches) to allocate that memory, for the Input Preprocessor to deallocate it when ready. If we ever run out of memory to allocate I’d dispatch an event requesting memory to be returned to the system by e.g. clearing expired caches. Then let some firmware defrag the memory allocations.
Add a layer of address indirection to avoid needing to actually copy the buffered data into it’s new place (“virtual memory”). And a 2nd layer to diverse the address from the number of bytes therein, so it can be reallocated.
How do we design an “Arithmetic Core” (AC) which can evaluate the counters (including clock) for comparison, ARX cryptography, & all the redundant 1’s Complement checksums involved in networking? Amongst other things I’ll describe later (FIXME: edit this list to list them).
Let’s say this is a 16bit machine, the perfect size for the checksums, port numbers, certain ciphers, audio, & even the instructions themselves. 16bit instruction can be split into 4 4bit nybbles: 2 input registers, 1 output, & 1 opcode. 65,536 2byte words should be more RAM than needed, 256’s likely not enough.
We’ll start with an adder. Adding a special
ZERO register allows us to move data from one register to another, & literal 0. Add a
carry-in opcode-bit to allow increment (for the counters) & literal 1. Add
save-flags opcode-bit capturing carry out amongst other bits into a special
FLAGS register (disabling writes to it) to allow combining registers into larger bitwidth numbers, or those 1’s Complement sums. So the carry-in opcode-bit can be XOR’d with this carry-flag, which can be cleared by comparing
invert-B opcode-bit to inversion, negation, subtraction, comparison, & literal -1.
Use the final opcode bit to switch from an Arithmetic Unit to a Logic Unit, where the
invert-B opcode-bits are repurposed to select between AND, OR, or XOR gates.
Of the 16 regs, 4 would be special:
ZEROgives all 0 bits, writing to it (
OUT) outputs to external hardware.
FLAGSstores bits controlling AC behaviour, including:
haltstops running the program once set (lowest-order bit so it can be set via increment)
carry-outfrom last ALU operation with save-flags opcode-bit set, for use with conditions.
parityindicates whether there’s an odd or even number of 1 bits in that ALU operation (highest-order bit for easy transfer to
ADDRdetermines which address in RAM the
RAMregister refers to.
PCis incremented between operations to determine the address of the next instruction. Writing to it is akin to a GOTO operation.
- Additional general-purpose registers may want to be reserved by non-trivial programs to handle function calls, but most programs running here would be trivial.
Every other clockcycle I’d emit opcodes to increment
PC. Whilst fetching its old address from RAM & loading it into an internal register to determine what to do next.
I’ve left a logic unit free, which combined with the save-flags bit provides just enough space for 2 header opcodes. One of which stores an 8bit literal into ADDR whilst saving a condition to be tested against
FLAGS to determine to follow through with the write. The other header opcode would save registers (loaded directly from the instruction or from specified registers) alongside 3 parameterizing bits, which would be used to select a rotation/shift for each ALU input. This is vital for ARX crypto, number-parsing, & more.
As for modulo-exponentiation crypto, that might need dedicated circuits to get decent performance… I won’t need exponentiation elsewhere! Or I could simply ask servers not to use those cyphers.
The TCP/IP stack is designed to be fast even on a trivial CPU like this, but especially for application-level protocols the Parsing Unit will be handy!
Computers sum numbers using what resembles long addition, they just have fewer digits yielding trivial addition tables! The binary addition tables are:
0+0=0 0+1=1 1+0=1 1+1=10
It’s a XOR gate, with an AND gate for carry! Chain 2 of these “half-adders” together ORing carries (since 1+1+1=11) to form a “full-adder”, summing in the carries from the previous cell. Leave carry-in & carry-out lines to add increment, multi-register numbers, & 1’s Complement additions.
As for negative numbers, & in turn subtraction, there’s a few ways to represent them… We could dedicate a “sign bit”, in which case we’d need a special subtraction circuit. We could invert all bits of the number (1’s Complement), which mostly works in a normal adder. Wrapping any carries (as happens upon adding a negative number) around back to the start. This is ideal behavior for a checksum! Inverting all bits and incrementing (“2’s Complement”) works perfectly in a normal full adder! Since the most-significant placevalue has its sign inverted.
How’d this Arithmetic Core (AC) I’ve been describing integrate with the rest of my hypothetical hardware designed specifically for implementing a browser (& it’s underlying OS) upon?
I’d populate the AC with code from the Output Unit, to be run either immediately or upon each byte it’s sent. Each of those bytes, if any, would be written to
ADDR for the code to access. Data written to
OUT would be sent to the destination the Output Unit specifies. These programs would stop once a certain bitflag is set. It may also be useful to let the Parsing Unit access the AC’s comparison flags for sorting!
As for how I’d write code for this Arithmetic Unit, mostly I’d probably use compiletime-constants in the Output Unit’s Assembly. Making good use of its Lempel-Ziv decompressor to serve as extra loop counters & unrolling said loops. I would want Output Unit opcodes to help me capture labels to reference elsewhere or overwrite them. It would ofcourse be easy to implement an Assembler (or even something higher-level) using the surrounding hardware, but I’d have much need.
I’d also have a special privileged, persistent, & prioritized mode for the AC with its own registers & RAM. This would be used by e.g. the Parsing Unit’s mainloop-firmware to mix in additional entropy into a pool there. Also I’d dispatch interrupts via latches & some firmware into this mode, even when the device is otherwise off.
When outputting to a size-limited destination in a pull-pipeline, the AC may need to be able to pause its program to be resumed later. So upon hitting this limit I’d have some firmware output its registers & (determined via some latches exposed via support circuits exposed as a literal) allocated memory to another datastore via the OUT register, with compression. Once the pull-pipeline is resumed firmware would decompress this memory & resume the program.
Having described the main processor I’d implement a browser engine & its underlying OS upon, what other hardware might I include for an auditory browser?
The output would predominantly be audio, with a headphone jack external ones can be plugged into. These are magnets vibrated by an electromagnet to create air vibrations known as “sound”. I may or may not include a Signal Processor to take some live load off the main processor. A buzzer (working similarly to the speakers) could be useful for communicating notifications & security info, but I won’t explore it further.
A microphone (via headphone jack when available) would be the primary input. These are reverse-speakers, magnets vibrating within an electromagnet to produce analogous voltages. This would be augmented with a AI coprocessor (Google calls it a “TPU”) to aid speech recognition.
I’d also include a volume wheel (count number of passing slits), joystick which can hit one of 5 buttons (primarily for in-page navigation), & a periodic “tick” signal to drive clocks & timeouts. Maybe a GPS receiver, which I won’t explore. Oh, and if this is a portable device we should include battery-charge sensor. Then there’s the issue of randomness…
Given the whole point of the device is to communicate information from the ‘net, we need hardware to connect to it. I’ll explore WiFi (whose transmitter & receiver would include some hardwired live processing), but it’d also be good to provide the option of an Ethernet port. Adding Bluetooth transmitter/receivers could be useful too. As for cellular ones, but adding the software to turn this hypothetical device into a phone would be a reader exercise. I won’t explore either of those.
Hooking up an RS232 port to interface to hardware terminals or (better yet) software terminals could help us bootstrap, at the cost of uglier & possibly less-inclusive output during initial development.
TODO: Describe the coprocessors
- Joystick (5 buttons)
- Microphone (+ headphone jack)
- Volume wheel
- Periodic tick
- GPS receiver?
- Battery sensor?
- Speaker (+ headphone jack)
- Signal Processor?
- AI co-processor
- Ethernet port?
To aid some computing some of the I/O we would need some additional hardware.
What exactly are we having this device output? And what additional hardware might be needed? What fileformats too?
Audio are vibrations in the air, which we can have vibrate a magnet inside a “solenoid” (coil of wire) thus producing a fluctuating voltage. We can connect a capacitor in paralel & a resister in series to strip out frequencies >20khz which we cannot hear.
We can hook wires holding each binary digit up to resisters connected in parallel, to convert them to analog voltages & sum them (a “DAC”). Add a couple transistors comparing analog voltages to convert the other way (a “ADC”).
Run this whole process in reverse (throwing in a large amplifying transistor) & we’re outputting audio! Sampling 16bits yields 98dB signal-to-noise ratio, whilst (as proven by Claude Shannon, amongst others, during WWII) 44.1khz can represent any frequency <22.05khz.
To digitally adjust volume, amongst other audio effects e.g. resampling, we’d need to apply some multiplications live on the output data. So say we create a rudimentary Digital-Signal-Processor in the form of a multiply-adder with 128 registers & 32bit instructions. No RAM. 64*44.1khz multiply-adds should usually be plenty! And very achievable. This multiply-adder could be constructed by ANDing each bit of one operand with all the bits of the other shifted to the appropriate placevalues, than incorporating a (possibly rearranged) pyramid of fulladders to sum all those intermediate results.
Any controlflow in that rudimentary DSP would be offloaded onto the Arithmetic Core I’ve already established, it’d just evaluate arithmetic formulas with (mostly?) 8bit constant factors over an output queue. A couple registers (2 registers hooked up to 2 speaker “drivers” to simulate stereo audio using their relative volumes) would be special in that they’d be sent to the DAC every 64 multiplies, & another would shift another input off the queue. As the queue nears empty, or when a countdown reaches 0, I’d trigger an event for the main processor to queue in more audio.
Unfortunately for our hypothetical speech synthesis… We’d quickly saturate all the compute power required by the other tasks! So I’d add analog circuitry to compute these “flutter” sine waves (representing our breathing cycles), sum them, & scale them. Then I’d memory-map digital-samples of this combined sine wave to the topmost memory address in the Arithmetic Core for it to branch over. We can do so by feeding the an amplifier’s (transistor?) output back to it’s input through capacitors & resisters.
If we were computing this trigonometry using digital circuits, that’d take a lot more effort!
To give the link-nav UX I want where you state a topic for our hypothetical device to infodump on, we need speech recognition. & in turn AI. Thus requiring additional hardware.
A neural net (argue amongst yourselves about the inherent metaphor) is a massive formula consisting of lots of multiplies & adds whose factors we continuously tweak to get us closer & closer to the desired behaviour on the given training inputs->outputs. Given a large enough formula we should be able to find factors matching our needs…
To aid computing each “layer”/”matrix-multiply” we need several multiply-accumulators running concurrently. Programming this would be quite different from the multiply-adder I added between the processor & it’s auditory output in the sheer numbers of factors I’d want to upload to it. Or use analog circuits, which would strongly resemble the SSD memory used to persist code, configuration, etc. With the charge-trapped flash implementing multiplication! Energy efficient, if we don’t wear it down with too many writes.
Let’s say we use the analog approach, I want the energy efficiency! But in either case we’d need to include a somewhat-arbitrary Control Unit to allow operating these AIs, & hook it up directly to the microphone ADC. We’ll say the main processor would instruct the AI coprocessor to evaluate a specific preloaded “model”, or to tweak it towards a certain input/output pair. It’d respond with it’s estimate of output probabilities.
P.S. It could be possible to repurpose a digital AI coprocessor for live playback of surround-sound FLAC files, but I’ve chosen analog. Repurposing the analog circuit in this way would quickly wear it down.
P.P.S. It’d be interesting to run LibreTranslate in this coprocessor as an optional part of the rendering pipeline… Because that is a useful feature!
For our devices to communicate over WiFi, we not only need a radio transceiver but we also need live processing (using hardwired circuits) to handle conversing in a noisy room (radio not audio but still…). Introverts are likely well aware of these challenges! I’ll discuss how this works in reference to OpenWifi’s diagrams & wikipedia’s protocol summaries.
Glossing over details of the radio hardware I don’t know much about, a major component is CRC error detection. This a minor twist on a 32bit shift register in which certain bits are conditionally inverted (XOR’d) based on the bit being shifted out. To reason about its effectiveness mathematicians view this simple circuit as a “finite-field polynomial division”.
Upon transmission we consider the incoming signal strength, timer, & certain received packets to determine when there's a gap in conversation. If we want to send a nontrivial amount of data we first send a Request-To-Send control frame. Once we receive a Clear-To-Send control frame from the WiFi hotspot we'll send the actual enqueued frame from the ringbuffer.
Each of these frames are follwed by their corresponding CRC32. Upon receiving a valid acknowledgement (ACK) control frame we dequeue the frame. When we need to resend a packet due to missing ACK, we do so with an exponential backoff.
That timer also runs some concept of “timeslicing” which I believe improves throughput. Frames are enqueued into a ringbuffer to send when the circuitry finds the opportunity to do so, though control frames can be injected into the middle of that queue.
The receiver validates that its CRC circuit is zeroed out by sent CRC, amongst other things determining whether to send an “I heard this” ACK controlframe, or to notify the main processor to read a frame out its ringbuffer.
This is a non-trivial amount of circuitry, but it needs to run in realtime free from the CPU being busy with other things. This has proven to be very achievable.
Several images on this page are from Wikimedia Commons, click one to see more details. Most are drawn by the author.