Fundamentals of Computers & Programming

This morning I’m tempted to describe the basic theory of how computers run programs, but first what is a program or a computer?

They are instructions composed of at least four basic operations:

  1. Store arbitrary ammounts of data to be read later
  2. Repeat instructions (to process more data or refine answers)
  3. Do different things for different data
  4. Communicate to, or ideally with you (otherwise what’s the point of computation?)

More operations can be added, but they’re not strictly needed.

We can ofcourse build electrical circuits capable of following these operations.

For (3) we can use build upon electrically-controlled switches called “transistors”.

For (1) we can use electrical feedback or, for more efficiency, capacitors. Though the latter requires circuitry to periodically reload it’s values.

For (2) we can include a periodic “clock signal” amongst the inputs, driven by the (dis)charging of capacitor.

From there we can abstract:

  1. feedback into fixed-sized “registers” or capacitors into “RAM”, the latter of which includes a condition for which capacitors are read given some “memory address”. This storage is all connected via some special wires called “the bus”.
  2. The periodic clock signal drives a “control unit” to determine which instruction to run and how to run it.
  3. Transistors are abstracted into logic gates, adders, etc.
  4. It can send signals (indirectly) to your monitor, etc.

That gives us some “CPU” hardware to run arbitrary programs, though modern CPUs are lot more complicated to deal with how much faster than RAM they’ve become.

In the early days we loaded software into our computers using holes punched in paper, but today we rely entirely on software to do it. And we have software to translate from a more human-friendly language to what the machine natively understands.

A “programming language” tends to follow these general steps in order to do it’s translation:

  1. “Lex” the input text into a sequence of “tokens”.
  2. “Parse” the relationships between those tokens into a hierarchical “Abstract Syntax Tree”.
  3. “Analyze” that AST in order to extract more implicit information.
  4. “Optimize” the code to run slightly faster.
  5. Figure out where to put all the data.
  6. “Compile” it from the AST into the machine code.

Using these “compilers” you can write your code in C, etc rather than machine code where you have variables and pointers (for 1), while/for loops (for 2), and if statements (for 3).

I/O (4) tends to get a lot more complicated in order for the computer to output something you can understand, so it tends to get relegated to the standard library. Where it calls down through “language bindings” and “system calls” into the operating system’s “kernel”, which’ll then coordinate outputs to the hardware.

Though I find functional programming like in Haskell to be a clearer expression of the concept. There data is stored (1) by passing them into functions, recursion is used exclusively for loops (2), pattern matching (which is a lot more expressive!) and higher-order functions are used for conditionals (3), and your functions arguments and return value are it’s I/O (4).

Everything else is a standard library abstraction, or some syntactic sugar upon these constructs.


At the heart of your computer is a CPU & some RAM! How do these work to give life to all the software you’re running?

Lets start with the fetch-decode-execute loop. Each time a quartz crystal (combined with a digital “latch”) oscillates with voltage your computer will load an instruction (or batch thereof) from memory, figure out what that opcode means, then follows that instruction! Before repeating the sequence.

Except each “clocktick” there’s no need for just the fetcher, decoder, or executor circuits to be active! We can decode the next instruction whilst the previous one is running, & we’re fetching the subsequent one. Even the 6502 used some of this, though now we take it to extremes splitting the decoder into numerous passes!

Which is vital today since (in part to store as much data as possible for your buck, & in part due to circuit size) RAM is an order of magnitude slower than the CPU.

Now you might notice the next issue which needs tackling: instructions don’t always occur in linear order! Often an early decoding pass can look ahead for GOTOs, but what if we have conditional branches? Or function calls? Or worse: method calls & switch statements?

CPUs address this by profiling your program’s execution, and tracking a callstack internally (which can double as a security measure). So they can guess where conditional & dynamic branches lead before they’re executed!

But the most effective approach to reducing the RAM bottleneck is to build smaller, faster RAM. Then again memory capacity is a hard limit on a computers’ capabilities (even if we aren’t using it efficiently today). So instead we use that smaller, faster RAM just as a “cache” of recently accessed data & the likely-related data near it. And we build multiple layers.

Also aids multiplexing RAM access, synchronized via a 2bit state machine.

Now you’re vulnerable to Spectre attacks!!

Its worth repeating: All the CPU optimizations Spectre exploits are necessary to stop CPUs from bottlenecking on RAM. I see some naive comments on this topic.

Now we’ve reasonably-addressed the RAM bottleneck in the decoder, how do we address the bottleneck in the executor? In part the answer is “use an optimizing compiler like GCC, LLVM, or any JS engine.”

The other answer is: don’t take instruction order as gospel.

Rather than executing instructions one-at-a-time you can buffer them per appropriate ALU, to be executed once its data has arrived from RAM or other ALUs.

And you can add a “register renamer” decoding pass to clarify dataflow dependencies.

At which point it becomes easy to “speculatively” evaluate instructions which might not be needed.

These optimizations might need to disabled! Which is what atomics are for.

Digital Circuits

In the last section I walked you through the concepts of fetch-decode-execute loop, pipelining, branch prediction, memory-caching, out-of-order execution, & speculative execution. Here I want to discuss how we build electronic circuits which can compute!

Which can remember data, take different actions for different data, & repeat such operations! Not to mention arithmatic!

We start with a component called a “transistor” which temporarily-fills a gap in the circuit when voltage is applied to a second input. Connect these in series to create an AND gate. Connect them in parallel to create an OR gate. Have them switch on a short-circuit to create a NOT gate. Also there’s ways to get them to perform analogue voltage arithmatic, amplification, & comparison.

a XOR b = (a OR b) AND NOT (a AND b)

We all associate computers with arithmatic, where the most fundamental operation is addition!

In binary 0 + 0 = 0, 0 + 1 = 1, 1 + 0 = 1, & 1 + 1 = 0 carry 1. So use a XOR gate & an AND gate!

But what do we do with that carry? We stack 2 of these “halfadders” OR’ing the carries to form a “fulladder”!

Negative numbers are represented by inverting all the bits & incrementing, since that works in the same circuit! Multiplication is a large row of AND gates outputting into a sum circuit.

Memory is crucial for computing, and there’s been lots of ways we’ve handled it.

For small ammounts of fast memory we can use feedback: Tie the output of a pair of NOT-AND or NOT-OR gates into an input of its twin so when you use the input of one gate to turn it on the other switches off. “latches”.

Capacitors can hold data temporarily, though we can use a small latch-based refresh circuit to make that data stay until powered-off. This is how RAM chips work! Since each bit costs nothing.

Magnets are a traditional means of storing data permanantly, not to mention burnable dies on CDs.

Now we use some quirk with layering transistors too tightly I fail to comprehend.

Read-only memory might be folded into the manufacturing process, or melt fragile wires (“fuses”) where we want 0s.

To determine which memory cell (or whatever) we want outputting to a common wire (“bus”) we use AND and (if not read from latches) NOT gates carefully laidout optimally. “selectors”/”decoders”.

Finally once the circuit has finished evaluating its boolean condition these circuits are significantly more useful when they automatically get re-run with the new data!

For that we charge a capacitor (with resistor) tied to a digital-latch to a) convert from a saw wave into a square wave & b) determine when to drain the transistor to produce said saw wave.

Though typically we use quartz crystals for that capacitor + transistor, they work very well for that!

For more in-depth discussion, incredibly impressed by Ben Eater’s 8bit computer tutorial!

Programming Languages

Programming languages covers massive computing theory ground! These convert the input devices we have (now that we’re not using punchcards) into the languages our CPUs natively understand, whilst providing other conveniences like freeing us from deciding where every single byte of data in our programs should be stored.

There’s a reason why my hypothetical hardware to reimplement Rhapsode & Haphaestus upon caters specifically to parsers & compilers… Generally I use jargon to avoid revisiting these concepts too often, and I prefer to add new dependencies rather than tediously write my own parsers.

Usually the programs we want to compile/interpret gets stored as a textstream (which we usually wrap in a “scanner” tracking our current position/char to make conditional tests easier to write correctly, & to improve error messages) leaving us needing to “parse” it to extract its implicit structure as an “Abstract Syntax Tree” (AST) before we can do any non-trivial computations on it.

The same holds for binary bytestreams. Despite what some may claim there’s really not that much different between text & bytes to a computer, afterall text are bytes to a computer!

Parsers usually utilize a lexer/tokenizer (there’s a distinction, rarely relevant today) to split the stream of text up into e.g. words, numbers, & punctuation. So the parser can operate on a slightly which is easier to reason about. Often there’ll be a 2nd scanner around the lexer.

Parsers come in several variants. Usually you’ll see pushdown automaton parsers either handwritten or using a parser generator. But there’s also statemachines, Shunting Yard, or Top-Down Operator Precedance parsers.

Once you’ve parsed text or binary streams, there may be other things you might want to do with it (e.g. JSON, XML; I have read it suggested that OSs should abstract parsing away from us). But we’re talking programming languages!

Where the next step is usually to resolve variable names to their definitions. Though interpretors usually do this at compiletime as opposed to runtime. For this we use mappings, usually hashmaps (converts textual or whatever keys into indexes into an array modulo array length; with some strategy handling when it finds something else in that slot). Allowing for typechecking & more.

Now that all the data’s in a structure the computer can readily can understand, we traverse the abstract syntax tree to “lower” it to a form which more closely resembles machinecode. Usually this mostly involves simplifying the “datamodel” from one we find easier to reason about, at which point typechecking & literal syntax comes into play.

Typechecking is another tree-traversal propagating types & ensuring they follow various invariants. Outputting nice error messages if not, possibly with aid of yet another tree traversal.

Many programming languages are implemented as “interpretors” which teaches the CPU how to run the (usually but not always lowered) program they parsed, rather than “compile” it all the way down to machine code. Then again I’ve seen tooling which makes designing hardware look like writing interpretors, only now you’re compiling to images to etch into silicon or lookuptables for FPGAs to evaluate rather than machine code.

These are easier to write (at least in an OO or functional language) but noticably slower without the lowering, but anything popular generally lowers their AST to “bytecodes” they can loop & switch over (to use C terminology).

An interesting optimization in bytecode interpreters I’ve seen in SQLite & most JavaScript (including the relatively-concise QuickJS) interpretors is to rewrite the opcode its currently interpreting based on profiling information.

Or you could combine interpretors (for less-run code) & compilers (for heavily-called code) to create a “JIT” like Julia & all dominant JavaScript engines (JavaScriptCore, SpiderMonkey, & V8).

Optimizing Compilers

Recently much of the performance benefits of modern CPUs have instead come from “optimizing compilers” (like GCC, LLVM [in turn rustc, Julia, etc], Glasgow Haskell Compiler, JavascriptCore, Spidermonkey, V8, JVMs, etc) emitting code better suited for them. Addressed by numerous optimization passes, and lowering split into several passes to reveal optimization opportunities.

At which point compilers have an interesting tradeoff: Do we make our own software faster or the software we’re compiling? On the otherhand most compilers are self-hosted meaning they benefit from the optimizations they implement. JITs have this worse!

(Incidentally in JavaScript JITs optimize for JavaScript programs produced from optimizing compilers, ones which address transmission & parsing costs these JITs kick in too late to address)

So what sort of code does modern hardware like?

Modern CPUs love small software! It fits better in its instruction caches! Perhaps an optimizing compiler’s main job is to rewrite your code to be more concise without unused code, duplicate code, computations which could be performed at compiletime, etc.

Vector instructions can help with this!

Sure you could argue that this is poor code quality the compiler is cleaning up, but thanks to lowering & other optimizations it sneaks into even the cleanest of code.

CPUs like straightline code without branches! They make prefetching the next batch of instructions trivial. They dislike conditional branches which they have to predict! They really dislike dynamic branches (i.e. function pointers, methods; caveats), but that needs a JIT.

In some cases this goes hand-in-hand with concision (e.g. collapsing chained GOTOs arising from lowering controlflow or deadcode elimination), other times not so much. So there’s an analysis pass deciding where to optimize for concision vs minimal controlflow.

I suspect that compiler engineers facepalm over highlevel languages’ love of pointers!

Branch density is especially a problem within loops, then again (especially if the optimizing if the optimizing compiler has anything to say about it) loops make branch prediction easier! They can even make dynamic branches easier to predict!

So GCC, at least, has a whole suite of optimization passes dedicated to shrinking loopbody sizes & minimizing loop iterations (again vector instructions are very handy here), especially for inner loops.

Replacing finite loops with duplicate straightline code helps, a.k.a. loop unrolling.

What else?

Inlining functions is a tricky balance to get right. Can reveal more optimizations & can save minuscule branching cost (easy to branch predict these) if its not overwhelmed by the harm it does to concision.

Some instructions (e.g. multiplication, divide) take longer to evaluate than others (e.g. add, subtract, shift), so there’s “peephole” optimizers handling that.

Ordering code so related branches (with preference given to probable paths) are near each other helps the instruction cache. Memory allocators from your language’s runtime system or (for C) standard library are designed to impart similar benefits for the data cache.

Then there’s CPU registers! Even if stackmachines look more elegant to me…

RAM is relatively-extremely slow so optimizing compilers try to keep all the data they can in CPU registers. At which point a late lowering pass needs to decide which CPU registers those corresponds to.

Which is roughly equivalent to The Map Colouring Problem, meaning there’s no good way to compute it. We have to rely on hueristics & fallbacks to stack.

As always lowering yields messiness to cleanup.


The most complex task a computer undertakes is to communicate its computations & stored data with other software & hardware; and through them us humans. That’s quite a language barrier to navigate, even without considering all the differing eccentricities we have developed over the mellenia! Which you really should consider, who’s to say which eccentricities are “correct” when there’s no correct!?

Even “backend code” is more datamodelling for the UI than anything…

Even communicating between two different programs is challenging (though you could blame the kernel for this)!

Usually it involves parsing & serializing data over a bytestream shared between them, similar to how compilers work (as described above). But you might need to add compression if there’s not enough bandwidth for your needs. You might need to add error detection or recovery if its not reliable enough.

Same always applies for datastorage or communicating between computers, which isn’t the kernel’s fault.

To communicate with hardware, you usually have to also struggle with that hardware wanting set the schedule of communication. Almost all the other hardware desiring the same thing resulting in “kernelspace” (where you have permission to do this) being a hostile programming environment!

And with the sheer range of hardware each speaking its own protocol that can be combined in a myriad of different ways, its no wonder Linux is such a massive project! Even then there are also driver components both beneath & above Linux/Mac/Windows; I have heard calls for more research into better tackling the complexity of the OS beneath your OS.

Ontop of all that these “hardware drivers” also have to split the use of that hardware between multiple “processes” (running programs), manage permissions for when those processes are permitted to access the hardware, & if energy-efficiency/battery-life is important to you determine when to power it on or off. Same are required for the CPU the kernel’s running on, though arguably compilers provide the drivers there!

At other times you lack some hardware (or the ability to use it) software on your computer requires & must emulate it upon the hardware you do have. These (e.g. terminal emulators, screenreaders & text-synthesis) operate upon the full graphics stack they’re converting to.

Once you have a consistant API upon whatever hardware you’re using which implements things like capslock for you (yes capslock is implemented in userspace; surprised?) you need to efficiently generate lots of data for your output devices!

Often you’ll want to record data from input devices at which point: see top of this section! Or maybe you provide tools allowing people to create equivalent files themselves.

Audio, graphics, & text code all have their own aesthetic at this layer.

And now finally you can start worrying about communicating with humans!

Partly this involves establishing a (e.g. visual) language humans & computers can share! Though you should avoid relying on specific classes of input and output devices to be inclusive of people who are physically or mentally incapable of using those devices. Provide configuration options.

And avoid requiring them to speak English, or otherwise adhear to the conventions of your region. Add dependencies like Pango & Gettext to fix this!

I/O Mediums

Above I briefly described the sheer ammount of effort that goes into I/O. Today I want to get a little more detailed and describe what characterises different mediums from the computer’s perspective! A little about how they work.

Specifically I’ll discuss what characterises graphics, audio, text, & persistant storage!

In short: graphics by sheer data generation, audio by timing, text by iteration & concatenation, persistant storage by parsing.

Graphics is defined by the sheer ammount of data needed to be generated! On my machine that’s 1,3667684= 4.2m bytes (or rather 3.1m, the 4th “colour channel” is used for compute) 60+ times a second. Though a vast majority of those bytes don’t change that rapidly at all.

Most computergraphics algorithms boil down to the techniques of “interpolation” (filling gaps over space or time), “matrix transforms” (repositioning), & “iconography” (how to depict things).

“Scanline algorithms” where you compute an image a row at a time used to be more popular back when computers were barely powerful enough to generate a live image, but they’re still used mostly in “vector graphics” (describing shapes to show onscreen) for hittesting/cropping & filling.

RAM is extremely useful here to allow painting over your existing canvas!

The 2D nature is also characteristic of graphics, at which point windowmanagers strongly associate them with mice whilst multiplexing!

Audio is defined by its precise timing requirements! Our ears are painfully good at detecting gaps in audio data, & things don’t sound right if the timing’s off.

So the code providing audio to the soundcard most have strict performance guarantees to the extent they practically can’t use the rest of the operating system! Which is usually solved by using an atomic ringbuffer to connect a “realtime” thread to a “buffering” thread.

Most audio are tweaked variations of a recording.

Text is represented as bytearrays where every value each (few) bytes can hold is associated with some sort of symbol (“character”/”char”) used when writing text in any supported language (caveats). Text encodes data in a very information-dense & versatile form from a human’s perspective.

Operating on strings almost entirely involves iterations & concatenation. With conversion to/from numbers (repeated multiply-add or divide with remainder).

Textviewing hardware is all software now for good reason!

Internationalization is vital in text I/O, which historically we weren’t great at. Early computers were limited to output only English, but now we have Unicode! With room to play with defining missing alphabets!

Mapping lookups, domain-specific interpretors, & fancy text rendering are used to address this to the extent they’re more characteristic of text I/O code than concatenation & iteration!

We still maintain backwards compatibility with English-only code, because why not? It doesn’t slow anything down.

Persistant storage is mostly a large stack of parsers, though multiplexing & access control can get interesting… We both want & don’t want software to share saved data! Traditionally this multiplexing was made very visible using “filemanagers”, though smartphones are hiding this (I don’t like).

Multiplexing gets interesting when optimizing concurrent access with mutual exclusion!

Very often we layer an optimizing (SQL or shell) interpretor ontop. SQL is a very talented optimizer!