# Designing a GPU

Visual output is a computationally-intensive task: Multiple megabytes (depending on your screensize & setup) must be generated 30+ times a second! This page explores how we build up from logic gates to a hardware-accelerated implementation of OpenGL ES 2.0 (now superceded by Vulkan & WebGPU) capable of powering 3D games & other tools.

This will require some familiarity with college/highschool/whatever-you-call-it math!

## Early 2D Graphics

But first, let’s go back in time to 1980s/1970s arcades! How did those limited machines generate roughly a single megabyte of output 30+ times a second? This was a major feat at the time! One technique to cut down on the amount of data which needs to be computed per-frame is to tie data output pins from one RAM chip (+ some pixel counter lines) up to the address of another so subimages can be reused. This works pretty well for certain character sets or many entertaining gameboards! As RAM minitiarized it got cheaper to decompress at write-time as opposed to read-time. At a smaller scale you can use this technique to encode an image with a small pallet of colours! Which happens to aid recolouring. To aid moving “sprites” freely around the screen (guesswork, I haven’t managed to find a schematic) we had circuitry which would start shifting a small image to output once the counters controlling output signal timing had hit their X/Y position. And repeat for the next few scanlines.

Having dedicated circuitry to further aid copying memory was often useful, `memcpy` remains vital in computer graphics to this day.

Then there were vector displays, where we hooked a “Control Unit” decoding instructions up to a rewired Cathode-Ray Tube display via Digital-Anolog-converters for it to draw lines. These looked quite distinctive, with unusually sharp geometries for the time! I’ll mention them again later

Other techniques were also in use at this exploratory time, but I think that covers the main ones.

## Maths

Let’s start by discussing integral arithmetic. The way we implement this in hardware closely resembles “long arithmetic”, except we only have 2 digits: on & off, 0v & 5v, true & false, 0 & 1, etc. So they have simpler addition, etc tables! ``````	0 + 0 = 0
0 + 1 = 1
1 + 0 = 1
1 + 1 = 10
``````

It’s a XOR gate, with an AND gate for carry! Chain 2 together so we can sum in the carries. We can simply OR those carries since in binary 1 + 1 + 1 = 11.

We can reuse that circuitry to implement subtraction by negating the most significant place-value. So the 8bit number 100000002 = -12810. To negate a number using this “2’s Complement” technique we invert (using XOR gates) all the bits & increment (typically using a carry-in signal). Once we can negate an operand we can perform subtraction!

Other techniques may be used too (say dedicating a bit to indicate the sign), but this is the most elegant from a hardware perspective. The binary multiplication table is:

``````	0*0 = 0
0*1 = 0
1*0 = 0
1*1 = 1
``````

An AND gate, no carry!

Except that isn’t the full multiplication process. We need to shift one operand to each place value for each digit of the other operand before multiplying by that bit. Yielding a bunch of intermediate values to sum into a number with potentially twice as many bits (often exposed as 2 registers)! Arranging those intermediate summers into a pyramid helps to maximize hardware concurrency, though we can do better.

To divide we chain (or clock, firmware would probably be just as fast…) subtracters to repeatedly bitshift the dividend in one bit at a time & conditionally subtract the divisor for a positive result. The last subtracter gives the remainder. The carry-out/overflow lines gives the quotient!

(Maths jargon used here: dividend/divisor = quotient rem remainder)

### Floating Point

To write very large or very small numbers “scientific notation” is commonly used where we represent them in terms of the formula “a*10b”. Calling a the “mantissa” which should have exactly 1 non-zero digit before the decimal point, & calling b the “exponent”. To add or subtract scientific numbers we adjust the exponents so the place values in the shifted mantissas line up to the larger exponent. Then we can actually add or subtract the mantissas, before reestablishing that 1-10 invariant. To multiply scientific numbers we add the exponents & multiply (XORing signs) the mantissas. To divide scientific numbers we subtract the exponents & divide the mantissas.

This does have limited precision, but your numbers probably aren’t that accurate anyways.

This directly corresponds to how we usually process fractional numbers in computers! Except the formula’s “mantissa*2exponent”. Where the mantissa is stored with a sign bit, & the exponent resembles 2’s complement (yes, the sign bit means floating point distinguishes -0.0 from +0.0). The preceding “1.” is not stored, but it needs to be re-added during calculation.

The min & max exponents represents 0, smaller-than-normal, infinity, & error values.

Normalizing involves repeatedly incrementing/decrementing the exponent & shifting the mantissa until we reach a 1bit. Hueristics can minimize the number of clock cycles. Or the FPU stores it unnormalized internally

### Polynomial Approximation

Here’s a piece of the puzzle I’ve long wondered about & only recently (as of writing this) found an answer to (thanks Julia programming language!).

Calculus & it’s concept of derivatives allows us to set the derivatives of a polynomial as it’s coefficients, provided that we divide out the factorials Calculus introduces. A “Taylor Series”. Thus allowing us to approximate any function for a range (which can be calculated) of input in way that’s easy to calculate! Especially for ones like trig or `ex` which are defined in terms of their derivatives!

The derivatives of `cos(x)` at `x = 0` are `1, 0, -1, 0, repeat infinitely` yielding the Taylor Series (around `x = 0`) of `1 - x2/2! + 24/4! - ...`.

Relatedly the derivatives of `sin(x)` at `x = 0` are `0, 1, 0, -1, ...` yielding the Taylor Series (around `x = 0`) of `x - x3/3! + x5/5! - ...`.

`tan(x)`’s Taylor Series is: `x + x3/3 + 2x5/15 + ...`.

All the trig functions have one!

`ex` is defined to be it’s own derivative, & like all exponentials `e0 = 1`. So naturally it’s Taylor series (for any x) is `1 + x + x2/2! + x3/3! + ...</sup>.`

The Taylor Series (given `|x| <= 1`) for `ln(1 + x)` is `1 - x2/2 + x3/3 - ...` since `dln(x)/dx = 1/x` definitionally. That can easily be used for the mantissa, but the exponent has to be handled separately. Meanwhile log-base-2 of an int is the count of it’s first-set-bit (reuse float-normalizer circuit?), multiply by exponent for floats!

Once we have `ln(x)` & `exp(x)`, we can define `pow(x, y) = exp(ln(x)*y)`. Shortcuts are taken, but once we have powers we can implement `sqrt(x) = pow(x, 1/2)`, `inversesqrt(x) = pow(x, -1/2)`. And with `1/x = x-1`, or log-subtraction, we can implement division in terms of multiply-adds!

In 3D graphics these non-trivial operations are mostly used to approximate lighting, as described below.

#### Trig Derivatives

I cited the derivatives of `cos(0)` as `0, 1, 0, -1, 0, ...` from which I computed it’s Taylor Series. Where do those trigonometric derivatives come from?

To answer that question we’ll need to go the fundamentals of Calculus your curriculum might have glossed over in favor of rote memorization. To express the paradoxical concept of change in an instant mathematicians use the formula `df(x)/dx = lim [d->0] ((f(x + d) - f(x))/d)`. That is: what does the slope of the graph approach as we decrease the range we sample?

For sin(x)…

``````	dsin(x)/dx = lim[d->0] ((sin(x + d) - sin(x))/d
`````` We need to get d out of `sin(x + d)`, so if you look at the attached visual proof `sin(x + d) = sin(x)cos(d) + cos(x)sin(d)`.

``````	dsin(x)/dx = lim[d->0] ((sin(x)cos(d) + cos(x)sin(d) - sin(x))/d
``````

Factorizing…

``````	dsin(x)/dx = lim[d->0](sin(x)(cos(d) - 1)/d + cos(x)*sin(d)/d)
``````

Leaving us to take the limits of `(cos(d) -1)/d = 0` & `sin(d)/d = 1`

``````	dsin(x)/dx = sin(x)*0 + cos(x)*1 = cos(x).
`````` To take that limit of `sin(x)/x` as x goes to 0, we say the area of the equilateral triangle OAB in the attached diagram < area of it’s corresponding circular sector < area of corresponding right-angled triangle.

Given the lengths of `OB` & `OA` are radi of a unit circle both = 1 this comes out to…

``````	sin(x)/2<x/2<tan(x)/2
``````

Given `sin(x) > 0` & `tan(x) = sin(x)cos(x)`, we can divide out `sin(x)/2`

``````	1<x/sin(x)<1/cos(x)
``````

Reciprocals…

``````	1<sin(x)/x<cos(x)
``````

As `cos(x)` gets closer to `cos(0) = 1`, so must `sin(x)/x`.

As for the limit of `(cos(x) - 1)/x`

``````	(cos(x) - 1)/x = (cos^2(x) - 1)/x(cos(x) + 1) = -sin^2(x)/x(cos(x) + 1).
``````

Since…

``````	Cos^2(x) - 1 = - Sin^2(x)
Cos^2(x) + Sin^2(x) - 1 = 0
Sin^2(x) + Cos^2(x) = 1
Sin & Cos are geometrically the base & height of a right-triangle inside a circle of radius one: last equation is guaranteed by pythagorean theorem
``````

Taking the limit…

``````	lim[x->0](-sin^2(x)/x(cos(x) + 1)) = -lim[x->0](sin(x)/x)*lim[x->0](sin(x)/(cos(x) + 1)) = (-1)*(0/2) = 0
``````

Similar logic as for `sin(x)` gives the deriviative of `cos(x)`

``````	dcos(x)/dx = lim[d->0]((cos(x + d) - cos(x))/d)
= lim[d->0]((cos(x)cos(d) - sin(x)sin(x) - cos(x))/d)
= lim[d->0](cos(x)(cos(d) - 1)/d - sin(x)sin(x)/x)`
``````

Using the limits calculated above.

``````	dcos(x)/dx = 0cos(x) - 1sin(x) = -sin(x).
``````

So the derivatives of `cos(x)` are `-sin(x), -cos(x), sin(x), cos(x), ...`! Plugging `x = 0` in (accepting that `cos(0) = 1` & `sin(0) = 0`) gives the `0, -1, 0, 1, ...` cited above.

Likewise the derivatives of `sin(x)` are `cos(x), -sin(x), -cos(x), sin(x), ...`, at `x = 0`: `1, 0, -1, 0, ...`.

If you didn’t follow, just know that we can compute those derivatives based on the geometric properties we want! And thus create a polynomial with those derivatives to calculate trigonometry to the precision of a standard IEEE float.

### Matrix Transforms

Computers, using the circuitry I’ve been describing, don’t natively perform algebra. We can ofcourse teach them algebra, but if we don’t want to bottleneck on it we can design shortcuts for formulas in particular structures. In 3D computer graphics we represent a point as a “vector” of 3 numbers (“x”, “y”, & “z”), 1 for each dimension. So how about simplifying formulas for those? Or rather, let me write these in 2D for concision.

Take the formula (for 2D) of:

``````	X' = aX + bY
Y' = cX + dY
``````

(mathematicians are nodding along, Computer Graphics devs think I’m missing something)

Chaining 2 of these “matrices” gives:

``````	X' = a'(aX + bY) + b'(cX + dY)
= a'aX + a'bY + b'cX + b'dY
= (a'a + b'c)X + (a'b + b'd)Y
Y' = c'(aX + bY) + d'(cX + dY)
= c'aX + c'bY + d'cX + d'dY
= (c'a + d'c)X + (c'b + d'd)Y
``````

It’s a new matrix! (Writing them in a grid makes the mathematical patterns easier to spot) Just by multiplying each vector by a handful of numbers we can apply however many matrices we want!

Take the 4D (per OpenGL convention 4th dimension “W” indicates points (1) vs directions (0)) matrix:

``````	X' = aX + bY + cZ + dW
Y' = eX + fY + gZ + hW
Z' = iX + jY + kZ + lW
W' = mX + nY + oZ + pW
``````

What do we have? Coefficients a, f, & k scales the vector along each dimension. d, h, & l provides translation (moves). Given Z is negative c & g provides perspective towards (0, 0). m, n, o, & p typically are set to 0, 0, 0, 1. The rest provides skews, which with trigonometry can become rotations! Early versions of OpenGL (which Mesa3D still supports) separately tracked model (stacked), view (inversed), & projection matrices to be combined & applied to all vertices. Inversing the matrix positioning the camera places it at the centre of the (virtual) universe where perspective matrices work correctly. There’s a technique for inversing any matrix, but it’s computationally cheaper to build it up from the inverse rotations, etc. Stacking “model” matrices helped with implementing stopmotion-like armatures for 3D models. Since v3 this is left to game engines to implement.

Now that we can geometrically transform all vertices into place, we can have an output”wireframe” image just by drawing dots or lines between them, say, by outputting to a vector monitor!

## Depth Buffering Algorithm

As just mentioned, we can multiply all vertices in a 3D scene by a “transformation matrix” covering the position of the object & camera alongside the behaviour of that camera’s lens to get “wireframe” output (think the Death Star plans in Star Wars, see right image) we can hand straight off to a rewired CRT “vector” monitor. Either as dots or lines between them. However without removing obscured lines these wireframes can be ilegible. I’ll explore a proper solution in the next section, but first: some hueristics. Ofcourse we’d want to remove points outside the display, which can be rapidly done for lines & triangles via a bitmask indicating whether each point’s above, below, left, or right of the display. Bitwise-AND combines them to indicate whether we can discard the line/triangle. Adding the Z axis into this bitmask is a useful hueristic, whose physical inaccuracies can be covered by depth-parametric blurring resembling fog. And we may want to geometrically-intersect lines with display edges…

Finally given the vertices per-face are listed in clockwise order when they face the camera, it is usually safe to delete faces where after matrix-transformation they’re listed in counter-clockwise order. They’re probably obscured by a forward-facing triangle! The formula for this on triangles resembles subtracting the last 2 vertices from the first & comparing the dot product to 0.

### Interpolation

So far we’ve designed a circuit which can output geometrically-transformed & hueristically-filtered vertices to a vector monitor producing a tidy “wireframe” image. Except the problem is, we’ve moved past wireframes & vector monitors for something more versatile! Which means we still have some more processing to do… Though we still use those hueristics as they help cut down on the computational work, hence why I covered them. We’ll say our 3D models consist of triangles.

To fill a triangle we first sort it’s 3 vertices. So we can iterate along the Y axis from the topmost to middle one & on to the bottommost. “Interpolating” the X position (amongst other configurable “varyings” like colour) along both lines using the formula:

``````	mix(x, y, a) = a*x + y*(1-a) where a is in range 0..1 inclusive
``````

From there we can interpolate the remaining varyings between those X positions! Summing (for translucancy) the possibly-postprocessed colour into this coord in the output. That output can then be sent to any standard monitor! Given we get the timing protocol right.

Finally, to properly let foreground objects obscure background ones we include the Z axis in the interpolation & store the depth for each pixel in a “depth buffer”. Comparing against the depth buffer determines whether the “fragment” pixel is obscured or not. Though translucency complicates this & needs careful handling.

### Lighting

The benefit of the “depth buffering” algorithm I’ve been describing is that as long as we can read the results out it’s trivial to keep throwing hardware at it! The actual physics of lighting (it bounces everywhere) contradicts this parallelisability, so we’ll need to rely on approximations.

And then 1990s movies narratively justify these physical innaccuracies! Whether by saying it’s a computer display or world (more a 1980’s thing), the T-Rex is escaping into rain or a well-lit open field, the characters are all toys, the heroes & villains wear masks or sunglasses in action scenes, or the like.

Or we could ofcourse reverse-simulate relevant lightrays, parallelising per output pixel.

This lighting can be done per-face, vertex (interpolated over triangle), or “fragment” pixel (treating the geometry as an approximation of a smooth curve). We’ll start with “diffuse” light. Take the direction “I” from the point “v” to the light “l”, `I = normalize(v - l)`, to use in computing it’s intensity on the vertex, `max(dot(I, N), 0)` where N is the “normal” direction the surface faces. This gives a multiplier for how much of the light’s colour to add in.

``````	normalize(v) = v/sqrt(dot(v, v))
dot(a, b) = a.x*b.x + a.y*b.y + a.z*b.z
``````

Incorporate exponentiation into that intensity to simulate shininess, “specular” lighting!

Then partially make up for the approximation by saying there’s “ambiant” light coming from all directions, like in a rain storm.

Sum these lights into the resulting colour!

We could also incorporate light distance into calculations, or test if we lie within a spotlight.

As for shadows, reflections, refractions, scifi portals, etc, etc we need to either render the scene multiple times from different angles to inform the actual render, or we need to use recursive raytracing. Or even raytracing (recursive or not) only certain objects in our otherwise depth-buffered scene! We might precompute the raytracing for static objects.

Raytracing & depth buffering were always competing 3D rendering algorithms, with full raytracing being recently more popular.

### Texturing

Finally to finish off describing The Depth Buffering Algorithm, what if we don’t want to compute all visible fragment pixels dynamically? What if we want to pre-prepare them? Whether by hand, pre-rendering from another angle, or whatever.

Then we take a 2D image & attach coordinates into that “texture” to each vertex to be interpolated along the faces. So that for each “fragment” therein we lookup that coordinate (incurs latency for hardware to handle) to add into the output.

Except… The interpolated coordinate will rarely land on an integral, so to get the best results we need to subsample the texture! Take the adjacent pixels according to the integral part of the coordinate & mix according to the fractional parts.

Except… As textures shrink into the horizon we really should be averaging more than just those 4 pixels to summarize the information at that coordinate.

This is most readily parallelized using preprocessing! Involves repeatedly averaging every 2x2 pixels in the texture to halve its size horizontally & vertically, & loading the halvings into a “mipmap” lookuptable indexed by Z. During fragment rendering the appropriate texture can be looked up from the mipmap & the coordinate subsampled as per otherwise-usual.

## GPU Hardware

I hope my writing so far highlights that the maths behind 3D graphics aren’t really that complex, we just do a lot of it! So how do we build hardware to handle this sheer amount of number crunching in real time? from wccftech reporting

The modern answer: Stuff the “GPU” full of compute units! (different vendors use different terms) Except, there’s a reason why CPUs don’t have million of cores. What lets us cram millions of them into a GPU?

The fact we’re running the same code across all vertices, triangles, fragments, or (for ray tracing) output pixels helps! This means we can extract the “control units” from the compute units to be shared amongst them. This complicates control flow, but we can deal with that (deserves it’s own thread!) especially given the mass simplification elsewhere. But how to handle the latency incurred by, say, fetching pixels out of textures?

When a CPU is being held up fetching data from RAM rightly prioritizing (also physically can’t be avoided) capacity over speed, it looks for another instruction it can execute. GPUs (whose RAM doesn’t prioritize capacity to the same extent, but concurrency becomes more of an issue) address this by assigning multiple datums to the same compute unit, so we can iterate over them as we fetch from RAM. Presumably the compute units’ resource are stretched across as many of its datums as we can.

### Compute Units

From now on these details about GPUs are educated guesswork, my own or others’. I’m striving for my comments to resemble real hardware, but the most interesting aspects are not public knowledge. I’ll relate my hypothetical hardware to GLSL, & in turn the Depth-Buffering Algorithm I’ve been describing. Which is what I’m optimizing for. All that said: How’d I design a GPU compute unit? As well as control unit & GLSL compiler? Minimizing complexity to fit more compute units in.

To start we’ll need several 32bit registers per datum processed by this compute unit. Each of those bits (since I haven’t described how registers work yet) can be stored in the electrical feedback between 2 cross-connected NAND or NOR gates (transister-level shortcuts can be taken). Add a bit more circuitry to convert the data to write into set/reset control lines only when an enable flag is set. 32 of those would be tied to the same enable line to form each register.

To maximize hardware concurrency I’d tie the read & write data lines of those registers to numerous busses (common wires) each hooked up to specific ALUs I’ll describe below. Or those busses could read “literals” (I’d probably include shorthands for 1 & 0…) that were shifted into the control unit. If we say thanks to the iteration over datums the same registers won’t be read immediately after being written, we could probably save some circuitry & clockcycles.

#### ALUs

Full adders, as I’ve described previously, are the core component of an arithmetic unit. What can we do with them? We can ofcourse add (+, +=) 2 integers, 32bit in our case. We can propagate a carry to add 64bit numbers. Or set the carry-line to increment the result. If we add AND gates to zero out either operand we can have assignment (=), increment (++), literal 0, literal 1, & selection (?:) which is useful to cut down on branching costs.  Add XOR gates to invert either operand or those 0s & we have literal -1, literal -2, decrement (–), & subtraction (-). Once we have subtraction we can take just the sign & OR’d bits to have comparison (<, >, <=, >=, ==, !=, lessThan(), greaterThan(), lessThanEqual(), greaterThanEqual(), equal(), notEqual(); the function syntax operates on vectors). Maybe that even gives us any() & all()! In turn combined with ?: this gives us min(), max(), clamp(), & step()! Add shift registers & we’d not only optimize multiply/divide by powers-of-2 but we can process floating point numbers! Adding or subtracting 2 floats would involve:

1. Subtract exponents into appropriate shift-control registers taking their absolute value.
2. Add shifted & possibly-negated mantissas taking the absolute result & possibly shifting in overflows. Whilst taking the max exponent, incrementing where needed. Shifting mantissa by exponent can give us floor() & fract()! Conditionally-increment floor() upon fract() & you have ceil()! I’d have 4 of these arithmetic units in each compute unit, & let each of them be vectorized into 4 bytes for processing exponents. Normally the registers loaded into the busses for each of these arithmetic units would be consecutive, but I’d allow the control unit to rearrange these for “swizzling”. Or applying a scalar to a vector. This covers the behaviour of several GLSL operators & aid implementing numerous functions!

#### Logic Unit

In a GPU compute unit I’d have a special bank of 1bit registers (“latches”) to store booleans. These booleans could be incorporated into computing the control signals for the ALU, & comparison results could be read out of the ALU into these booleans.

 ![Continuation of GLSL operators table showing boolean (&&, ^^, ) operators](glsl-bool-ops.png)

I’d give these latches a logic unit consisting of XOR gates (^^) thus not only providing a core component of floating point multiplies but also via literals providing NOT. The arithmetic Units could be abused to provide AND & OR, helping to rapidly discard offscreen triangles.

Storing floating point sign bits in those latches would help with: • Those multiplies I mentioned
• Comparison against zero
• faceforward() in returning it’s resulting in a sign bit
• & abs() & sign()

I will be revisiting these latches upon discussing controlflow.

#### Multiplier

Perhaps the main thing GPUs do is lots & lots of multiplications! So I’d create another “main” arithmetic unit with 32 adders similar to the previous one. Another 16 adders sum each pair of those, then 8, 4, 2, & 1 to sum all 32 of the top row. I’d allow reading results from each of these layers. Set all the input numbers to one operand, with incrementing bitshifts, & zero out according to the other operand & we have a 32bit unsigned integer multiplier!

Since we can negate inputs & invert the output this can also handle signed multiplies.

Use the minor arithmetic unit to add exponents, the logic unit to XOR signs, & this summer to multiply mantissas & we have a floating point multiplier! To fully normalize floats I’d add 1st-set-bit circuits. Yielding numbers to subtract from the exponents as we compute exponents for floating-point addition, using 2 layers of this summer. Which would also aid implementing exp2() & log2(), as that’s an int’s log2().  I’d have 16 of these multipliers (with or without extra layers) to aid performing matrix transforms in essentially a single clockcycle! With just 1 vectorized multiply we’ve implemented unit conversion via radians() or degrees(), & matrixCompMult(). Except this * operator would not work as standard upon matrices, so I’d have the GLSL compiler rearrange matrix *s into vector-matrix *s where possible.

Multiply-sums (maybe a 2nd round of firmware’s used to sum all the multiply results? Or we do it in the same clockcycle?) are extremely common in GPUs & deserve their own machinecode opcode!

There’s matrix-vector multiplies as just mentioned, used to apply geometric transforms & perspective effects. There’s dot() products upon which so much is built:

• normalize() adds * & inversesqrt().
• faceforward(), as used in culling backwards-facing triangles, copies the inverted sign to output vector.
• Result’s processed by reflect() & refract() for ray-tracing.
• cross(), which computes the perpendicular vector, adds swizzling & subtracts componentwise.
• mix(), used to interpolate varyings, multiplies by scalers (a & 1-a) & adds componentwise.

Multiply-sum is very versatile! Well worth it to a GPU to optimize whether for depth buffering or raytracing! Thanks in large part to Taylor Series, a useful opcode for implementing several standard GLSL functions is a multiply-add. Once we normalized inputs as part of addition, there should be plenty of adders available to merge the constituent multiply & add operations… Uses include sin()/cos(), asin()/acos(), sqrt()/inversesqrt(), & log()/exp(), those built on them, & a few others. These GLSL functions aren’t actually used all that often, & we can afford for them to be slower. sqrt() & inversesqrt() can reach the desired result quicker by quickly switching to Goldshmidt Approximation, giving us both sqrt() & (half) inversesqrt()! This is used with dot-products to calculate length(), distance(), & normalize(). Used in computing lighting. Merging the sin() & cos() opcode goes far towards letting them reuse microcode computations for many shaders, whilst making it easy to implement tan() by adding a division. Implement asin()/acos()/atan() the same way?

Once we’ve implemented log() & exp() (or would it better to use log2() & exp2() which I’ve already stated is hardware-accelerated as part of floating point normalization?) we can use that to implement pow() which is used to approximation shininess & whether by log-division or x^-1 the division (/) operator. Though ideally I’d optimize division away at GLSL compiletime. As for non-Taylor Series uses… A couple multiplies after clamped (x - edge0)/(edge1 - edge0) gives smoothstep().

Finally we have reflect() & refract() post-processing dot-products for ray-tracing. refract() involves several multiply-adds much of which, including those for sqrt(), is conditional upon an internal sign. reflect() is just a single multiply-add. Merging in extra doublings/halvings would be trivial & helpful for several of these opcodes.

As such a multiply-add is especially useful for raytracing, & still valuable for depth-buffering!

#### Branching

Branching is a core construct in computing: We want different outputs for different inputs! But the SIMT architecture GPUs use makes branching more difficult: Every compute unit receives the same microcode. The optimal solution is the ?: operator, so every compute unit is kept.

Failing that we could send the condition to all Compute Units as part of the microcode for them to evaluate against their boolean latches & decide whether to let the write through (clamp() could use both!). The GLSL compiler would then need to propagate conditions (converted to Conjunctive Normal Form where that doesn’t duplicate code) down to the contained assignments. Block-local assignments can be exclude, helps with other optimization passes. Early-return could be given a function-local bool to compile to.

If we actually have a loop, ideally I’d let a constant-propagation optimization pass unroll them all so they can be handled using the above technique, so the GPU doesn’t need to care.

For any loops which can’t be unrolled, we’ll just have to let the control unit & all its compute units coordinate. Only once they’ve all hit a `break` statement can the Control Unit actually take that branch. Until then there’d be break-flags in the compute units upon which the assignments are conditional upon. `continue` could be handled the same way. To map `break`/`continue` to jump targets I’d include a stack in the Control Unit & maybe overflow to caches.

GLSL has a concept of reusable functions you can call. The GPU I designed could possibly use the loop-stack I just described to evaluate these, but instead I think I’d probably have the GLSL compiler inline all functions. GLSL has requirements to allow this. Returns would become assignments to an implicit out argument, & as mentioned yesterday a flag blocking any remaining assignments in the function from running.

Builtin functions compile to special opcodes, with inlining possibly revealing useful optimizations.

#### Aggregate Types

GLSL has a concept of programmer-defined structs, which I’d mostly have the GLSL compiler decompose into their constituent fields. Except… I already said that I’d have a row of 16*32 Arithmetic Units I could use here to assign & compare (=, ==, !=) whole structs up to 512 floats or ints long! Then there’s the generated type-constructor functions which, like those for vectors & matrices, would copy data into contiguous registers. Just without repeating data to fill in unspecified fields. When it comes to arrays in GLSL, I’d treat them like structs if (post loop-unrolling & constant propagation) they’re only indexed by constants. As for the rest, which are rarely used directly but are useful for implementing textures & per-vertex “attribute” inputs, I’d let the GPU’s Compute Units access RAM via a cache hierarchy. To cover the latency of memory fetches & iterate over every datum that Compute Unit processes, & loop until all the data has arrived. To avoid compromising between fast memory & high capacity memory, CPUs have both. With the fast memory storing recently accessed data, & fallsback to slower memory for any data it’s not storing (a “cache miss”). There may be multiple levels of this, each transferring larger chunks of data. In a GPU these “caches” are also needed to coordinate sheer Compute Unit concurrency, talk about rush hour! Upon multiple cache misses I’d probably drop all but one, for the loop to refetch.

Writes would almost entirely be used for internal purposes I’ll discuss shortly. To apply these writes the caches would need to send data up towards their parents or RAM, either immediately or at their leasure. Say when we need to free up cache memory? Or when the cache hierarchy isn’t overwhelmed with memory operations? To make this a bit more consistant we could have the Compute Units grab a lock on a memory address before writing to it. Except… Did I mention the cache hierarchy would be overwhelmed? We can minimize that by tracking which cachelines we already hold the only copy of by adding a 2bit “MESIstate machine to each of them (more states could be usefully added with a 3rd bit).

And like we’d have multiple busses into the cache so our children or Compue Units can access it concurrently, we could similarly connect our sibling caches to avoid bothering our parent.

## OpenGL ES 2.2 Implementation

The Depth Buffering Algorithm transitions from processing vertices to triangles to “fragment” pixels therein before outputting each of those fragment pixels. I find this to be the most interesting part of the process, but public documentation on it is scarce. So I’m largely guessing! How’d I transition from processing vertices to processing triangles in the GPU I described? It may be simpler to go the other way: Iterate over a triangle’s vertices & process the 1st one we successfully lock. In other words, I’d have all Compute Units try to claim processing of their triangle’s (hardcode an ID to derive that from) 1st vertex, with cache misses instructing it that processing said vertex would be redundant. Instead it’d try claiming the 2nd then 3rd. Afterwords each Compute Unit would need to validate that all it’s vertices have been computed, & hopefully we won’t need to rerun the vertex shader.

Those output “varyings” would be written to “scratch” memory in the cache but never flushed to RAM. Ensuring the computed “varyings” are near at least one of the Compute Units which needs it. To transition from triangles to their inner “fragments” (as we interpolate their inputs via mix()) I’d need an opcode to fork off a computed number of new tasks, copying relevant data into them & adding a counter, after we’ve discarded over half the datums. This would be a special opcode in that it operates accross registers for different datums!

When the tasks spawned in a Compute Unit overflows the newly-freed datums I’d write the overflow to some that scratch memory. Upon underflow I’d check the overflowed tasks other Compute Units in a cache-friendly order to take over some of their workflow. If even that’s not enough to handle the overflow, we can add an implicit loop to processing the remaining data. Reading out the data could involve atomic mutations on the appropriate coordinates in the depthbuffer then framebuffer. However, in doing so I’d want to switch to a different Control Unit to minimize the load I’d be placing on the cache hierarchy. This alternate Control Unit would use the summers to merge-sort the data, outputting the lowest coordinate to the highest. Preblending colours wherever that’s needed. This should maximize the chances that the cacheline of those depth & frame buffers we need is already in the cache. Then, as mentioned, go back & processing remaining fragments!

### Device Driver

To let apps use hardware-accelerated graphics GPUs provide them a library to use (ignoring the kernel inserting itself in the middle & second-guess everything, at least I think that’s all the Direct Rendering Manager code I read through does…), adhering to a standardized API from Kronos Group so apps don’t have to worry about what GPU they’re running on. Here I’ll describe how I’d implement OpenGL ES 2.2 (to keep things simple) upon the hypothetical GPU I’ve described. Mostly the OpenGL ES implementation would consist of (de)allocating GPU-accessible memory (`glGen_()`, `glDelete_()`, & `glIs_()`), managing global references to them (`glBind_()`), & copying data into them whether in whole (`gl_Data()`) or in part (`gl_SubData()`). With seperate functions (as indicated by `_`) for Buffers (reusable accross draws), wrapping Vertex Attribute Arrays, textures, wrapping output Frame Buffers & Render Buffers, & Shader programs. There’s also global & property tracking. The bulk of the code would be in compiling GLSL shaders (glCompileShader()), consisting of a C preprocessor, parser, name resolution, typechecker, ideally optimizations (most importantly constant propagation, loop unrolling, if/return-lowering, function inlining, common-subexpression elimination, & swizzle propagation), & code generator. Also there’s combining there’s shaders (glLinkProgram()) adding connecting code & literal code to be overriden with “uniforms” & pointers to buffers. Then we have glDrawArrays() glDrawElements() to tell our hypothetical GPU to run that program! Possibly working around hardware limits like Mesa3D does. glFlush() & glFinish() operates on it’s underlying locks & queues. There’s a few functions (glDepthRangef(), glViewport(), glFrontFace(), glEnable(), glDisable(), glStencil*(), etc) which hook expose otherwise internal `uniform`s to the additional code we’ve added to connect the shaders. glLineWidth() & glPolygonOffset() hooks to different connecting GPU code. glDepthFunc(), glBlend*(), & certain enable-flags controls the behaviour that hypothetical switches to for more effecient read-out. Then we have texture-compression (which I haven’t discussed under the assumption that today RAM is plentiful), whilst glClear() & glGenerateMipMap() could be implemented glDrawArrays().

### OpenGL 1 & 2

The OpenGL standards effort started in 1992 when we haven’t fully figured out how GPUs should be designed, especially within the transister densities of the day. By the 1980s though we had pretty much figured out the computer graphics algorithms we’ve been using ever since. It was just matter of waiting for the hardware to catch up, using supercomputers like The Cray (which closely resembled a modern GPU within the manufacturing limits of the day) or robots in the meantime. So versions 1 & 2 have some significant differences from OpenGL 3 & 4, with OpenGL ES being the first to shed its legacy. How’d we implement OpenGL 2 upon OpenGL 3?

The biggest difference: In those days GPUs weren’t programmable, if they were there was not enough consistency to target multiple GPUs.

So to implement OpenGL 2 on a modern GPU the driver would have to provide their own vertex & fragment shaders games would be stuck using. The vertex shader would implement matrix transforms & diffuse/specular/ambiant lighting with optional spotlight cutoffs & intensity reduction over distance. The fragment shader would implement texturing. Adhering to OpenGL 2 we’d expose functions for each of those OpenGL uniforms.

Another difference is how vertex arrays are sent to the GPU. In OpenGL 2 we called functions per-attribute & per-vertex, where as to adapt to modern OpenGL we’d need to gather those all into a buffer before handing them off to the GPU. Though passing the GPUs pointers to data is a better API for most of what we do.

Finally these early versions of OpenGL mandated that GPU vendors bundle a matrix transform library with their drivers. There’s little reason to hardware accelerate this. So I’d need to implement matrix multiplication, as well as the rotate/scale/translate/skew/identity/perspective/orthographic matrices whilst computing their inverses, as well as a “look-at” matrix wrapping rotations. These would all be exposed functions.

As mentioned earlier 3 matrices were tracked: model (stacked), view (inversed), & projection matrices to multiply together. This can all be readily implemented in constant-time performance without hardware acceleration.

We weren’t always able stuff a GPU full of “Arithmetic Units”, instead we built them like factory assembly lines. Achieving sheer throughput at the expense of the performance of individual datums. Connecting multiply-adders into the depth-buffering formulas, whether using hardwiring (easier pipelining) or software via Control Units (where NVidia got their fame), is straightforward. But how did they transition between the pipeline stages?

Early on fragment processors & (where the CPU wasn’t already fast enough) vertex processes were separate chips, connected bucket sorting in some order or other. Like in the 1st PlayStation or iPhone. Bucket sort has its limits, but those can be aligned to the hardware limits. Or if we can fit a handful of vertex & fragment processors… The transitionary circuitry is kept secret, but I’ll echo some conjectures I’ve read.

A queue buffers input from OpenGL calls to the vertex processors (some versions of OpenGL allows for preprocessing software, or other I/O could be queued here as well like in the Nintendo Gamecube or, via an ARM core, Wii). I’d have some memory listing all triangles in the geometry with vertex indices. Once a vertex has finished processing I’d flag the appropriate registers in that memory, thus indicating which triangles are ready to be rasterized.

In GPUs imagery were traditionally split up into tiles, & often compressed. Within a tile pixels were listed along a space-filling curve. So after culling triangles they were dispatched to the appropriate processor for the output tile(s) they cover. Triangles typically only cover a single tile.

The bulk of rasterization is in looking up texture coordinates, so we’d want to compute multiple output pixels (a whole tile’s worth?) in the same processor. Then we’d need a cache to buffer & prefetch output tiles. These techniques could explain the behaviour of some later sprite hardware seen in arcades like Galaca. The circuits to output this rendered image to the screen would need to be tweaked to undo the tiling, which aided it’s computation.

Unless otherwise noted, imagery is sourced from Wikimedia Commons or Khronos Group. Click to see additional credits.