Having just studied how to make (2D vector) graphics perform well on a normal CPU, I will now turn my attention to how to hand the task off onto hardware better suited for the task. And 3D graphics. By studying the Mesa3D OpenGL implementation!
Depth Buffer Algorithm
But today I want to describe the actual “Depth Buffer” algorithm that underlies OpenGL, and which is handed off to that specialized hardware. I will not discuss the more modern “Recursive Ray Tracing” algorithm movies and increasingly games now use.
Simply put, the Depth Buffer Algorithm consists of six steps:
- Matrix maltiply every point in the 3D model*
- Light every point in the 3D model*
- Discard obviously obscured triangles from the 3D model, & clip to viewport
- Finalize “fragment” colours*
- Check depth buffer to determine which fragment to output as pixels
The * marks steps you can substitute with your own “shader” programs. There are newer variations of this which adds additional steps.
But these steps deserve further explanation.
(3D) Matrices are store the lowercase variables in the formulai:
X' = aX + bY + cZ + d Y' = eX + fY + gZ + h Z' = iX + jY + kZ + l
Which can be easily used to express any geometric transform, or sequence thereof.
Here multiplying all points by the same matrix accomplishes three things:
- Optionally applies perspective (by dividing by Z)
- Places the object in the scene
- Places the camera in the scene (by placing the camera at the center of the universe)
OpenGL exposes these as three seperate matrices, but they can easily be combined by the CPU.
Since, for performance reasons, each point is lighted independantly of all other points OpenGL is at best highly approximate.
The intensity for each light source is computed from the angle of the point’s surface (it’s normal) & the angle between it and the light source, with an exponent optionally be added to approximate shininess.
The intensities of all these light sources (and the omnipresent “ambiant” light) is multiplied by their colours and summed.
This is called “Gouraud Shading”.
Processing each triangle will be considerable more computationally intensive in the fullowing steps, so it’s important to discard ones that are obviously unneeded. Specifically, the Depth Buffer Algorithm applies the following tests:
- (OPTIONAL) Assuming it’s points are listed in clockwise order, is the triangle facing away from the camera?
- Is the triangle outside the viewport?
- Does the triangle intersect the edges of the viewport? If so split into smaller triangles.
Since we now know the color (and/or other information) of each point bounding the triangle, we need to compute the intermediate values at each pixel (or rather, for the sake of nuance, “fragment”) in directly between them. This is called “linear interpolation”.
To cover every fragment, it’ll first iterate along the edge between the topmost point and the bottom most point, then along the Y axis as determined by the middle point.
For step 5 (finalize the fragment colour) you could just use the interpolated colour.
Or you could look the interpolated point up in a 2D “texture” image. Or use a better lighting algorithm like Phong Shading. Or use a more specialized lighting algorithm, perhaps for rendering skin. Or something else entirely. Or some combination of the above.
OpenGL doesn’t care what you tell this step to do.
And finally all fragments are checked against the titular “depth buffer” to determine whether it’s in front of what’s already been rendered, and as such should override that pixel in both the frame & depth buffers.
If you want to incorporate semi-transparent surfaces this gets a bit trickier, and does require the CPU to sort the 3D models it needs rendered.
Yesterday I described the “Depth Buffer” algorithm used by OpenGL to render a 3D scene, but how do we build hardware to make that algorithm fast? That’s what I want to answer this morning.
Essentially we just need something that can process every point, triangle, and fragment therein (or alternatively, for the more modern “Recursive Ray Casting” algorithm, every output pixel) simultaneously.
Interestingly as transistors got smaller and faster GPUs got simpler, in contrast to CPUs.
But first how do processors work more generally? There are four main components:
- “Registers” store data being processed now.
- “ALUs” (Arithmatic-Logic Units) apply common operations (here floating point vector arithmatic, especially multiply-add/sum) to those registers.
- “CUs” (Control Units) fetches and decodes the processor’s instructions.
- And “cache” partially addresses the bottleneck on RAM by storing the most recently accessed data. Smaller caches are faster but less useful.
So how do you build a circuit to process multiple items simultaneously? You stuff it full of ALUs!
You have multiple have multiple register “banks” (storing a particular item) per ALU, multiple ALUs per “L1d” cache, multiple L1d caches per CU (& L2 cache), & multiple control units per GPU.
Then add a cache hierarchy (smaller caches read from bigger caches) not just for more performance but also parallelism. Covering up the latency by having ALUs constantly switch between different items.
The catch with this design is that by having numerous ALUs share a single CU, that CU can only follow a branch if all the ALUs agree that branch should be taken (which it would have to implement for the sake of loops). Otherwise the ALUs need to be responsible for deciding whether an individual instruction applies to it.
There’s no callstack, so no recursion is allowed. And certainly no function pointers!
I’d implement this by having opcodes specify a conditional mask on a special register.
That about covers what I learnt at Uni, but let me extrapolate further.
- How are data items allocated to the ALUs?
- How does the GPU transition between processing points to triangles to fragments?
- How is data read out of the GPU? How does it handle multiple fragments being on the same pixel?
The cheapest way to allocate initial jobs to ALUs would probably be to give each ALU a unique ID for it to fetch the relevant data.
To split tasks like in the transition from triangle culling to fragment shading I’d start by populating some internal registers before toggling a latch altering which physical registers are considered internal. Once that overflows it could consult per ALU atomic queues the most cache-local first. If that’s still not enough I’d incorporate a loop into this operation’s microcode or firmware.
The cheapest way to transition from processing a 3D model’s points to it’s triangles is probably to start with the triangles. At the very least you could then iterate over each of those points and process them. You could even cache computational results within the cache hierarchy, hoping that the 2 subsequent iterations don’t have any data needing to be freshly computed. This’d incur some duplicate computation, but cache bandwidth is probably the more limited resource.
Operating independantly the GPU’s ALUs do have some idea of which fragments it should discard (even before allocating that work) and can optimistically blend the computed onto the background. But those pixels need to be written to RAM in Y-then-X order.
It’s conceivable that the ALUs could order their own items if it temporarily switched to a different control-unit, but to sort results from multiple ALUs I’d perform atomic writes trying to minimize the number of different directions the cache-hierarchy is pulled whilst hoping direct memory access conflicts are rare. Each cacheline can keep a couple bits of read-write lock state, whilst the memory reads would need to loop anyways in case there’s greater than expected latency.
While I’m at it, let me say that I’d probably design a GPU’s ALUs to consist primarily of a integer vector add/sum circuit, leaving the control-unit to split floating point operations into smaller micro-ops. For a minor performance hit this should allow stuffing many more ALUs into the GPU, but it probably doesn’t make much of a difference.
Fixed Pipeline GPUs
Fore more-indepth discussion see: A Trip Through the Graphics Pipeline 2011
Before we could stuff enough transistors into a GPU that it can process every vertex, triangle, then “fragment” pixel simultaneously we had to rely on hardcoding this Depth Buffering algorithm & optimizing for sheer throughput.
Assembling multiply-adders into the formulas for matrix transforms, shading, triangle culling, interpolation, texturing, depth-testing, & blending is straightforward. As is associating such a multiply-adder with a control-unit to allow graphics devs to swap in their own logic.
This leaves the transitions between phases for discussion, & unfortunately that’s where tradesecrets lie. So this is mostly (others’) educated guesses.
There needs to be a circuit dispatching incoming vertices possibly with RAM-buffering & maybe GPU-side generation.
To transition from vertices to triangles a table (defaulting to incrementing sequence) listing vertex indices making up each triangle is tracked in the GPU to be matched against each processed vertex. Once all have been matched for a triangle its dispatched to the triangle culling phase. That strategy would limit the number of triangles that can be rendered in each batch, but I have seen Mesa3D code subverting such constraints. NOTE: PlayStation1 barely had a vertex processing unit!
Once triangles have been culled GPUs split them up into the tiles (sized to match cache sizes, & presumably prioritizing upper-left since that’s sent out first) they cover. Necessitating a many-to-many dispatch circuit.
Concurrency occurs within each tile, including interpolation. The GPU stores images in RAM as sequences of tiles.
The main throughput challenge in this “fragment shading” phase is fetching values from a texture image in graphics-RAM. So GPUs have long that time by proceding to the next tile & fetching its pixel from the texture image.
Finally we need to take those tiles & combine them into the output image! Or if you’re the PlayStation1 you do a naive RAM write, relying on the CPU to sort triangles from back to front.
Combining these tiles into an output image shouldn’t require much more than a concurrent memory cache. With a choice of logic to combine the old & new tiles. Presumably there’d a performance win in delaying writes in case more data to blend/depthtest in will follow up soon.
Certainly adding more depthtests earlier on would conserve computation thus improving throughput!