Having just studied how to make (2D vector) graphics perform well on a normal CPU, I will now turn my attention to how to hand the task off onto hardware better suited for the task. And 3D graphics. By studying the Mesa3D OpenGL implementation!
Depth Buffer Algorithm
But today I want to describe the actual “Depth Buffer” algorithm that underlies OpenGL, and which is handed off to that specialized hardware. I will not discuss the more modern “Recursive Ray Tracing” algorithm movies and increasingly games now use.
Simply put, the Depth Buffer Algorithm consists of six steps:
- Matrix maltiply every point in the 3D model*
- Light every point in the 3D model*
- Discard obviously obscured triangles from the 3D model, & clip to viewport
- Finalize “fragment” colours*
- Check depth buffer to determine which fragment to output as pixels
The * marks steps you can substitute with your own “shader” programs.
But these steps deserve further explanation.
(3D) Matrices are store the lowercase variables in the formulai:
X' = aX + bY + cZ + d Y' = eX + fY + gZ + h Z' = iX + jY + kZ + l
Which can be easily used to express any geometric transform, or sequence thereof.
Here multiplying all points by the same matrix accomplishes three things:
- Optionally applies perspective (by dividing by Z)
- Places the object in the scene
- Places the camera in the scene (by placing the camera at the center of the universe)
OpenGL exposes these as three seperate matrices, but they can easily be combined by the CPU.
Since, for performance reasons, each point is lighted independantly of all other points OpenGL is at best highly approximate.
The intensity for each light source is computed from the angle of the point’s surface (it’s normal) & the angle between it and the light source, with an exponent optionally be added to approximate shininess.
The intensities of all these light sources (and the omnipresent “ambiant” light) is multiplied by their colours and summed.
This is called “Gouraud Shading”.
Processing each triangle will be considerable more computationally intensive in the fullowing steps, so it’s important to discard ones that are obviously unneeded. Specifically, the Depth Buffer Algorithm applies the following tests:
- (OPTIONAL) Assuming it’s points are listed in clockwise order, is the triangle facing away from the camera?
- Is the triangle outside the viewport?
- Does the triangle intersect the edges of the viewport? If so split into smaller triangles.
Since we now know the color (and/or other information) of each point bounding the triangle, we need to compute the intermediate values at each pixel (or rather, for the sake of nuance, “fragment”) in directly between them. This is called “linear interpolation”.
To cover every fragment, it’ll first iterate along the edge between the topmost point and the bottom most point, then along the Y axis as determined by the middle point.
For step 5 (finalize the fragment colour) you could just use the interpolated colour.
Or you could look the interpolated point up in a 2D “texture” image. Or use a better lighting algorithm like Phong Shading. Or use a more specialized lighting algorithm, perhaps for rendering skin. Or something else entirely. Or some combination of the above.
OpenGL doesn’t care what you tell this step to do.
And finally all fragments are checked against the titular “depth buffer” to determine whether it’s in front of what’s already been rendered, and as such should override that pixel in both the frame & depth buffers.
If you want to incorporate semi-transparent surfaces this gets a bit trickier, and does require the CPU to sort the 3D models it needs rendered.
Yesterday I described the “Depth Buffer” algorithm used by OpenGL to render a 3D scene, but how do we build hardware to make that algorithm fast? That’s what I want to answer this morning.
Essentially we just need something that can process every point, triangle, and fragment therein (or alternatively, for the more modern “Recursive Ray Casting” algorithm, every output pixel) simultaneously.
Interestingly as transistors got smaller and faster GPUs got simpler, in contrast to CPUs.
But first how do processors work more generally? There are four main components:
- “Registers” store data being processed now.
- “ALUs” (Arithmatic-Logic Units) apply common operations (here floating point vector arithmatic, especially multiply-add/sum) to those registers.
- “CUs” (Control Units) fetches and decodes the processor’s instructions.
- And “cache” partially addresses the bottleneck on RAM by storing the most recently accessed data. Smaller caches are faster but less useful.
So how do you build a circuit to process multiple items simultaneously? You stuff it full of ALUs!
You have multiple have multiple register “banks” (storing a particular item) per ALU, multiple ALUs per “L1d” cache, multiple L1d caches per CU (& L2 cache), & multiple control units per GPU.
Then add a cache hierarchy (smaller caches read from bigger caches) not just for more performance but also parallelism. Covering up the latency by having ALUs constantly switch between different items.
The catch with this design is that by having numerous ALUs share a single CU, that CU can only follow a branch if all the ALUs agree that branch should be taken (which it would have to implement for the sake of loops). Otherwise the ALUs need to be responsible for deciding whether an individual instruction applies to it.
There’s no callstack, so no recursion is allowed. And certainly no function pointers!
I’d implement this by having opcodes specify a conditional mask on a special register.
That about covers what I learnt at Uni, but let me extrapolate further.
- How are data items allocated to the ALUs?
- How does the GPU transition between processing points to triangles to fragments? Well, that last transition would involve generate new data items to be allocated as per the first question.
- How is data read out of the GPU? How does it handle multiple fragments being on the same pixel?
To route data items to a GPU’s ALUs I would have the cache hierarchy store some flags indicating who is free and accepting new work. In the individual register banks this flag would disable computation in a similar way to how I’d handle conditions.
And when those ALUs generate new work (upon interpolation) I’d first check availability nearby in order to maximize parallelism. Failing that I’d propagate it up the cache hierarchy to be allocated that way, and perhaps use the cache as a buffer.
The cheapest way to transition from processing a 3D model’s points to it’s triangles is probably to start with the triangles. At the very least you could then iterate over each of those points and process them.
Or to reduce work duplication the L1d cache could help out and use it’s cache misses to determine when to queue up a new point to be processed by the vertex shader. The ALUs could then temporally switch to processing those items and store the results in that L1d cache for interpolation.
Operating independantly the GPU’s ALUs do have some idea of which fragments it should discard (even before allocating that work) and can optimistically blend the computed onto the background. But those pixels need to be written to RAM in Y-then-X order.
It’s conceivable that the ALUs could order their own items if it temporarily switched to a different CU, but to sort results from multiple ALUs I’d have a binary tree of adders perform a merge sort and, when necessary, simple color blending.
And while I’m at it, let me say that I’d probably design a GPU’s ALUs to consist primarily of a integer vector add/sum circuit, leaving the CU to split floating point operations into smaller micro-ops. For a minor performance hit this should allow stuffing many more ALUs into the GPU, but it probably doesn’t make much of a difference.
Fin, tomorrow I’ll start studying the Mesa3D OpenGL implementation. And eventually Linux’s Direct Rendering Media infrastructure.