How does the CPU utilize the GPU for hardware accelerated (3D) graphics? You use an implementation of OpenGL like Mesa3D!
Mesa3D implements the public OpenGL APIs mainly to “flush” any necessary data, check inputs, track state in “context” globals, and forwards calls onto the driver class. Though it also lowers matrices, compresses textures, and I’ll dig into it’s GLSL compilation later.
Today I’ll start digging into Mesa3D’s Radeon driver, because I’ll get overwhelmed studying all the drivers and I’ve got a Radeon GPU.
To start off with there’s an array of fallback drivers, the first two of which seems to be doing the same sort of state tracking as the wrapping functions, the third just sets some configuration, and the last does the real work.
To initialize the Radeon GPU, it allocates some memory for the device, initializes it with lots of configuration info, and asks the file descriptor for the ID of the GPU (using Linux’s Direct Rendering Media).
When told to create a buffer for a given __DRIdrawable, the Radeon driver initializes up to six:
- A standard Mesa framebuffer
- Frontcolor buffer
- Optional back color buffer
- Possibly combined depth & stencil buffers
I just see state tracking here, nothing happening on the GPU. Though some of these might really on software emulation.
All the real renderbuffer action occurs when switching between different contexts. Where, after some checks against dirty buffers, the public API state, and what buffers are available it:
- Optionally calls .getBuffersWithFormat() or .getBuffers() on the screen’s DRI2 loader.
- Finds the buffer with the specified name and updates it’s state, possibly “opening” it to find it’s tiling information.
- Stores the new target & updates size info.
And beyond that when the Radeon driver’s told to make a drawable current, it flushes any necessary data and possibly resets the public API’s state. From there it creates and updates framebuffers as needed before it’s own state.
Unbinding a context just resets the public API’s globals.
And finally that vtable contains methods for constructing a “context”, which will allocate and initialize that object with plenty of state, configuration, and function pointers. A C macro chooses from one of two.
Transform ‘N Lighting work chunking
The first step is to compact the attribute arrays associated with the vertices into a TNL-specific array and the rest in a seperate array, skipping any not enabled by a specific bitfield.
Following that it extracts the minimum and maximum vertix indices, before calling a driver method determined by ctx->Query.CondRenderMode to determine if it should proceed.
Then if min_index is greater than non-zero it recomputes the triangles so it is, or failing that sets updates
.start on the vertices.
If there’s too many vertices for the CPU or GPU to process in a single go, Mesa3D’s TNL pipeline (used by it’s Radeon200 driver) will split the vertices in half either in place or newly-allocated memory depending on the quantity.
Which involves takes the longest prefix of the triangles array with the same base vertex and stack allocates a “context” to reference that slice before restructuring that data into new GPU buffers whilst resolving the geometry mode (GL_POLYGON, etc) with some copying.
The inplace handling meanwhile simply handles each vertex it sees in turn until it has enough to send off to the GPU or it’s got few enough it can afford to do a copying-split.
Once it has a chunk of vertices it recursively calls back into the main TNL drawing routine.
So far I have described how Mesa3D’s TNL subsystem checks whether the rendering should happen, rebased the indices to start from 0, and split the vertices to cap how many are in each batch.
Once it has done that it splits the chunk by “base vertex”, associates subchunk to the vertex buffer (using some GPU memory), and forwards it’s own to the driver’s ->RunPipeline() method.
Continuing on with my overview of how Mesa3D’s Radeon200 GPU driver works, there’s a method for reading driver identifiers. Where this driver specifies the GL_VENDOR is “Mesa Project” & the GL_RENDERER is whatever Linux’s Direct Rendering Infrastructure says appended with whether the TCL fallback is enabled.
Then continuing batch there’s the “IOCtl” methods of Clear(), Finish(), & Flush() which I’ll be focusing on this morning.
To .Flush() data through the GPU, it first forwards the method call to it’s DMA property before sending any of it’s queued up instructions to the GPU. Then it tells it’s screen’s DRI2 loader to flush any of that data out of the GPU to it’s destination.
To .Finish() some rendering, it starts with a .Flush() before waiting on all the color buffers then the depth buffer. Waiting involves scheduler-yielding spinlocks and/or Direct Rendering Media Commands written to the GPU’s file descriptor.
And to .Clear() the output so all pixels are the same color, it first flushes any existing activity. Then if needed (using buffers other than FRONT_LEFT, BACK_LEFT, DEPTH, STENCIL, or COLOR0) it tells the software rasterizer to clear it’s bits, before using the generic routine of temporarily switching the GPU’s state to render a couple of triangles atop everything.
The software-based clear meanwhile involves validation, traversing it’s datamodel, and setting every pixel to the specified value.
Continuing on my study of how Mesa3D’s Radeon200 driver works, today I will describe it’s state-setting methods.
To .Update()/invalidate state it first invalidates it’s drawbuffer’s bounds followed by it’s SoftWare RASTerizer & Texture ‘N Lighting engine before setting some flags and possibly NULLing out the currently-running program.
Invalidating the drawbuffer calls back into the public API to set it to the scissor’s bounds.
Invalidating the SoftWare RASTerizer has split into two parts, the first just forwards to dynamic dispatch. The second sets some flags, and invalidates the vertex state by enabling all
new_inputs flags & resetting the interp & copy_pv vertex methods to determine whether a shortcut can be taken.
Upon lighting change it sets a
RESCALE_NORMALS flag to
Upon setting the OpenGL draw buffer, Mesa3D’s Radeon driver first check if it’s the front renderbuffer being set. If so it extracts encoding information and asks the DRI2 Loader to “get” those buffers. Afterwhich it opens each Buffer Object and gets it’s tiling information and updates the framebuffer size.
In either case it follows up by checking whether the Radeon GPU can handle this draw call and whether it needs to update the framebuffer and it’s bounds, before setting the renderbuffer.
Getting a particular renderbuffer involves casting particular
ColorDrawBuffers and checking their
class field. Failing this it calls fallback methods (primarily
fallback) on the driver. Afterwhich it references the new renderbuffer, sets some flags based on available buffers, and computes it’s new scissor state which it sets via another driver method.
To read this renderbuffer it uses a very similar process.
To Copy some Pixels, Mesa3D’s Radeon200 driver uses the generic process of falling back to SoftWare Rasterization if the target is to big for it’s “temp” texture singleton or it’s in an unsupported state. Afterwhich it temporarily resets OpenGL’s state to render two triangles across the target rectangle.
To Draw some Pixels Mesa3D’s generic process is to first check the encoding, or tile up the input, before using OpenGL to render that box with an optional depth or stencil buffer.
Tiling a DrawPixels call just involves iterating over each cell and recursively calling _mesa_meta_DrawPixels for each. I don’t want to get into the CPU-based fallback implementations because they look quite involved, but copying pixels does appear to involve memcpy()s.
To read some pixels, Mesa’s Radeon drivers first tries
do_blit_readpixels before falling back to
_mesa_readpixels after making sure the state is updated.
Those fallbacks mostly deal with encoding issues, but I’ll move onto the next methods now.
Most of which (including AlphaFunc, BlendColor, BlendEquationSeparate/BlendFuncSeparate, ClipPlane, ColorMask, CullFace, DepthFunc, DepthMask, Viewport/DepthRange, etc) uses bitwise operations to reencode the data for easier use with the hardware.
Light[Model]fv does much the same, but might dispatch to these other methods.
Mesa3D’s Radeon200 .PolygonMode() method reconfigures methods for rendering Points, Lines, ClippedLines, Triangles, and Quads by looking them up in a table, haing computed the index for that table by determining whether it’s twosideded and/or unfilled. If neither are the cases it uses it’s own PrimTabs, PrimElts, & ClippedPolygon methods rather than using Texture ‘N Lighting’s.
It also reencodes similar information (plus
_tnl_need_projected_coords()) as the “vertex state”.
Calling Mesa3D’s Radeon drivers’
.Scissor() method also involves flushing the GPU’s state and computing the bounding box, beyond simple reencoding.
And setting the viewport may also involve resizing the framebuffer (via the public API layer) and updating the “Scissor” state.
.NewTextureImage() just memory allocates the datastructure, .DeleteTextureImage() deallocates it after calling a different method to free it’s buffer.
Allocating it’s buffer frees what was there before, has the SoftWare RASTerizer do it’s initialization, and assigns a miptree.
The software rasterizer saves some data about the buffer’s sizing before allocating memory for it’s “image slices”. And assinging a miptree involves calculating layout information, if the current miptree is for a different image size/encoding, before “opening” the buffer object.
And to .FreeTextureImageBuffer() it decrements the miptrees reference count (which may in turn decrement the buffer object’s reference count), dereference the buffer object, and frees the SoftWare Rasterizer’s texture.
To .MapTextureImage() into memory, it extracts various sizing and encoding information before memory mapping the buffer object. And .UnmapTextureImage() unmaps it’s buffers.
.ChooseTextureFormat() performs a
switch statement over the current encoding to find the closest one supported natively by Radeon GPUs.
To .CopyTexSubImage() it first checks some “stamps” to avoid stepping on buffer “swap”, computes an offset based on the miptrees encoding (looking it up in a table), and does blit.
.CopyTexSubImage() falls back to, & .Bitmap() uses, the normal OpenGL triangle rendering APIs.
And there’s a method for the window manager to get access to the texture, which calls a method on the screen’s DRI2 Image, extracts it’s encoding/sizing information, and creates a miptree for it.
And finally upon context initialization it sets some encoding globals depending on the CPU’s endianness.
Once it has .CreateNewTextureObject() it sets a number of bitflags in a form that can be passed to Radeon GPUs.
And .DeleteTexture() involves .Flush()ing any vertices to the GPU, setting some flags on those vertices, and unreferencing the miptree before performing a normal texture object deletion.
.TexEnv() involves encoding
COORD_REPLACE_ARB values as expected by Radeon GPUs.
LOD_BIAS_EXT also requires querying a parameter from Linux’s Direct Rendering Infrastructure to be more specific to this GPU.
.TexParameter() unsets the texture’s
validated flag for unsupported properties.
.TexGen() sets a
recheck_texgen flag for the current “unit”.
& .NewSamplerObject() allocates/inits it via the public API layer before setting a
.NewProgram() is a simple allocate/initialize routine mostly relying on the generic
program subsystem, allocating a bit of extra memory to vertex shaders. And .DeleteProgram() simply frees all the contained pointers.
Upon .ProgramStringNotify() it initializes certain fields of the shader, and does additional translation for vertex programs & TNL.
Translating vertex shaders for Radeon200 GPUs involves:
- Verifying there is a program
- Verifying it only outputs supported field (fallsback to software rasterization)
- Inserts code to compute final position, if necessary
- Add a “state reference” for any “fogc” output.
- Stores some counts.
- Associate inputs with OpenGL builtins
- Applies any aliasing
- Verifies necessary outputs are generated
- Rewrites “swizzle instructions for some reason, handle fog
- Lower certain opcodes.
- Rewrite scalar/vector arithmatic
- Add extra instructions for handling fog.
- Verifies enough registers can be trivially allocated.
- Repeat 9-13 for each operation.
Meanwhile handing the program to the Texture ‘N Lighting CPU-based subsystem does literally nothing.
Asking .IsProgramNative() will do this same translation if necessary before checking a flag.
.NewQueryObject() is a normal allocation & initialization routine, & .DeleteQuery decrement’s it’s buffer object’s refcount before freeing it’s label and then itself.
To .BeginQuery() it calls .flush() on it’s Direct Memory Access before opening it’s buffer object and setting some flags/properties.
To .EndQuery(), Mesa3D’s Radeon drivers flushes it’s Direct Memory Access, performs a space check, forwards to another method, and unsets the
To .CheckQuery(), it calls .Flush() if the buffer object is referenced by the “CS”, gets the query result if the buffer object isn’t busy via a temporarily memory mapping it.
Or it’s the same as .WaitQuery() which (again) flushes in the presence of that reference before getting that query result.
It’s alloced/initialized as per normal, with a ClassID and memory management methods attached (though the realloc happens on the GPU with a flush, and plenty of encoding data stored CPU-side).
Mapping a framebuffer object involves copying over encoding data, opening/blitting/mapping the buffer object, and if necessary reencoding it.
Unmapping the framebuffer involves reverse any reencodings before unmapping the buffer object and reversing any blits.
Binding a framebuffer sets the “draw buffer”, by:
- calling a
- updating the public API layer framebuffer state & it’s bounding box,
- checking error states,
- updating the driver’s own state,
- enabling the stencil buffer and/or depth test,
- and/or setting the frontface and/or depth range.
Mesa3D’s Radeon driver can also set the framebuffer to a renderbuffer, which follows the same process after flushing any data through the GPU and ensuring the public API-layer state is updated.
Setting it to a texture will first check with the SoftWare Rasterizer to perform any necessary conversions, and looks up the appropriate scale in the texture’s miptree tables, before setting the framebuffer as per normal.
.FinishRenderTexture() is just a .Flush() operation.
There’s a validation routine which calls the
vtbl.is_format_renderable method for the encoding of each attachment to this renderbuffer.
Blitting the framebuffer is done using the normal triangle-rendering OpenGL APIs or using one of three CPU-based routines after checking whether the blit’s actually needed.
And there’s a method for window manager integration largely referring to the screen’s DRI2 image method
lookupEGLImage, but also flushing GPU state and copying encoding info.
These are attached to a dispatch table for the SoftWare RASTerizer (swrast).
To start the span renderer, it flushes any vertices to the GPU before memory mapping all textures, the framebuffer for the swrast, and maybe the read buffer.
Finishing the span renderer involves flushing the swrast (writing out these spans with reencodings) and unmapping all those textures.
TNL Callback Methods
I’ll continue my exploration of how Mesa3D’s Radeon200 driver works by exploring it’s callback functions for the Texture ‘N Lighting (TNL) subsystem., which are attached to a special dispatch table.
To .UpdateMaterial() it accesses the GPU state with R200_DB_STATE(), updates it to match the arguments, and saves it with R200_DB_STATECHANGE(), by directly accessing the GPU’s memory.
To .RunPipeline(), it starts by validating the OpenGL state and copying it over into GPU control memory. Failing to validate this state will trigger the SoftWare RASTerizer or TCL subsystem. Then it’ll hand control back over to the TNL subsystem.
Which’ll check for any input changes which it handles by updating it’s own state as necessary and validating it. Then it runs each pipeline stage associated with the TNL context.
Continuing my exploration of how Mesa3D’s Radeon200 driver works, I just read through code that allocates memory for the GPU state, and writes a Direct Rendering Media command to the GPU’s file descriptor to tell it to pay attention to this memory.
But after that it attaches more methods to the TNL “Renderer”.
Starting a render involves setting the vertex format, which in turn involves queueing up a bunch of GPU commands.
To render an individual primitive it sets a
radeon.swtcl.render_primitive property and if needed converts it to a point, line, or triangle. Which it does by making sure it’s buffers and render state are valid, whilst uploading any necessary matrices, textures, lights, etc. Before it makes any appropriate adjustments to the GPU state, including most importantly enqueueing the new primitive and setting
radeon.swtcl.hw_primitive to it.
Finishing a render is a noop, and resetting the line stipple queues up a bunch of GPU commands to change the GPU state just like is done upon starting a render.
For the rest of these methods it uses the CPU-based & pluggable TNL subsystem.
.BuildVertices() resolves offset pointers and scale/translate transforms before calling the clip-space’s .emit() method.
Whilst .CopyPv() & .Interp() are forwarded directly to the clipspace property.
To summarise my toots from last week on how Mesa3D’s Radeon200 driver works, it encodes data as expected by Radeon GPUs whilst using the shared “TNL”, “swrast”, and “meta” subsystems” to implement things not natively supported by the GPU. That meta subsystem in particular implements 2D operations via the normal 3D APIs, and TNL is mostly to chunk up work.
Communicating with the GPU is mainly done via shared memory in userspace, but the kernel is involved in setting this up and synchronization.
There’s also a DRI2 loader (with window manager-specific implementations) which (de)allocates memory for textures, shader programs, buffers, framebuffers etc in GPU memory.
Wayland’s implementation communicates with the window manager (ideally via a Mesa3D-specific extension that can be hotplugged into any window manager using the reference parser) and allocates memory via Mesa’s EGL implementation and in turn memory-mapping.
Direct Rendering Media (DRM)
Upon device-specific initialization Linux’s Direct Rendering Media subsystem exposes a common set of “IOCTLs” defined in array under drivers/gpu/drm/drm_ioctl.c.
drm_version IOCtl copies the driver’s major, minor, patchlevel, name, date, & desc fields over into a provided pointer. This involves a special memcpy (after checking string lengths) that bridges from kernel space to user space.
drm_getunique grabs a mutex on the device file to access the
unique property of that file & it’s
_len, copying it over into the userspace pointer if it has enough room.
drm_getmagic IOCtl, used for authentication of operations that can’t be shared between processes, copies over the
magic property of the
idr_allocing it if necessary, allocated out of a “radix tree” (space efficient trie) before falling back to incrementing a global.
drm_irq_by_busid legacy IOCtl first checks some feature (bit)flags, whether the GPU’s even on the PCI bus, and whether it’s state is valid before copying the device’s
Looking at the next chunk of (capabilities-related) Linux’s Direct Rendering Media (DRM) IOCtls,
drm_getstats are deprecated exposing only a shell of their former functionality.
drm_getcap branches over the input’s
capabilities to copy the specified property(s) into it’s
value, possibly based on some condition. Most of which are unsupported in the absence of a
drm_setclientcap does the exact opposite, copying the input
value property into the appropriate driver property(s) specified by the input
capability property. Unless the
DRIVER_MODESET bitflag is unset, in which case it throws an EOPNOTSUPP error.
drm_setversion grab’s the device’s mutex to access the version properties. If necessary it compares three seperate version numbers to check validity and initialize these properties.
Accelerated Graphics Port IOCtls
drm_agp_acquire atomically increments the usecount on global or device property (the accessor function can be hotswapped), upon verifying the device’s
agp->acquired property, after which it ensures the device has been assigned a
drm_agp_release verifies that the device has been acquired, atomically decrements that use count, and unsets that acquired flag.
drm_agp_enable sets some properties & calls a method on the bridge’s driver.
drm_agp_info copies version, mode, aperture, memory, and device identifier infos from the device’s
agp->agp_info structure over into a passed-in pointer.
drm_agp_alloc allocates from the GPU’s RAM (check memsizes, call 3 methods, or CPU allocate) and tracks this in a linked list.
drm_agp_free looks the memory up in that linked list, unbinds it if necessary (which, if C macro enabled, calls the appropriate bridge method & removes it from a linked-list), removes it from that linked-list, and frees it both on the GPU (via a choice of 5 bridge methods or done CPU-side, if similarly C macro enabled) and the CPU (via kfree).
drm_agp_bind looks up the memory in the allocated linked-list, (if enabled) calls some bridge driver methods, and adds it to a bound linked-list.
drm_agp_unbind looks up the memory in the allocated linked-list, and (if similarly compile-time enabled) calls a bridge driver method, and removes it from the bound linked-list. All of which also happens upon freeing this memory.
But one question: why’s Accelerated Graphics Protocol treated specially by Direct Rendering Media?
Primarily for output (in this chunk of IOCtls) there’s a syscall to wait for a VBLANK interval (required by old analog monitors we’ve been maintaining backwards compatibility until recently), configure the color encoding, and a deprecated/noop DRM_IOCTL_UPDATE_DRAW.
drm_wait_vblank checks if it has access to and locked all the relevant hardware, computes a
pipe_index using bitwise operations, and looks up the vblank interval in a device-specific lookup table. If that indicates this is a “query” it reads in a new sequence number to return alongside the current timestamp. Otherwise within several locks it atomically the driver to call a driver method only from a single process, before looking up the interval, checking if it has passed, and replying/wait.
After it’s checks,
drm_legacy_modeset_ctl is split into two syscalls: pre & post vblank.
The pre syscall just sets a flag on the appropriate vblank table, and another if it’s interval is 0.
The post syscall checks whether it’s in a modeset & if so resets the vblank (via a couple of driver methods) within a “IRQ” lock (no hardware interruptions) before “putting” it (by disabling the timer if it’s the last to deref a refcount) if the 0x2 bitflag is set.
drm_mode_getresources which reads various properties from the device & it’s files, crtc’s, encoders, & connectors.
Graphical Execution Manager/GPU RAM allocation
drm_gem_close finds and revokes the process’s reference, in between which it lets the GPU driver dereference it itself. Which it does via it’s
gem_close_object methods before clearing it’s Direct Memory Access handle, it’s access control, and (within a lock) dereference and possibly frees it via an IDR (compressed trie) tree.
drm_gem_flink looks up the GEM object from the process’s IDR tree before allocating a new reference for it within the driver’s IDR tree whilst holding a lock.
drm_gem_open first checks if the GEM object has already been opened (whilst holding a lock), and if not allocates a new handle in it’s place using the
gem_object_open driver methods.
Then there’s IOCtls (
drm_prime_handle_to_fd) which converts between GEM handles and file descriptors (useful for handing renderbuffers over to the window manager), and they turn out to just call driver methods if they are provided. Otherwise they error with ENOSYS.