Mesa3D (OpenGL implementation)

How does the CPU utilize the GPU for hardware accelerated (3D) graphics? You use an implementation of OpenGL like Mesa3D!

Mesa3D implements the public OpenGL APIs mainly to “flush” any necessary data, check inputs, track state in “context” globals, and forwards calls onto the driver class. Though it also lowers matrices, compresses textures, and I’ll dig into it’s GLSL compilation later.

Driver Classes

Today I’ll start digging into Mesa3D’s Radeon driver, because I’ll get overwhelmed studying all the drivers and I’ve got a Radeon GPU.

To start off with there’s an array of fallback drivers, the first two of which seems to be doing the same sort of state tracking as the wrapping functions, the third just sets some configuration, and the last does the real work.

To initialize the Radeon GPU, it allocates some memory for the device, initializes it with lots of configuration info, and asks the file descriptor for the ID of the GPU (using Linux’s Direct Rendering Media).

When told to create a buffer for a given __DRIdrawable, the Radeon driver initializes up to six:

I just see state tracking here, nothing happening on the GPU. Though some of these might really on software emulation.

All the real renderbuffer action occurs when switching between different contexts. Where, after some checks against dirty buffers, the public API state, and what buffers are available it:

  1. Optionally calls .getBuffersWithFormat() or .getBuffers() on the screen’s DRI2 loader.
  2. Finds the buffer with the specified name and updates it’s state, possibly “opening” it to find it’s tiling information.
  3. Stores the new target & updates size info.

And beyond that when the Radeon driver’s told to make a drawable current, it flushes any necessary data and possibly resets the public API’s state. From there it creates and updates framebuffers as needed before it’s own state.

Unbinding a context just resets the public API’s globals.

And finally that vtable contains methods for constructing a “context”, which will allocate and initialize that object with plenty of state, configuration, and function pointers. A C macro chooses from one of two.

Transform ‘N Lighting work chunking

The first step is to compact the attribute arrays associated with the vertices into a TNL-specific array and the rest in a seperate array, skipping any not enabled by a specific bitfield.

Following that it extracts the minimum and maximum vertix indices, before calling a driver method determined by ctx->Query.CondRenderMode to determine if it should proceed.

Then if min_index is greater than non-zero it recomputes the triangles so it is, or failing that sets updates .start on the vertices.


If there’s too many vertices for the CPU or GPU to process in a single go, Mesa3D’s TNL pipeline (used by it’s Radeon200 driver) will split the vertices in half either in place or newly-allocated memory depending on the quantity.

Which involves takes the longest prefix of the triangles array with the same base vertex and stack allocates a “context” to reference that slice before restructuring that data into new GPU buffers whilst resolving the geometry mode (GL_POLYGON, etc) with some copying.


The inplace handling meanwhile simply handles each vertex it sees in turn until it has enough to send off to the GPU or it’s got few enough it can afford to do a copying-split.

Once it has a chunk of vertices it recursively calls back into the main TNL drawing routine.


So far I have described how Mesa3D’s TNL subsystem checks whether the rendering should happen, rebased the indices to start from 0, and split the vertices to cap how many are in each batch.

Once it has done that it splits the chunk by “base vertex”, associates subchunk to the vertex buffer (using some GPU memory), and forwards it’s own to the driver’s ->RunPipeline() method.

Driver attributes

Continuing on with my overview of how Mesa3D’s Radeon200 GPU driver works, there’s a method for reading driver identifiers. Where this driver specifies the GL_VENDOR is “Mesa Project” & the GL_RENDERER is whatever Linux’s Direct Rendering Infrastructure says appended with whether the TCL fallback is enabled.

Flushing/Clearing

Then continuing batch there’s the “IOCtl” methods of Clear(), Finish(), & Flush() which I’ll be focusing on this morning.

To .Flush() data through the GPU, it first forwards the method call to it’s DMA property before sending any of it’s queued up instructions to the GPU. Then it tells it’s screen’s DRI2 loader to flush any of that data out of the GPU to it’s destination.

To .Finish() some rendering, it starts with a .Flush() before waiting on all the color buffers then the depth buffer. Waiting involves scheduler-yielding spinlocks and/or Direct Rendering Media Commands written to the GPU’s file descriptor.

And to .Clear() the output so all pixels are the same color, it first flushes any existing activity. Then if needed (using buffers other than FRONT_LEFT, BACK_LEFT, DEPTH, STENCIL, or COLOR0) it tells the software rasterizer to clear it’s bits, before using the generic routine of temporarily switching the GPU’s state to render a couple of triangles atop everything.

The software-based clear meanwhile involves validation, traversing it’s datamodel, and setting every pixel to the specified value.

State Setting

Continuing on my study of how Mesa3D’s Radeon200 driver works, today I will describe it’s state-setting methods.


To .Update()/invalidate state it first invalidates it’s drawbuffer’s bounds followed by it’s SoftWare RASTerizer & Texture ‘N Lighting engine before setting some flags and possibly NULLing out the currently-running program.

Invalidating the drawbuffer calls back into the public API to set it to the scissor’s bounds.

Invalidating the SoftWare RASTerizer has split into two parts, the first just forwards to dynamic dispatch. The second sets some flags, and invalidates the vertex state by enabling all new_inputs flags & resetting the interp & copy_pv vertex methods to determine whether a shortcut can be taken.


Upon lighting change it sets a RESCALE_NORMALS flag to _NeedEyeCoords xor Transform.RescaleNormals.

Upon setting the OpenGL draw buffer, Mesa3D’s Radeon driver first check if it’s the front renderbuffer being set. If so it extracts encoding information and asks the DRI2 Loader to “get” those buffers. Afterwhich it opens each Buffer Object and gets it’s tiling information and updates the framebuffer size.

In either case it follows up by checking whether the Radeon GPU can handle this draw call and whether it needs to update the framebuffer and it’s bounds, before setting the renderbuffer.

Getting a particular renderbuffer involves casting particular Attachments or ColorDrawBuffers and checking their class field. Failing this it calls fallback methods (primarily fallback) on the driver. Afterwhich it references the new renderbuffer, sets some flags based on available buffers, and computes it’s new scissor state which it sets via another driver method.

To read this renderbuffer it uses a very similar process.


To Copy some Pixels, Mesa3D’s Radeon200 driver uses the generic process of falling back to SoftWare Rasterization if the target is to big for it’s “temp” texture singleton or it’s in an unsupported state. Afterwhich it temporarily resets OpenGL’s state to render two triangles across the target rectangle.

To Draw some Pixels Mesa3D’s generic process is to first check the encoding, or tile up the input, before using OpenGL to render that box with an optional depth or stencil buffer.

Tiling a DrawPixels call just involves iterating over each cell and recursively calling _mesa_meta_DrawPixels for each. I don’t want to get into the CPU-based fallback implementations because they look quite involved, but copying pixels does appear to involve memcpy()s.

To read some pixels, Mesa’s Radeon drivers first tries do_blit_readpixels before falling back to _mesa_readpixels after making sure the state is updated.


Those fallbacks mostly deal with encoding issues, but I’ll move onto the next methods now.

Most of which (including AlphaFunc, BlendColor, BlendEquationSeparate/BlendFuncSeparate, ClipPlane, ColorMask, CullFace, DepthFunc, DepthMask, Viewport/DepthRange, etc) uses bitwise operations to reencode the data for easier use with the hardware. Enable & Light[Model]fv does much the same, but might dispatch to these other methods.


Mesa3D’s Radeon200 .PolygonMode() method reconfigures methods for rendering Points, Lines, ClippedLines, Triangles, and Quads by looking them up in a table, haing computed the index for that table by determining whether it’s twosideded and/or unfilled. If neither are the cases it uses it’s own PrimTabs, PrimElts, & ClippedPolygon methods rather than using Texture ‘N Lighting’s.

It also reencodes similar information (plus _tnl_need_projected_coords()) as the “vertex state”.

Calling Mesa3D’s Radeon drivers’ .Scissor() method also involves flushing the GPU’s state and computing the bounding box, beyond simple reencoding.

And setting the viewport may also involve resizing the framebuffer (via the public API layer) and updating the “Scissor” state.

Textures

.NewTextureImage() just memory allocates the datastructure, .DeleteTextureImage() deallocates it after calling a different method to free it’s buffer.

Allocating it’s buffer frees what was there before, has the SoftWare RASTerizer do it’s initialization, and assigns a miptree.

The software rasterizer saves some data about the buffer’s sizing before allocating memory for it’s “image slices”. And assinging a miptree involves calculating layout information, if the current miptree is for a different image size/encoding, before “opening” the buffer object.

And to .FreeTextureImageBuffer() it decrements the miptrees reference count (which may in turn decrement the buffer object’s reference count), dereference the buffer object, and frees the SoftWare Rasterizer’s texture.


To .MapTextureImage() into memory, it extracts various sizing and encoding information before memory mapping the buffer object. And .UnmapTextureImage() unmaps it’s buffers.

.ChooseTextureFormat() performs a switch statement over the current encoding to find the closest one supported natively by Radeon GPUs.

To .CopyTexSubImage() it first checks some “stamps” to avoid stepping on buffer “swap”, computes an offset based on the miptrees encoding (looking it up in a table), and does blit.

.CopyTexSubImage() falls back to, & .Bitmap() uses, the normal OpenGL triangle rendering APIs.

And there’s a method for the window manager to get access to the texture, which calls a method on the screen’s DRI2 Image, extracts it’s encoding/sizing information, and creates a miptree for it.

And finally upon context initialization it sets some encoding globals depending on the CPU’s endianness.

Radeon200-Specific

Once it has .CreateNewTextureObject() it sets a number of bitflags in a form that can be passed to Radeon GPUs.

And .DeleteTexture() involves .Flush()ing any vertices to the GPU, setting some flags on those vertices, and unreferencing the miptree before performing a normal texture object deletion.

.TexEnv() involves encoding ENV_COLOR, LOD_BIAS_EXT, or COORD_REPLACE_ARB values as expected by Radeon GPUs. LOD_BIAS_EXT also requires querying a parameter from Linux’s Direct Rendering Infrastructure to be more specific to this GPU.

.TexParameter() unsets the texture’s validated flag for unsupported properties.

.TexGen() sets a recheck_texgen flag for the current “unit”.

& .NewSamplerObject() allocates/inits it via the public API layer before setting a MaxAnisotropy.

Shaders

.NewProgram() is a simple allocate/initialize routine mostly relying on the generic program subsystem, allocating a bit of extra memory to vertex shaders. And .DeleteProgram() simply frees all the contained pointers.

Upon .ProgramStringNotify() it initializes certain fields of the shader, and does additional translation for vertex programs & TNL.

Translating vertex shaders for Radeon200 GPUs involves:

  1. Verifying there is a program
  2. Verifying it only outputs supported field (fallsback to software rasterization)
  3. Inserts code to compute final position, if necessary
  4. Add a “state reference” for any “fogc” output.
  5. Stores some counts.
  6. Associate inputs with OpenGL builtins
  7. Applies any aliasing
  8. Verifies necessary outputs are generated
  9. Rewrites “swizzle instructions for some reason, handle fog
  10. Lower certain opcodes.
  11. Rewrite scalar/vector arithmatic
  12. Add extra instructions for handling fog.
  13. Verifies enough registers can be trivially allocated.
  14. Repeat 9-13 for each operation.

Meanwhile handing the program to the Texture ‘N Lighting CPU-based subsystem does literally nothing.


Asking .IsProgramNative() will do this same translation if necessary before checking a flag.

Query Objects

.NewQueryObject() is a normal allocation & initialization routine, & .DeleteQuery decrement’s it’s buffer object’s refcount before freeing it’s label and then itself.

To .BeginQuery() it calls .flush() on it’s Direct Memory Access before opening it’s buffer object and setting some flags/properties.

To .EndQuery(), Mesa3D’s Radeon drivers flushes it’s Direct Memory Access, performs a space check, forwards to another method, and unsets the query.current property.

To .CheckQuery(), it calls .Flush() if the buffer object is referenced by the “CS”, gets the query result if the buffer object isn’t busy via a temporarily memory mapping it.

Or it’s the same as .WaitQuery() which (again) flushes in the presence of that reference before getting that query result.

FrameBuffer Objects

It’s alloced/initialized as per normal, with a ClassID and memory management methods attached (though the realloc happens on the GPU with a flush, and plenty of encoding data stored CPU-side).

Mapping a framebuffer object involves copying over encoding data, opening/blitting/mapping the buffer object, and if necessary reencoding it.

Unmapping the framebuffer involves reverse any reencodings before unmapping the buffer object and reversing any blits.

Binding a framebuffer sets the “draw buffer”, by:

Mesa3D’s Radeon driver can also set the framebuffer to a renderbuffer, which follows the same process after flushing any data through the GPU and ensuring the public API-layer state is updated.

Setting it to a texture will first check with the SoftWare Rasterizer to perform any necessary conversions, and looks up the appropriate scale in the texture’s miptree tables, before setting the framebuffer as per normal.

.FinishRenderTexture() is just a .Flush() operation.

There’s a validation routine which calls the vtbl.is_format_renderable method for the encoding of each attachment to this renderbuffer.

Blitting the framebuffer is done using the normal triangle-rendering OpenGL APIs or using one of three CPU-based routines after checking whether the blit’s actually needed.

And there’s a method for window manager integration largely referring to the screen’s DRI2 image method lookupEGLImage, but also flushing GPU state and copying encoding info.

Span Renderer

These are attached to a dispatch table for the SoftWare RASTerizer (swrast).

To start the span renderer, it flushes any vertices to the GPU before memory mapping all textures, the framebuffer for the swrast, and maybe the read buffer.

Finishing the span renderer involves flushing the swrast (writing out these spans with reencodings) and unmapping all those textures.

TNL Callback Methods

I’ll continue my exploration of how Mesa3D’s Radeon200 driver works by exploring it’s callback functions for the Texture ‘N Lighting (TNL) subsystem., which are attached to a special dispatch table.

To .UpdateMaterial() it accesses the GPU state with R200_DB_STATE(), updates it to match the arguments, and saves it with R200_DB_STATECHANGE(), by directly accessing the GPU’s memory.

To .RunPipeline(), it starts by validating the OpenGL state and copying it over into GPU control memory. Failing to validate this state will trigger the SoftWare RASTerizer or TCL subsystem. Then it’ll hand control back over to the TNL subsystem.

Which’ll check for any input changes which it handles by updating it’s own state as necessary and validating it. Then it runs each pipeline stage associated with the TNL context.

TNL Renderer

Continuing my exploration of how Mesa3D’s Radeon200 driver works, I just read through code that allocates memory for the GPU state, and writes a Direct Rendering Media command to the GPU’s file descriptor to tell it to pay attention to this memory.

But after that it attaches more methods to the TNL “Renderer”.


Starting a render involves setting the vertex format, which in turn involves queueing up a bunch of GPU commands.

To render an individual primitive it sets a radeon.swtcl.render_primitive property and if needed converts it to a point, line, or triangle. Which it does by making sure it’s buffers and render state are valid, whilst uploading any necessary matrices, textures, lights, etc. Before it makes any appropriate adjustments to the GPU state, including most importantly enqueueing the new primitive and setting radeon.swtcl.hw_primitive to it.

Finishing a render is a noop, and resetting the line stipple queues up a bunch of GPU commands to change the GPU state just like is done upon starting a render.


For the rest of these methods it uses the CPU-based & pluggable TNL subsystem.

.BuildVertices() resolves offset pointers and scale/translate transforms before calling the clip-space’s .emit() method.

Whilst .CopyPv() & .Interp() are forwarded directly to the clipspace property.

Summary

To summarise my toots from last week on how Mesa3D’s Radeon200 driver works, it encodes data as expected by Radeon GPUs whilst using the shared “TNL”, “swrast”, and “meta” subsystems” to implement things not natively supported by the GPU. That meta subsystem in particular implements 2D operations via the normal 3D APIs, and TNL is mostly to chunk up work.

Communicating with the GPU is mainly done via shared memory in userspace, but the kernel is involved in setting this up and synchronization.

DRI2 Loader

There’s also a DRI2 loader (with window manager-specific implementations) which (de)allocates memory for textures, shader programs, buffers, framebuffers etc in GPU memory.

Wayland’s implementation communicates with the window manager (ideally via a Mesa3D-specific extension that can be hotplugged into any window manager using the reference parser) and allocates memory via Mesa’s EGL implementation and in turn memory-mapping.

Direct Rendering Media

Linux provides several ioctl calls for interacting with GPUs and graphics cards. These are implemented generically for all GPUs to get/set generic properties (including locks and authentication), allocate/free GPU memory (with a fallback buddy memory allocator), call methods on for the relevant driver, and interact with synchronization primitives. Any objects this manipulates are looked up in tries for security.

For Radeon GPUs those driver methods sets/gets additional properties & calls down to another layer of methods for specific hardware versions to communicate over I/O registers, ringbuffers, and other shared memories. Most of the effort here goes into starting up & tearing down the driver, though the logic for figuring on, off, and how much is significant too.

The Radeon-specific IOCtl called by Mesa3D to configure the shared between it and the Radeon GPU driver simply sets various properties, presumably to be copied over to the device’s ringbuffer upon CPU interrupt.

GLSL Compiler

Code to run on the GPU is written in C-like language called “GLSL”. This section describes how that compiler works.

glCreateShader

After validation/debugging this allocates a a new shader object storing the provided stage & name alongside a refcount of 1, and uses a locked hashmap to allocate an ID for the newly-created shader and store it.

glShaderSource

Which after looking up/validating it’s input concatenates it’s input strings via an intermediary offsets array, before possibly looking it up in a cache & setting the new property with some memory management and possibly a checksum computation for debugging.

glShaderCompile

glCompileShader, after looking-up/validating/debugging it’s inputs, initializes it’s standard library tables & calls _mesa_glsl_compile_shader before doing additional logging/debugging depending on the set _Shader->Flags & CompileStatus. Today I’ll describe that initialization, leaving the actual compilation to tomorrow.

Builtins Table

After checking if it’s already been initialized (via a mutex, refcount, & null check) it atomically references the “type singleton”.

Then it memory allocates a mem_ctx, and an arbitrary shader to hold a newly-allocated glsl_symbol_table (which is actually a C++ class, even with a different naming convention). The glsl_symbol_table in turn allocates a stack of hashmaps, it’s own mem_ctx, & it’s own allocation area(s).

From there it uses a variadic (arbitrary number of arguments) utility method to allocate, validate, & register (in that shader/symbol table) each GLSL builtin function & it’s arguments.

Those types representing functions & their argument types are the same Abstract Syntax Tree to which GLSL normally parses to. The validation only runs if the DEBUG macro is set at compiletime & consists of a visitor checking various type rules. And the type signatures for these functions are allocated by the caller, often generated via a builtin-specific helper function.

The difference from user-defined functions is that these are each identified by a compiler-known ID number.

These functions start with “intrinsics” mostly referring to atomic pointers, though those are only compiled if flag indicates the GPU supports it. But those also include image (registered via it’s own helper method), memory barrier, etc functions.

Then it, with some help from various macros, finishes with the commonly used builtins including:

Parsing

It starts by checking whether the shader includes a #include, indicating that it can’t use the on-disk shader cache. Then it allocs/inits a _mesa_glsl_parse_state C++ object, optionally claims a lock for allocating names, and evaluates a C-like preprocessor.

That GLSL preprocessor starts by alloc-/init-ing it’s GL C PreProcessor Parser (sorry, previously misread that name!), removing escaped newlines, running the lexer (as implemented in Lex) to convert those preprocessor directives into “tokens”, over which it runs the preprocessor parser (as implemented in Bison) which evaluates them, outputting a new string and refers to a stack for #if, etc preprocessors.

After which it validates the if-stack has been closed and reallocs memory.


After the preprocessor is another check for whether we can use the on-disk cache, then if that hasn’t failed it calls mesa_glsl_parse around which it allocs & frees a lexer.

These are also written in Lex/Bison & the lexer includes numeric fields in the tokens it yields.

Then if it’s a compute shader, it checks whether the GPU supports those, and if requested it’ll print the Abstract Syntax Tree it has parsed.

Compilation

_mesa_ast_to_hir starts by initializing a table of all built-in variables (which partially depends on the type of shader it’s compiling) amongst other properties.

Then a new scope is pushed to their stack for the shader’s variables seperate from the built-ins and a method is called for each root C++ object in the Abstract Syntax Tree to have it straightforwardly compile itself.

After that it verifies functions do not share the same name, they do not recurse (by using a visitor to build a call graph & applying the Partial Ordering algorithm to detect cycles), & builtins assignment rules are followed.

Then it rearranges all variables to be at the front of the list in source order (previously it’s inverse order), sets a flag if gl_FragCoord is used, does something relating to removing unused variables (in part by using another visitor), and then finally checks we don’t read write-only variables via another visitor.


After that lowering/checks it does additional validation (via validate_ir_tree) & optionally prints this “IR” for debugging.

That validation is the some routine used to validate built-in functions, & as these validations should be unimpacted by GLSL’s syntax.

Then it frees the old InfoLog, validates & copies various properties from the (temporary) compiler state to the shader depending on the stage for which it’s being compiled.

After which it allocates a new symbol_table, assigns missing IDs to subroutines, uses a visitor to insert missing instructs, applies optimizations, frees the state & caches it.

Optimization

Those optimizations involves once or repeatedly:

  1. Replacing various patterns with smaller/faster alternatives.
  2. Inlining any function with a single return.
  3. Removing unused functions & function signatures.
  4. Splits structures into multiple variables where it can.
  5. Propagates invariance flags through dereferences.
  6. Simplifies ifs with constant condition and/or empty branch(es).
  7. Merges simple nested ifs. (Understandable for GPUs not to like)
  8. Move conditions from if (cond) discard; into the discard instruction.
  9. Reduces the ammount of copying.
  10. (Optional) Rewrites (matrix * vector) to (vector * matrix transposed), which is faster on some GPUs.
  11. (Optional) Combine similar assignments into a single vector instruction.
  12. Remove dead assignments.
  13. Removes “local” dead code.
  14. Optimize away single-use variables.
  15. Inlines constant expressions (in 3 seperate passes).
  16. Optimizes based on the numeric of exprs.
  17. Balances the Abstract Syntax Tree where valid.
  18. Evaluates (partially) constant & simplifies numeric operations.
  19. Simplifies implicit gotos to gotos.
  20. Simplifies constant-indexing into a vector to a “swizzle” operation.
  21. Rewrites “vector inserts” into swizzle assignments.
  22. Merges and possibly removes swizzle operations.
  23. Lower arrays that are always indexed by constants into seperate variables.
  24. Simplifies gotos to gotos.

Most of which involves a visitor.

Additional more expensive optimizations may then optionally commence. This involves finding and unrolling valid loops (with simple & complex cases), afterwhich additional constant, if, & jump loweringS may be beneficial.

After these main & optionaly repeated optimizations are performed, it:

  1. Removes dead built-in variables.
  2. In debug builds, validates we haven’t messed up the AST.
  3. Compacts the memory usage.
  4. Rebuilds the symbol table.

Conclusion

I’ve still got questions including how those optimizations make it into the compiled code, because it looked like the lowering to bytecode happened before they ran via dynamic dispatch.

But once it’s been lowered to bytecode, the Radeon GPU drivers (for example) lower it to a subset of that standard supported by that hardware both in userspace and kernelspace.

glCreateProgram & glCreateShader

glCreateProgram & glCreateShader allocs/inits a new shader program, and uses a locked hashmap to allocate an ID and store it.

glGetShaderiv & glGetProgramiv

glGetShaderiv & glGetProgramiv looks up a specified property (TYPE, DELETE/COMPILE STATUS, LOG/SOURCE LENGTH, or SPIR V BINARY ARB) on the specified shader or program.

glGetShaderInfoLog & glGetProgramInfoLog

During linking or compilation, error messages are written to an in-memory logging string stored in the Shader or Program, which you can access via these functions.

glShaderSource

glShaderSource after looking up/validating it’s input concatenates it’s input strings via an intermediary offsets array, before possibly looking it up in a cache & setting the new property with some memory management and possibly a checksum computation for debugging.

glCompileShader

glCompileShader, after looking-up/validating/debugging it’s inputs, initializes it’s standard library tables & calls _mesa_glsl_compile_shader before doing additional logging/debugging depending on the set _Shader->Flags & CompileStatus. Today I’ll describe that initialization, leaving the actual compilation to tomorrow.

Standard library initialization

After checking if it’s already been initialized (via a mutex, refcount, & null check) it atomically references the “type singleton”.

Then it memory allocates a mem_ctx, and an arbitrary shader to hold a newly-allocated glsl_symbol_table (which is actually a C++ class, even with a different naming convention). The glsl_symbol_table in turn allocates a stack of hashmaps, it’s own mem_ctx, & it’s own allocation area(s).

From there it uses a variadic (arbitrary number of arguments) utility method to allocate, validate, & register (in that shader/symbol table) each GLSL builtin function & it’s arguments.

Those types representing functions & their argument types are the same Abstract Syntax Tree to which GLSL normally parses to. The validation only runs if the DEBUG macro is set at compiletime & consists of a visitor checking various type rules. And the type signatures for these functions are allocated by the caller, often generated via a builtin-specific helper function.

The difference from user-defined functions is that these are each identified by a compiler-known ID number.

These functions start with “intrinsics” mostly referring to atomic pointers, though those are only compiled if flag indicates the GPU supports it. But those also include image (registered via it’s own helper method), memory barrier, etc functions.

Then it, with some help from various macros, finishes with the commonly used builtins including:

Parsing

It starts by checking whether the shader includes a #include, indicating that it can’t use the on-disk shader cache. Then it allocs/inits a _mesa_glsl_parse_state C PreProcessor object, optionally claims a lock for allocating names, and evaluates a C-like preprocessor.

That GLSL preprocessor starts by alloc-/init-ing it’s GL C PreProcessor parser, removing escaped newlines, running the lexer (as implemented in Lex) to convert those preprocessor directives into “tokens”, over which it runs the preprocessor parser (as implemented in Bison) which evaluates them, outputting a new string and refers to a stack for #if, etc preprocessors.

After which it validates the if-stack has been closed and reallocs memory.

After the preprocessor is another check for whether we can use the on-disk cache, then if that hasn’t failed it calls mesa_glsl_parse around which it allocs & frees a lexer.

These are also written in Lex/Bison & the lexer includes numeric fields in the tokens it yields.

Then if it’s a compute shader, it checks whether the GPU supports those, and if requested it’ll print the Abstract Syntax Tree it has parsed.

Lowering

Then it allocates a new exec_list to output to whilst freeing the old & calls _mesa_ast_to_hir for the main conversion.

_mesa_ast_to_hir starts by initializing a table of all built-in variables (which partially depends on the type of shader it’s compiling) amongst other properties.

Then a new scope is pushed to their stack for the shader’s variables seperate from the built-ins and a method is called for each root C++ object in the Abstract Syntax Tree to have it straightforwardly compile itself.

After that it verifies functions do not share the same name, they do not recurse (by using a visitor to build a call graph & applying the Partial Ordering algorithm to detect cycles), & builtins assignment rules are followed.

Then it rearranges all variables to be at the front of the list in source order (previously it’s inverse order), sets a flag if gl_FragCoord is used, does something relating to removing unused variables (in part by using another visitor), and then finally checks we don’t read write-only variables via another visitor.

Validation

After that lowering/checks it does additional validation (via validate_ir_tree) & optionally prints this “IR” for debugging.

That validation is the some routine used to validate built-in functions, & as these validations should be unimpacted by GLSL’s syntax.

Then it frees the old InfoLog, validates & copies various properties from the (temporary) compiler state to the shader depending on the stage for which it’s being compiled.

After which it allocates a new symbol_table, assigns missing IDs to subroutines, uses a visitor to insert missing instructs, applies optimizations, frees the state & caches it.

Optimization

Those optimizations involves once or repeatedly:

  1. Replacing various patterns with smaller/faster alternatives.
  2. Inlining any function with a single return.
  3. Removing unused functions & function signatures.
  4. Splits structures into multiple variables where it can.
  5. Propagates invariance flags through dereferences.
  6. Simplifies ifs with constant condition and/or empty branch(es).
  7. Merges simple nested ifs. (Understandable for GPUs not to like)
  8. Move conditions from if (cond) discard; into the discard instruction.
  9. Reduces the ammount of copying.
  10. (Optional) Rewrites (matrix * vector) to (vector * matrix transposed), which is faster on some GPUs.
  11. (Optional) Combine similar assignments into a single vector instruction.
  12. Remove dead assignments.
  13. Removes “local” dead code.
  14. Optimize away single-use variables.
  15. Inlines constant expressions (in 3 seperate passes).
  16. Optimizes based on the numeric of exprs.
  17. Balances the Abstract Syntax Tree where valid.
  18. Evaluates (partially) constant & simplifies numeric operations.
  19. Simplifies implicit gotos to gotos.
  20. Simplifies constant-indexing into a vector to a “swizzle” operation.
  21. Rewrites “vector inserts” into swizzle assignments.
  22. Merges and possibly removes swizzle operations.
  23. Lower arrays that are always indexed by constants into seperate variables.
  24. Simplifies gotos to gotos.

Most of which involves a visitor.

Additional more expensive optimizations may then optionally commence. This involves finding and unrolling valid loops (with simple & complex cases), afterwhich additional constant, if, & jump loweringS may be beneficial.

After these main & optionaly repeated optimizations are performed, it:

  1. Removes dead built-in variables.
  2. In debug builds, validates we haven’t messed up the AST.
  3. Compacts the memory usage.
  4. Rebuilds the symbol table.

Compilation

The actual compilation to bytecode is done via the NewProgram method after linking.

Radeon GPUs for example support a subset of the standard ARB bytecode, so it calls the program subsystem to do the initial translation (mostly consisting of a visitor subclass allocating registers via a bump pointer) & optimization before lowering the unsupported opcodes in both the userspace & kernelspace Radeon drivers.

glAttachShader

glAttachShader (after looking up it’s global context, specified shader, & specified program & verifying that we’re not introducing a duplicate) resizes the program’s Shaders array to hold this new shader & refcounts it.

glLinkProgram

glLinkProgram (after looking up it’s global context & specified program, & erroring on duplicate names) starts by building a bitmask of all the shader types provided and ensuring the GLSL functions table is initialized (as was described for the parser).

Then it calls the driver method(s) to flush the vertices to the GPUbefore calling _mesa_glsl_link_shader to do the real work.

This starts by reinitializing the program’s data & validating all shaders are ready to be linked. The linking is done differently depending on whether the shaders want to use SPIR-V encoding, after which errors are propagated, a driver method called, debugging data might be printed, & the results possibly cached.

non-SPIR V

If it’s not using SPIR-V encoding, (after initialization & extensive validation) it combines shaders marked to run at the same stage by validating global variables, & seperately their types, are shared correctly. Then it validates each function is only implemented in a single shader, before finding the main function & duplicating it into a new program.

From there it validates extracts information about the globals for the appropriate shader stage, followed optionally xfb_stride qualifiers.

Then it does layout qualifiers, before duplicating the symbol table & gets the main() function’s definition so it can inline vars where valid (once for the new main function, and once for each shader), inline/link function calls where valid (using a visitor C++ subclass), & link output variables for non-fragment non-main() shaders (adding them to the symbols table).

Afterwhich it runs a visitor to infer array sizes, links “uniform blocks” (I don’t understand this bit) & validates the result.

Depending on the hardware, additional lowerings for certain variables may be performed during that non-SPIR V linker.

SPIR V

The alternative SPIR V linker calls a driver method to allocate the program & performs simpler checks.

Assigning to uniforms

To pass parameters into your shader programs you use the glGetUniformLocation & glUniformType functions. Alongside ofcourse glBindBuffer, etc for specifying the arrays to be concurrently processed into an image.

glGetUniformLocation, after it’s lookups and validations, first iterates over all uniforms in the program to find the specified “uniform” variable despite any optimizations, and finishes by looking at the appropriate for it’s type.

Then to store a “uniform” of a given type the provide array or number will be copied into an array property of the specified GLSL program, after (as per normal) extensive lookups and validations. Vectors, matrices, arrays, & structures may ofcourse take up multiple slots in that array, & matrices may optionally be transposed.