Direct Rendering Media is the Linux kernel’s subsystem for exposing graphics card (display output + GPU + dedicated RAM) hardware, and multiplexing this hardware between however many programs wish to access it.
It’s acronym is not to be confused with Digital Rights Management or whatever backronym you find more suiting for it.
Upon device-specific initialization Linux’s Direct Rendering Media subsystem exposes a common set of “IOCTLs” defined in array under drivers/gpu/drm/drm_ioctl.c.
drm_version IOCtl copies the driver’s major, minor, patchlevel, name, date, & desc fields over into a provided pointer. This involves a special memcpy (after checking string lengths) that bridges from kernel space to user space.
drm_getunique grabs a mutex on the device file to access the
unique property of that file & it’s
_len, copying it over into the userspace pointer if it has enough room.
drm_getmagic IOCtl, used for authentication of operations that can’t be shared between processes, copies over the
magic property of the
idr_allocing it if necessary, allocated out of a “radix tree” (space efficient trie) before falling back to incrementing a global.
drm_irq_by_busid legacy IOCtl first checks some feature (bit)flags, whether the GPU’s even on the PCI bus, and whether it’s state is valid before copying the device’s
Looking at the next chunk of (capabilities-related) Linux’s Direct Rendering Media (DRM) IOCtls,
drm_getstats are deprecated exposing only a shell of their former functionality.
drm_getcap branches over the input’s
capabilities to copy the specified property(s) into it’s
value, possibly based on some condition. Most of which are unsupported in the absence of a
drm_setclientcap does the exact opposite, copying the input
value property into the appropriate driver property(s) specified by the input
capability property. Unless the
DRIVER_MODESET bitflag is unset, in which case it throws an EOPNOTSUPP error.
drm_setversion grab’s the device’s mutex to access the version properties. If necessary it compares three seperate version numbers to check validity and initialize these properties.
Accelerated Graphics Port IOCtls
drm_agp_acquire atomically increments the usecount on global or device property (the accessor function can be hotswapped), upon verifying the device’s
agp->acquired property, after which it ensures the device has been assigned a
drm_agp_release verifies that the device has been acquired, atomically decrements that use count, and unsets that acquired flag.
drm_agp_enable sets some properties & calls a method on the bridge’s driver.
drm_agp_info copies version, mode, aperture, memory, and device identifier infos from the device’s
agp->agp_info structure over into a passed-in pointer.
drm_agp_alloc allocates from the GPU’s RAM (check memsizes, call 3 methods, or CPU allocate) and tracks this in a linked list.
drm_agp_free looks the memory up in that linked list, unbinds it if necessary (which, if C macro enabled, calls the appropriate bridge method & removes it from a linked-list), removes it from that linked-list, and frees it both on the GPU (via a choice of 5 bridge methods or done CPU-side, if similarly C macro enabled) and the CPU (via kfree).
drm_agp_bind looks up the memory in the allocated linked-list, (if enabled) calls some bridge driver methods, and adds it to a bound linked-list.
drm_agp_unbind looks up the memory in the allocated linked-list, and (if similarly compile-time enabled) calls a bridge driver method, and removes it from the bound linked-list. All of which also happens upon freeing this memory.
But one question: why’s Accelerated Graphics Protocol treated specially by Direct Rendering Media?
Primarily for output (in this chunk of IOCtls) there’s a IOCtl to wait for a
VBLANK interval (required by old analog monitors we’ve been maintaining backwards compatibility until recently), configure the color encoding, and a deprecated/noop
drm_wait_vblank checks if it has access to and locked all the relevant hardware, computes a
pipe_index using bitwise operations, and looks up the vblank interval in a device-specific lookup table. If that indicates this is a “query” it reads in a new sequence number to return alongside the current timestamp. Otherwise within several locks it atomically the driver to call a driver method only from a single process, before looking up the interval, checking if it has passed, and replying/wait.
After it’s checks,
drm_legacy_modeset_ctl is split into two parts: pre & post vblank.
The pre syscall just sets a flag on the appropriate vblank table, and another if it’s interval is 0.
The post syscall checks whether it’s in a modeset & if so resets the vblank (via a couple of driver methods) within a “IRQ” lock (no hardware interruptions) before “putting” it (by disabling the timer if it’s the last to deref a refcount) if the 0x2 bitflag is set.
drm_mode_getresources which reads various properties from the device & it’s files, crtc’s, encoders, & connectors.
Graphical Execution Manager/GPU RAM allocation
drm_gem_close finds and revokes the process’s reference, in between which it lets the GPU driver dereference it itself. Which it does via it’s
gem_close_object methods before clearing it’s Direct Memory Access handle, it’s access control, and (within a lock) dereference and possibly frees it via an IDR (compressed trie) tree.
drm_gem_flink looks up the GEM object from the process’s IDR tree before allocating a new reference for it within the driver’s IDR tree whilst holding a lock.
drm_gem_open first checks if the GEM object has already been opened (whilst holding a lock), and if not allocates a new handle in it’s place using the
gem_object_open driver methods.
Then there’s IOCtls (
drm_prime_handle_to_fd) which converts between GEM handles and file descriptors (useful for handing renderbuffers over to the window manager), and they turn out to just call driver methods if they are provided. Otherwise they error with ENOSYS.
drm_mode_getplane copies information from each of the device’s “planes” for which it holds a “lease” to the userspace.
drm_mode_getcrtc looks up the CRTC configuration from an IDR tree and copies data out of it to userspace whilst holding a lock.
drm_mode_setcrtc does the reverse with plenty more checks (+ 2 driver methods), allocating memory where needed.
drm_mode_setplane works pretty much the same as set/get crtc, except on the set side it mainly calls a choice of driver methods.
drm_mode_cursor advances an X/Y “cursor” or initializes it using a couple of driver methods.
drm_mode_gamma_get/set are relatively simple property accessors with fewer checks, and also calls the
gamma_set driver method.
drm_mode_getencoder & getconnector also copies various proeprties to userspace, the latter calling .fill_modes().
DRM_IOCTL_MODE_DETACHMODE are now noops, successfully doing nothing.
drm_mode_getproperty looks up the property in an IDR tree and copies it to userspace according to it’s type.
drm_property_lookup_blob does similarly in another map, but doesn’t need to worry about types.
drm_connector_property_set is translated into a
drm_mode_getfb copies looks up a framebuffer, copies it’s data to userspace, & calls .create_handle()
drm_mode_addfb2 creates a framebuffer using a driver method and adds it to a list whilst holding a mutex, and there’s an older version which converts it’s input to v2.
drm_mode_rmfb looks up the framebuffer and removes the specified framebuffer from the device file’s framebuffers list whilst holding a lock. After which it dereferences it and possibly queues up a free routine.
drm_mode_page_flip finds, validates, and locks the necessary data to call the
page_flip[_target] driver method.
drm_mode_dirtyfb looks up the specified framebuffer, copies input into kernel space, and calls it’s .dirty() driver method.
drm_mode_create_dumb validates it’s input and calls the
.dumb_create() driver method.
drm_mode_mmap_dumb hands off directly to
.dumb_map_offset(), or falls back to the GEM infrastructure which ensures the memory’s allocated in the GPU’s RAM & creates a new GEM object referencing it.
drm_mode_destroy_dumb works very much like
drm_mode_mmap_dumb, with it’s fallback ultimately calling
.gem_close_object() before actually freeing it.
drm_mode_obj_get_properties is split into two parts, first it looks up the appropriate object (in an IDR tree) then it looks up each of the specified properties on that object, which is done via different else-if chains for different object types.
drm_mode_obj_set_property does the inverse to set the specified properties on the specified object. This time it looks up a property ID and might call a
.commit() driver method. And setting the individual properties might or might not be more involved as well.
drm_mode_atomic is a related IOCtl which allocates and initializes some state for atomically updating the state.
drm_mode_createblob allocates and initializes a BLOB structure, and adds to the relevant collections.
drm_mode_destroyblob looks up the specified BLOB, removes it from those datastructures, and decrements the refcounts.
drm_syncobj_create checks bitflags to determine whether it’s currently supported before allocating/initializing it and returning an IDR-tree reference to it. These syncobjects are reference counted and wraps more general “memory fences” useful for any driver.
drm_syncobj_destroy removes it from the IDR tree and decrements the refcount.
drm_syncobj_handle_to_fd converts a syncobj it has looked up in the relevant IDR tree into a “sync file” descriptor, wrapping the memfence with normal file syscalls and freeing the syncobject upon close.
drm_syncobj_fd_to_handle does the exact inverse.
drm_syncobj_transfer reads like 2 IOCtls in 1 determined by the
dst_point argument. If that’s set it finds the syncobj, waits on it’s memfence and allocates a new point for it’s timeline. Or it transfers the memfence to another syncobj.
drm_syncobj_wait finds all the specified syncobj’s, checks if any have been triggered already, and if not waits for one of those memfences to be triggered. Or for a timeout to fire, derived from the “timeline” if present - otherwise it’s as requested by userspace. There’s also a “timeline” variant of this, but I’m not seeing how that differs.
drm_syncobj_reset replaces all the memfences of the specified syncobjs with new ones. &
drm_syncobj_signal replaces them with a stub memfence.
drm_syncobj_timeline_signal finds all the specified syncobjs, copies all specified points into kernelspace, allocates a new “chains” array, adds those points/memfences to syncobjs and new chains.
drm_syncobj_query reads the current state of each specified memfence and copies it into userspace.
drm_crtc_get_sequence looks up the specified crtc and it’s vblanking flags so it can copy that information (partially under a lock) into userspace.
drm_crtc_queue_sequence allocs/inits a new event before sending it (by calling it’s
completion_release method, signalling it’s memfence, and enqueueing it onto it’s device file’s event list) or enqueuing it.
As for “leases”, which are a kind of lock a central daemon can claim over some GPU operations, there’s four IOCtls that can be used to manage them:
drm_mode_create_lease performs various security checks, looks up the specified objects, allocs/inits the lease/DRM “master”, adds it to a (new) IDR tree, & duplicates the file descriptor to authenticate the caller.
drm_mode_list_lessees iterates over the lease holders under a lock & copies their IDs to userspace.
drm_mode_get_lease copies the object IDs of all listed leasable objects into userspace.
drm_mode_revoke_lease finds the lesee in the appropriate IDR tree & removes them both from the IDR tree & linked-lists.
Most of it straightforwardly directly gets & sets various properties device-independant properties, and/or dynamically dispatches to driver methods. Locks are often involved, as are IDR trees (compact tries) for resolving userspace references. Several of these are guarded by bitflags, checked either by IOCtl generic or specific logic.
To handle older display output protocols (compatible with Cathode Ray Tubes) it has a lookup table for vblanking delays/other flags, & it implements a simple “Buddy” Memory Allocator (augmenting that of the corresponding userspace library “libdrm”) to manage the dedicated RAM in a subsystem called GEM (Graphics Execution Manager).
It uses standard memfences to synchronize with the GPU, and has linked lists to manage collections vitally including the queue of “events” to be received by the device file’s
Those IOCtls wrap methods provided by the GPU device driver, and since I use a Radeon GPU I’ll be discussing that driver.
(Un)loading & Opening/Closing
Linux’s Radeon GPU driver has a method
load_kms (Kernel Mode Settings) to configure the hardware for a given output encoding. I will describe how it works this morning.
It starts by allocating memory for the new configuration, reading in flags from the PCI bus (+ an internal Radeon property), and initializing/verifying it’s properties outputting any errors to a special debugging file. This includes setting callbacks & flags for the different versions of these GPUs. And registers debug files.
Having initialized all those fields it runs some tests (I won’t go into details) to make sure things work and are performant and registers it with the PM runtime or VGA hardware.
Then it initializes the modeset, including it’s “dynamic” properties which can be looked up via an IOCtl, which’ll also init it’s atombios or combios I2C bus + other hardware and check it’s EDID. Getting it’s Atom Connector from a veriety of sources.
Next it initializes the ACPI data/hardware, sending ATCS & ATIF signals. And if this is a PX hardware, it’ll then reconfigure the PM runtime.
Not that I have much of an idea of what much of this code really means.
The second method, which I’ll cover this morning, of Linux’s Radeon GPU driver is
This synchronizes with the PM (whatever that is!) and allocs/inits an internal structure and GEM (Graphical Execution Manager) buffer objects. Whilst setting up GPU virtual memory tables if supported (by a preceding check and callback methods), and enqueueing the operation on a shared ring buffer.
Once everything’s initialized, it reserves some additional Buffer Object memory, adds it to the device’s virtual memory tables, computes the CPU & GPU virtual addresses for that page, adjusts an interval tree (not disimilar to a binary search tree) and the Buffer Object to match, and frees it all on fail.
All of this only happens for newer Radeon GPUs that support virtual memory, on older ones it just synchronizes closing up with the following shared steps.
In either case it calls into the PM runtime (again, whatever that is!) to mark this GPU as the last busy device and configures autosuspend.
The third method, which’ll describe this morning on Linux’s Radeon GPU driver is
This synchronizes with the PM runtime, unsets
cmask_filp properties under a lock if they’re the same as the device’s own
file_priv value, and atomically frees each UVD & VCE handle using memfences & noting any uses whilst communicating this to the GPU using Buffer Objects.
For newer GPUs with virtual memory support it’ll try reserving some additional buffer object memory. If it fails & an
accel_working flag on the GPU driver is set, it’ll remove that buffer object (by removing it from a kernel-space linked list & the GPU’s interval tree before adding it to a free list or unreferencing it’s memfence), & unreserving it by via the TTM infrastructure.
In either case if that flag’s set it’ll deallocate that virtual memory table.
For those GPUs which support virtual memory it’ll also free & unset the
Then for all Radeon GPUs it calls into the PM runtime to mark the GPU as being the last busy device & enable autosuspend.
Linux’s Radeon GPU driver’s
lastclose_kms method is called once no more processes are using it according to a refcount. Though most of the work has already been done per-process in
This’ll start by resetting the mode of the framebuffer emulation, using other driver methods, Direct Rendering Media properties, & locks whilst disabling it’s “planes”. Another method “commits” this new state once adjusted.
Then it switches the VGA state by looking it up from a linked list by stored ID, finds an active client from another list, and reconfigure it’s audio (via the client’s
set_gpu_state method), framebuffer console (under it’s lock, switching the method table for debugging messages), & power state (optionally calling into the PM runtime). Between which it’ll call the VGA’s (under lock)
switchto & the client’s
reprobe methods. Then it’ll set the active flag.
load_kms method, Linux’s Radeon GPU driver has a
This starts by checking if it’s already been unloaded, and if it’s a Px-series device calls into the PM runtime to synchronize with and forbid the device. Then it frees it’s ACPI (unregistering from the bus), modeset (if flagged as initialized unschedules polling, calls a hpd.fini method, resets CRTC config, and frees memory), device (disconnect shared [virtual] memory & buses), and it’s own memory.
Starting studying Linux’s Radeon GPU driver’s VBlanking methods, this morning I’ll describe
get_vblank_counter. VBlanking is a delay between output signals where analog Cathod Ray Tube circuits would move it’s electron beam back to the top, and digital displays didn’t care to switch away from that protocol until recently with HDMI.
This starts by looking up the given CRTC pipe & (once it’s count stabalizes) stats. Stats involves reading the GPU’s registers via RAM & performing bitwise ops.
To get the count itself it calls a callback method that’s been configured whilst initializing the bus. And once it has successfully/stabally reads the stats/count it’ll check some bitflags on those stats to output debugging information.
And if an inactive CRTC is specified it’ll just read the count without the stats with a debugging message that this count may be wrong.
The second VBlanking method of Linux’s Radeon GPU driver is there to enable the VBlanking interrupt,
enable_vblank_kms. I’ll describe it this morning.
This validates that the specified CRTC number is in the valid range, and within an interrupt (IRQ) lock it sets a flag for that CRTC & sets the interrupt via a callback method.
disable_vblank_kms method does the exact same thing but unsets the flag.
The fourth VBlanking method of Linux’s Radeon GPU driver is
get_vblank_timestamp, which I’ll describe this morning.
This validates it’s input and calls
get_scanout_position for a maximum number of retries until it has responded fast enough, passing either the VBlank or CRTC hardware mode depending on a device flag. Then it does some simple math to convert the output.
get_scanout_position (the fifth & final VBlanking method) reads from certain GPU registers depending on the general Radeon GPU model & the specified pipe/CRTC. Bitwise operations are used to decode the response, and timestamps are taken to measure the latency of these reads as used by
Linux’s Radeon GPU driver’s first method handling CPU interrupts is
This unsets a bunch of flags (supposedly to disable the to-be installed interrupts) within a interrupts lock. Then it calls callback methods to set the CPU interrupts (within that lock) & to process any received interrupts (outside the lock).
The postinstall method meanwhile sets the
max_vblank method to 0x00ffffff or 0x001fffff based on whether it’s an AVIVO (family >= CHIP_RS600) model.
Linux’s Radeon GPU’s third method for managing it’s CPU interrupts is
This unsets a bunch of flags before calling the callback method for setting the interrupts, all within a CPU interrupts lock. very similar to what happens before installing an interrupt.
The fourth & final method is
irq_handler_kms which gets called when the GPU triggers the interrupt. Which just dispatches to a callback method, and on success tells the PM runtime to mark this device as last busy.
Today I want to describe the IOCtls specific to Radeon GPUs, which are looked up by the generic Direct Rendering Media IOCtl method in a driver-specific table.
This starts with a whole bunch of deprecated IOCtls, all of which now successfully does nothing and are flagged to require authentication.
Most of the other Radeon-specific IOCtls are related to the GEM subsystem for manager the GPU’s dedicated RAM, but there’s also CS & INFO IOCtls.
RADEON_GEM_INFO copies various properties from the Radeon device into userspace, including
mman.bdev.man[TTM_PL_VRAM]->size << PAGE_SHIFT. With the appropriate pin sizes subtracted from the output
RADEON_GEM_CREATE rounds up the desired size, creates the GEM object (in turn validating the size, creating the buffer objecting, & adding it to a linked-list), and wraps it in a GEM handle, immediately returning any errors and whilst holding a readlock.
RADEON_GEM_MMAP looks up the GEM object from the appropriate IDR tree, converts it to a Radeon buffer object, gets it’s
userptr if it has the right method table, & gets it’s mmap offset
RADEON_GEM_SET_DOMAIN when passed the
RADEON_GEM_DOMAIN_CPU domain tells the DMA subsystem to wait on that Radeon buffer object’s memory. For other domains it does nothing, possibly erroring.
RADEON_GEM_PWRITE both complains about being unimplemented & erroring with ENOSYS.
RADEON_GEM_WAIT_IDLE calls the DMA subsystem to wait on the looked up Radeon buffer object, and if the
tbo.mem.mem_type property is
RADEON_GEM_DOMAIN_VRAM & it can it calls the
mmio_hdp_flush bus callback method.
RADEON_CS initializes an on-stack parser, it’s instruction buffer, & it’s relocations buckets from userspace data. It then processes a that instruction buffer by parsing it via a callback, synchronizing on each rings’ memfences, and enqueueing instructions on a locked ring.
Enqueueing an instruction may involve allocating a new Virtual Memory ID, synchronizing those rings again, flushing the virtual memory tables to the GPU via it’s ring buffer, and executing the instruction via another bus-specific callback method.
RADEON_INFO copies a specified Radeon-specific property over into userspace.
And back to the GEM related IOCtls,
RADEOM_GEM_SET_TILING validates it’s input & sets the specified flags on the looked up Radeon buffer object. As per it’s
RADEON_GEM_BUSY checks with the DMA subsystem on the lookedup Radeon buffer object, and gets it’s (converted)
RADEON_GEM_VA validates it’s input before reserving additional memory for the looked up Radeon buffer object and adding it to the list of GPU virtual memory tables if necessary.
RADEON_GEM_OP gets or sets the
initial_domain domain property of the looked up Radeon buffer object.
RADEON_GEM_USERPTR validates it’s input before setting certain properties, registering their values with the Memory Management Unit, optionally doing additionally validation with extra reserved memory, and wrapping the Gem object in a GEM handle all whilst holding a readlock.
Here I’ll cover Linux’s Radeon GPU driver’s main GEM (Graphics Execution Manager) methods.
The first is
gem_free_object_unlocked. This bitcasts the provided GEM object to a Radeon buffer object, unregisters it with the Memory Management Unit’s notifier, & decrements it’s refcount via the TTM subsystem.
The second is
gem_open_object which reserves additional memory for the provided/bit-casted Radeon buffer object & if necessary adds that memory to the GPU’s virtual memory tables.
gem_close_object meanwhile reserves that memory, decrements it’s refcount in the virtual memory table removing it if necessary (by taking a number of locks & removing it from appropriate lists & the interval tree before adding it to a freelist), before it can finally unreserving that memory via the TTM subsystem.
radeon_mode_dumb_create memory aligns the input sizing info (branching upon the
cpp, allocating at least one page), creates the GEM object (via the TTM & kernel-alloc subsystems, storing the process’s PID & adding to a mutex-locked linked-list), & wraps it in a handle (via an IDR tree, VMA nodes, the the GEM object’s
open method or the driver’s
gem_open_object, & locks).
radeom_mode_dumb_mmaplooks up the appropriate GEM object/Radeon buffer object, checks if it has a
userptr property set, and returns it’s VMA node’s
vm_node.start property bit-shifted by a constant PAGE_SHIFT to be memory mapped into userspace by the caller.
As for directly handling file syscalls, it’s mostly the same as for any GPU (as described earlier). Though MMap calls into the TTM subsystem & IOCtls additionally calls the PM (Power Management?) runtime.
Prime handle / File Descriptor Conversion
prime_handle_to_fd looks up the specified DRM GEM object (in an IDR tree/compressed trie) & DRM prime buffer (within a Red-Black balanced binary tree) within a lock. Within an additional lock it’ll then get the file descriptor of it’s
dma_buf properties, or the DMA subsystem, call other driver methods to “export” it.
Before releasing the second lock Linux’s Radeon GPU driver’s
prime_handle_to_fd method will add the DMA buffer & handle to the prime handles Red-Black tree, and prior to unlocking the first it’ll clean up any memory after getting a file descripting from the DMA buffer. The DMA buffer is managed by an external subsystem.
prime_fd_to_handle meanwhile gets the DMA buf from the provided file descriptor, lock that file to lookup the prime handle, and take another lock to “import” the buffer via either the
gem_prime_import or (with more preparation)
gem_prime_import_sg_table other driver methods. Then it creates the prime handle (by adding it to the appropriate IDR tree and VMA nodes before calling the object’s
open or driver’s
gem_open_object methods) and buf handle (by adding it to the RB-tree).
NOTE: These are the generic Direct Rendering Manager method implementations the Radeon driver is using here.
gem_prime_export just adds a check for the TTM userpointer before resuming the generic processing from Linux’s Direct Rendering Media subsystem which just hands off to the Direct Memory Access subsystem and increments some refcounts.
gem_prime_pin casts it’s input GEM object to a Radeon Buffer Object which it pins whilst temporarily “reserving” it.
Reserving/unreserving the Radeon Buffer Object is done via the TTM subsystem (whatever that is!).
And the actually pinning is done by first verifying the buffer object has a userpointer. If it has already been pinned, it increments that count and gets + verifies the “GPU offset” to memory map.
Otherwise it verifies the
prime_shared_count and domain, computes each “placement”, has the TTM subsystem validate it, and on success increments the pin count & updates the pin sizes.
Unpinning meanwhile involves decrementing the pin count and upon reaching zero: unset the “placements”, revalidate with the TTM subsystem, & upon success subtract from the pin sizes. Also whilst “reserving” the Radeon buffer object.
gem_prime_get_sg_table casts the input GEM object to a Radeon Buffer object to get the
num_pages property whose data is required by the generic Direct Rendering Media implementation, which kmalloc’s the output & initializes with
This scatterlist is defined in lib/scatterlist.c as it’s used by various drivers, and serves to collapse contiguous pages into a single page.
To import one of these SG Tables, it creates a new Radeon Buffer Object (which stores the SG table) within a resv lock & adds it to a a linked list of GEM objects within the device’s GEM lock.
gem_prime_vunmap both hands directly off to the TTM subsystem once it’s input has been casted to a Radeon Buffer Object from a GEM object.
Different versions of Radeon GPUs use different I/O protocols, so Linux’s drivers for it has two layers of methods. The first layer (called by the Direct Rendering Manager subsystem) manages memory and other higher-level concepts specific to the Radeon family. The second knows how to read and write data directly to/from the GPU.
Linux’s Radeon rs780 driver’s
init method starts by initializing a new debugging device file outputting r600/rs780-specific registers, which it reads via the appropriate data bus andwriting a request if necessary.
Then it reads in the BIOS information from ATRM falling back to ACPI VFCT, IGP VRam, etc exposing the BIOS as a u8 array. If it can’t it errors out and the driver doesn’t start, otherwise it verifies the BIOS state and checks wether it’s ATOM.
In the case of r600/rs780 it expects an ATOM BIOS, which it’ll then procede to initialize with methods for reading and writing data & I/O registers (via the CPU’s I/O bus which may or may not also be the RAM bus, and may involve an INDEX register for extending the number of registers supported), methods to directly hand off to other driver methods, and information:
- identifying the device,
- parsed from the ATOM BIOS
- scratch registers
Next it checks the CPU BIOS and various GPU registers to see if the card has been “posted”. If not it errors out for missing GPU BIOS or initializes the ATOM ASIC by reading some registers and having ATOM execute a “table” or two.
With the GPU BIOS running it now gets to the bulk of initialization, including:
- Referencing scratch registers into a new array
- Clearing all the
surface_regsor their referenced buffer objects, including via the TTM & DMA subsystems.
- Get the clock info parsed from the ATOM (or COM-) BIOS, or falls back to OpenFirmware. Then it’ll initialize various properties upon success, failure, or independant thereof.
- Initialize memfence properties & memory rings, and schedule work to verify it’s working.
- Initializes AGP if available on this device by calling into the generic “DRM” implementation to acquire it and read it’s info before reading more info out of registers & drivers properties, then enable AGP and a flag register. By default AGP refers to a global variable and an enable method therof.
- Disable AGP if the hardware was available but (5) failed by reconfiguring some methods to older logic operating directly on GPU registers.
- Read the GPU’s memory controller registers.
- Initialize GPU buffer objects by calling into the “arch”, TTM, & DMA subsystems before having it allocate some initial memory and creating a dbugging filesystem for it.
- If required, initialize the GPU with the appropriate microcode.
- Figure out the appropriate Power Management mode and initialize it with hardware monitors, background heat monitoring work, mutexes, debugging device files, etc.
- Allocate a ring buffer and verifies that it supports scratch registers as claimed.
- Similarly initialize a UVD ring buffer with added background work, firmware, and buffer objects.
- Allocate a r600 ring but don’t validate it supports scratch registers.
- Allocate GART pages & (as buffer objects) GPU VRAM.
- Enable PCIE Gen2 by reading/writing/verifying GPU registers.
- Initialize VRAM as pinned buffer objects.
- Write initialization code to GPU registers.
- Enable “PCIE GART” or “AGP” by writing into appropriate GPU registers.
- A bunch more reading & especially writing of GPU registers and driver properties for
- Allocate WriteBack structures as Buffer Objects.
- Start the GFX Index ring by resetting a scratch register, and writing to a memfence.
- Start the UVD subsystem & the corresponding ringbuffer, zeroing the ring on error.
- If “installed” initialize the CPU interrupts by initializing a spin lock, the DRM VBlank (for Cathod-Ray Tube-compatible output protocols) subsystem, the MSI PCI-protocol, various background work, before installing the CPU interrupts (via the IRQ subsystem wrapped by other driver methods) and flushing delayed work on error.
- Initializes r600/700 IRQs, reading & writing GPU registers to disable them, writing to GPU registers to resume the evergreen or r600 IRQs dealloc’ing the ring buffer on failure, writing to more GPU registers, disables the evergreen or r600 registers with yet more GPU register writes, calls into the PCI subsystem, and sets ENABLE flags in the
IH_RB_CNTLGPU registers. Cleans up on failure.
- Set the IRQs by reading and especially more GPU registers.
- Allocs/Init the WriteBack ringbuffer as a buffer object.
- Load the CP microcode via more GPU register writes.
- Resume the CP via more register writes and the TTM subsystem.
- Resume the UVD subsystem and it’s ring buffer.
- Init the Indirect Buffer (with a waitqueue & buffer objects) depending on whether it’s
CHIP_BONAIREor newer, starting it (by pinning & mapping the buffer object whilst it’s TTM reserved), and creating a debugging filesystem for it.
- It figures out how many audio pins the GPU card model has, starts the audio driver, & enables those pins.
- On failure, it uninitializes most of the previous steps.
- Fini the Power Management infrastructure whether it’s DPM or “old” by (for DPM) removing any device files, disabling (within a lock) & finalizing the DPM, finalizing the hardware monitor, and freeing the memory. “Old” also needs to manage the CRTC clock & PM “profiles”.
- Disable the audio driver & (by possibly recounting them & calling an optional audio driver method I won’t dig into) all it’s “pins”.
- Stop the CP infrasture (via register writes & the TTM subsystem), finalize it’s ringbuffer (unsetting properties within a lock, & un-mapping/pinning it as a buffer object whilst reserving it via the TTM subsystem), & marking it’s scratch registers as free.
- Disable CPU interrupts (by unsetting some GPU register bitflags, reading acknowledgement GPU registers, & writing to more GPU registers), stop the RLC (by unsetting a specific GPU register or two), & finalize the Interrupt Handling ringbuffer (by un-pinning/mapping it as a buffer object whilst reserved & decrementing the refcount).
- If it has it setup, finalize the UVD subsystem, it’s ringbuffer, other Buffer Object, & firmware.
- Unset the WriteBack enabled flag & un-pin/map it’s corresponding Buffer Object.
- Suspend (by un-map/pin-ing it as a Buffer Object) & finalize (by signalling the memory fences, freeing the corresponding CPU-side linked lists, & decrementing the Buffer Object’s refcount) the SA BO Manager.
- Disabling generic Radeon CPU interrupts by calling into the DRM (Direct Rendering Manager) subsystem generic to all GPUs, disabling MSI in the PCI subsystem, and “flushing” any hotplugging work. The DRM subsystem has interrupts for VGA & Cathod-Ray Tube-compatible output protocols, before calling another GPU method.
- Finalize the GART tables by freeing it’s memory (in part via other driver methods per page & the PCI subsystem), unset the appropriate GPU registers, & free + unref the Buffer Object.
- Free/unref the VRAM scratch Buffer Object.
- Release the AGP backend if attached.
- Forcefully release any remaining GEM Buffer Objects.
- Within a lock & for each ringbuffer: wait on it’s memfence (by checking various atomic pointers before waiting for a CPU interrupt), cancel/wakeup any delayed work, & mark it’s scratch registers as free.
- Finalize the Buffer Objects by removing it’s debugging filesystems, unpinning the
stolen_vga_memorybuffer object, calling into the TTM subsystem, finalizing GART (again), & calling into the Arch subsystems.
- Free the ATOM BIOS info.
suspend follows the steps:
- Suspend Power Management within a lock by resetting some state. For DPM it calls another lower-level driver method, and for older GPUs it (outside the lock) cancels some delayed work.
- Finalize the audio drivers as per
- Stop the CP ringbuffer via the TTM subsystem & GPU register writes.
- If attached finalize the UVD subsystem, it’s handles, & their memfences.
- Disable CPU interrupts & unset corresponding GPU register(s) like upon
fini, but don’t free the ringbuffer.
- Disable the WriteBack buffer by unsetting a driver flag.
- Write to GPU registers to disable GART tables, & un-map/pin it’s buffer objec.
resume on the other hand reinitializes the ATOM BIOS, resumes Power Management (also possibly setting up voltages & clocks), & repeats steps (15)-(31) of
vga_set_state, which is also shared by many of the other versions, simply sets or unsets bitflags in the
CONFIG_CTL GPU register.
The following method
asic_reset which starts by checking various bitflags in the GPU registers and, depending on which are set, it sets various bit flags in the GPU registers (which might wait on to take effect) including in
It’ll then check those GPU registers again to see whether it should do a hard reset, and a third time to verify that it has been reset. Or the caller can tell this function to do the hard reset procedure instead of all of this.
If those bitflags are not set upon the third check it unsets the hung bitflag in the
R600_BIOS_3_SCRATCH regiser (which it has set previously).
To do a hard reset it’ll write to various GPU registers, and bitflags thereof, with hardcoded delays, and the PCI subsystem.
Some of the aforementioned pauses are dynamically dispatched to other driver methods I might cover later, but others are busy-loops repeatedly checking a bitflag.
Continuing my exploration of Linux’s Radeon rs780 driver’s methods,
mmio_hdp_flush writes to the
R_005480_HDP_MEM_COHERENCY_FLUSH_CNTL GPU register.
gui_idle checks a bitflag in the
GRBM_STATUS GPU register.
mc_wait_for_idle busy waits on a bitflag in the
R_000E50_SRBM_STATUS GPU register, with 1microsecond delays.
get_xclock returns a CPU-side driver property.
get_gpu_clock_counter writes request (1) to
RLC_CAPTURE_GPU_CLOCK_COUNT and reads the response from
MSB GPU registers, all within a dedicated mutex.
get_allowed_info_registerchecks which register the caller wants read, and if it’s within a certain set it reads it.
Linux’s Radeon rs780 driver has three methods grouped under the method table’s
gart property all of which are used for managing memory paging.
gart.tlb_flush writes bitflags to
VM_CONTEXT0_REQUEST_RESPONSE GPU registers, before busy waiting on a bitflag in
VM_CONTEXT0_REQUEST_RESPONSE until a timeout.
gart.get_page_entry merges it’s
flags arguments into an encoding understood by the GPU.
gart.set_page just calls writeq to send signal out the I/O bus (which may just be a write to a memory location).
Today I’d like to describe Linux’s Radeon rs780 driver’s methods for managing the ringbuffers used for communication with the GPU. For version rs780 there’s three sets of these methods communicating over different channels. I’ll at least cover the ones communicating via the GFX ring today.
ib_execute writes instructions to the Indirect Buffer’s ring buffer to push a new function onto it’s call stack.
emit_fence writes instructions to flush the read and other caches, and maybe more.
emit_semaphore writes a simpler instruction into the semaphore’s ring buffer.
I’ll have to cover
cs_parse later, which translates Mesa3D’s standard bytecode to that supported by Radeon rs780 GPUs.
ring_test finds a free register, locks the ring buffer CPU-side, and writes a
SET_CONFIG instruction. And
ib_test sends that instruction through Indirect Buffer in order to test that works.
is_lockup checks the same GPU-register bitflags as for GPU restarts, before timing an atomic read.
get_rptr reads a GPU register or WriteBuck ringbuffer property.
get_wptr reads the
R600_CP_RB_WPTR register, &
set_wptr sets that register before waiting on a read to complete.
The dispatch tables for communicating over Direct Memory Mapping or the UVD ring buffer are implemented very similarly.
Radeon GPUs do not fully support the standardised bytecode format provided by userspace/Mesa3D to the Radeon drivers. These bytecodes need to be further lowered in kernelspace.
This morning I’ll describe how that happens, though it can depend on the communication channel to the GPU.
It starts by alloc/init’ing a “track” and stores certain driver configuration properties over to it. It’ll free this once the compilation has completed.
For each bytecode in the provided Indirect Buffer it first “parses”/decompact it (possibly erroring out) before lowering (and parsing additional fields of) it depending on it’s “type” property.
RADEON_PACKET_TYPE0 it iterates over every
AVIVO_D1MODE_VLINE_START_END instruction contained before it’s arguments are supported, including by looking up the CRTC number. If that CRTC is not enabled, the instruction is zeroed out, and otherwise it’s slightly rewritten.
RADEON_PACKET_TYPE2 is not rewritten in any way.
RADEON_PACKET_TYPE3 has different logic for each opcode, each validating their arguments (possibly by calling
r600_cs_track_check to validate external encoding issues) and in some cases performing minor rewrites according to a relocations list. These relocations in part involves validating the target instruction & converting from CPU addresses to GPU addresses.
For others it errors.
Also, it’s worth noting that upon using DMA communication channels, there’s additional opcodes to lower.
CPU Interrupt Handling
irq.set first validates that some required flags have been set on the driver, before reading various GPU flags based on the version of the GPU hardware. Then it (mostly atomicly) reads additional flags from various driver properties & ringbuffers, encodes that data into GPU registers depending on the hardware version, and finally syncs on another.
irq.process meanwhile first checks a couple of flags & possibly synchronizes on the the
IH_RB_WPTR. Then it reads the current write pointer either from that register or shared memory before checking and fixing buffer overflows.
Then it grabs a (manually-written) lock, reads the ring buffer’s read pointer, reads then writes various GPU registers to acknowledge this data has been read. After which it can iterate over all items currently in the interrupts ring buffer.
For each item in that ring buffer, it examines the opcode (as one or two words:
src_data) to validate, sets drivers properties, and/or enacts that operation.
This includes D1/D2 VBlank (calling down into DRM & Radeon-specific handling which uses GPU registers, other methods, & locks)/VLine/PFlip (by setting some driver properties & using DRM to send events to userspace), HPD/DAC hotplug, various ring buffers, thermal high to/from low, & logs GUI idles.
Having interpreted all those items in the interrupts ring buffer, it schedules hotplugging/HDMI/thermal work based on flags set by the interpretor (the handlers are implemented elsewhere), updates the read pointer, & if new items have been added to it jumps back to the interpretor.
copy.copy (they both refer to the same C function), &
copy.dma first syncs (using a newly-created
radeon_sync object) & locks the specified ringbuffer, before writing instructions and data to that ring buffer.
The ringbuffers to use are specified alongside these methods.
surface.clear_reg are not yet implemented.
hpd.init iterates over each “connector” in the driver’s
mode_config, and for each check it’s
connector_type (as some type breaks certain ring buffers), sets a hardware version-specific GPU register, builds up an enable bitmask, and sets it’s polarity via
After that loop, it updates a driver property for those enable flags. & calls
irq.set within the CPU interrupt lock.
hpd.fini iterates over those “connection” again unsetting the GPU register appropriate to the hardware version and building up “disable” bitflags, which’ll be used to update that driver property before calling
irq.set again using the CPU interrupts lock.
hpd.sense reads the appropriate bitflag from a GPU register for the specific hardware version.
hpd.set_polarity (as used in
hpd.init) sets the appropriate GPU register bitflag, or clears it depending on
pm.misc looks up the current configured
requested_clock_mode_index from the
power_state table & if that configuration says to, it adjusts the voltage by running a GPU “ATOM” BIOS program with the given parameters.
pm.prepare disables the active CRTCs by setting the
AVIVO_CRTC_DISP_READ_REQUEST_DISABLE in their appropriate GPU register.
pm.finish enables those CRTCs again by unsetting that DISABLE bitflag on them.
pm.init_profile (a new update for the rs780!) builds that table referred to by
pm.misc depending on how many
power_states (2, 3, or otherwise) are desired. The “default” profile is initialized with other Power Management configuration outside these tables.
pm.get_dynpm_state sets three PM config properties based conditionally on the CRTC count, planned action, device model, flags, power state level, etc.
pm.get/set_engine_clock both runs programs in the GPU’s BIOS to get/set this data.
pm.set_clock_gating methods are all NULL, indicating to the caller to do some fallback logic.
pm.get_temperature reads and decodes the
CG_THERMAL_STATUS GPU register.
pm.set_uvd_clocks writes control signals to 2, 3, or 4 GPU registers, computes “clock dividers”, before writing the results to various registers with occasional pauses.
Most of these are specific to rs780 rather than 6xx more broadly.
dpm.init parses various “ATOM” BIOS headers to set various driver properties.
dpm.setup_asic does nothing.
dpm.enable starts by retrieving the refresh rate from the first enabled CRTC, then writes to
CG_INTGFX_MISC to disable BIOS powersaving. If the
GLOBAL_PWRMGT_EN bitflag of the
GENERAL_PWRMGT GPU register is set, it errors out. Then it sets various GPU registers for the DPM parameters (which may error out), computing clock dividers where necessary, before writing to more GPU registers to enable various aspects of DPM waiting for vblanks, etc where necessary. If a desirable
voltage_control is set, it prepares tables before setting the GPU registers.
Before enabling clock scalings & the “program at” via more GPU registers.
dpm.late_enable checks the
pm.int_thermal_start, if so it sets the GPU registers for the termal temperature ranges & CPU interrupts.
dpm.disable sets the GPU registers to disable the dynamic power management (as done during
dpm.enable), clock scaling (a subset of as is done during
dpm.enable), & possibly trigger CPU interrupts.
dpm.pre_set_power_state does nothing, as does
dpm.set_power_state similarly sets various GPU registers.
dpm.display_configuration_changed gets the first enabled CRTC’s VRefresh rate (copied to
refresh_rate driver property) and writes that to the “program at” GPU registers
dpm.fini frees the memory used to store the DPM parameters.
dpm.get_sclk returns the
->sclk_high property depending on the value of it’s
dpm.get_mclk returns the
dpm.print_power_state printk’s various subproperties from that
ps_priv property for debugging. Where this info goes is very much a topic for another time.
dpm.debugfs_print_current_performance_levelprintk’s out a handful of properties chosen based on the decoded
CG_SPLL_FUNC_CNTL GPU registers.
dpm.force_performance_level writes to various GPU registers partially determined by the provided
level, possibly incorporating computed clock dividers.
dpm.get_current_sclk reads & decodes the
CG_SPLL_FUNC_CNTL GPU registers.
dpm.get_current_mclk returns the
pm.dpm.priv->bootup_uma_clk driver property.
pflip.page_flip writes a
AVIVO_D1GRPH_UPDATE_LOCK bitflag to the
AVIVO_D1GRPH_UPDATE GPU register corresponding to the specified CRTC# as a lock, before writing to three more corresponding registers, busywaiting on a different bitflag there & releasing the lock.
pflip.page_flip_pending simply routines the
AVIVO_D1GRPH_SURFACE_UPDATE_PENDING bitflag in the corresponding
AVIVO_D1GRPH_UPDATE GPU register.