Direct Rendering Media is the Linux kernel’s subsystem for exposing graphics card (display output + GPU + dedicated RAM) hardware, and multiplexing this hardware between however many programs wish to access it.
It’s acronym is not to be confused with Digital Rights Management or whatever backronym you find more suiting for it.
IOCtls
Upon device-specific initialization Linux’s Direct Rendering Media subsystem exposes a common set of “IOCTLs” defined in array under drivers/gpu/drm/drm_ioctl.c.
Metadata IOCtls
The drm_version
IOCtl copies the driver’s major, minor, patchlevel, name, date, & desc fields over into a provided pointer. This involves a special memcpy (after checking string lengths) that bridges from kernel space to user space.
drm_getunique
grabs a mutex on the device file to access the unique
property of that file & it’s _len
, copying it over into the userspace pointer if it has enough room.
The drm_getmagic
IOCtl, used for authentication of operations that can’t be shared between processes, copies over the magic
property of the drm_file
, idr_alloc
ing it if necessary, allocated out of a “radix tree” (space efficient trie) before falling back to incrementing a global.
And drm_irq_by_busid
legacy IOCtl first checks some feature (bit)flags, whether the GPU’s even on the PCI bus, and whether it’s state is valid before copying the device’spdev->irq
property.
Looking at the next chunk of (capabilities-related) Linux’s Direct Rendering Media (DRM) IOCtls, drm_getclient
& drm_getstats
are deprecated exposing only a shell of their former functionality.
drm_getcap
branches over the input’s capabilities
to copy the specified property(s) into it’s value
, possibly based on some condition. Most of which are unsupported in the absence of a DRIVER_MODESET
bitflag.
drm_setclientcap
does the exact opposite, copying the input value
property into the appropriate driver property(s) specified by the input capability
property. Unless the DRIVER_MODESET
bitflag is unset, in which case it throws an EOPNOTSUPP error.
And drm_setversion
grab’s the device’s mutex to access the version properties. If necessary it compares three seperate version numbers to check validity and initialize these properties.
Accelerated Graphics Port IOCtls
drm_agp_acquire
atomically increments the usecount on global or device property (the accessor function can be hotswapped), upon verifying the device’s agp->acquired
property, after which it ensures the device has been assigned a agp->bridge
property.
drm_agp_release
verifies that the device has been acquired, atomically decrements that use count, and unsets that acquired flag.
drm_agp_enable
sets some properties & calls a method on the bridge’s driver.
drm_agp_info
copies version, mode, aperture, memory, and device identifier infos from the device’s agp->agp_info
structure over into a passed-in pointer.
drm_agp_alloc
allocates from the GPU’s RAM (check memsizes, call 3 methods, or CPU allocate) and tracks this in a linked list.
drm_agp_free
looks the memory up in that linked list, unbinds it if necessary (which, if C macro enabled, calls the appropriate bridge method & removes it from a linked-list), removes it from that linked-list, and frees it both on the GPU (via a choice of 5 bridge methods or done CPU-side, if similarly C macro enabled) and the CPU (via kfree).
drm_agp_bind
looks up the memory in the allocated linked-list, (if enabled) calls some bridge driver methods, and adds it to a bound linked-list.
drm_agp_unbind
looks up the memory in the allocated linked-list, and (if similarly compile-time enabled) calls a bridge driver method, and removes it from the bound linked-list. All of which also happens upon freeing this memory.
But one question: why’s Accelerated Graphics Protocol treated specially by Direct Rendering Media?
Screen Outputting
Primarily for output (in this chunk of IOCtls) there’s a IOCtl to wait for a VBLANK
interval (required by old analog monitors we’ve been maintaining backwards compatibility until recently), configure the color encoding, and a deprecated/noop DRM_IOCTL_UPDATE_DRAW
.
drm_wait_vblank
checks if it has access to and locked all the relevant hardware, computes a pipe_index
using bitwise operations, and looks up the vblank interval in a device-specific lookup table. If that indicates this is a “query” it reads in a new sequence number to return alongside the current timestamp. Otherwise within several locks it atomically the driver to call a driver method only from a single process, before looking up the interval, checking if it has passed, and replying/wait.
After it’s checks, drm_legacy_modeset_ctl
is split into two parts: pre & post vblank.
The pre syscall just sets a flag on the appropriate vblank table, and another if it’s interval is 0.
The post syscall checks whether it’s in a modeset & if so resets the vblank (via a couple of driver methods) within a “IRQ” lock (no hardware interruptions) before “putting” it (by disabling the timer if it’s the last to deref a refcount) if the 0x2 bitflag is set.
There’s also drm_mode_getresources
which reads various properties from the device & it’s files, crtc’s, encoders, & connectors.
Graphical Execution Manager/GPU RAM allocation
drm_gem_close
finds and revokes the process’s reference, in between which it lets the GPU driver dereference it itself. Which it does via it’s close
or gem_close_object
methods before clearing it’s Direct Memory Access handle, it’s access control, and (within a lock) dereference and possibly frees it via an IDR (compressed trie) tree.
drm_gem_flink
looks up the GEM object from the process’s IDR tree before allocating a new reference for it within the driver’s IDR tree whilst holding a lock.
drm_gem_open
first checks if the GEM object has already been opened (whilst holding a lock), and if not allocates a new handle in it’s place using the open
or gem_object_open
driver methods.
Then there’s IOCtls (drm_prime_fd_to_handle
& drm_prime_handle_to_fd
) which converts between GEM handles and file descriptors (useful for handing renderbuffers over to the window manager), and they turn out to just call driver methods if they are provided. Otherwise they error with ENOSYS.
Output encoding
drm_mode_getplane
copies information from each of the device’s “planes” for which it holds a “lease” to the userspace.
drm_mode_getcrtc
looks up the CRTC configuration from an IDR tree and copies data out of it to userspace whilst holding a lock. drm_mode_setcrtc
does the reverse with plenty more checks (+ 2 driver methods), allocating memory where needed.
drm_mode_getplane
& drm_mode_setplane
works pretty much the same as set/get crtc, except on the set side it mainly calls a choice of driver methods.
drm_mode_cursor
advances an X/Y “cursor” or initializes it using a couple of driver methods.
drm_mode_gamma_
get/set are relatively simple property accessors with fewer checks, and also calls the gamma_set
driver method.
drm_mode_getencoder
& getconnector also copies various proeprties to userspace, the latter calling .fill_modes().
DRM_IOCTL_MODE_ATTACHMODE
& DRM_IOCTL_MODE_DETACHMODE
are now noops, successfully doing nothing.
drm_mode_getproperty
looks up the property in an IDR tree and copies it to userspace according to it’s type. drm_property_lookup_blob
does similarly in another map, but doesn’t need to worry about types.
drm_connector_property_set
is translated into a drm_mode_obj_set_property
IOCtl.
drm_mode_getfb
copies looks up a framebuffer, copies it’s data to userspace, & calls .create_handle()
drm_mode_addfb2
creates a framebuffer using a driver method and adds it to a list whilst holding a mutex, and there’s an older version which converts it’s input to v2.
drm_mode_rmfb
looks up the framebuffer and removes the specified framebuffer from the device file’s framebuffers list whilst holding a lock. After which it dereferences it and possibly queues up a free routine.
drm_mode_page_flip
finds, validates, and locks the necessary data to call the page_flip[_target]
driver method.
drm_mode_dirtyfb
looks up the specified framebuffer, copies input into kernel space, and calls it’s .dirty() driver method.
drm_mode_create_dumb
validates it’s input and calls the .dumb_create()
driver method.
drm_mode_mmap_dumb
hands off directly to .dumb_map_offset()
, or falls back to the GEM infrastructure which ensures the memory’s allocated in the GPU’s RAM & creates a new GEM object referencing it.
drm_mode_destroy_dumb
works very much like drm_mode_mmap_dumb
, with it’s fallback ultimately calling .close()
or .gem_close_object()
before actually freeing it.
drm_mode_obj_get_properties
is split into two parts, first it looks up the appropriate object (in an IDR tree) then it looks up each of the specified properties on that object, which is done via different else-if chains for different object types.
drm_mode_obj_set_property
does the inverse to set the specified properties on the specified object. This time it looks up a property ID and might call a .commit()
driver method. And setting the individual properties might or might not be more involved as well.
drm_mode_atomic
is a related IOCtl which allocates and initializes some state for atomically updating the state.
drm_mode_createblob
allocates and initializes a BLOB structure, and adds to the relevant collections.
And drm_mode_destroyblob
looks up the specified BLOB, removes it from those datastructures, and decrements the refcounts.
Synchronization
drm_syncobj_create
checks bitflags to determine whether it’s currently supported before allocating/initializing it and returning an IDR-tree reference to it. These syncobjects are reference counted and wraps more general “memory fences” useful for any driver.
drm_syncobj_destroy
removes it from the IDR tree and decrements the refcount.
drm_syncobj_handle_to_fd
converts a syncobj it has looked up in the relevant IDR tree into a “sync file” descriptor, wrapping the memfence with normal file syscalls and freeing the syncobject upon close. drm_syncobj_fd_to_handle
does the exact inverse.
drm_syncobj_transfer
reads like 2 IOCtls in 1 determined by the dst_point
argument. If that’s set it finds the syncobj, waits on it’s memfence and allocates a new point for it’s timeline. Or it transfers the memfence to another syncobj.
drm_syncobj_wait
finds all the specified syncobj’s, checks if any have been triggered already, and if not waits for one of those memfences to be triggered. Or for a timeout to fire, derived from the “timeline” if present - otherwise it’s as requested by userspace. There’s also a “timeline” variant of this, but I’m not seeing how that differs.
drm_syncobj_reset
replaces all the memfences of the specified syncobjs with new ones. & drm_syncobj_signal
replaces them with a stub memfence.
drm_syncobj_timeline_signal
finds all the specified syncobjs, copies all specified points into kernelspace, allocates a new “chains” array, adds those points/memfences to syncobjs and new chains.
And drm_syncobj_query
reads the current state of each specified memfence and copies it into userspace.
Sequences
drm_crtc_get_sequence
looks up the specified crtc and it’s vblanking flags so it can copy that information (partially under a lock) into userspace. drm_crtc_queue_sequence
allocs/inits a new event before sending it (by calling it’s completion_release
method, signalling it’s memfence, and enqueueing it onto it’s device file’s event list) or enqueuing it.
Leases
As for “leases”, which are a kind of lock a central daemon can claim over some GPU operations, there’s four IOCtls that can be used to manage them:
drm_mode_create_lease
performs various security checks, looks up the specified objects, allocs/inits the lease/DRM “master”, adds it to a (new) IDR tree, & duplicates the file descriptor to authenticate the caller.
drm_mode_list_lessees
iterates over the lease holders under a lock & copies their IDs to userspace.
drm_mode_get_lease
copies the object IDs of all listed leasable objects into userspace.
And finally drm_mode_revoke_lease
finds the lesee in the appropriate IDR tree & removes them both from the IDR tree & linked-lists.
Conclusion
Most of it straightforwardly directly gets & sets various properties device-independant properties, and/or dynamically dispatches to driver methods. Locks are often involved, as are IDR trees (compact tries) for resolving userspace references. Several of these are guarded by bitflags, checked either by IOCtl generic or specific logic.
To handle older display output protocols (compatible with Cathode Ray Tubes) it has a lookup table for vblanking delays/other flags, & it implements a simple “Buddy” Memory Allocator (augmenting that of the corresponding userspace library “libdrm”) to manage the dedicated RAM in a subsystem called GEM (Graphics Execution Manager).
It uses standard memfences to synchronize with the GPU, and has linked lists to manage collections vitally including the queue of “events” to be received by the device file’s read()
syscall.
Radeon Driver
Those IOCtls wrap methods provided by the GPU device driver, and since I use a Radeon GPU I’ll be discussing that driver.
(Un)loading & Opening/Closing
Linux’s Radeon GPU driver has a method load_kms
(Kernel Mode Settings) to configure the hardware for a given output encoding. I will describe how it works this morning.
It starts by allocating memory for the new configuration, reading in flags from the PCI bus (+ an internal Radeon property), and initializing/verifying it’s properties outputting any errors to a special debugging file. This includes setting callbacks & flags for the different versions of these GPUs. And registers debug files.
Having initialized all those fields it runs some tests (I won’t go into details) to make sure things work and are performant and registers it with the PM runtime or VGA hardware.
Then it initializes the modeset, including it’s “dynamic” properties which can be looked up via an IOCtl, which’ll also init it’s atombios or combios I2C bus + other hardware and check it’s EDID. Getting it’s Atom Connector from a veriety of sources.
Next it initializes the ACPI data/hardware, sending ATCS & ATIF signals. And if this is a PX hardware, it’ll then reconfigure the PM runtime.
Not that I have much of an idea of what much of this code really means.
The second method, which I’ll cover this morning, of Linux’s Radeon GPU driver is open_kms
.
This synchronizes with the PM (whatever that is!) and allocs/inits an internal structure and GEM (Graphical Execution Manager) buffer objects. Whilst setting up GPU virtual memory tables if supported (by a preceding check and callback methods), and enqueueing the operation on a shared ring buffer.
Once everything’s initialized, it reserves some additional Buffer Object memory, adds it to the device’s virtual memory tables, computes the CPU & GPU virtual addresses for that page, adjusts an interval tree (not disimilar to a binary search tree) and the Buffer Object to match, and frees it all on fail.
All of this only happens for newer Radeon GPUs that support virtual memory, on older ones it just synchronizes closing up with the following shared steps.
In either case it calls into the PM runtime (again, whatever that is!) to mark this GPU as the last busy device and configures autosuspend.
The third method, which’ll describe this morning on Linux’s Radeon GPU driver is postclose_kms
.
This synchronizes with the PM runtime, unsets hyperz_filp
& cmask_filp
properties under a lock if they’re the same as the device’s own file_priv
value, and atomically frees each UVD & VCE handle using memfences & noting any uses whilst communicating this to the GPU using Buffer Objects.
For newer GPUs with virtual memory support it’ll try reserving some additional buffer object memory. If it fails & an accel_working
flag on the GPU driver is set, it’ll remove that buffer object (by removing it from a kernel-space linked list & the GPU’s interval tree before adding it to a free list or unreferencing it’s memfence), & unreserving it by via the TTM infrastructure.
In either case if that flag’s set it’ll deallocate that virtual memory table.
For those GPUs which support virtual memory it’ll also free & unset the driver_priv
property.
Then for all Radeon GPUs it calls into the PM runtime to mark the GPU as being the last busy device & enable autosuspend.
Linux’s Radeon GPU driver’s lastclose_kms
method is called once no more processes are using it according to a refcount. Though most of the work has already been done per-process in postclose_kms
.
This’ll start by resetting the mode of the framebuffer emulation, using other driver methods, Direct Rendering Media properties, & locks whilst disabling it’s “planes”. Another method “commits” this new state once adjusted.
Then it switches the VGA state by looking it up from a linked list by stored ID, finds an active client from another list, and reconfigure it’s audio (via the client’s set_gpu_state
method), framebuffer console (under it’s lock, switching the method table for debugging messages), & power state (optionally calling into the PM runtime). Between which it’ll call the VGA’s (under lock) switchto
& the client’s reprobe
methods. Then it’ll set the active flag.
Complementing it’s load_kms
method, Linux’s Radeon GPU driver has a unload
method.
This starts by checking if it’s already been unloaded, and if it’s a Px-series device calls into the PM runtime to synchronize with and forbid the device. Then it frees it’s ACPI (unregistering from the bus), modeset (if flagged as initialized unschedules polling, calls a hpd.fini method, resets CRTC config, and frees memory), device (disconnect shared [virtual] memory & buses), and it’s own memory.
VBlanking
Starting studying Linux’s Radeon GPU driver’s VBlanking methods, this morning I’ll describe get_vblank_counter
. VBlanking is a delay between output signals where analog Cathod Ray Tube circuits would move it’s electron beam back to the top, and digital displays didn’t care to switch away from that protocol until recently with HDMI.
This starts by looking up the given CRTC pipe & (once it’s count stabalizes) stats. Stats involves reading the GPU’s registers via RAM & performing bitwise ops.
To get the count itself it calls a callback method that’s been configured whilst initializing the bus. And once it has successfully/stabally reads the stats/count it’ll check some bitflags on those stats to output debugging information.
And if an inactive CRTC is specified it’ll just read the count without the stats with a debugging message that this count may be wrong.
The second VBlanking method of Linux’s Radeon GPU driver is there to enable the VBlanking interrupt, enable_vblank_kms
. I’ll describe it this morning.
This validates that the specified CRTC number is in the valid range, and within an interrupt (IRQ) lock it sets a flag for that CRTC & sets the interrupt via a callback method.
And the disable_vblank_kms
method does the exact same thing but unsets the flag.
The fourth VBlanking method of Linux’s Radeon GPU driver is get_vblank_timestamp
, which I’ll describe this morning.
This validates it’s input and calls get_scanout_position
for a maximum number of retries until it has responded fast enough, passing either the VBlank or CRTC hardware mode depending on a device flag. Then it does some simple math to convert the output.
That get_scanout_position
(the fifth & final VBlanking method) reads from certain GPU registers depending on the general Radeon GPU model & the specified pipe/CRTC. Bitwise operations are used to decode the response, and timestamps are taken to measure the latency of these reads as used by get_vblank_timestamp
.
CPU Interrupts
Linux’s Radeon GPU driver’s first method handling CPU interrupts is irq_preinstall_kms
.
This unsets a bunch of flags (supposedly to disable the to-be installed interrupts) within a interrupts lock. Then it calls callback methods to set the CPU interrupts (within that lock) & to process any received interrupts (outside the lock).
The postinstall method meanwhile sets the max_vblank
method to 0x00ffffff or 0x001fffff based on whether it’s an AVIVO (family >= CHIP_RS600) model.
Linux’s Radeon GPU’s third method for managing it’s CPU interrupts is irq_uninstall_kms
.
This unsets a bunch of flags before calling the callback method for setting the interrupts, all within a CPU interrupts lock. very similar to what happens before installing an interrupt.
The fourth & final method is irq_handler_kms
which gets called when the GPU triggers the interrupt. Which just dispatches to a callback method, and on success tells the PM runtime to mark this device as last busy.
IOCtls
Today I want to describe the IOCtls specific to Radeon GPUs, which are looked up by the generic Direct Rendering Media IOCtl method in a driver-specific table.
This starts with a whole bunch of deprecated IOCtls, all of which now successfully does nothing and are flagged to require authentication.
Most of the other Radeon-specific IOCtls are related to the GEM subsystem for manager the GPU’s dedicated RAM, but there’s also CS & INFO IOCtls.
RADEON_GEM_INFO
copies various properties from the Radeon device into userspace, including mman.bdev.man[TTM_PL_VRAM]->size << PAGE_SHIFT
. With the appropriate pin sizes subtracted from the output vram_visible
& gart_size
properties.
RADEON_GEM_CREATE
rounds up the desired size, creates the GEM object (in turn validating the size, creating the buffer objecting, & adding it to a linked-list), and wraps it in a GEM handle, immediately returning any errors and whilst holding a readlock.
RADEON_GEM_MMAP
looks up the GEM object from the appropriate IDR tree, converts it to a Radeon buffer object, gets it’s userptr
if it has the right method table, & gets it’s mmap offset
RADEON_GEM_SET_DOMAIN
when passed the RADEON_GEM_DOMAIN_CPU
domain tells the DMA subsystem to wait on that Radeon buffer object’s memory. For other domains it does nothing, possibly erroring.
RADEON_GEM_PREAD
& RADEON_GEM_PWRITE
both complains about being unimplemented & erroring with ENOSYS.
RADEON_GEM_WAIT_IDLE
calls the DMA subsystem to wait on the looked up Radeon buffer object, and if thetbo.mem.mem_type
property is RADEON_GEM_DOMAIN_VRAM
& it can it calls the mmio_hdp_flush
bus callback method.
RADEON_CS
initializes an on-stack parser, it’s instruction buffer, & it’s relocations buckets from userspace data. It then processes a that instruction buffer by parsing it via a callback, synchronizing on each rings’ memfences, and enqueueing instructions on a locked ring.
Enqueueing an instruction may involve allocating a new Virtual Memory ID, synchronizing those rings again, flushing the virtual memory tables to the GPU via it’s ring buffer, and executing the instruction via another bus-specific callback method.
RADEON_INFO
copies a specified Radeon-specific property over into userspace.
And back to the GEM related IOCtls, RADEOM_GEM_SET_TILING
validates it’s input & sets the specified flags on the looked up Radeon buffer object. As per it’s GET
ter.
RADEON_GEM_BUSY
checks with the DMA subsystem on the lookedup Radeon buffer object, and gets it’s (converted) tbo.mem.mem_type
property.
RADEON_GEM_VA
validates it’s input before reserving additional memory for the looked up Radeon buffer object and adding it to the list of GPU virtual memory tables if necessary.
RADEON_GEM_OP
gets or sets the initial_domain
domain property of the looked up Radeon buffer object.
And finally, RADEON_GEM_USERPTR
validates it’s input before setting certain properties, registering their values with the Memory Management Unit, optionally doing additionally validation with extra reserved memory, and wrapping the Gem object in a GEM handle all whilst holding a readlock.
GEM methods
Here I’ll cover Linux’s Radeon GPU driver’s main GEM (Graphics Execution Manager) methods.
The first is gem_free_object_unlocked
. This bitcasts the provided GEM object to a Radeon buffer object, unregisters it with the Memory Management Unit’s notifier, & decrements it’s refcount via the TTM subsystem.
The second is gem_open_object
which reserves additional memory for the provided/bit-casted Radeon buffer object & if necessary adds that memory to the GPU’s virtual memory tables.
gem_close_object
meanwhile reserves that memory, decrements it’s refcount in the virtual memory table removing it if necessary (by taking a number of locks & removing it from appropriate lists & the interval tree before adding it to a freelist), before it can finally unreserving that memory via the TTM subsystem.
Dumb methods
radeon_mode_dumb_create
memory aligns the input sizing info (branching upon the cpp
, allocating at least one page), creates the GEM object (via the TTM & kernel-alloc subsystems, storing the process’s PID & adding to a mutex-locked linked-list), & wraps it in a handle (via an IDR tree, VMA nodes, the the GEM object’s open
method or the driver’s gem_open_object
, & locks).
& radeom_mode_dumb_mmap
looks up the appropriate GEM object/Radeon buffer object, checks if it has a userptr
property set, and returns it’s VMA node’s vm_node.start
property bit-shifted by a constant PAGE_SHIFT to be memory mapped into userspace by the caller.
File operations
As for directly handling file syscalls, it’s mostly the same as for any GPU (as described earlier). Though MMap calls into the TTM subsystem & IOCtls additionally calls the PM (Power Management?) runtime.
Prime handle / File Descriptor Conversion
prime_handle_to_fd
looks up the specified DRM GEM object (in an IDR tree/compressed trie) & DRM prime buffer (within a Red-Black balanced binary tree) within a lock. Within an additional lock it’ll then get the file descriptor of it’s import_attach->dma_buf
or dma_buf
properties, or the DMA subsystem, call other driver methods to “export” it.
Before releasing the second lock Linux’s Radeon GPU driver’s prime_handle_to_fd
method will add the DMA buffer & handle to the prime handles Red-Black tree, and prior to unlocking the first it’ll clean up any memory after getting a file descripting from the DMA buffer. The DMA buffer is managed by an external subsystem.
prime_fd_to_handle
meanwhile gets the DMA buf from the provided file descriptor, lock that file to lookup the prime handle, and take another lock to “import” the buffer via either the gem_prime_import
or (with more preparation) gem_prime_import_sg_table
other driver methods. Then it creates the prime handle (by adding it to the appropriate IDR tree and VMA nodes before calling the object’s open
or driver’s gem_open_object
methods) and buf handle (by adding it to the RB-tree).
NOTE: These are the generic Direct Rendering Manager method implementations the Radeon driver is using here.
GEM Prime
gem_prime_export
just adds a check for the TTM userpointer before resuming the generic processing from Linux’s Direct Rendering Media subsystem which just hands off to the Direct Memory Access subsystem and increments some refcounts.
gem_prime_pin
casts it’s input GEM object to a Radeon Buffer Object which it pins whilst temporarily “reserving” it.
Reserving/unreserving the Radeon Buffer Object is done via the TTM subsystem (whatever that is!).
And the actually pinning is done by first verifying the buffer object has a userpointer. If it has already been pinned, it increments that count and gets + verifies the “GPU offset” to memory map.
Otherwise it verifies the prime_shared_count
and domain, computes each “placement”, has the TTM subsystem validate it, and on success increments the pin count & updates the pin sizes.
Unpinning meanwhile involves decrementing the pin count and upon reaching zero: unset the “placements”, revalidate with the TTM subsystem, & upon success subtract from the pin sizes. Also whilst “reserving” the Radeon buffer object.
gem_prime_get_sg_table
casts the input GEM object to a Radeon Buffer object to get the num_pages
property whose data is required by the generic Direct Rendering Media implementation, which kmalloc’s the output & initializes with sg_alloc_table_from_pages
.
This scatterlist is defined in lib/scatterlist.c as it’s used by various drivers, and serves to collapse contiguous pages into a single page.
To import one of these SG Tables, it creates a new Radeon Buffer Object (which stores the SG table) within a resv lock & adds it to a a linked list of GEM objects within the device’s GEM lock.
gem_prime_vmap
& gem_prime_vunmap
both hands directly off to the TTM subsystem once it’s input has been casted to a Radeon Buffer Object from a GEM object.
Bus methods
Different versions of Radeon GPUs use different I/O protocols, so Linux’s drivers for it has two layers of methods. The first layer (called by the Direct Rendering Manager subsystem) manages memory and other higher-level concepts specific to the Radeon family. The second knows how to read and write data directly to/from the GPU.
open
Linux’s Radeon rs780 driver’s init
method starts by initializing a new debugging device file outputting r600/rs780-specific registers, which it reads via the appropriate data bus andwriting a request if necessary.
Then it reads in the BIOS information from ATRM falling back to ACPI VFCT, IGP VRam, etc exposing the BIOS as a u8 array. If it can’t it errors out and the driver doesn’t start, otherwise it verifies the BIOS state and checks wether it’s ATOM.
In the case of r600/rs780 it expects an ATOM BIOS, which it’ll then procede to initialize with methods for reading and writing data & I/O registers (via the CPU’s I/O bus which may or may not also be the RAM bus, and may involve an INDEX register for extending the number of registers supported), methods to directly hand off to other driver methods, and information:
- identifying the device,
- parsed from the ATOM BIOS
- mutexes
- scratch registers
Next it checks the CPU BIOS and various GPU registers to see if the card has been “posted”. If not it errors out for missing GPU BIOS or initializes the ATOM ASIC by reading some registers and having ATOM execute a “table” or two.
With the GPU BIOS running it now gets to the bulk of initialization, including:
- Referencing scratch registers into a new array
- Clearing all the
surface_regs
or their referenced buffer objects, including via the TTM & DMA subsystems. - Get the clock info parsed from the ATOM (or COM-) BIOS, or falls back to OpenFirmware. Then it’ll initialize various properties upon success, failure, or independant thereof.
- Initialize memfence properties & memory rings, and schedule work to verify it’s working.
- Initializes AGP if available on this device by calling into the generic “DRM” implementation to acquire it and read it’s info before reading more info out of registers & drivers properties, then enable AGP and a flag register. By default AGP refers to a global variable and an enable method therof.
- Disable AGP if the hardware was available but (5) failed by reconfiguring some methods to older logic operating directly on GPU registers.
- Read the GPU’s memory controller registers.
- Initialize GPU buffer objects by calling into the “arch”, TTM, & DMA subsystems before having it allocate some initial memory and creating a dbugging filesystem for it.
- If required, initialize the GPU with the appropriate microcode.
- Figure out the appropriate Power Management mode and initialize it with hardware monitors, background heat monitoring work, mutexes, debugging device files, etc.
- Allocate a ring buffer and verifies that it supports scratch registers as claimed.
- Similarly initialize a UVD ring buffer with added background work, firmware, and buffer objects.
- Allocate a r600 ring but don’t validate it supports scratch registers.
- Allocate GART pages & (as buffer objects) GPU VRAM.
- Enable PCIE Gen2 by reading/writing/verifying GPU registers.
- Initialize VRAM as pinned buffer objects.
- Write initialization code to GPU registers.
- Enable “PCIE GART” or “AGP” by writing into appropriate GPU registers.
- A bunch more reading & especially writing of GPU registers and driver properties for
r600_gpu_init
. - Allocate WriteBack structures as Buffer Objects.
- Start the GFX Index ring by resetting a scratch register, and writing to a memfence.
- Start the UVD subsystem & the corresponding ringbuffer, zeroing the ring on error.
- If “installed” initialize the CPU interrupts by initializing a spin lock, the DRM VBlank (for Cathod-Ray Tube-compatible output protocols) subsystem, the MSI PCI-protocol, various background work, before installing the CPU interrupts (via the IRQ subsystem wrapped by other driver methods) and flushing delayed work on error.
- Initializes r600/700 IRQs, reading & writing GPU registers to disable them, writing to GPU registers to resume the evergreen or r600 IRQs dealloc’ing the ring buffer on failure, writing to more GPU registers, disables the evergreen or r600 registers with yet more GPU register writes, calls into the PCI subsystem, and sets ENABLE flags in the
IH_CNTL
&IH_RB_CNTL
GPU registers. Cleans up on failure. - Set the IRQs by reading and especially more GPU registers.
- Allocs/Init the WriteBack ringbuffer as a buffer object.
- Load the CP microcode via more GPU register writes.
- Resume the CP via more register writes and the TTM subsystem.
- Resume the UVD subsystem and it’s ring buffer.
- Init the Indirect Buffer (with a waitqueue & buffer objects) depending on whether it’s
CHIP_BONAIRE
or newer, starting it (by pinning & mapping the buffer object whilst it’s TTM reserved), and creating a debugging filesystem for it. - It figures out how many audio pins the GPU card model has, starts the audio driver, & enables those pins.
- On failure, it uninitializes most of the previous steps.
fini
- Fini the Power Management infrastructure whether it’s DPM or “old” by (for DPM) removing any device files, disabling (within a lock) & finalizing the DPM, finalizing the hardware monitor, and freeing the memory. “Old” also needs to manage the CRTC clock & PM “profiles”.
- Disable the audio driver & (by possibly recounting them & calling an optional audio driver method I won’t dig into) all it’s “pins”.
- Stop the CP infrasture (via register writes & the TTM subsystem), finalize it’s ringbuffer (unsetting properties within a lock, & un-mapping/pinning it as a buffer object whilst reserving it via the TTM subsystem), & marking it’s scratch registers as free.
- Disable CPU interrupts (by unsetting some GPU register bitflags, reading acknowledgement GPU registers, & writing to more GPU registers), stop the RLC (by unsetting a specific GPU register or two), & finalize the Interrupt Handling ringbuffer (by un-pinning/mapping it as a buffer object whilst reserved & decrementing the refcount).
- If it has it setup, finalize the UVD subsystem, it’s ringbuffer, other Buffer Object, & firmware.
- Unset the WriteBack enabled flag & un-pin/map it’s corresponding Buffer Object.
- Suspend (by un-map/pin-ing it as a Buffer Object) & finalize (by signalling the memory fences, freeing the corresponding CPU-side linked lists, & decrementing the Buffer Object’s refcount) the SA BO Manager.
- Disabling generic Radeon CPU interrupts by calling into the DRM (Direct Rendering Manager) subsystem generic to all GPUs, disabling MSI in the PCI subsystem, and “flushing” any hotplugging work. The DRM subsystem has interrupts for VGA & Cathod-Ray Tube-compatible output protocols, before calling another GPU method.
- Finalize the GART tables by freeing it’s memory (in part via other driver methods per page & the PCI subsystem), unset the appropriate GPU registers, & free + unref the Buffer Object.
- Free/unref the VRAM scratch Buffer Object.
- Release the AGP backend if attached.
- Forcefully release any remaining GEM Buffer Objects.
- Within a lock & for each ringbuffer: wait on it’s memfence (by checking various atomic pointers before waiting for a CPU interrupt), cancel/wakeup any delayed work, & mark it’s scratch registers as free.
- Finalize the Buffer Objects by removing it’s debugging filesystems, unpinning the
stolen_vga_memory
buffer object, calling into the TTM subsystem, finalizing GART (again), & calling into the Arch subsystems. - Free the ATOM BIOS info.
suspend
& resume
suspend
follows the steps:
- Suspend Power Management within a lock by resetting some state. For DPM it calls another lower-level driver method, and for older GPUs it (outside the lock) cancels some delayed work.
- Finalize the audio drivers as per
fini
. - Stop the CP ringbuffer via the TTM subsystem & GPU register writes.
- If attached finalize the UVD subsystem, it’s handles, & their memfences.
- Disable CPU interrupts & unset corresponding GPU register(s) like upon
fini
, but don’t free the ringbuffer. - Disable the WriteBack buffer by unsetting a driver flag.
- Write to GPU registers to disable GART tables, & un-map/pin it’s buffer objec.
resume
on the other hand reinitializes the ATOM BIOS, resumes Power Management (also possibly setting up voltages & clocks), & repeats steps (15)-(31) of init
.
vga_set_state
& asic_reset
vga_set_state
, which is also shared by many of the other versions, simply sets or unsets bitflags in the CONFIG_CTL
GPU register.
The following method asic_reset
which starts by checking various bitflags in the GPU registers and, depending on which are set, it sets various bit flags in the GPU registers (which might wait on to take effect) including in R600_BIOS_3_SCRATCH
.
It’ll then check those GPU registers again to see whether it should do a hard reset, and a third time to verify that it has been reset. Or the caller can tell this function to do the hard reset procedure instead of all of this.
If those bitflags are not set upon the third check it unsets the hung bitflag in the R600_BIOS_3_SCRATCH
regiser (which it has set previously).
To do a hard reset it’ll write to various GPU registers, and bitflags thereof, with hardcoded delays, and the PCI subsystem.
Some of the aforementioned pauses are dynamically dispatched to other driver methods I might cover later, but others are busy-loops repeatedly checking a bitflag.
Miscellaneous
Continuing my exploration of Linux’s Radeon rs780 driver’s methods, mmio_hdp_flush
writes to the R_005480_HDP_MEM_COHERENCY_FLUSH_CNTL
GPU register.
gui_idle
checks a bitflag in the GRBM_STATUS
GPU register.
mc_wait_for_idle
busy waits on a bitflag in the R_000E50_SRBM_STATUS
GPU register, with 1microsecond delays.
get_xclock
returns a CPU-side driver property.
get_gpu_clock_counter
writes request (1) to RLC_CAPTURE_GPU_CLOCK_COUNT
and reads the response from RLC_GPU_CLOCK_COUNT_LSB
& MSB
GPU registers, all within a dedicated mutex.
And get_allowed_info_register
checks which register the caller wants read, and if it’s within a certain set it reads it.
GART
Linux’s Radeon rs780 driver has three methods grouped under the method table’s gart
property all of which are used for managing memory paging.
gart.tlb_flush
writes bitflags to R_005480_HDP_MEM_COHERENCY_FLUSH_CNTL
, VM_CONTEXT0_INVALIDATION_LOW/HIGH_ADDR
, & VM_CONTEXT0_REQUEST_RESPONSE
GPU registers, before busy waiting on a bitflag in VM_CONTEXT0_REQUEST_RESPONSE
until a timeout.
gart.get_page_entry
merges it’s addr
& flags
arguments into an encoding understood by the GPU.
And gart.set_page
just calls writeq to send signal out the I/O bus (which may just be a write to a memory location).
Ring Buffers
Today I’d like to describe Linux’s Radeon rs780 driver’s methods for managing the ringbuffers used for communication with the GPU. For version rs780 there’s three sets of these methods communicating over different channels. I’ll at least cover the ones communicating via the GFX ring today.
Here ib_execute
writes instructions to the Indirect Buffer’s ring buffer to push a new function onto it’s call stack.
emit_fence
writes instructions to flush the read and other caches, and maybe more.
emit_semaphore
writes a simpler instruction into the semaphore’s ring buffer.
I’ll have to cover cs_parse
later, which translates Mesa3D’s standard bytecode to that supported by Radeon rs780 GPUs.
ring_test
finds a free register, locks the ring buffer CPU-side, and writes a SET_CONFIG
instruction. And ib_test
sends that instruction through Indirect Buffer in order to test that works.
is_lockup
checks the same GPU-register bitflags as for GPU restarts, before timing an atomic read.
get_rptr
reads a GPU register or WriteBuck ringbuffer property. get_wptr
reads the R600_CP_RB_WPTR
register, & set_wptr
sets that register before waiting on a read to complete.
The dispatch tables for communicating over Direct Memory Mapping or the UVD ring buffer are implemented very similarly.
Shader Programs
Radeon GPUs do not fully support the standardised bytecode format provided by userspace/Mesa3D to the Radeon drivers. These bytecodes need to be further lowered in kernelspace.
This morning I’ll describe how that happens, though it can depend on the communication channel to the GPU.
It starts by alloc/init’ing a “track” and stores certain driver configuration properties over to it. It’ll free this once the compilation has completed.
For each bytecode in the provided Indirect Buffer it first “parses”/decompact it (possibly erroring out) before lowering (and parsing additional fields of) it depending on it’s “type” property.
For RADEON_PACKET_TYPE0
it iterates over every AVIVO_D1MODE_VLINE_START_END
instruction contained before it’s arguments are supported, including by looking up the CRTC number. If that CRTC is not enabled, the instruction is zeroed out, and otherwise it’s slightly rewritten.
RADEON_PACKET_TYPE2
is not rewritten in any way.
And RADEON_PACKET_TYPE3
has different logic for each opcode, each validating their arguments (possibly by calling r600_cs_track_check
to validate external encoding issues) and in some cases performing minor rewrites according to a relocations list. These relocations in part involves validating the target instruction & converting from CPU addresses to GPU addresses.
For others it errors.
Also, it’s worth noting that upon using DMA communication channels, there’s additional opcodes to lower.
CPU Interrupt Handling
irq.set
first validates that some required flags have been set on the driver, before reading various GPU flags based on the version of the GPU hardware. Then it (mostly atomicly) reads additional flags from various driver properties & ringbuffers, encodes that data into GPU registers depending on the hardware version, and finally syncs on another.
irq.process
meanwhile first checks a couple of flags & possibly synchronizes on the the IH_RB_WPTR
. Then it reads the current write pointer either from that register or shared memory before checking and fixing buffer overflows.
Then it grabs a (manually-written) lock, reads the ring buffer’s read pointer, reads then writes various GPU registers to acknowledge this data has been read. After which it can iterate over all items currently in the interrupts ring buffer.
For each item in that ring buffer, it examines the opcode (as one or two words: src_id
& src_data
) to validate, sets drivers properties, and/or enacts that operation.
This includes D1/D2 VBlank (calling down into DRM & Radeon-specific handling which uses GPU registers, other methods, & locks)/VLine/PFlip (by setting some driver properties & using DRM to send events to userspace), HPD/DAC hotplug, various ring buffers, thermal high to/from low, & logs GUI idles.
Having interpreted all those items in the interrupts ring buffer, it schedules hotplugging/HDMI/thermal work based on flags set by the interpretor (the handlers are implemented elsewhere), updates the read pointer, & if new items have been added to it jumps back to the interpretor.
Memory Copying
copy.blit
, copy.copy
(they both refer to the same C function), & copy.dma
first syncs (using a newly-created radeon_sync
object) & locks the specified ringbuffer, before writing instructions and data to that ring buffer.
The ringbuffers to use are specified alongside these methods.
Surface Registers
surface.set_reg
& surface.clear_reg
are not yet implemented.
HPD
hpd.init
iterates over each “connector” in the driver’s mode_config
, and for each check it’s connector_type
(as some type breaks certain ring buffers), sets a hardware version-specific GPU register, builds up an enable bitmask, and sets it’s polarity via hpd.set_polarity
.
After that loop, it updates a driver property for those enable flags. & calls irq.set
within the CPU interrupt lock.
hpd.fini
iterates over those “connection” again unsetting the GPU register appropriate to the hardware version and building up “disable” bitflags, which’ll be used to update that driver property before calling irq.set
again using the CPU interrupts lock.
hpd.sense
reads the appropriate bitflag from a GPU register for the specific hardware version.
& hpd.set_polarity
(as used in hpd.init
) sets the appropriate GPU register bitflag, or clears it depending on hpd.sense
’s result.
Power Management
pm.misc
looks up the current configured requested_power_state_index
/requested_clock_mode_index
from the power_state
table & if that configuration says to, it adjusts the voltage by running a GPU “ATOM” BIOS program with the given parameters.
pm.prepare
disables the active CRTCs by setting the AVIVO_CRTC_DISP_READ_REQUEST_DISABLE
in their appropriate GPU register.
pm.finish
enables those CRTCs again by unsetting that DISABLE bitflag on them.
pm.init_profile
(a new update for the rs780!) builds that table referred to by pm.misc
depending on how many power_states
(2, 3, or otherwise) are desired. The “default” profile is initialized with other Power Management configuration outside these tables.
pm.get_dynpm_state
sets three PM config properties based conditionally on the CRTC count, planned action, device model, flags, power state level, etc.
pm.get/set_engine_clock
both runs programs in the GPU’s BIOS to get/set this data.
pm.get/set_memory_clock
, pm.get/set_pcie_lanes
, & pm.set_clock_gating
methods are all NULL, indicating to the caller to do some fallback logic.
pm.get_temperature
reads and decodes the CG_THERMAL_STATUS
GPU register.
& pm.set_uvd_clocks
writes control signals to 2, 3, or 4 GPU registers, computes “clock dividers”, before writing the results to various registers with occasional pauses.
DPM
Most of these are specific to rs780 rather than 6xx more broadly.
dpm.init
parses various “ATOM” BIOS headers to set various driver properties.
dpm.setup_asic
does nothing.
dpm.enable
starts by retrieving the refresh rate from the first enabled CRTC, then writes to CG_INTGFX_MISC
to disable BIOS powersaving. If the GLOBAL_PWRMGT_EN
bitflag of the GENERAL_PWRMGT
GPU register is set, it errors out. Then it sets various GPU registers for the DPM parameters (which may error out), computing clock dividers where necessary, before writing to more GPU registers to enable various aspects of DPM waiting for vblanks, etc where necessary. If a desirable voltage_control
is set, it prepares tables before setting the GPU registers.
Before enabling clock scalings & the “program at” via more GPU registers.
dpm.late_enable
checks the irq.installed
& pm.int_thermal_start
, if so it sets the GPU registers for the termal temperature ranges & CPU interrupts.
dpm.disable
sets the GPU registers to disable the dynamic power management (as done during dpm.enable
), clock scaling (a subset of as is done during dpm.enable
), & possibly trigger CPU interrupts.
dpm.pre_set_power_state
does nothing, as does dpm.post_set_power_state
. dpm.set_power_state
similarly sets various GPU registers.
dpm.display_configuration_changed
gets the first enabled CRTC’s VRefresh rate (copied to refresh_rate
driver property) and writes that to the “program at” GPU registers FVTHROT_TARGET_REG
, FVTHROT_CB1/2/3/4
.
dpm.fini
frees the memory used to store the DPM parameters.
dpm.get_sclk
returns the pm.dpm.requested_ps->ps_priv->sclk_low
or ->sclk_high
property depending on the value of it’s low
argument.
dpm.get_mclk
returns the pm.dpm.priv->bootup_uma_clk
property.
dpm.print_power_state
printk’s various subproperties from that ps_priv
property for debugging. Where this info goes is very much a topic for another time.
dpm.debugfs_print_current_performance_level
printk’s out a handful of properties chosen based on the decoded FVTHROT_STATUS_REG0
& CG_SPLL_FUNC_CNTL
GPU registers.
dpm.force_performance_level
writes to various GPU registers partially determined by the provided level
, possibly incorporating computed clock dividers.
dpm.get_current_sclk
reads & decodes the FVTHROT_STATUS_REG0
& CG_SPLL_FUNC_CNTL
GPU registers. dpm.get_current_mclk
returns the pm.dpm.priv->bootup_uma_clk
driver property.
Page Flipping
pflip.page_flip
writes a AVIVO_D1GRPH_UPDATE_LOCK
bitflag to the AVIVO_D1GRPH_UPDATE
GPU register corresponding to the specified CRTC# as a lock, before writing to three more corresponding registers, busywaiting on a different bitflag there & releasing the lock.
pflip.page_flip_pending
simply routines the AVIVO_D1GRPH_SURFACE_UPDATE_PENDING
bitflag in the corresponding AVIVO_D1GRPH_UPDATE
GPU register.