GHC statically links a C/C– library into every Haskell executable. This page describes how that executable performs each of it’s tasks.
Perhaps the most inarguable component of GHC’s RunTime System is the (de)allocation of memory, for your programs to store all it’s data in regardless of whether it has been fully computed yet.
GHC’s root allocator, upon which the other(s) are implemented, allocates large chunks of memory at a time independant of e.g. free()/malloc(). Thus making it more resiliant when combined with other libraries & OSs.
There’s two variations of this for whether it needs to track which chunks are invalid. Let’s say it doesn’t.
In which case it first gets those “megablocks” then marks them as allocated in a bitmask possibly split by higher order bits.
Allocating the megablocks on POSIX is a call to
mmap(), with additional error handling and pointer alignment. To free a megablock it calls
munmap() & clearing the bits in the bitmask(s). The bitmask is for introspection (by e.g. profilers) or shutdown, not for reusing memory.
There’s also code to bind memory to certain cores/CPUs via
mbind() unless debugging is enabled.
Beneath the megablock allocator is a “block allocator” which works a lot like
free() usually are implemented by bucketting into linked-lists holding memory within some power of 2. Except it tries to keep allocation close to O(1) at the expensive of deallocation.
Allocation must be rapid, it’s done constantly and thus can easily become a bottleneck!
Consecutive free memory gets merged. Allocation counts are kept.
Upon those allocators is implemented “arenas” where a free pointer is incremented by varying ammounts until it reaches a limit. Which makes allocation instentaneous at the cost of not being able to free that memory individually.
Finally there’s some (computationally expensive) code the GHC devs can enable to traverse the heap & assure themselves that the it remains correctly structured.
Those allocation optimizations came at the cost of deallocation. The main “arena allocator” doesn’t even support deallocation!
So how does it deallocate memory? And how does it figure out what should get deallocated?
Scattered prominantly throughout the compiled code (and the scheduler, to be discussed later) is calls to GarbageCollect().
After significant scheduling/profiling/debugging initialization, it starts by collecting the “weak pointers” which is an aside I’ll tackle another day. It postprocesses all new “blocks” allocated for each CPU.
The Haskell heap is split into a configurable number of “generations”, so that when one overflows it’s collected alongside any lower generations. New objects are allocated into generation 0.
At this point the generations are initialized depending on if they’re being collected.
Initializing the generations involves setting various properties, freeing mutable blocks (if it’s “non-moving”, a concept for tomorrow), and makes sure all memory is associated with a generation. Older generation mutable blocks are stashed aside & their state validated.
It then initializes a thread for the GC, and a mark stack if it’s a full collection. It increments the number of running GC threads & wakes them all up (if threading’s enabled). Followed by more debugging code.
After all that setup it can now start the actual collection! It iterates over the root objects (“mut lists”), examines it’s metadata reference, & (unless it’s “clean”) traverses it’s fields to copy them to an older generation. Or increment their “age” & copy to a new arena in the same generation.
If it encounters any “static objects” it’ll deduplicate them. A write barrier is enforced here.
Compact & large objects are handled specially. Everything else gets flagged & pushed to stack.
When copying a “thunk” to a new arena/generation it’ll overwrite the old data (under a writebarrier) with a forwarding pointer. If the computed value for a lazily-computed thunk doesn’t fit it’ll also write a forwarding pointer. These forwarding pointers are removed as it “evacuates” a heap, immediately before actually copying that data over to the new heap.
The arena allocations are manually inlined here.
There’s some concept of black & white holes, and atomics when threading’s enabled.
At this point, after traversing the heap, it’ll free the copy-collected roots. Then find more roots to evacuate from the interpretor & scheduler, after which it handles weak and stable pointers (again, I’ll discuss them another day).
That traversal often pushes items to be traversed onto a stack to be copy-collected later in a concurrent copying GC. Now’s that later (with special handling of weak pointers). Afterwhich these GC threads need to be shutdown.
Now we’ve completed the collection, and are on to tidy up.
- If threading’s enabled “sparks” are pruned (later topic).
- If profiling’s enabled it computes a census, it may gather a census of uncollected thunks.
- It sums some other counts to report performance data to the GHC devs, followed up by per-generation data.
- It iterates over generation collected & frees the evacuated arenas. And uncollected large objects.
- It compacts certain data (later topic).
- It finalizes & validates generations state.
At this point there’s some special finalization for non-moving heaps (tomorrows topic) & it computes the new size of older generations.
- It frees the mark stack & any bitmaps.
- It resizes the “nursery” generation 0 (reallocating them) & “resets” them assigning scheduler capabilities.
- It marks GC’d CAFs from the interpretor as dead.
- It updates stable name tables & unlocks them.
- For full GCs it’ll traverse the heap to determine whether it can unload any object code.
- It’ll schedule any weak pointer finalizers, run a census if enabled,
- It sends exceptions to threads being killed.
- For full GC’s it’ll gather additional counts and check whether there’s memory to return to the OS via munmap() on UNIX.
- It restores cost-center data.
- If enabled it’ll generate an inventory.
- It’ll report the end of GC collection to the profiler.
- And finally it releases remaining locks.
I described GHC as having a concurrent copying generational garbage collector. This is heavily optimized for the case your Haskell program is constantly spewing out memory allocations which rarely last long, the “Strong Young Generational Hypothesis”. You may view that as inefficient, but it’s what the runtime expects.
But what if that isn’t appropriate? What if in the oldest generation it’s faster to free what little is now dead than to copy out what’s still alive?
A benefit of the copying allocators is that by not worrying about freeing thunks individually we could make allocation little more than incrementing a pointer. That allocator is inlined into the Haskell compilation! But the optional & final nonmoving generation needs a different allocator to be called whilst evacuating younger generations.
In this allocator free memory is stored in different linked lists for each power of 2, for contiguous blocks of the same size to be merged.
After all the younger generations are collected, this nonmoving generation if enabled (it isn’t by default) is collected specially.
To do so if no other garbage collectors are running, it:
- Resizes the younger generations.
- Resets some bitflags on all objects.
- It allocates a “mark queue”.
- It iterates over all the root objects from different sources & pushes them to a mark stack if they’re in this nonmoving generation. Largely same as normal GC.
- Traverse the data graph referenced from those roots, handled a bit differently upon shutting down multithreading. This involves: a) It checks which memory blocks are already fully collected. b) It runs the main depth-first (a.k.a. queueing up items in a FIFO stack) graph traversal with a little post-processing for weak pointers, etc.
- It collects dead weak pointers, then reruns the graph traversal one last time.
- Frees emptied blocks.
- If threading’s enabled, schedule finalizers & resurrect threads.
- If debug’s enabled, it processes the interpretor’s “CAFs”.
- Validates the mark queue has completed. (Validation is very necessary in this RunTime System for the GHC devs to maintain their sanity!)
- Resets the threads & weakpointers states.
- If threading’s enabled, prunes the “spark” queue. Sparks are a topic for another time.
- Releases more states, lists, capabilities for the marking.
- Frees the mark queue used during traversal.
- Resizes the younger generations again.
- It now begins the process of freeing the thunks not marked as live, and reports this to the profiler.
- It frees the large objects directly via the block (de)allocator, then compacted objects, then stable names. The later two will be discussed next time?
- Iterate over the non-moving heap to find consecutive dead objects to push onto freelists.
- Finalize any threading & profiling.
Weak pointers in GHC allows you to keep a value alive as long as a value it references remains alive. And the allow you to run a finalizer (written in Haskell or C) under a global lock when it dies. This may be integrated into GHC’s scheduler for performance. ForeignPtrs are implemented upon WeakPtrs, and threads are garbage collected via weak pointers.
The linked list of weak pointers is checked after & between the main GC, which it calls into to keep it’s main value alive.
Stable Pointers & Names
The garbage collector moves pointers around to keep allocation fast & optimize for “wasteful” use of memory. But if you want to expose a Haskell thunk to C, that code won’t be able to handle the pointer moving. So we need “Stable Pointers”.
Stable pointers are allocated via a linked list into a locked increased-on-demand arena, and are pointers to the unstable pointers. The GC treats these as another source of “live” roots.
A similar concept of “stable names” are for pointer equality.
There’s a concept of an “allocation pool” which manages a freelist around a provided allocator, but I don’t see what that’s used for. Or how Haskell code can call it.
For performance Haskell code may reencode certain data as non-lazy
Compact# regions, which are self-contained and treated as a single object by the garbage collector. A hashtable may be involved in this reencoding for deduplication.
Calling Haskell Functions from C
Sometimes you want to call a Haskell function from C code, at which point you’d need to change “calling conventions”. Because the way you call a Haskell function differs from how you call a C function.
The code to call C functions from Haskell is generated at compile time, possibly using LLVM. C code is also generated vice versa, but it calls APIs defined by the RunTime System.
There’s also APIs for JIT’ing function pointers to call arbitrary Haskell callbacks…
Before calling a Haskell function you have to write each of it’s arguments into a new “thunk” for this particular call. Though I’m not making much detailed sense of the C macro which is called to do this, it’s not called elsewhere in the RunTime System…
Though the C– handles different types of thunks differently, possibly even calling intermediate functions which should have already been compiled away. And there’s some concept of “INFO TABLES” there for the different types of thunks…
“Adjusters” are GHC RunTime’s API for JIT’ing Haskell callback functions into C function pointers. The tricky part is receiving the C arguments in dynamically-allocated/generated code. It has a raw implementation for X86 & otherwise uses LibFFI.
Once it’s decoded the C arguments from that calling-convention, they can be “applied” to the Haskell function before calling it.
Haskell has a function named “error” which crashes your program with an error message, calls to which gets implicitly generated for incomplete patterns. This could easily be implemented as a foreign C function via the exit() syscall, but what if you want to recover from errors?
For that the GHC runtime system, and C– itself, is extended.
Ultimately allowing you to call
catch in IO functions. In pure Haskell though you should prefer straightforward types like
The first files I’m reading relate to outputting nice error messages, and allowing alternate callbacks to be run instead. The most complicated bit of which is for reading the callstack & printing it out, but that can be disabled at compiletime.
Another component allows throwing errors to other userspace threads if they’re still alive whilst they’re not actively running by messing with it’s callstack. Or maybe enqueueing on a linked list.
Looking at the “primitive functions” exposed to the Haskell standard libraries via GHC.Prim, there’s:
unmaskAsyncExceptionswhich untoggles some bitflags & dequeues any exceptions enqueued whilst those bits were enabled.
maskAsyncExceptionstoggles those bits on, as does
getMaskingStatechecks those bits.
throwToC function mentioned earlier to raise an exception on another thread.
killMyselfraises an exception.
Haskell has a fundamental rule that (except for the MVar API) you cannot mutate any data, only construct new variants of the data. Which’ll usually share most of the same memory safe in the knowledge that won’t change out from under it.
This immutability means you can introduce concurrency wherever you like in your program without messing anything (except C APIs) up. I’ll explain how this works, regarding keeping enough OS threads running despite C hazards.
To start GHC’s concurrency, some mutexes & linked lists are initialized for the scheduler so it can complete this work under
Next it initializes “capabilities” to hold per-core state (either configurable or kernel-provided count), counts & thread safe “task” reference for the “task manager”, before starting the per-core threads via pthread & corresponding “task” structures under their locks.
Profiling messages are written throughout this process.
Upon startup each thread retrieves it’s capability under a lock, optionally configures it’s hardware characteristics & sets that thread-safe pointer to this task.
Then it (re)allocates a fresh “incall” before starting the scheduling. Once that ends it gives the capability to another thread, removes the task from the scheduler’s linked lists, & frees the associated memory. All under the capability’s lock.
SIGALRM (on UNIX) is used to trigger preemptive multitasking, and to dynamically schedule garbage collection.
Each iteration starts with a check that suspendThread()/resumeThread() calls are balanced (which are used to signal to the scheduler to work around possible blocking I/O calls).
Next it responds to the process of shutting down according to the value of
sched_state. Upon interruption it’ll run a full garbage collection, & upon the following SHUTTING_DOWN it’ll check whether it can exit now.
Then it schedules handlers for any OS signals received.
Next it runs each “message” in the capability’s “inbox”, before each batch of which it considers whether to run the garbage collector. The next batch is all the messages which comes in whilst processing the current ones. After each batch it iterates over the relevant MVars & fills them with
Running the message differs depending on it’s type. For TRY_WAKEUP it reinserts the thread into the scheduler’s list. On error it’ll call the exception handling code described previously.
There’s also blackhole & whitehole messages, which are mechanisms by which GHC detects certain forms of infinite loops. Or the message might have been revoked, in which case it shouldn’t do anything.
Next if there’s blocked messages it’ll marshal the scheduling information into & out of a call to the
select() system call. And if there’s any “sparks” (tomorrow’s topic) it’ll create a thread to run those.
Then it figures out how many of the threads to activate & dispatches tasks to them.
Now it tries to detect deadlock situations & sees if it can resolve it with e.g. garbage collection.
It’ll check if it should sleep during which it might perform some garbage collection. If there’s no work it’ll repeat this massive mainloop.
Now it dequeues a userspace thread to run, verifies it can run it now & that we’re not shutting down this scheduler.
It starts a heap profiling timer & eventually runs the haskell code under a flag indicating so.
It closes off the iteration by checking for blackholes (e.g.
a = a in a), closes any profiling, catches certain synchronization issues to be thrown as errors, reallocates any memory limits it ran into (or throws an error), updates the thread’s state & position in the linked list, & does a final garbage collection & heap profile.
CPU Bound Parallelism
There’s two basic types of concurrency: I/O bound & CPU bound. Modern OSs (and the scheduler described yesterday, as well as people selling you hardware) do not distinguish these, which adds performance overhead. Haskell however does & that’s where it’s userspace threads can make a huge improvement!
CPU bound concurrency can only be sped up by utilizing as many CPU cores you have available, and is the topic of today’s much briefer discussion.
If you were wandering what the term “spark” meant from yesterday, it’s these non-IO threads. They are run by a tight loop run as a thread on-demand by the scheduler written a Haskell (because that’s what the scheduler expects), which repeatedly calls back into the C runtime to dequeue the next function to call.
It exits as soon as there’s any less-speculative IO threads to run.
Tasks are queued amongst atomic deques per OS-thread, so it can “steal” work from other threads when idle.
“Fizzled”/expired sparks are discarded & possibly profiled, and if it failed to find anything it’ll determine whether it should retry upon failing to steal work from another thread. It won’t if there’s a lack of work, & it will in the event of synchronization issues.
The deque itself is a ringbuffer with atomic top & non-atomic bottom counters to be incremented & decremented. The top is incremented when stealing an item, the bottom when adding a new one. Decrement top upon pop.
I/O Bound Concurrency
The IO-bound scheduler is also implemented in Haskell, though the root userspace scheduler written in C can also do the same work.
I/O bound tasks rarely requires actual “parallelism” with multiple cores, just dispatching events to the correct coroutines. Kernels provides some system call or other to receive those events.
The IO scheduling threads are run on startup or whenever starting a new OS thread, if threading’s enabled.
It creates a “backend” to wrap the
kqueue() system call & manage it’s state. GHC’s runtime will additionally store a mutable array of the filedescriptor events to listen for & where to dispatch them, a filedescriptor for other threads to wake this one up, a counter, & a lock.
Once that mainloop is initialized it forks a new OS(?) thread, tells the C-based RunTime System which file descriptor it should use to wake this thread, & starts looping.
With the lock held after checking the thread state it repeatedly runs the systemcall 3 times before checking the thread state & either yielding (a GHC primitive function returning control to the root userspace scheduler described above) or closing the loop/releasing the lock.
If the first syscall fails to turn up anything it yields before the second. The third syscall is blocking.
The Prelude I/O APIs will call into this mainloop to register event handlers when an action would’ve blocked.
Alongside the I/O schedulers, GHC’s RunTime System starts a singular thread for scheduling timeouts. (Again, written in Haskell)
This tracks the upcoming timeouts in a minheap (binary tree with the invariant that each branch has a lower priority than it’s children) to identify expired timeouts and how long to sleep for on each iteration. It uses the
poll() syscall so this thread can be awoken on a special file descriptor when the upcoming tasks change.
The branches of the minheap are ID’d, with bitmasks directing edits to the correct branch.
System.IO.timeout public API will call into this scheduler via
GHC.Conc.IO.threadDelay, or directly on
mingw32_HOST_OS systems. Or if threading’s disabled it’ll use a primitive function to call into the root scheduler.
There’s a few more miscallaneous topics to cover regarding concurrency before move onto discussing synchronizing. Starting with some (mutex-locked) globals.
Code’s added around your
main function to catch & report uncaught exceptions even if they occur during error reporting. A mutex-locked global reference to this thread is saved for other threads to propagate their errors to it.
And there’s a mutex-locked global hashmap of what to label each thread in debugging reports.
A linked-list of the state of all OS threads seen by the GHC runtime is tracked with their Haskell callstack. Each task references the ID of the OS thread (which is configured to match the scheduling params of the associated “capability”), and each OS thread has a thread-local variable to the task.
OS “conditions” are used to wake up a OS thread when needed, and blocking syscalls are tracked.
A seperate task is tracked for the garbage collector.
Finally it’s worth noting that as threads go to sleep, they readjust their callstacks to remove contiguous “update frames” (and handle errors). In doing so it checks for infinite loops (lazily or eagerly marked as “blackholes”, via atomic instructions), or cases where the same data’s being computed on another thread.
An “update frame” is a computation who’s value will get cached once lazily computed. Contiguous ones updates the same “thunk”.
GHC’s RunTime System provides synchronization mechanisms to help you make use of the results of concurrency.
This includes a locked hashmap of non-atomic reader/writer locks on open file descriptors, allowing multiple readers or a single writer to ensure a consistent view of the filesystem. I’ve been tripped up by this before…
It implements legacy atomics functions in ARM assembly. But mostly it implements Software Transactional Memory.
Software Transactional Memory
Software Transactional Memory (STM) lets you commit & rollback your RAM reads/writes just like you would in a relational database. Since Haskell can prevent STM transactions from having any other side-effects helps the rest be implemented in GHC’s RunTime System!
It implements STM by maintaining a linked-list of mutations to apply when committing.
To commit it locks the TVars having checked all the TVars haven’t changed, then writes their new values & unlocks them. To rollback it discards this list.
Here I’ll describe inter-thread messaging, which I’ve already touched upon when describing the root userspace scheduler & the exceptions system. Because it’s run as part of that scheduler.
Each thread has linked-list “inbox” of “messages”, which upon each iteration of the scheduler it interprets.
There are several types of messages which will be described in the following toots…
MSG_TRY_WAKEUP examines why the thread was blocked to undo it, updates it’s state, & adds it to the scheduling linked-list.
MSG_THROWTO` checks whether this thread is already dead, why it’s blocked, & whether exceptions are currently blocked. If they’re not blocked it rewrites the callstack, otherwise adding the exception to a linked list to be examined upon unblocking exceptions. Then it might have a closure to unlock.
A thunk being marked as a blackhole indicates that it’s final value is currently being computed. This might indicate an infinite loop if that value is computed from itself, or it might indicate it may be more efficient to wait for that data to be computed in another thread before finishing computing the current one. In which case a message is sent!
The appropriate thread will then examine the blackholed data’s thunk type & possible enqueue computation onto it’s linked list. Or wakes the thread.
If it wakes the thread, that’s done the exact same way as
TRY_WAKEUP. And the enqueued tasks blocked on a blackhole are stored in yet another linked-list.
Some messages it’s reading might be revoked & should be skipped. Whilst “whiteholes” redirects to another message to be evaluated as per above.
This is a bytecode interpretor which comes with a dedicated disassembler converting it into text you can read for debugging. Not that I see much to comment on in that disassembler…
The interpretor is run from the scheduler, and after some validation it iterates over each bytecode by examining each object’s type.
- Indirect pointers repeat this branch for the referenced object.
- Data constructors becomes C returns.
- Returning callback functions may also involve allocating or duplicating special objects for profiling.
- Calling a function involves checking whether there’s enough stackspace & maybe performing some profiling before pushing to the update stack & repeating the loop to interpret it.
- Otherwise it might gather stats & profiling before returning to the scheduler.
To return a value, it examines the stack pointer.
- If the top is
RET_SMALLit’ll convert the response into the appropriate function call. For which it’ll interpret the thunk specially, expecting to see some bytecode.
UPDATE_FRAMErewrites the called thunk to cache the computed value.
RET_BCOmight do some garbage collection &/or profiling before running the bytecode.
- Otherwise it’ll return to the scheduler.
To run each opcode in Byte Code Objects, after some profiling:
- For breakpoints it checks whether to return to the scheduler indicating the breakpoint
- For stack checks it checks whether to return to the scheduler asking for more stackspace to be allocated
- It can push various types onto that stack, including possibly multiple return targets for the callee to jump to. Or newly-allocated memory
- It can rearrange the stack
- It can pop an arg & callee to combine them
- Expand top-of-stack into it’s fields, or collapse them
- Conditional opcodes branching on the top-of-stack, or two.
- Evaluate the thunk at the top-of-stack, where this discussion started. May first decide to yield to the scheduler for the sake of it.
- Returning the top-of-stack works as described previously.
- There’s an opcode for calling C functions, which mostly involves use of LibFFI.
- And ofcourse
Usually GHC will use the GNU linker to resolve cross-module goto opcodes, but if you’re running the GHCi/Template Haskell interpretor it’ll use a built-in one. This built-in linker needs to know where all the functions (identified by e.g. 2 numbers) are located in memory.
This is tracked in a mutex-locked hashmap.
It also refers to a compile-time generated list of built-in function names to their callbacks.
Every “code object” being linked also includes yet another hashmap from function names to value & it’s type, they’re not all global.
Some libraries rely on being able to store globals (even if multiple versions are somehow loaded), so there’s a mutex-locked array in which to store those. The RunTime System lists these supported globals in an enumeration, for which it generates onetime-set accessors.
Even the simplest programs today are assembled from multiple “libraries”, and there needs to be there to combine them!
To do so for each
ObjectCode needing linking (stored in a linked list & flagged) it starts by constructing a hashmap which has been populated with the builtins upon startup. Whilst validating lack of duplicates.
Once it knows all the destinations for machine code jumps, it iterates over the header fields of the executable file. Different code is compiled depending on whether it’s outputting a ELF, PEi386, or MachO format. But as it does so it may recurse back into this whole linker to load new libraries in.
Bitwise operations are used to update the machine code for jumps to directly reference it’s target.
After that rewrite if it hasn’t failed those jump opcodes the mprotect() syscall might then be called to prevent the code from being further modified, as mutable code is a prime candidate for security vulnerabilities. This might involve a dedicated memory (de)allocator for the linker when applied to e.g. ELF files.
The headers for the executable format are then examined again to e.g. allocate mutable memory segments it declares.
If that hasn’t failed the
ObjectCode is flagged and we’re done.
Here’s some of the standard “thunks” the GHCi/Template Haskell interpretors refer to in order to implement datastructures & other function calls.
Of the simpler cases, there’s ones for:
- Extracting a certain field (offsets < 16) from a pointer
- Forward the call with n-arguments to another function
- Handle stack underflows & overflows
- save/restore “cost centers”
- Return to the branch for some variant < 8
- Convert between compiled & interpreted calling conventions
- “Blackholes” marks values that are being computed. Upon call if it’s not an infinite loop (same thread), it’ll enqueue the task to be run after it’s done computing.
- “Whiteholes” represent locked closures & acts as a yielding spinlock.
- Various thunks used directly by the interpretor, GC, etc errors if they somehow get called.
- Some thunks for int, char, & bool literals are compiletime-allocated. Other functions converts to/from C types.
Care is taken to handle laziness.