Compiling C & C++ with GNU Binutils & GCC

How does the C(++) programs you write, and all the C programs making up an OS, get converted into raw machine code your CPU’s Control Unit can directly understand? Primarily using GCC & GNU Binutils.

Many languages also follow a similar process, or they may have special programs for “interpreting” their bytecode.

GNU Binutils

“Binary” executable files contains more than raw machine code your CPU’s control unit can directly understand. There’s additional labelling for our sake, & for linkers (though I haven’t gotten through that code yet).

Whilst today’s topic GNU Binutils contains a suite of commands, the core of most of them is libbfd. LibBFD’s headerfiles are predominantly C structs & macros defining the numerous formats (including archive files) GNU supports, it also implements methodtables for them.


Usually the read & write implementations for each format doesn’t take that much code, but ELF for some reason takes a lot more. Can anyone please explain what ELF is about?

LibBFD also provides e.g. a hashmap implementation as being central to many GNU Binutils commands.

There’s several trivial (or at least as trivial as C gets) commands, though I’ll have to discuss the major commands over the next two days…

ELF “object” format

I don’t know other commonly used formats a lot but ELF is very complex.

ELF is basically a memory representation, that’s why it can be used for programs or core-dumps.

It describes memory sections and their contents, and can also provide other extra data, like symbols and debug headers.

Ekaitz Zarraga

Actually ELF offers two complementary views into the binary:

The section header, meant to be used by the linker at build time, tells what are the contents of each section of the binary (e.g., symbols, relocation entries, debug info).

The program header, meant to be used by the loader (e.g., ld.so) to create the process image, tells which parts of the binary (“segments”) go where in memory, and which sections they are made of.

Thiago Jung Bauermann

GNU ASsembler (GAS)

GAS (not to be confused with the recently GNU Assembly) program converts human readable assembly code, resolves any e.g. jump targets, & converts it into opcodes the machine directly understands.


After extensive initialization & parsing the commandline flags (some flags are specific to which CPU it’s assembling for; referenced files are parsed with their flag), it does sanity checks on input & output files.

Next it initializes all the datastructures used for resolving symbols, fragments, subsegs, reading, input scrubbing, expression parsing, macros, writing the output file (whilst opening it) via LibBFD as mentioned yesterday, dot symbols also via LibBFD, anything else specifically for the target CPU, optionally instruction tables, & DWARF.

Then it defines a “.gasversion” symbol & any others defined on the commandline, saving them all to a hashtable.

Now it may start the real work…


For Mach O files (compiletime flag) it starts by creating text, data, & bss subsegments via LibBFD then sets their flags. Regardless it adds ABS, UND, & special GAS reg/exp subsegments. CPU architecture-specific logic follows, which for i386 includes initializing hashtables for instruction mnemonics, register mnemonics, & lexing.

Then it itereates over all source files listed on the remaining commandline arguments & for each…

After this iteration it cleans up & ends.


For each file it resets several other subcomponents possibly creating a debug file.

For each line, chunks of which are read in by the “input scrubber” (scanner), it first checks if it’s seeing a label & if so defines a new symbol for it. After skipping preceding whitespace it checks if the following identifier is followed by a :, =, or ==, otherwise looking it up in the psuedoops & macros tables to evaluate before parsing the opcode.

Non-identifiers may be empty, local labels, or comments.


Assembling each line of Assembly code (after labelling, psuedoops, & macros have been resolved) is CPU-specific. For i386 it:

GNU Linker

The GNU Linker subproject is probably the main reason Linux From Scratch has you install GNU Binutils first, as it’s what combines multiple “object files” into a single executable whether at build time (for reliability) or by Linux immediately prior to running the program (for memory efficiency).

GNU Binutils also contains a Gold subproject contributed by Google that does the same thing in C++ with multithreaded optimizations, but I won’t discuss that further.


After extensive initialization & the correct filepaths have been found which in part involves looking up a methodtable for emulation, it parses remaining commandline flags with help from the emulator. As part of the commandline flags parsing it may edit the object file via LibBFD & (which has a stack) it’s ldlang subcomponent, open/parse config files parsed via a Yacc/Bison script, and/or as per usual set globals.

This is followed by validation & LibBFD adjustments.

During this validation/finalization it calls some additional methods on the emulator & loads plugins.

ldlang

The ldlang subcomponent may now run to reflect it’s complex linked-tree-based datamodel in LibBFD with hooks for those plugins & the emulator.

After this initial linking pass it gathers the list of everything to be linked, from which it can build start & stop sections before applying all the gathered assignments via LibBFD. By this point we’re traversing “expressions”.


Next it makes sure all symbols are associated with the correct sections & optionally removes any unreferenced symbols. There’s some validation which may also be handed to an emulator method. It marks any “wild” sections. It traverses the script to associate inputs to outputs. It evaluates resulting “insert statements”.

After another emulator hook it gives any remaining input sections somewhere to go. It merges any sections requesting that, then for CTF sections which it’ll then write.


GNU Binutils bundles another library for handling CTF regions.

It then copies LMA regions forward a section. It optionally defines __start & __stop sections early, followed by startof & sizeof sections.

An emulator hook is called before recording sections taking into account any VMA memory hierarchy.

It locates the relro section if present, then computes the memory sizes of each section in the linked tree.

After another emulation hook it configures the __start & __stop sections.


The ldlang subcomponent then evaluates all remaining assignments, calls a “finish” emulation hook, rewrites ldexp’s symbols table to reference sections, validates the result, checks whether required symbols are defined, & performs some final adjustments.

Finalization

Having processed the “lang” datamodels it outputs any warnings & errors before writing the data out by:

It may then output the resulting information to configured files for debugging, etc and/or do more expensive validation, before cleaning up after itself & copying the file into it’s destination.

GDB

Perhaps the most famous subproject of GNU Binutils (which didn’t come with their download, or would be relevant to Linux From Scratch at this early stage) is GNU DeBugger!

GDB’s vast codebase can reasonably be summarised as an event-driven commandline program following the Bridge object-oriented design pattern. That is, it’s behaviour is defined by the interaction between the appropriate implementations of:

P.S. Both GDB & GtkSourceView take care to provide all the hooks you need to tie the two together, so I’m serious about wanting that elementary GDB GUI! I think this was done mainly for GNOME’s monolithic IDE.

But the real logic is ultimately in the kernel’s, in our case Linux, ptrace() syscall.

ptrace() syscall

To handle GDB’s ptrace() syscalls:

  1. For PTRACE_TRACEME after security checks it sets a flag, possibly CPU-specifically “attaches”, & returns.
  2. Otherwise it finds the referenced process under a lock.
  3. For PTRACE_ATTACH & PTRACE_FREEZE it “attaches” (with some locks) by setting some flags, performing security checks, inserting it into a linked list, sends it a signal, updates the scheduler, & continues with CPU-specific code.
  4. More locks & checks
  5. CPU-specific code…

On x86, Linux continues by checking which subsystemcall was made:

All other ptrace() subsystemcalls return to being largely non-CPU-specific:

To summarize: Having access to the scheduler which swaps the CPU’s state between the different “processes” makes it trivial for Linux’s ptrace() syscall to inspect the state of a process. Even if the exact structure being inspected is CPU-specific.

Having breakpoints & instruction/block stepping on the otherhand benefits from hardware assistance triggerring systemcalls from the debugged.

Multipricision Arithmatic

To evaluate literal arithmatic at compiletime GCC, for some reason, uses arbitrary-precision arithmatic provided by libGMP, libMPC, & libMPFR. In some cases (global constants & preprocessing) this “constant propagation” is required, in others it goes far towards outputting concise machine code. Because CPU opcodes can rarely embed more than one literal.

While it doesn’t use multipricision arithmatic at runtime, using it at compiletime allows for greater reliability for metaprogrammers.

Multiprecision arithmatic basically implements the long-arithmatic notations you learnt in school. So does the “Arithmatic Logic Unit” circuitry in a CPU, but that can only handle a fixed number of bits. Software on the otherhand can use arrays to store as many digits as they want!

LibGMP

LibGMP’s MPN subcomponent microoptimizes by implementing this in Assembly reimplemented for a wide variety of CPU architectures, to take advantage of bitflags & other CPU features not exposed directly to C.

The MPF subcomponent wraps MPQ to store the array’s length alongside it’s pointer, and adds preprocessing to speed it up further. And to treat as a signed integer (the sign of that size field is really the sign of the whole number) & exponents. It can also read/write textual notation. The exponent & “mantissa” are combined to create floating point numbers.

The MPZ subcomponent adds resizability beyond what MPF does (but not exponents), & MPQ wraps two MPZ numbers to form a fraction.


GMP also bundles:

It’s worth noting MPZ’s maths resembles Scientific Notation, & more closely how your CPUs Floating Point Unit works. The difference here is you get to decide how many bits to use. Whilst MPZ can store any Rational number ulike MPF or your CPU.

LibMPFR & LibMPC

LibMPFR (seperate project) wraps LibGMP’s MPN to implement it’s own scientific notation-based floating point notation seperate from MPF, adding additional trigonemetric operators. And LibMPC wraps LibMPFR to implement Complex numbers, composed of two subnumbers one of which is conceptually multiplied by the square root of -1 before being added.

It will be interesting to discuss how trigonetry is implemented in-software, but I’ll do so elsewhere.

GCC

I’ve recently discussed GNU ASsembler which converts from a textual representation to the “machine code” your CPU’s Control Unit directly understands.

But these “Assembly Languages” require you to rewrite the same code for each CPU architecture you want to support. And to decide in which register, stack slot, or other pointer offset in which to store every single individual piece of data; including function arguments & returns.

GNU Compiler collection (GCC) provides a better abstraction.

I’ll focus exclusively on GCC’s C compiler, although GNU also uses C++ & it also bundles Ada, HSAIL, D, Fortran, Go, & Objective C/C++.


After multi-pass parsing of the commandline arguments GCC through it’s TopLev C++ class initializes the localization, diagnostics with colours & URLs, OS-specific, exitting, callstack, & allocation.

Further configuration is read in from specified files & environment variables. Unrecognized commandline arguments are reported with possible autocompletion suggestions. It outputs & exits for help/info commands.


After all this initialization, some of which is language-specific, and some profiling it calls compile_file. Then it outputs some diagnostics, tidies up, & exits.

To compile_file (ignoring the time profiling) it starts by running the frontend’s parser possibly outputting computed locations to stderr before lowering the resulting Abstract Syntax Tree to “gimple” cyclicly-linked-list SSA bytecode form as described yesterday. A post-parsing language hook is now called before the backend.


With the program (regardless of whether it’s written in C or not) into GCC’s IL it finishes sanitizing addresses, threads, & hardware addresses. And finishes compiling OpenMP (a featuring I’m largely ignoring in these studies).

Then it outputs the shared constant pool, object blocks, finishes clone pairs, weak pointers, possibly DWARF frames, debugging info, indirect constants, & processes remaining external directives. And gives hints to the linker.

Language Hooks (for C)

The pipelining infrastructure in the GCC’s driver class discussed the other day calls some “hooks” for the appropriate language (decided at compiletime) registered using C macros. I’ll be going over all the significant hooks for C in a rough approximation of the correct order, ignoring anything for C++. Most of the rest are for initialization!

Which I think I’ll start with LANG_HOOKS_PARSE_FILE, or I might push this off until tomorrow…

Parsing

Unix represents “files”, for better or worse, as a sequence of bytes which or may not represent (ASCII/UTF-8 encoded) text which are easier for humans to read & edit. UNIX idealogues would say always use text, but I don’t see how to the computer there’s much difference.

To do any serious process of this data it’s richer structure must first be “parsed” out. Which is definitely the case for converting C programs into Assembly & on into Machine Code.


All C-like languages first finalizes that initialization.

GCC’s C parser is a handwritten “pushdown automaton”, loosely resembling how you might express C’s syntax in EBNF. The lexer is, with some postprocessing to e.g. add types to literals & concatenate consecutive string literals, bundled as a core part of the C PreProcessor. Though the parser occasionally directly with it’s “pragmas”, possibly sourced from build config files.

This libcpp doesn’t appear to bundled with the download I have…


The parser incorporates behind feature flags Objective-C, OpenACC, OpenMP, and a little C++ at minimal additional complexity.

GCC provides a generic binary tree for the parser to use for representing expressions & types, as well as identifier lookup & pretty error reporting.

The parser calls into a suite of Builder (OO design pattern) functions to construct declarations, for resolving, validating, & constructing declarations & statements during parsing.


Not only are identifiers resolved during parsing (rather than a seperate compilation pass like, say, Haskell) but so is typechecking! Allowing for the parser to desugar typeof(), sizeof(), & alignof() into literal values.

C makes it easy to typecheck each individual expression independantly of all others. Just referring to the local or global identifiers.

The code for building, validating, & desugaring the AST for statements & expressions is alongside the typechecking code.


The expression AST builders largely serves to make C’s implicit type coercions explicit, resolve pointer offsets from e.g. struct fields, perform the same operation over every field of an enum, and to partially resolve the &reference operator which is meaningless in machine code. Assignment operators infers a dereference operator on the left-hand-side which must be a valid “lvalue”.

The statement AST builders largely serves to flatten C’s “structured” control flow.

Already quite class to machine code!

Lexing & Preprocessing

When programming in a relatively rudimentary language like C it may be useful to be able to abstract away the syntax using “macros”. Even if there’s plenty of gotchas with C’s preprocessor…

When parsing a text file it may be useful to first “lex” the text into a sequence of consecutive “tokens”, between which the parser can extract the relationships between.

In GCC’s C compiler these tasks are one & the same!


This “LibCPP” lexer starts with scanner class converting a stack of files in any text encoding to a sequence of UTF-8 characters. At a first pass (primarily for the preprocessor component) it uses CPU vector instructions to rapidly split the file into lines, before dispatching function for enacting the #directive or lexing the next character.

The directives may in turn dynamically dispatch to the appropriate “pragma” callback.


The directives may also interpret boolean expressions (entirely seperate from GCC’s constant propagation optimization!) often to manipulate a conditions stack instructing the scanner whether to skip lines. Or they might add “macros” to a hashtable. There’s also error reporting infrastructure.

These macros are resolved (with any arguments themselves becoming macros) in a seperate pass over the lexer I’ve already described.

Yes, LibCPP implements it’s own hashtables…

Memory Layout

I would argue that, while it may also do additional optimizations, a C’s compiler main job is to save you from deciding where in memory every minuscale piece of data belongs in memory. Or in which register, but that’s a topic for another day,

C allows you to declare structs, enums, & unions which’ll be lowered to pointer offsets & constants at parsetime. Or to introspect the memory sizes & alignment of those custom types using sizeof() & alignof().


To parse sizeof() & alignof() to a literal integer, the parser calls c_sizeof_or_alignof_type() which’ll examine the type’s AST.

A function’s size is size_one_node & it’s alignment is FUNCTION_BOUNDARY / BITS_PER_UNIT. void & errors have a size & alignment of size_one_node. These types, & incomplete types throughs warnings/errors.

These constants & following verification code is CPU-specific.


To compute the size of other user-defined types, after handling commoncases & caching, it calls a bi-recursive routine that switches upon the type of expression which can also handle maths within array sizes.

The code to compute alignments is platform-specific & incorporated into the sizing code.

Memory offsets are stored in the AST for structs whilst parsing them, computed from this sizing & alignment info, to be looked up when one of it’s fields are accessed.

Enums meanwhile parse to constants.

SSA Lowering

It may make certain compiler optimizations more obvious once you’ve established a Single Static Assignment (SSA) invariant that any variable can only be assigned once when it’s initialized. In other words, functional languages are already SSA!

It may make other optimizations more obvious once you model the reality that every intermediate value in an expression also needs to be placed somewhere.

This is what the “gimplification” pass is for! Beyond what the parser has already desugared.


The first step of converting a function to SSA is to consider it’s arguments: which arguments can be stored in registers? How about it’s return value?

If all the relevant flags are set, it’ll create a hashtable to use for address sanitization. Then it tackles the body!

Once it’s (re)initialized all the relevant datastructures for this function body, it incorporates the arguments into the SSA tree taking types & the calling convention into account, though I’m not fully comprehending.


The next step is to convert the function’s body, so to convert a statement or expression to SSA it uses a recursive routine (with validation) switching over all the various different operators. Some of which may be language-specific.

There’s a bit of a different process for lowering booleans, because bools aren’t native to machine code: it’s just control flow & conditions.

The few remaining &reference operators are lowered away.

It tracks any sideeffects affecting this lowering.


There’s another recursive subroutine for deciding how much memory to allocate on the callstack. Address sanitization if enabled is tracked whenever lowering a variable declaration. Where possible variables are inlined into the SSA tree.

Structured control flow is further lowered to GOTOs. cases are associated with containing switches.

New variables are allocated throughout this process, some of which might be void’d.


The C-specific hook allows handles lowering shifts, declaration warnings, & pre/post increment/decrement specially.

There’s plenty of post-processing… It outputs a cyclicly linked-list of variable-sized opcodes which may include labels, including creating new variables.

And upon compiling a function body adds additional wrapping SSA ops, or rearranges existing ones. Maybe there’s callee-copy arguments?

The function lowering may add additional processing for profiling & address santization.

Other C Language Hooks

Looking over the next bundle of C compiler GCC “language hooks”, FINISH_INCOMPLETE_DECL addresses some corner cases.

WARN_UNUSED_GLOBAL_DECL checks whether it should print a warning, PRINT_IDENTIFIER makes the appropriate printf() calls, TYPES_COMPATIBLE_P calls into the typechecking system to determine whether given types can be coerced, MISSING_NORETURN_OK_P checks for special & legacy syntactic cases for main(), BUILTIN_FUNCTIONS[_EXT_SCOPE] special cases internal functions.


I’m not quite clear what SIMULATE_BUILTIN_FUNCTION_DECL does beyond adding implicit declarations. There’s a table of callbacks for all the different “attributes” you can attach to a declaration, under language hooks COMMON/FORMAT_ATTRIBUTE_TABLE.

TREE_DUMP_DUMP_TREE_FN handles C bitfields specially. SIMULATE_ENUM_DECL calls the same functions for parsing an enum.

TYPE_FOR_MODE does numerous checks to return an appropriate type, as does TYPE_FOR_SIZE.


INCOMPLETE_TYPE_ERROR runs the appropriate error_at() calls to report an incompletely specified type. TYPE_PROMOTES_TO returns an appropriate type for the input one.

REGISTER_BUILTIN_TYPE registers implicit type declarations.

TO_TARGET_CHARSET performs converts between character encodings via LibCPP.

EXPR_TO_DECL conditionally wraps an expression in a global variable. GET_DECLS is a noop for C. TREE_INLINING_VAR_MOD_TYPE_P checks for unsized types.

Then there’s OpenMP…

Sanitization

With your program parsed into a cyclicly-linked-list SSA variable-sized bytecode form, GCC finalizes some “sanitization” after the parser’s start.

For memory addresses this involves constructing “shadow pointer” types if missing, conditionally adding some calls to builtin functions, & constructing a global for those runtime checks to consult with enough for space for all relevant globals.

Thread sanitization imports those same functions calling the runtime checks at startup & shutdown.


Hardware address sanitization adds a different runtime initialization/finalization function call to the program.

All of these checks are are entirely optional to compile, and probably wouldn’t be included in the first bootstrapping of an OS like Linux From Scratch due to it’s runtime requirements.

I’ll skim some of this code to see if there’s anything worth commenting upon when I upload this thread…

Compilation

Compiling the code, as opposed to embedded data, is done conditionally by calling finalize_compilation_unit on theglobal symbol_table object.

After some profiling this C++ method it computes the calling convention (partially CPU-specific) & marks alias targets. Processing each previously marked aliased pointer involves considers whether we can remove it.


It also remodels the aliased pointers in the varpool or control graph subsystems.

After some debugging output it lowers the programming Abstract Syntax Tree to the intermediate language both before & after additional alias removal/remodelling. I think I’ve already described that, but I’ll double check.

After some more debugging info, validation, & setting hooks it’ll move on to the actual compilation. If there weren’t any errors it’ll run the IPA optimization passes & plugin hooks.


After running the IPA passes it’ll may run some CPU-specific code & debugging output. Before rerunning those IPA passes & debug hooks.

After more debugging/profiling late IPA passes are run & figures out which functions we want to output.

Then it gathers, sorts, & outputs each cgraph, varpool, or ASM node whether defined or not mostly using CPU-specific code. Or directly outputting that data or ASM. CGraph nodes also refer to the generic subsystem for LTO sections & it’s hashmaps.


CGraph nodes also runs a SSA pass & applies those IPA & other optimization passes. After some optional warnings & tidyup it iterates over the CGraph to output any literal labels & CPU-specific Assembly instructions.

Varpools may also incorporate LTO sections.

It’ll then topologically sort all the functions & expand them seperately if they haven’t already been. Followed by variables, as per above.

It handles any new alias pairs & iterates over newly-created functions to output/optimize.

It outputs any weakreferences to be linked later before finishing with debugging, validation, & freeing memory.

Optimization

GCC runs a suite of extra checks & optimizations over your programs to help ensure it works optimally. This section will describe each of those passes in turn.

Warn On Unused Result

pass_warn_unused_result warns when calling functions marked with the warn_unused_result attribute, without doing anything with it’s result value. Presumably largely used for error handling.


For each function this passes iterates over all it’s bytecodes, recursing into any BIND, TRY, CATCH, or EH_FILTER. (relevant to C?)

On any relevant function calls it checks the output target stored in the opcode to decide whether to warn.

Lowering Control Flow

Skipping over GCC optimization passes which aren’t relevant to plain C, the next on my list is pass_lower_cf which in part combines multiple return statements into a single one per function.

It also lowers away lexical scope (possibly relevant…) & exception handling (irrelevant).


Expecting parses to start each function with a BIND opcode it recurses over the tree of those BIND opcodes extracting vars & ensuring correct opcode ordering as it does so.

Within each BIND block it examines each statement’s opcode:

With the bulk of the lowering done, it optionally validates that all DEBUG instructions have been removed, setting the debug_nonbind_markers.

A final return; instruction will be added if missing.

All RETURN opcodes located & replaced with GOTOs are moved to the end of the function with the appropriate labels at the end of the function.

And all temp data involved in theses tasks are cleaned up!

Control Flow Graph

Each “block” of C code can (if not for function pointers) transition to a statically-defined set of other blocks depending on control flow. This forms the “Control Flow Graph” (CFG). This analysis pass is performed by the pass_build_cfg C++ class & the execute_build_cfg/build_gimple_cfg C functions.

This CFG will be vital if we ever we ever want to optimize it!


build_gimple_cfg starts by initializing a new graph with the functions codeblocks & their labels.

After optionally preprocessing DEBUG instructions GCC iterates every instruction in the function.

If it’s a CALL instruction it checks whether it should set or unset it’s control flow altering flag depending on whether it may trigger any exceptions. So seemingly irrelevant to C. I guess CFG only considers within-function control flow.

It splits the cyclicly-linked list into the seperate codeblocks, populating the labels mapping. Certain instructions start or ends a codeblock.


GCC’s CFG analysis ensures there’s at least one (empty) block, it trims the allocation of the codeblocks array, & removes duplicate labels including for exception handling. The later involves rewriting GOTOs, etc to not reference the labels about to be deleted.

switch statements (who’s cases have previously been sorted) are handled specially in a seperate subpass to collapse contiguous cases.

It then constructs a hashtables when populating edges….


GCC creates an opening edge then iterates over each codeblock & the instructions therein, converting GOTO, RETURN, COND, SWITCH, exception handling, CALLs in case of exception handlers or no returns, ASSIGN in case of exceptions, any special ASM code, OpenMP & Transactional Memory into control flow edges in this Control Flow Graph.

It keeps a list of blocked entered via GOTO, CALL, or OpenMP & if there are any of the formers it goes over the basic blocks again to find & correct abnormal edges.


A basic block is created to dispatch to the abnormal edges, though I’m not clear on what defines an edge as being abnormal.

build_gimple_cfg iterates over the basic blocks again to assign “discriminators” based on line numbers & a hashmap. And it finishes up by once again removing duplicate labels revealed by those subpasses.

Basic CFG Optimization

With the function body this modified, execute_build_gimple outputs some debugging, “cleans up” the CFG, & optimizes loops.

The cleanup subpass starts by optionally altering any loop headers to have only one entry edge, before traversing the tree in preorder looking for edges to remove & flagging each codeblock. And allowing constants to propagate into simplifying control flow. GOTOs to GOTOs are merged into a single GOTO.

If necessary SSA invariants are then reestablished, & dominators recomputed.

It initializes a hashmap & bitmap to track case labels.


GCC’s CFG cleanup then iterates over the basic blocks to remove GOTOs to the next instruction, before traversing the control flow again re-applying both those same optimizations in case opportunities got revealed.

Any codeblocks not marked are garbage collected.

Next it locates or validates any loops in this Control Flow Graph. Which it’ll output for debugging & flag in the opcodes.

Presumably OpenMP looped ANNOTATE instructions are handled specially.

Invalid Return warnings

The next GCC (warning) pass in my list is pass_warn_function_return. This iterates over every exit edge from the Control Flow Graph (as described yesterday) to ensure there are RETURN opcodes & it checks whether the presence of a return value matches the signature of the function.

Two seperate loops are implemented for this behind different conditions. In a noreturn function lowers return statements to __unreachable() calls.

The C parser fully typechecks returns.

Warn alloca()

The next (warning) pass validates calls to alloca(), once the codes in SSA form with a corresponding Control Flow Graph.

To do so it iterates over every codeblock & opcode within to find those alloca() calls. Depending on configuration this may be an unconditional warning. Or it may check whether the use of alloca() stays within fixed runtime limits: it’s not in a loop & it’s argument has a max or smallish constant value.

These warnings are nicely formatted for a good DX.

Call Graphs

So far I’ve discussed use of SSA invariants & control flow graphs to aid optimization & additional typechecking. So today I’ll discuss how it extracts an explicit Call Graph from this data to further aid optimization.

To do so it iterates over each instruction of each codeblock skipping DEBUG instructions. If it’s a CALL opcode it’ll add an explicit or implicit edge to the callgraph. And there’s special handling for OpenMP.

Then it records references to every instruction & destructors.


Recording references from a statement involves iterating over it’s tree to mark each data is loaded & saved.

Each node (function) in this call graph has a hashmap storing all it’s edges to the other nodes. The nodes are populated one function at a time, like all other optimization/analysis/warning passes work.

The callgraph node to populate for each function is stored inside the IL declarations, created as needed when adding the edges.

Freeing Lang Data

The next GCC optimization/analysis/warning pass removes fields required by the frontend parser & (C-irrelevant) lowering of exception handling. They’re not needed anymore. Though I don’t find such deallocation very interesting, so I’ll otherwise skip this pass!

String Optimization

This optimization pass reduces duplicate calls to strlen() arising from use of C’s string library, ammounting to the dominant C compiler admitting nil-terminated strings was a poor design choice.


First this pass computes the dominators for each block & possibly locates loops to initialize their “SCaler EVolution” checks, before conditionally initializing a hashmap & arrays for this pass’s analysis. From there it walks the dominators tree, outputs optional debugging info, & deallocates it’s temp data.

Upon entering a codeblock this tree traversal iterates over it’s instructions looking for relevant ones.

The first iteration over instructions upon entering a codeblock invalidates the previously runtime-computed lengths for modified strings. The second simplifies relevant PHI instructions from the SSA pass. The final iteration applies the optimization before storing aside the altered state.

Upon leaving a codeblock it clears this stored state.


To optimize str[n]len() it looks up the constant or variable size for the string regardless of intermediate assignments. Or stores result in the map.

Calls to functions annotated with alloc_size stores a constant length in that mapping.

Optimizing strchr() populates any missing second argument to hold an end pointer.

Optimizing strcpy(), stpcpy(), or their _chk variants are converted if possible to memcpy() & makes the length of the 1st arg same as the 2nd.

Optimizing strncat(), stpncpy(), strncpy(), & their chk variants issues warnings for out-of-bounds access.

Optimizing mem[p]cpy[_chk]() checks for string copying.

Optimizing strcat() adds the length of the second argument to the first’s.

Optimizing alloca(), malloc(), or calloc() stores the allocated length as corresponding to that string.

Optimizing memset(_, 0, _) removes duplicates (arising from calloc() calls) & sets the string length to 0.

Optimizing memcmp for tiny arrays converts it into inlined comparison operators.

Optimizing str[n]cmp() involves adding that n arg, or determining whether it’s definitely equal or not.

It tracks string assignments &, as noted earlier, clobbering. Whilst pointer equality can inform str[n]cmp optimizations, so that’s tracked.

printf()

The C standard library has a printf() function (don’t worry, I’ll be discussing GNU LibC shortly after GCC) which has it’s own type rules for safe usage. The variadic args much match what’s described in the format string. Traditionally these additional rules were not typechecked by the compiler, but now GCC provides a warning pass for printf()-like functions as determined by an annotation.

Arguably this should be more than a warning, but they don’t want to break old programs.


This starts if lacking by making sure we have a lookup table initialized to decode the target’s charset, via it’s CPU-specific dispatch table.

The printf()-like optimizations spends some effort locating sprintf()’s destination argument to get at it’s size for all the previous optimizations, or computes the possible size range.

Then it locates the format string before doing the raw check & length computation on the extracted values. Before applying some simplifications & returning.

For small-bounded text formatting calls it’s faster to use the return value rather than a pointer arg. Or use strcpy() rather than sprintf("%s").


To check a format string & compute the possible lengths of the result it iterates over & parses (using strchr()) every “directive” in the literal format string.

For each it calls a method on the parsed directive, counts how much literal text to add, validates the computed range for the segment, issues common (e.g. NULL) warnings, sums them into the full length range (partially based on likelihood), & checks how much space is remaining in the output buffer if any whilst a generous limit.


Literal text in the format string is parsed to a format_plain directive of that length.

”%%” parses to a format_percent directive of length 1.

Upper or lower “%a”, “%e”, “%f”, or “%g” parses to a format_floating directive with a length taken ideally from the precision parameters.

“%d”, “%i”, “%o”, “%u”, “%x”, or “%X” parses to a format_integer directive similar but simpler than format_floating.

“%p” disables further checks.

“%n” parses to a format_none directive of length 0.

“%c” parses to a format_character directive of length 1, adjusted for width or precision parameters. %C parses to a format_character directive which checks the range of it’s unicode input to determine whether it’s 0, 1, or 2 bytes long at which likelihood.

“%s” or “%S” parses to a format_string directive which gets it’s length from that of the passed string using all the previous measures.

And ofcourse, with the iteration, these issue warnings.

Warn Non-NULL Compare

The next (warning) GCC pass iterates every function’s arguments to check whether any of them are annotated nonnull & are compared against NULL regardless. As that’s checked at compiletime.

Because while yesterday I discussed how GCC optimizes away C’s poor design choice of nil-terminated strings, this pass extends the language to address the poor design of C’s NULL.

Undefined Variables

In C you can declare a variable without assigning it a value. If you then read the variable before writing to it that is Undefined Behaviour. Single Static Assignment makes it easy to warn C devs whenever this might crop up!

And most languages now upgrade this warning to an outright error.


With the dominators computed, it iterates over every codeblock in the function & it’s instructions skipping DEBUG ops. A first pass checks all SSA_NAMES referenced by the instruction.

After thusly issueing any warnings for instruction parameters it increments the number of vdefs if relevant & handles CALL & ASSIGN ops specially.

For CALLs it also checks if there’s any references to undefined memory passed to constant or specially-annotated pointers. The checks on ASSIGN ops are quite nuanced but also appears to relate to &referencing undefined memory.

A fuller pass will run later in GCC’s optimization pipeline!

Undefined Behaviour Sanitization

In C there’s various edge cases which the standards explicitly do not specify. This is the dreaded “Undefined Behaviour” which is generally less than useful, and widely avoided. As such GCC can warn you whenever it compiles such code.

So GCC allows you to optionally insert, via the pass_ubsan pass, a suite of runtime checks to throw C++ exceptions.


It iterates over each instruction in each codeblock.

DEBUG or clobbering instructions are skipped. Maths operations are optionally replaced with library calls to handle signed integer overflow. Non-null pointer dereferences are optionally augmented with NULL checks. Enums & non-bitfield bools are optionally annotated with checks that their encoded values are in-range, which is 0 or 1 for bools.

Earlier I mentioned a pass warning against checking if nonnull-annotated function parameters are NULL. This pass optionally checks against passing NULL.


The arguments passed to the clz() & ctz() builtin functions are optionally annotated with validation checks. It optionally adds checks that NULL isn’t returned from non-NULL functions. Pointers are optionally validated at runtime they are in-bound of valid memory. And pointer arithmatic is optionally annotated with functions to check the pointers remain valid.

A runtime C++ exception is thrown upon failing these checks. Control-flow edges for other exceptions are removed per codeblock.

Function Metrics

To guide hueristics for later optimizations, GCC includes a pass to compute useful metrics for each function. Like (according to the file’s own summary):

This “context” refers to which parameters are constant & whether the call is inlined or not.


It computes the summary for a given function, once it has been (re)allocated, by first computing the stackframe size. This “stackframe” is where local variables (not moved out into CPU registers) are stored in memory, so how much memory do we need to allocate from the callstack?

To do so it iterates over every variable declaration associated with the function, summing the size of every non-register non-global var taking into account alignment. The result is scaled to a distribution.


If I’m reading the code correctly computing code size & timing is cached. If it isn’t yet it reestablishes Single Static Assignment, checks inlining annotations, & analyzes the function body.

That analysis involves optionally computing dominators before summing the CPU-specifically scaled times/sizes for each instruction in the appropriate tablerow for the different contexts. Adding some semi-constant overhead for calling the function.

That context is reconsidered for each codeblock.

Initial Inlining

It’s often a useful optimization to replace a function call with that function’s “body”, as that frequently reveals more optimizations! The initial “inlining” pass I’ll describe today doesn’t use the analysis described yesterday, instead using simpler instead handling all the more obvious cases.

C++ abstractions often yields these obvious cases, but they can occur in C too.

This pass expects the instructions to be in SSA form, which will need to be reestablished after it runs.


After checking for errors it considers whether it should inline.

Any non-recursive functions annotated “always_inline” are always inlined, anything else is never inlined into an “always-inline” annotated function. “flatten”-annotated functions recusively inline all their non-cyclic callees. Then a couple of extra iterations inlines any small or “inline”-annotated functions, according to any function summaries described yesterday. It’s during these iterations the actual inlining happens!


These checks updates the relevant Call Graph edges to mark them for inlining & ensure they can actually be inlined regardless of other optimization options.

After each iteration of looking for small or “inline”-annotated functions it iterates over every codeblock/instruction to find these corresponding CALL instructions, split the containing codeblack around the call, & splice in the callee’s codeblocks. Turning arguments & returns into variables to be SSA’d.

__builtin_object_size()

A common issue in C programs is when they read from or write to memory outside of the data they intend to. This is trivially overlooked when skimming code, and is the source of real security vulnerabilities like Heartbleed.

To help C devs address this GCC provides the __builtin_object_size() function, which is lowered in the next pass. This seems to resemble sizeof(), but runs later in the compilation pipeline. Perhaps so it can be used by santization?


To do so it iterates over every codeblock & instruction therein looking for those __builtin_object_size() CALL instructions.

The gotcha for GCC in lowering these CALLs is that the type system has already been lowered away at parse-time, and as such it needs to go recursively searching for all the type sizes to sum. So it refers heavily to a cache of computed sizes.

If I understand this chunk of code correctly, the actual types are on the SSA_NAME instructions - and propagated through the other instructions.

The resulting size is converted into a literal int.

Constant Propagation Bitflags

Sometimes programs have mathematical expressions where every number involved is a constant value. Maybe because it makes the intent clearer, or maybe as a consequence of other optimizations. In either case it’s optimal to do this computation at compiletime rather than runtime.

In part because it’ll always be the same anyways so why bother recomputing? In part because minimizing constants yields more compact machine code.

Sometimes these constants propagate between control flow branches.


To propagate these constants regardless of control flow, pass_cpp with dominators computed splits this task between initialization. propagation, & finalization where it sets computed values & unsets appropriate loop LOOP_CLOSED_SSA flags.

To initialize it iterates over each codeblock & contained instruction to determine whether they’re definitely varying. If it is, it propagates that status to all it’s immediate outputs. Then it iterates over the codeblocks to handle PHI instructions.


The propagation phase seems to be little more than a SSA graph traversal using a bitmask as the “fringe” to store which edges still needs to be processed. As it does so it propagates the previously-set flags, taking special care around PHI instructions.

Constant Propagation & Peephole Optimizations

Continuing with the story of how GCC implements constant propagation, yesterday I described how it flags which values are defintely varying. Today I’ll discuss forward propagation.

This implements “peephole” optimizations to combine a handful of connected instructions into a more effecient alternative.


It creates a lattice to track SSA_NAMEs (it wasn’t SSA’d after inlining yet?) & postorder traverses the codeblock tree iterating over their instructions.

The first instruction iteration looks for PHI instructions to discard. To be replaced with a value to propagate via the lattice.

The second iteration applies forward propagation of assignment statements, taking special care around pointers, complex numbers, & some C-irrelevant types.

The third iteration attempts to merge each instruction going forwards with their arguments whilst retrieving propagated values from the lattice.


For pointer assignments this GCC pass reconsiders the flags for whether this expression is constant or varying, and whether it has sideeffects.

In this iteration it’ll repeatedly apply 2 subpasses to repeatedly replace the instruction & it’s arguments with more optimally alternatives. The first normalizes any pointers then constant propagation then ASSIGN/CALL/ASM/DEBUG/RETURN. If so it reconsiders the flags.

The second handles ASSIGN of CONDs, SWITCH, COND, or CALL.

Update the latice.


The third iteration updates destination PHI instructions. PHI instructions, btw, represents a value which may come from multiple control flow blocks, thereby making asserting the Single-Static-Assignment invariant easier.

Trailing loops removes instructions which were marked as replaced via a bitmask, or are now dead due to rearranged no-return function calls.

Codeblock Ordering

Like all computer data, architecture programs are stored as a sequence of bytes. At least if your computer at least pretends to follow the Von Neuman architecture, which they all do. So what’s the best order in which to output these bytes? That’s what pass_early_thread_jumps estimates, if I understand this right.


With a loop optimizer initialized it hands the work off to a new thread_jumps object. It selects a next codeblock for each one with nontrivial exits, than applies results.

For each codeblock with a nontrivial number of exits, it examines the final/branch instruction selecting which branch is taken at runtime. It looks backwards for where that condition is constant, making sure not to infinite loop over loops. Because that bumps the likelihood one branch is taken over another.

To extract the optimal ordering of codeblocks it examines the “paths” that computed, removing ones mentioning dead branches or (which becomes a bitmask) are already “jump-threaded”.


It converts all the paths into a tree it can traverse in postorder. Loop headers are threaded specially to assume we stay in the loop before examining their loop bodies to select each of their codeblock’s most common exit target. This done twice for each loop/function body, once without joiners & once with.

The function’s body is threaded whilst computing the postorder traversal, loops are all handled afterwords.

Scalar Replacement of Aggregates

As far as a CPU is concerned all data are numbers (or bitsequences), so GCC’s Scaler Replacement of Aggregates (early) pass deconstructs larger types into their scaler components. The C parser already does much of this, but this ensures none are left dangling from e.g. inlinings.

A near identical pass will be run again amongst the final optimizations.


It starts by initializing it’s collection (bitmaps, hashtables, & a stack) for the current function, before looking scalarization candidates in the arguments & declarations.

A valid scalarization candidate is:

These variables are marked in a bitmask.

Then it iterates over all codeblock’s instructions attempting to decompose any RETURN, ASSIGN, CALL, or ASM instruction.


The next subpass examines all the candidate instructions that last subpass extracted into the collection types, checking all reads to see if the intermediate storage is actually needed. Which it’ll then procede to remove.

Before iterating over the declarations again to apply further validation & if so flags it’s accesses. Then optionally & non-optionally checks some more invariants.

The final subpass modifies those RETURN/ASSIGN/CALL/ASM instructions according to what’s in the collections.


Those instruction modifications recurses over the type to generate corresponding instructions to replace the “aggregated” one. Because this enlarges the codesize it may have previously discarded larger structs from this optimization.

Then it applies the same decomposition to read in the function arguments.

Profiling info may be written out for the GCC devs, and the collections are freed. Steps may be skipped if there’s no work to do.

Full Redundancy Elimination

TLDR: Iterating over a function’s codeblocks in postorder (with or without tracking cyclic edges), it hashes instructions to determine what to replace future instructions with. Unfortunately the Intermediate Language’s data representation does not make this easy.

A prime concern of compiled machine code is to minimize it’s filesize. Because the better the program fits in the CPU’s instruction cache the less it needs to resort to the slow process of fetching it from RAM.

Redundant instructions, whether arising from other optimization passes or because it reads nicer, are a prime offender. Which the Full Redundancy Elimination optimization pass addresses.


With dominators calculators & loop optimizer optionally initialized it does it’s processing.

If these arguments are missing, this central function do_rpo_vn finds the entry & exit points to analyze. It clears all the EDGE_DFS_BACK bitflag on all entry edges.

It depth-first iterates over the control flow graph to set some flags, gather a postorder indices array, & optionally gather cyclic edges (loops). This is seperated into loops for locating the cycles & function/loop headers, shortcircuit tidyups on trivial cases, locating exits, computing postorder, & cleanup.


If those iteration over the control flow edges yields no loops it’ll skip processing PHI instructions. It creates a mapping from the codeblock indices to their index in the postorder array. It updates in-region flags for each basic block & edge. It creates a few more collections for the next loop.

Which iterates in postorder to gather unwind state (C-irrelevant exception handling?) & unset codeblocks’ EXECUTABLE flags.

It optionally iterates over cycles to expose them better to later loops.


If it is optimizing loops it performs a main iteration over all codeblocks checking/updating the unwind state, validating EXECUTABLE flags, process the basic block, & their PHI instructions to validate them, & inserts unwinding instructions.

Otherwise it uses a greedy worklist to iterate over the basic blocks in postorder to validate them & determine in which order to process them.

A few final checks/loops outputs debug information to the GCC devs & frees the collections.


To actually process a basic block it first checks if there’s actually any PHI if the caller told us to process them.

Unless we’re handling cycles, it’ll consider recursively inserting the loop header into place.

It’ll optionally iterate over the PHI instructions to recursively compute a “value number” (hash?) for the expressions they reference.

It updates the edges’ EXECUTABLE flags. It iterates over instructions to choose other branches to examine & lookup what can replace this instruction.


SSA_NAMES, assignments, conditions, & calls are handled specially. There’s plenty of debugging info written for the GCC devs. If it’s made any changes there may be per-instruction tidyup for it to perform, possibly defered to a later pass.

Failing to find another instruction to replace it pass_fre will add it’s parameters to the collections of instructions which may replace later instructions.

Numeric Ranges

It can be useful to iterate compute the possible numeric ranges of each variable in order to figure out how many bits it requires to represent any number in that range. Such an optimization pass is crucial for getting the most performance out of, say, JavaScript but it can also be useful for C.


So with dominators calculated & loops handled the pass_early_vrp (Value Range Propagation) folds the appropriate “folder” selected by configuration over the current function.

The different expression fold-ers priorizes local expression or (using those computed dominators) control flow possibly exclusively. Handing each relevant instruction or codeblock off to a gimple_ranger or gimple_range_analyzer to interpret for the previously-computed max & min values. Storing the extracted ranges temporarily in a cache, or in an array of bitmaps indicate which vars hold equivalent numeric ranges.

PHI Merging

A PHI instruction is introduced during enforcing Single Static Assignment (SSA) to indicate a value which may come from multiple conditional codeblocks, a lightweight way to model control flow without variable reassignment.

But during optimization these PHI instructions may be altered to degenerate forms. Maybe a PHI instruction only wraps a single value, or is only used as input to another PHI instruct. The pass_merge_phi iterates over PHIs for each block to locate & remove these cases.

Dead Store Elimination

Developers will probably avoid this for clearer code, but this case can crop up as the result of optimizations: Sometimes a variable is assigned to but never used. In which case it’s useless to computing that variable’s value, and the assignment should be removed.

So after reassigning identifiers, calculating dominators, & computing the postorder sequence pass_dse (Dead Store Elimination) iterates over every codeblock & instruction therein.


If it finds any assignment or memcpy/strncpy/memset calls, gathering an array of which byteslices of each variable is read & written. Dropping anything that’s obviously, e.g. assigning zeros to the result of calloc(), or less obviously, looking forward in the instruction stream, dead.

It’ll also examine any SSA “definitions” or PHI nodes to see if dropping those stores has obviously removed the need for them.

It may then remove exception handling edges.

Dead Code Elimination

Ofcourse variable assignment are from the only instructions which might have no effect other than to slow the CPU down with busywork. So at various points throughout the GCC compilation pipeline starting (optionally, for high-enough optimization levels) now after pass_dse.

pass_[cd_]dce essentially ammounts to a garbage collector for computer code rather than data.

Starting with the usual dominators & loops calculations. Marking backedges for full collections.


The first thing a garbage collector does is to find the “roots” which are obviously alive, and it’s no different for pass_dce. For this it iterates over all codeblocks and their PHIs & other instructions. PHI/PREDICT/LABEL/DEBUGs aren’t inherantly necessary, ASM/RESX/RETURN/GOTO are. As well as volatile or global mutations, exception handling, & in full collections infinite loops. CALL is more nuanced, whilst control flow is treated as necessary in partial collections.


The next step of Garbage Collection (GC) or Dead Code Elimination (DCE) is to traverse the graph of objects/instructions to mark anything referenced directly or indirectly by a root. Which for GCC requires treating CALLs (especially to free(), etc), ASSIGN, CAST, & transactions specially.

Now that we know what to keep, it can iterate through freeing anything else. And removing any resulting degenerate instructions.

Then it tidies up after itself & optionally outputs profiling info.

Control Flow Simplification

One technique for lowering away PHI nodes arising from enforcing Single Static Assignment is to duplicate the codeblocks where they occur for each of their possible values. This helps to reveal more opportunities for inlining, but can also increase the codesize which is bad for the instruction cache.

So with dominance info, possibly the (hash)set of nontrapping memory accesses, & the order to process codeblocks computedpass_phiopt iterates over every “nonfixed” codeblock.


Codeblocks ending in a GOTO, with abnormal edges, where both the source or target have NULL successors, etc, etc are discarded.

Where there’s no more than two branch targets it optionally attempts a “cond store replacement”. If there’s only assignment on a certain branch & there’s a PHI node which reads it, the trailing code after both branches is duplicated with those branches merged in.


Otherwise it checks all the PHI nodes to see if it can inline them away.

In this initial iteration over PHI nodes it looks for values (who’s types don’t honour signed zeros) who’s value has been checked to see whether it’s equal to a literal that it can inline. It specially handles optimizing (x != 0) ? x + y : y to x + y.

A second iteration finds a common PHI node accross both branches that it can move out into the trailing code. Then it optimizes away various cases on conditionals, taking into account the possible numeric range of each variable.

Memory Access Analysis

The main bottleneck in a CPU, which motivates all their (vulnerable) optimizations, are memory accesses. As such GCC needs to optimize them away as much as possible, and for that it needs to gather some data. This is what GCC’s pass_modref does.


To do so after examining some flags & (optionally) allocating a new summary for the function, it iterates over the function to determine whether the function can be analyzed.

During that iteration over codeblocks & their instructions it analyzes any arguments to each instruction to record any loads & stores into arrays in it’s summary, before handling ASM (which might disable this analysis) & CALLs (to get their summary) specially. Upon memory clobbering ASM this analysis is discarded.

A second iteration incorporates the memory clobbering data from any recursive calls, which may also lead to this analysis being discarded.

Tailcalls

Behind the scenes functions allocate “stack frames” into a memory region starting from near the top of (virtual) RAM. These stackframes are freed upon return. Though if a function returns the value of another call (a “tailcall”), there’s no need to keep the caller’s stack frame around. Instead it can have the callee directly return to it’s caller.

pass_tail_recursion checks the exit branches for these cases, flags those CALLs, & removes setup/teardown overhead from recursion.

switch to if

Sometimes it’s more optimal to compile conditions using switch (looking up the branch to take in a table), at other times it’s more optimal to use if/else if/else. pass_if_to_switch handles converting obviously relevant if/else branches to switch in case that’s more optimal.

To do so it iterates over every basic block looking for those taken upon a trivial condition upon a common variable to add to a hashmap. If sucessful it’ll iterate again in postorder to validate what it found.

Having asserted what it found was a straight line of if/else if/else conditions it restructure the bytecode, control flow graph, & PHI instructions to turn them into a single switch branch.

if to switch

To convert a switch branch into if/else if/else branches where optimal GCC’s pass_convert_switch iterates over the last instruction of all basic blocks. For each switch it finds, with some debugging output, it:

  1. Groups cases by value, this may reveal degenerate switch branches with only one target.
  2. Compute upper & lower bounds for each set of labels, & counts how many comparisons would be required in if/else form.
  3. Discards cases better handled by other optimizations
  4. Discards cases with too large of a range, or if there’s non-empty branches, or if it can’t handle the PHIs.
  5. It gathers loop targets as codeblocks rather than labels, iterates over those branches to determine in which order to emit the comparisions.
  6. It stores the value being switched over, rewrites PHIs whilst emitting the comparisons, &emits the initial comparison to assert the values in range.

Branch Prediction

It takes a very/relatively long time for a CPU to fetch the next batch of instructions from RAM to execute. So it fetches them ahead of time, at which point branches get in the way of this optimization. Runtime profiling can aid this, but for code the CPU’s never seen before (or has forgotten) the CPU requires the compiler’s guidance as to which of those branches it should prefetch.

pass_profile is the optimization which computes these CPU hints.


After validating this work hasn’t already been done, and initializing both loop optimizers (adjusting infinite loops to be handled correctly) & dominators info, GCC’s pass_profile optimization iterates over the codeblocks to determine which are obviously unlikely according to developer annotations.

Any codeblock only branching to obviously unlikely codeblocks is also unlikely.

A second iteration & propagation counts the successors of codeblock, for a 3rd edge marking iteration.


With the unlikely codeblocks computed it iterates over all edges to check for return statements, to propagate that as a likely (or unlikely, if it hueristically appears we’re returning an error) codepath.

It then iterates over the codeblocks & their instructions to propagate probabilities from annotations on called functions or the codeblocks themselves.

Each loop or recursion is handled next, marking any exits as unlikely. Bailing if they’re absent. It estimates number of iterations.


It computes a precise probability for each exit edge. It incorporates any hueristically detected loop variables into it’s probability calculations for remaining edges. It iterates over each edge &, according to previously determined methods, computes it’s likelihood.

It iterates over edges to predict function calls.

It iterates again to combine probabilities if a branch ended up with multiple.

It cleans up & determines whether to recompute via a similar but different process.

Constant & Pure Functions

Often functions will only read (except for the callstack) from memory to return value (“constant” functions), usually these reads will be restricted to function arguments (“pure” functions). This makes code more maintainable, but it also enables greater optimization. So GCC has an analysis, pass_local_pure_const, to detect these cases.


After checking whether to skip this function (no recursion, has direct calls) it examines the function, reports results, & reconstructs control flow.

After initializing a closure it iterates over every codeblock & instruction therein skipping DEBUGs.

If the instruction operates upon volatile globals it’s neither constant nor pure. It looks over any memory it modifies skipping static or locals to find volatiles (again), used annotations (neither constant nor pure), & optionally all other writes (neither constant nor pure), & constant memory (pure).

During tidyup it detects infinite loops & handles malloc(), etc specially.

Function Splitting

Sometimes functions have a fast codepath & a slow codepath. These are often explicitly split in the header files you import so the first codepath (and cheap condition) can be inlined, but if not GCC can split these two codepaths itself in the pass_split_functions optimization.


Skipping noreturn, malloc, main(), unlikely, non-inlinable, versioned, (C-irrelevant) nested, single-direct-caller, non-inlined or noinline, & section functions it starts by finding back edges of loops.

After initializing some collections GCC’s pass_split_functions iterates over codeblock examining their probabilities & instruction therein calculating the codesize for each block. It then depth-first-searches the control flow tree (backedges were previously removed) to consider where it can split & where it should.

If it found any condidates it adds the new function, adjusts the callgraph, & replaces the slow codepath with it’s call.


GCC has builtin functions for marking which codepaths are more or less likely, which I believe GNU LibC uses heavily. There are also opcodes in the intermediate language to do the same thing. The next (lowering) pass removes these, as their feedback has already been incorporated into their codeblock’s probabilities.

pass_release_ssa_names iterates over all SSA_NAMEs in a function & allocates each a new identifier.

Multi-target functions

There’s a pair of optimization passes which clones functions annotated with multiple targets, and choose the correct clone for each of it’s callees (with inlining).


The next pass rearranges memory structures so the alignment’s friendlier to the CPU’s vector circuits (disables carry bits of the adders), something to do with transactional memory (ignoring this topic), & lowers thread-local-storage to runtime calls if not natively supported by the CPU/OS.

That, with ALL my previous threads covers GCC’s “small IPA passes”.

pass_ipa_icf

pass_ipa_icf, if I understand it correctly, removes duplication in your program, making the output machine code more concise so it can better fit in the instruction cache.


For each function it starts by filtering out flagged, to-be-linked functions, or mutable variables from consideration before building it’s own callgraph which it uses to update previously-computed hashes by testing each pair for equality.

It then iterates over those hashmap buckets to reconsider the groupings.

Surrounding that “subdivision” it outputs debugging info for the GCC devs.

It traverses the callgraph to determine which callers it should “split” stored as a “congruance class” in a dedicated collection (which it’ll optionally validate after collection).

This yields data for additional subdivision, with a 2nd congruence collection yielding data for congruence classes to merge. Which involves multi-pass sorting & different handling for vars vs functions.

Interprocedural Copy Propagation

If a function is repeatedly called with the same constant arguments, it may be beneficial to clone those functions to allow for that constant propagation. This is what pass_ipa_cp is for.

It also serves to lower C++, etc’s but I won’t go into that further.


After allocating collections it iterates over the functions in topological order constructing “lattices” detect opportunities for this, propagating constants along the callgraph.

A second pass in the same (precomputed) topological order clears out dead nodes & makes the decisions whether the functions would benefit from this optimization by e.g. estimating the associated move costs, looking at other callers, consulting the lattices, etc. If so it creates the new callgraph node.

Information gathered is copied into the new clones in a third pass.

A fourth pass corrects previously computed value ranges to match new callgraph.

Then cleanup!

Interprocedural Aggregate Decomposition

As far as a CPU is concerned all data are numbers (or bitsequences), so GCC deconstructs larger types into their scaler components. (Repeating previous toot) The C parser already does this, but unless we remove these aggregates from function signatures it can’t do much. That’s what this pass is for!


Iterating over the functions in postorder (callees to caller) it looks for functions whose type signatures it can modify whilst removing dead parameters. Taking care of some edgecases.

This first iteration also propagates flags marking which parameters are actually used & actually splits the parameters until nothing changes.

A second reverse/preorder (caller to callees) iteration removes unneeded returns.

After which is cleanup & debugging output.

Interprocedural Inlining

It’s often a useful optimization to replace a function call with that function’s “body”, as that frequently reveals more optimizations! (Verbatim repeating previous toot) Whilst I have discussed this topic previously, the pass_ipa_inline optimization can examine more data to determine when it’s probably profitable to inline.


After initialized per-function flags & reordering non-“flatten”-annotated functions it iterates over to determine which (non-recursive, etc) of them to inline.

After inlining any appropriate “flatten”-annotated functions, it greedily inlines any small & single-caller functions in callees to caller order. A first pass here gathers relevant data, a second decides which edges would be valid and profitable to inline inserting the results into a minheap. A third over the minheap consults a cache for final/readjusted codesize growth estimates & actually does the inlining!

It then removes uncalled functions & inlines 1 call functions on hot codepaths.

Interprocedural Constant & Pure Functions

Often functions will only read (except for the callstack) from memory to return value (“constant” functions), usually these reads will be restricted to function arguments (“pure” functions). This makes code more maintainable, but it also enables greater optimization. (Repeating previous toot) pass_ipa_pure_const handles this between function calls.


After C-irrelevant propagation of the “nothrow” annotation, it propagates flags for where memory allocation is performed.

A 3rd postorder traversal propagates the pure & const flags based on worst-case analysis. This may remove C-irrelevant constructors & destructors.

Interprocedural Memory Accesses

The main bottleneck in a CPU, which motivates all their (vulnerable) optimizations, are memory accesses. As such GCC needs to optimize them away as much as possible, and for that it needs to gather some data. (Repeating previous toot) pass_ipa_modref propagates this info between function calls.

In a postorder callgraph iteration GCC’s pass_ipa_modref first propagates within Strongly Connected Components (SCCs, co-recursive?) until no more changes need to be made. This is based in small part on the pure/const flags computed by the previous pass. Any useless summaries are removed.

Any flags are merged/propagated in a second fix-point loop within this iteration.

A second iteration updates the function type signatures.

Then it tidies up.


As a flipside to pure functions, GCC has a pass_ipa_reference analysis for determining which static globals each function directly or indirectly accesses (unless it has it’s reference taken). Presumably this helps it improve data locality for these variables.

After gathering all the variables for it to consider, it iterates over all the functions twice to find how these variables are accessed.

Then it iterates over every function in postorder to do the real work.

For any valid cycle (direct or indirect recursions) it merges the read & write variable sets from all callees with every function in this cycle. All functions in a cycle shares the same read/write var sets.

It finishes with some debug output for GCC devs & tidyup, involving another postorder iteration over the functions.


pass_ipa_single_use builds on pass_ipa_reference to flag any variables which are only read or written by a single function.

Loop Unrolling

If a loop iterates only a few times, it would likely reveal more optimizations if the loop was converted into straightline code. Besides this removes conditional branches which, while vital to support, CPUs struggle with!

The other day I linked to a blogpost which manually unrolled a loop to reveal vector optimizations, but I don’t think this pass attempts anything like that.


With loop optimizers initialized & if there’s more than 1 loop, it starts by estimating number of iterations.

To estimate the maximum number of iterations it looks at every exit condition for the loop & finds the loop counter to extract the maximum number of iterations.

For a configurable number of iterations (if it wasn’t configurable, this optimization pass would definitely optimize itself!) it discards loops with non-trivial exits & recomputes those estimated iteration numbers so that it can look for the innermost loop it should unroll.


Upon finding a loop which (after some modification to remove edgecases) has a single exit & a no more than a configured number of iterations it’ll duplicate the function body that many times (with some validation & cleanup) and/or remove dead edges for the next iteration. Failing that it’ll try again under different conditional checks.

These loops are placed in an array so that if it managed to unroll any loops or remove edges it can tidyup those loops to allow for further loop unrolling.


This tidyup involves:

It ends this loop by validating the results possibly clearing invalid state.

And finishes by flagging loops as irreducible, & cleans up.

Propagation passes

For some mathematical functions the sign of it’s input doesn’t matter, in which case we’d want to drop any negations, etc. For non-obvious cases of that GCC’s pass_backprop analysis/optimization propagates this signed flags backwards through the expression graph.

This is done in 4 phases: propagating into variables/PHIs in block postorder, reprocess any queued variables flagged too optimistically, use that info to optimize instructions in block preorder, & deletes dead variables.


Not sure why this is done, but there’s a pass_phiprop which when multiple branches assign different dereferenced variables to a common var, it’ll merge that into a single dereference. Maybe to give more time for prefetching data, & drop unneeded loads?

For each PHI in dominator-preorder it checks if all parameters are dereferences & rewrites those args, then if successful rewrites all uses of the PHI to insert the combined dereference.


Then it repeats normal forward propagation.

Returning Aggregate Values

After all this propagation, relowers __builtin_object_size, & sets a flag to reanalyze aliases, it optimizes certain cases of return values. Iterating over every codeblock & instruction to ensure that larger values which don’t fit in CPU registers are written directly to the callstack rather than some temp memory.

Specifically it looks for the “aggregate” return value being computed by a subcall, so it can annotate that call with where it should write it’s result to.

Conditional Dead Code Elimination

GCC repeats several more previous passes including refining the ordering of the codeblocks (very little difference in operation, I checked). The next new pass lowers C’s variadic arguments for certain trickier calling conventions. Let’s assume it’s not X86 relevant?

Then for expensive error-throwing functions with unused results (pass_call_cdce) it wraps them with a condition t only call them when they’ll throw an error.

With one iteration to find these cases, & another to rewrite them.

Conditional Store Elimination

Conditional branches can interfere with the future Common Subexpression Elimination optmization pass, so pass_cselim (Conditional Store Elimination) attempts to offload more of those branches onto the PHI instructions.

With loop optimizers initialized & dominators (& non-trapped memory accesses) computed it iterates over the basic blocks from predecessors to successors to find simple enough branches it can optimize, figuring out which branch falls through to the other.


Conditional branches can interfere with the future Common Subexpression Elimination optmization pass, so pass_cselim (Conditional Store Elimination) attempts to offload more of those branches onto the PHI instructions. Before/after which those stores will be diffed to land them in the appropriate codeblock in one of two ways.

After a few more checks it restructures the codeblocks/stores to actually remove the branch, & adjusts the referenced instructions to read the right data.


It uses several tricks to remove the need for the condition, & (depending on the branch structure) if successful it’ll actually go through the work of removing it.

These PHIs are postprocessed to determine where the PHIs aren’t actually needed.

After which it tidies up all it’s memory.

Copy Propagation

I know firsthand that minimizing how many times data as copied around RAM is vital, and GCC addresses smaller cases of this in it’s pass_copy_prop optimization. Memory accesses afterall are a bottleneck for the CPU!


This is split into two parts: initialization & propagation.

The initialization Iterates over every codeblock & instruction/PHI therein, gathering a smallint map & flagging instructions where appropriate.

GCC’s pass_copy_prop reuses a generic traversal over the control flow graph of codeblocks & instructions therein propagating those flags it has just set according to a bitmask of what has yet to be traversed.

Some post-processing iterates over every variable to remove any duplicates & update their references. Which more dead instructions for more generic code to delete. Which may lead it to tell the pipeline runner the control flow graph needs to be recomputed.

Bitflag Conditions

In an attempt to yield more concise machine code, GCC has a pass for combining conditional branches when they’re used like a logical (short-circuiting) || or && ops.

It iterates over the codeblocks in reverse order looking for ones ending in if/else branches. If it finds a pair resembling && or (using De Morgan’s) || which can beneficially be converted from short-circuiting boolean operations to bitwise operations it’ll perform the appropriate conversion.

Copy Loop Headers

If loop headers are simple enough it can beneficial to move them from the top of loop to the bottom (convert while loops into do whiles), so that continueing a loop involves 1 branch not 2.

To make that correction GCC’s pass_ch optimization, with loop optimizers initialized, iterates over every loop looking for ones it can optimize & determining how much (if any) of the header it should inline.

Then it finalizes the SSA restructuring.

More Aggregate Decomposition

At least 3 earlier passes including the C parser has already done much of this work, but CPU’s don’t deal in terms of structures & unions. They deal in terms of numbers/bitsequences possibly stored in RAM. The previous passes ensured all within-function “aggregates” are “scalarized” & that function calls can be. This pass finishes the job!

Once it’s collections are initialized it finds all arguments & variables to be scalarized.


Next GCC’s pass_sra optimization iterates over every codeblock & instruction therein looking for returns, assignments, function calls, & asms for it to attempt to scalarize by recursing over the type system’s Abstract Syntax Tree expanding the operation into multiple. Assignment is handled slightly differently to pair source & target.

Any accesses queued for more advanced restructuring are handled after that main iteration before another loop commits the changes to the opcodes


And the final subpass (before cleanup/additional debug output) iterates over a function’s parameters & their accesses to update any code referencing them.

Dominators

In compiler engineering a codeblock’s “dominator” is the other codeblock which is necessarily before this one, no matter which branches are taken in between. This has informed several optimizations I’ve previously described, but today’s suite of subpasses specifically optimizes the dominators tree.


After initializing various collections & calculating this dominators tree & loops, this pass_dominators optimization flags any depth-first-search control flow graph backedges.

It then iterates over all the basic blocks examining their last instruction to collect their outgoing control flow graph edges & any invariants within each codeblock.

Reruns the Value Range, jump threading, & jump thread simplifier optimizations over the examined function (I believe I’ve discussed them all previously).

If this triggered any modifications it’ll check if there’s any dead edges & use the jump threader to remove them.


For each codeblock & every instruction therein it’ll correct any flags for modified instructions.

It flags that SSA needs to be reestablished, frees the now outdated domaintor info, & uses the jump threader to rearrange the codeblocks into a new optimal order.

For any flagged codeblocks it’ll fixup C++, etc exceptions.

It removes deadcode defined as following a noreturn-annotated call.

Then cleans up the collections used to inform all this, & with optional debugging info.

Erroneous Codepaths

There may be codepaths in your program which can’t or shouldn’t ever be taken. These should be seperated out by GCC so they can be flagged as unlikely. This is what GCC’s pass_isolate_erroneous_paths optimization does.


After initializing it’s collections it runs two subpasses for implicit & explicit erroneous codepaths. After which it frees those collections, deletes outdated dominator info, & set any relevant flags if the code’s been modified.

For implicit erroneous codepaths it iterates over every codeblock (unless it has a direct loop, abnormal branch edges, or exceptions), every PHI op therein, & every variable they unify to find & isolate any undefined behaviour. The isolation is performed by duplicating the codeblock.

For explicit erroneous it iterates over codeblock (unless it has abnormal edges or exceptions) & instructions therein to find & isolate invalid use of 0 or NULL into a signal-triggering conditions.

Commutivity

In mathematics a + b = b + a, or a * b = b * a. This is known as “commutivity”, and allows GCC to rearrange operands into an order it can better optimize.


After initializing various collection (populated with variables, etc from the current function) with the loop optimizer & dominators, it starts by converting subtract instructions (who’s result isn’t used outside +/-) into add + negate instructions to reveal more commutivity. The negation is stored out-of-line from the code.

Another pass iterates over each codeblock’s instructions again to find commutative operators. Once it finds one it first gathers all the direct/indirect operands (decomposing multiplies by constants into adds) into a singular array. Swapping the operands so the head is non-commutative. So that it can become a left-associative expression.

In doing so it flags the other instructions it’s rewriting so they can be removed during the broader instruction traversal.


In gathering that operands list it looks up the pre-computed “rank” for each operand for them to be sorted by after it’s converted into left-associative form.

Some “peephole” optimizations & (key reason for this optimization!) constant propagation are applied. Followed by “undistributing” multiplication/division factors from sums, as per the algebra you learnt at school, going from most common to least.

It does similar to increase array locality. Or it’ll convert adds back to multiplies.


Subpasses on the each commutative run also includes ones for bitwise ops on comparisons, for unifying range checks, the copysign(x, y) builtin, & (with various related optimizations) converting multiplies into powi(x, y) builtin.

Unless those subpasses optimized basically everything away it moves constants towards the front it checks if there’s 3 or (rarely happens) more operands & approximately (due to that rarity) sorts them by rank in output code.

Then adds reassocpow() & negation.


GCC’s pass_reassoc then reinserts the negations it has previously set aside, rearranging additions into subtractions where appropriate. Followed by fixing up impacted conditional branches.

And, as per usual, outputting debugging information for GCC devs & freeing all the collections it referred to in all those subpasses.

Trigonometry

Computers calculate sin(x) to refine cos(x) & cos(x) to refine sin(x), and some other math functions implemented the same way. As such both are calculated as byproducts of the other, & if you take both functions of an angle the call should be unified.


It iterates over every call in every codeblock to the cos(), sin(), cexpi(), pow(), powi(), cabs(). for sin(), cos(), & cexpi() it gathers all the relevant uses before merging the call & adjusting those uses to reference it.

powi by a constant becomes inlines into repeated multiplication, or sqrt(), etc.

And there’s special logic to handle taking the absolute value of a imaginary number.

bswap

Not much to say for the pass_optimize_bswap optimization: If you manually inline bswap(), GCC will reverse that. So please don’t, you’re just making your code harder to read!

Address-of Array Indexing

The parser already does this wherever it can, but any address-of instructions needs to be lowered prior to outputting the corresponding Assembly code. pass_laddress implements this specifically for any new &b[i] instructions have been revealed by other optimizations or calling convention lowering.

Loop Invariant Code Motion

When there’s some expression within a loop (a “loop invariant”) who’s value doesn’t change, it’s better to calculate it once rather than once-per-iteration. Moving these loop invariants is called “loop invariant code motion”.


Once it’s detected all the loops & initialized various collections populated from them, pass_lim iterates over all the codeblocks in loop-prefixed postorder to gather any memory references, optionally validate, & propagate up the loop hierarchy.

Another pass disqualifies any side-effecting function calls from the optimization. From which it can iterate over each loop’s PHIs (if more then two) than instructions to mark any loop invariants

So that it can recursively iterate over the loops (without abnormal exits) & move those variables to the outer loop, with some validity checks & pruning.

Afterwhich it recomputes that iteration order & iterates over the PHIs than instructions in each block to move expressions not just vars.


Those PHIs and (apart from CONDs) instructions are moved into outer loops as specified by the same analysis as used for moving variables.

After which it finalizes changes to the control flow graph & cleans up it’s collections & detected loops.

Partial Redundancy Elimination

For the sake of concision GCC’s pass_pre (Partial Redundancy Elimination) approximately removes duplicate variables. I think this is more for compiletime rather than runtime speed since other passes like pass_fre does a better job.


After initializing loops, dominators, reverse-postorder, etc collections & skipping massive functions, it operates in 3 basic steps: compute what’s available & “antic”, then inserts them where needed.

Then finalizes, outputs debugging info, & cleans up.

Computing available variables breadth-first traverses the dominators graph looking for relevant PHIs then instructions to populate a list of variables & hashed expressions.

Computing antics involves a bitmask of abnormal edges then traverse the edges to find which incoming variables are valid until no more changes need to be made.

For insertions until no more changes need to be made it iterates over the codeblocks in reverse postorder to construct assignment & insert into appropriate PHIs.

Code Sinking

You may have code which stores identical variables within the branches out from a basic block, so GCC has a pass_sink_code optimization to make sure those gets deduplicated amongst everything else for code concision. Or maybe those stores should be moved down until later.


After initializing loop optimizers, dominators, profiling, & clearing away branching edge cases it examines it’s predecessor’s PHIs for common stores to extract into the current codeblock until there’s no more.

If there’s any dominator children without abnormal children GCC’s pass_sink_code will iterate over the codeblock’s instructions in reverse order to determine which codeblocks they should be “sunk” to, as long as they don’t have any sideeffects or depend on unsunk instructions. From there it examines PHIs, etc to find in which blocks the value’s actually used to determine where to move the instruction if anywhere.

The process repeats for all dominator children before cleaning up.

Loop Optimizations

The loop optimizers are kept around between all these optimizations, rather than constructing & deconstructing them per-pass. All of these serve to reduce loops & loop iterations, avoiding the cost of loading (mispredicted) non-linear machine code from RAM.

Loop Unswitching

An important performance metric for GCC to optimize is branch mispredicts (causing the CPU to clear it’s pipeline & start over), and one effective means of improving this is to move invariant checks out from inside the loop. Instead duplicating the loop behind a check for each conditional invariant.

So for any function with more than one loop, GCC’s pass_tree_unswitch optimization iterates over them.


For innermost loops it first checks if it’s an iteration loop or if we’re enlarging the code more than configured.

If not it locates a valid branch to “unswitch” & generates those duplicated loops. Replacing the extracted conditions with 0 == 0 or 0 == 1, making sure to drop the inevitable dead branches now. This process repeats until max times, and if there any successful unswitchings it performs more thorough dead branch elimination, reestablishes SSA, & repeats for the two new loops.


For outer loops it finds those newly generated “loop guards” & attempts to move them to an outer loop. Which may require it to reestablish SSA again, & fix up PHIs.

Note that it does leave the branches it extracts within the loop, just makes it painfully obvious for later optimizations to remove the conditional branch. A flag is set indicating this needs to be done.

Loop Versioning

Multiplication is a relatively expensive operation (summing e.g. 64 intermediate values), so an effective optimization which can reveal vectorization opportunities is to split such loops into different versions.


If there’s more than one loop GCC’s pass_loop_version optimization starts by iterating over every (hot) loop to determine which it can optimize. No rejected inner loops, & has (expensive) multiplications this could remove.

Then chooses versions of the loops to generate.

Next pass_loop_versioning traverses the dominators tree for each loop it can optimize to recompute version ranges & determine whether this optimization allows it to remove any conditions from within the loop. A third iteration over the loops uses that info to determine whether it should apply the optimization to each loop, collecting them into a queue.

Which will be iterated over to duplicate, “guard”, SSA, & simplify a more optimized version of them for the selected “strides”.

Induction Variables

It’s common for loops to run for a fixed number of iterations, and there’s ways of optimize such loops. Such as by replacing it with n copies of it’s body, “loop unrolling”! To detect less obvious cases of this, pass_iv_canon adds obvious “loop inductors” to them. Possibly doing some loop unrolling & peeling itself.


If there’s more than one loop in the function, it starts by estimating & caching the number of iterations for each loop utilizing the numeric range analysis.

It then iterates over all the loops to determine rewrite them based on that analysis. It determine whether it can add a new loop condition, which loop conditions that replaces, & attempts to unroll or peel (unroll n loops before the loop proper) the loops.

After this main rewrite it marks exits for unrolled/peeled loop iterations as unreachable. It iterates over the loop’s Control Flow Graphs to mark cycles as irreducible. It cleans up, reestablishes SSA, & flags CFG needs to be recomputed.

Loop Distribution

To reveal vectorization opportunities in C/etc loops GCC will try splitting the loop into multiple with mutually independant bodies. However this does increase pressure on the branch predictor, so these loops will be merged again once the optimizations they reveal have been taken advantage of.


Once it’s collections have been initialized (include postorder codeblock list) & if there’s more than one loop, it unsets instruction flags & iterates over the loops.

For each innermost loop it discards multiexit, (optionally) cold, or non for-like loops. Then it locates the outermost loop it could move this one to & iterates over PHIs then the loop body to find instructions it’d be useful to split into their own loops. Ones that assigns values used outside the loop without triggering sideeffects.

It then uses dominators to compute inter-instruction dependencies to determine which instructions should come bundled with those it decided to “distribute”.


Within the main method for applying these optimizations distribute_loop, it bails if it can’t find which loop to nest within, construct a Reverse Data Graph (RDG) from the instructions & computed dependencies, or if there’s too many memory references.

It iterates over each loop we’ve decided to generate & traverses the RDG to find which other instructions should be bundled with it in it’s own loop using the instruction’s flags field & a depth-first-search. Merging where these overlap.


It merges the bitmaps of instructions we’ve assigned to new decomposed loops, so it can use it whilst iterating over the new loops it’ll generate to see if they actually reveal any optimizations. It bails if not.

It merges any partitions which won’t give a win, or have similar memory access models, etc.

Then it generate the new loops, or replaces them with (previously-merged) memset() or memcpy() calls before cleaning up & iterating to next the loop.

Outer Loop Unrolling

In my GCC discussions, it looks like I’ve skipped over a pass which unrolls the outer loop around some innermost loop. Possibly rearranging the execution order back into a single innermost loop if valid. Presumably to minimize how often it has to check & branch predict the loop condition?


If there’s more than one loop in the function it iterates over all innermost loops looking for ones it can & should optimize. If so it computes the loop’s data dependencies for more involved checks.

After checking the variables within these innermost loops aren’t too interdependant it unrolls the outer loop before attempting to fuse the resulting inner loops. By iterating over those inner loops fixing the PHI instructions, merging the codeblocks for each subsequent loop into the first child.

This necessitates recomputing the loop structure & SSA invariants.

Loop Interchange

Sometimes the nesting of loops is inoptimal - we want the innermost loop to iterate more often than the outermost loop. So GCC’s pass_linterchange figures out when to correct for this.


If there’s more than two loops in the function it’ll iterate over all the innermost loops computing & applying optimal loop nesting.

To compute the nesting it checks if the loop’s structured simply enough, doesn’t have any data dependencies preventing it, & bubblesorts by known number of iterations.

To apply this computed loop order it iterates over the computed order, revalidates the referenced loop extracting it’s condition, double checks it’s actually profitable to do the swap by checking all the datareferences, & swaps the loop conditions between the existing target loops.

This requires to swap the corresponding the corresponding in the loop indices, & remove the newly-revealed deadcode.

Graphite IL

The next GCC optimization pass on my list has an aweful potential to be a large sidetrack upon discussing GCC optimizes loops. It recompiles GCC’s GIMPLE intermediate language to Graphite & back to better optimize memory references in loops, if there is more than one loop & (as this optimization would’ve already been applied) in absence of OpenMP/etc. After preprocessing loops & calculating dominators.

So for an actual explanation please see: http://gcc.gnu.org/wiki/Graphite

If Conversion

A key goal of all these GCC loop optimizations I’ve been describing is to ensure they can be “vectorized”. Conditional branches can stand in the way of this, so there’s a pass which attempts to replace them with an instruction that selects which value to use based on a condition (akin to cond ? t : f, akin to how GPUs work).


If there’s more than one loop in the function, it iterates over them all to apply this optimization before tidying up (enforcing SSA) & validating.

To optimize each loop GCC’s tree_if_conversion optimization:

  1. Determines how aggressive to be
  2. Finds & “splits” all “critical” edges in the loop’s Control Flow Graph
  3. Validates that it can optimize loop based on that & examining the instructions.
  4. Applies loop versioning as described earlier
  5. Merge all the loop’s codeblocks into a single (vectorizable) one referring to the dominators tree, after inlining the conditions
  6. Applies CSE & DCE optimizations & cleansup
Loop Vectorization

A simple means by which CPUs allow for parallelism is “vectors”, which can be done by disabling certain carry bits. Thereby halving the bitsizes of the numbers, & doubling the concurrency. Due to the success of these optimizations larger vector sizes are now widely supported.

Unlike in GPU programs, developers don’t have access to these CPU features leaving the compiler optimizations to infer where to add them. Loops are a prime opportunity, with previous opts clarifying opportunities.


If there’s more than one loop & noting vectorization already applied, it iterates every loop taking some care regarding the order in which inner loops are vectorized.

If the loop is flagged for vectorization it first analyzes it to determine 1) If the outer loop has already been vectorized. 2) Doesn’t nest more than one loop. 3) Has trivial enough control flow. 4) Doesn’t call unalyzable functions. 5) Has trivial enough dataflow within & between iterations. 6) Etc.

With some code tweaks.


Amongst later checks it begins the process of vectorizing the loop body by loading it’s relevant instructions &, seperately, PHIs into a “worklist” for it to load into a new structure. Which through extensive (I don’t understand) postprocessing becomes an “SLP” structure.

Given these SLPs it determines whether the computed optimization is worth it in multiple passes. Or whether it requires duplicating the loop for different conditions (again).

In some cases it retries this analysis.


If unsuccessful there’s plenty of tidyup to do. If successful (after revalidating some of those conditions) it generates those new conditionalized “versions” of the loop, then “peels” a few trailing iterations off in case they wouldn’t fill a full hardware vector unit.

It scales probabilities in the loop’s Control Flow Graph to reflect it’s now running multiple iterations simultaneously, merges the computed SLPs (once “scheduled”) with the instructions referenced by the loop’s PHIs.


Following this it corrects the results of several more analysis passes (or flags analyses which needs to be rerun), and tries repeating the vectorization on that trailing loop for what failed to get vectorized in this pass.

After all the loops have been vectorized it postprocesses the optimized code to make some minor corrections (again flagging analyses that needs to be rerun) regarding function calls, datastorage, & vector ops.

Evidently GCC considers this a vital optimization.

Loop Unrolling

If a loop only iterates a fixed number of times, it can be more efficient to replace it with straightline code. Because the CPU can trivially prefetch such straightline code without getting confused regarding which code to prefetch.


If there’s more than one loop in the function & with some bitflags allocated, it starts by estimating the number of iterations per loop before iterating over them all a configurable number of times.

For each of GCC’s pass_complete_unroll optimization’s configurable number of iterations, it checks whether we’ll need to keep a bitmask of invalidated SSA invariants & reestimates the number of iterations for each loop.

Before iterating over all those loops from innermost to outermost setting the appropriate bitflags, discarding loops flagged as needing vectorization (which should have been done by now, but reusing code) amongst other flags, & finds/recreates induction variables.


As part of the routine for finding/recreating those induction variables pass_complete_unroll finishes by doing the unrolling (& peeling) it’s named for. Unrolling involves performing some final checks (e.g. we’re not increasing codesize too much!) before cloning the loop body (& loop indices, etc) n times substituting new variables & fixing up the PHIs.

If successful it’ll remove the loop from the loop index & control flow graph, indicating these unrolled iterations are certain to run.


If successful pass_complete_unroll will also somewhat manually update the:

After it has finished (re)processing all the loops it may redetermine which loops are irreducible.

pass_predcom

Loops processing arrays often involves data from previous iterations. In which case it’s more optimal to read this data from within CPU registers, so it doesn’t have to wait around for RAM to respond.

How relatively extremely slow RAM is motivates most hardware & compiler optimizations!


After checking there’s loops present, with copy tables initialized, & before reestablishing SSA invariants if necessary pass_predcom iterates over all loops in the function.

For each loop (once validated) it starts by attempting to extract the loop’s dataflow. If successful it groups the root datarefs by which array (if any) they index into, via a tree structure & bitmasks. And discarding any groupings which may be unpredictably mutated. A second pass removes invalid groupings from the resulting linked list.

Another iteration over them extracts the index offsets to restructure the groupings into “chains”.

If it computed any it adds initializers & finalizers.


Loading those chains into a worklist, it checks every pair to see if they can be combined via mergesort.

It determines how much it should unroll the loop so it can apply the computed optimizations. Followed by actually unrolling the loop, with a callback to replace the appropriate reads according to those computed chains.

Vectorization by Instruction Pairing

Continuing my explorations of how GCC takes advantage of your CPU’s vector hardware, despite C not explicitly supporting vector types, pass_slp_vectorize pairs compatible operations within each codeblock.


After unflagging instructions & PHIs as having been visited & with an allocator (and, for reruns the loop indexes) initialized it iterates over codeblocks in reverse postorder locating bundles of codeblocks with the same dominator.

For each bundle of codeblocks pass_slp_vectorize collects dataref “regions” (iterating over every instruction in every codeblock until it next successfully extracts interesting data references) before repeatedly locating & applying vectorization opportunities (reusing the same “SLP” infrastructure used to vectorize loops) for each subsequent vector size.

Collecting data refs mostly involves validation more than extracting the appropriate fields for each opcode.


Looking for vectorization opportunities involves 1st iterating over all those datarefs whilst if any to see if there’s any hurdles regarding non-vectorizable instructions or memory representations.

A 2nd iteration pays more attention to the in-memory representation, with a fixpoint loop to fixup groups with duplicate entries.

Then it examines the opcodes for constructors, applies various tweaks, & constructs & optimizes the SLP representation for various cases. Then checks alignments.

Loop Prefetchinging

It takes a relatively very long time for a CPU to fetch memory from RAM, so the sooner we a program can start the prefetcher the better. Loops provide GCC with good opportunities to derive these prefetch instructions!


To do so pass_loop_prefetch, if there’s more than one loop & the target machine code supports explicit prefetches, iterates over all the loops from innermost to outermost with copytables initialized & ensuring a builtin prefetch function is declared.

Skipping cold loops in favour of small codesize pass_loop_prefetch estimates how long the loop would take to execute to determine whether adding prefetch instructions would be profitable.

It iterates over all the loop’s codeblocks & instructions therein to retrieve any memory references from assignments. The optimizations is skipped is skipped if there’s too many or too few.

It deduplicates those retrieved memory references by “reuse”, judging profitability.


If that ~deduplication yields nothing to prefetch it exits.

It estimates the number of iterations until a value is reused, exits if unsuccessful, so that it can take the least common multiple as the number of iterations to unroll. Each item in it’s prefetched plan is adjusted for this number of iterations, exiting if it’s now unable to prefetch them. Or if it’s not profitable to do so.

It flags the stores, allocates prefetches to hardware units, loop unrolls, & generates the instructions.

Loop Counter Optimization

Most loops (especially for loops) include a counter, so it’s profitable for the compiler to ensure those are written to the most optimal machine code.


If there’s more than one loop & with collections initialized tree_iv_optimize iterates over the loops from innermost to outermost before resetting the optimization indexes.

For each one it locates the counter variable & it’s uses, considers & costs various alternatives, inserts the most optimal, & removes the old one(s).

SIMD Cleanup

Once GCC has inferred SIMD operations in your programs it’s important to ensure those SIMD ops work nicely together.

To do so pass_simduid_cleanup runs three subpasses to extract all vectors being processed into a hashmap, optimize each of their calls with that info, & shrink the size of temporary arrays.

Undo SIMD

There’s pass which can be enabled per-function to lower those vector operations back into scaler, possibly revealing deadcode it should immediately remove.

This is pass is quite straightforward and doesn’t need much more explanation.

Lower Switch

Switch statements provides GCC with great flexibility regarding how to compile it to machine code, of which the main optimization pass taking advantage of this opportunity is pass_lower_switch.


After collecting all switch statements amongst the instructions within every codeblock in the function, possibly merging cases for higher optimization levels, it iterates over those switchs again.

pass_lower_switch starts lowering each switch statement by treating each branch as an individual cluster. It aggregates clusters which could be who’s conditions could be compiled as bittests in two passes over that array.

Iterating over those results & twice more over the labels it finds good opportunities for jump tables (looking up branch targets from an array). It does the same for any remaining switchlabels.

Then it generates the code & relevant PHIs.

Inverse Square Root

I can’t recall whether I’ve covered this already, but inverse square root is commonly computed as a byproduct of normal square root (using the same techniques as sin() & cos()). So there’s a pass_cse_reciprocals which, utilizing the dominators tree, seeks out for each relevant args & vars to seek out divisions by precomputed square roots to, seperately, turn them into multiplications.

Because multiplications can be done in logarithmatic time rather than linear time.

Strength Reduction

An important optimization for GCC to apply is to examine each individual instruction to see if it could be replaced with something more optimal.

To do so, with allocators, collections, & loop indexes initialized, pass_strength_reduction first traverses the dominators tree looking for opportunities amongst each codeblock’s instructions & (which it’ll consider the opportunity cost of removing) PHIs.

It then whether & how to optimize each of those candidates as it applies those optimizations!

Codepath Splitting

Presumably to aid simplifying the unstructured control flow, GCC has an optimization pass_split_paths which decides whether to duplicate backedges on a (especially for) loop.

To do so, with loop optimizers, copy tables, & dominance info initialized, it iterates over every loop from innermost to outermost skipping ones we’d prefer to keep concise rather than fast.

For each loop it locates a codeblock with relevant control flow, and if broader control flow isn’t too complex it duplicates it.

Codepath Tracing

A fairly similar pass called pass_tracer detects inoptimal control flow paths & duplicates “tails” to optimize them.


If there’s enough codeblocks to justify the optimization, it flags Depth-First Search backedges as an initial pass. Then, with collections & the dominators tree initialized, it iterates over every relevant codeblock to populate a minheap scored by previously-computed probabilities.

From least frequent to most pass_tracer computes the control flow path to this codeblock. For each of those it removes it from the minheap & decides it can/should duplicate that codeblock.

Warn Restrict

Ignoring repeat optimizations, next in GCC’s pipeline is the pass_wrestrict warning which typechecks function calls for invalid pointer usage.

Wide Arithmatic

CPUs typically provides an instruction to compute both the upper & lower words of a multiplication result.

For this reason pass_optimize_widening_mul traverses the dominators tree to locate multiplication (as well as +, -, ~, & mod) assignments & builtin calls which it’ll attempt to expand.

If the multiplication result is a larger bitsize it’ll get converted into a widemull, depending on it’s type. Failing that x*copysign(1,y) becomes xorsign(x,y). Or subsequent adds are fused in.

pass_optimize_widening_mul also recognizes that CPUs store a bitflag for whether any arithmatic operations have overflowed, something C devs have to fight to recreate. So if there’s any arithmatic operations it’ll check if it appears they are checking for arithmatic overflow, so it can replace such checks with CPU native support. The unsized arithmatic used by GCC utilizes Assembly to overcome this shortcoming.

Similar widening is applied to +/-. Divides & remainders are fused into 1 op.

Store Merging

CPUs nowadays operate on larger wordsizes than individual bytes. So if you’re writing to 4 consecutive bytes you can usually store 32bit value in it’s place, so there’s a pass_store_merging optimization to perform this conversion! Yielding more concise code with possibly less pressure on the data caches.


To do so, with exception edgecases removed & the dominator tree calculated, pass_store_merging iterates over the codeblocks to find relevant stores.

An initial iteration over the instructions in each codeblock counts the number of valid assignments & (C++?) variable constructors to see if it’s a valid codeblock to optimize. If it’s invalid the relevant state is removed.

A second iteration over the instructions gathers relevant stores, invalidating & optimizing where relevant as it goes. Upon terminating all chains (which it does again at the end) for each it applies them to the code being compiled.


Applying a chain involves two steps: 1) final planning/coalescing & 2) mutating the code.

For (1) it does validation, iterates over the newly-sorted list of stores to extract “merged store groups” (lowering bitflags specially in the process & validating the stores don’t overlap), & validates the stores it has merged.

For (2) it iterates over all the store groups outputting the new stores (in various different cases) for each & if successful conditionally deletes the old stores.

Split Critical Edges

There’s a pass_split_crit_edges pass for tidying up some sort of edgecase which can trip up several optimizations, via a callback on each “critical edge” it finds. With special handling for loops & case statements.

A critical edge is one with more than 2 of both successors & predecessors in the dominators tree. A codeblock which is heavily used.

Not that I really understand what this is about.

Late Warn Uninitialized

A dominant source of undefined behaviour in C is uninitilized memory. I’ve already described GCC’s initial pass at warning against this, but (after repeating the earlier checks) pass_late_warn_uninitialized now does some more thorough typechecks regarding control flow.


With the dominators tree initialized it loads all the uninitialized PHIs (where values join between control flow branches) from all codeblocks into a worklist.

Skipping PHIs with virtual operands or are fully initialized, GCC’s pass_late_warn_uninitialized warning iterates over all PHIs in it’s worklist to find any uses under conditions for which the value’s possibly undefined. If it finds one it outputs a warning.

Uncopy Propagate

Sometimes the constant/copy propagation passes can be overeager, so there’s a pass_uncprop optimization to walk inoptimal cases back. Traversing the dominators tree to find them & maybe push onto the PHIs.

Lowering VA_ARG

Yesterday I finished describing the main, midsuite of GCC optimizations, but before the endsuite there’s a few tweaks that need to be applied. Some of these are repeat optimizations, & some are for C++ or transactional memory. Ignoring those, the next pass lowers variadic arguments.

This is split in two iterations over the codeblocks & instructions therein: The first, with a dominator tree, locates VA_ARG() calls into retrieving references.

The second validates all VA_ARGS have been removed.

Alternative Optimizations

pass_lower_vector is another pass which lowers vector operations back into scalar operations for functions flagged for it, revealing dead code to immediately eliminate. Upon change it flags that the Control Flow Graph needs to be recomputed.

Following this are lighterweight -O0 variants of pass_lower_complex, pass_sancov, pass_lower_switch, pass_asan, & pass_tsan.

Then pass_sanopt optimizes various uses the API calls the sanitization passes adds, then optimizes C++ exceptions.

Late Aggregate Return Values

GCC made attempts earlier, but when a function returns an aggregate (e.g. struct) type it’s more efficient to decompose that into multiple scalars. Like is done everywhere else without the hurdle of function ABIs.


This time pass_nrv starts by validating it can optimize this function, including by iterating over the codeblocks & instructions therein to validate there’s only a single variable which is returned (introduced via previous optimization).

Once it has found that variable being returned from the function it replaces all assignments to it with the var specified by the one specified by the function’s previously-computed metadata. Whilst removing such assignments which already exists.

Correcting Vector Ops

CPUs provide seperate instructions for processing vectors as opposed to arrays, so GCC’s next optimization pass_gimple_isel iterates over all codeblocks & instructions to find any vector (comparison/index) ops to correct. Revealing deadcode.

Control Flow Graph Simplification

Now that all the main optimizations have been done, GCC runs the pass_cleanup_cfg_post_optimizing optimization to remove various degenerate structures in the Control Flow Graph (CFG).

CPUs don’t like control flow as it hinders their ability to prefetch instructions, so simplifying it is vital!


After rerunning the main CFG fixup pass it starts by traversing the graph using a bitmask fringe[1] to locate dead codeblocks. Where relevant merging codeblocks, constant propagating, etc as it goes. Handle loops specially.

After handling C++ exceptions pass_cleanup_cfg_post_optimizing iterates over the instructions to find then remove any labels which are never jumped to.

Consecutive cases of switch statements jumping to the same label are merged into a range test, which if successful sets a flag to indicate it may have revealed more opportunities for simplifying control flow.

[1] In graph/tree traversal a “fringe” is the collection storing to-be-processed nodes.

Warn NoReturn

Finally pass_warn_function_noreturn checks flags set on each function to berate you about not adding noreturn attributes to the appropriate function.

Also I was going to cover pass_expand today, but it looks more relevant to cover that as part of the endsuite of passes. Even if it isn’t listed as such.

Nearing done! Wow, GCC has a lot of code!

Expand to RTL

As we switch to the endsuite of optimization passes, lowering the GIMPLE Intermediate Language into something close to the output Assembly for Binutils to assemble, it first lowers the control flow to the RTL Intermediate Language.


To do so it first undoes the SSA invariant (time-profiled), allocates a small int map from SSA “partition”s to “pseudo”s. An initial iteration over the codeblocks & instructions therein to recursively remove DEBUG op edgecases.

A second iteration over the codeblocks & instructions therein flags any arrays which are ever indexed by a non-constant value.

After extensive initialization (some of this code is autogenerated based on CPU data) it allocates stackspace for each of the local variables (incorporating those SSA partitions). Possibly introduces a “stack protector” (to crash your program on invalid memory access) amongst other things.

It then emits the function prelude required by the Calling Convention.


Following the prelude it outputs the instructions it has planned to allocate/initialize stack vars making sure to properly handle SSA partitions.

If the function’s main() it ensures it gets called on startup.

It then outputs code to protect the stack, lowers SSA PHI ops into mutable variables which may reveal dead control flow edges, initializes the new function body, unsets EXECUTABLE bitflags on each edge, iterates over the codeblocks with a new hashmap, & cleans up extensively.


For each codeblock it does some final initialization getting ready for the new IL, removes trailing returns where it can control flow through to the function epilogue, and after handling some edgecases iterates over each instruction in this codeblock to lower DEBUG, conditional, & CALL ops.

Then it iterates over the control flow edges to turn them into explicit GOTOs, handles an edgecase, adjusts stack pointer, finds the block tail, & sets the codeblock for each instruction.

Stack Variables

To call a function in Assembly you push your important state onto the stack followed by any arguments you’re about to call & which memory address to return control flow to. CPUs may even provide specially-optimized instructions to aid this, though that still leaves you doing a lot manually without a compiler like GCC!

This stack is stored in the uppermost segment of (virtual) memory growing down, with it’s head stored in one or (to track “stackframes”) two registers.

Once GCC has “expanded” the code to resemble typical Assembly, it continues that task in pass_instantiate_virtual_regs to lower “virtual registers” into offsets from the stack pointer.


To do so pass_instantiate_virtual_regs (ignoring USE/CLOBBER/ASM_INPUT) iterates over the instructions to, after handling special cases involving the destination, lowers each of their MEM/REG/SUBREG operands with validation. Changes are propagated into uses. Sometimes more thoroughly than others.

That may reveal the instruction to be dead, and it reruns the simpler operands pass once, or for functions twice, again.

After iterating over the instructions it iterates over the variable declarations for the sake of debugging to lower it & any referenced expressions to stack slots. With a little postprocessing & a second iteration ensuring a stackslot is assigned to each declaration.

RTL Control Flow

The next GCC pass initializes collections to analyze the control flow expressed by the “RTL” Intermediate Language, including copy tables, callback hooks, control flow chains, flags for GOTO targets, & cleans up dead code.

Those control flow chains start as a linked list of newly-computed codeblocks before getting split wherever relevant.

GOTO Folding

GOTOs involves loading code from RAM, which is painfully slower than the CPU itself. Though not as bad as conditional jumps because GOTOs (usually) jump to a hardcoded address, hence they don’t (normally) need to be branch predicted. So the next optimization, pass_jumps, tries to minimize these.


After deleting code storing to unused registers & movs to the same register, & optionally outputting debugging info to GCC devs, it reruns a variant of the normal Control Flow Graph Cleanup.

This Control Flow Graph (CFG) cleanup (which was also used in the RTL CFG initialization) first deletes unreachable codeblocks juding by the dominators tree. If successful it checks if deleting those trivially dead instructions (as per the start of pass_jumps) gives any new results. This may mean it needs to compact the array of codeblocks, possibly using a bitmask.

For noreturn functions it might add fake exit edges.


Next it repeatedly:


If it added those fake exit edges for noreturn functions, they’re now removed again.

It quickly garbage collects any dead code, which can be used inside by the cleanup mainloop described previously instead of what pass_jumps started with.

Jumptables for dead switch branches are deleted.

And with a freshly computed dominators tree it might repair loop structures.

Subregisters

In some Assembly languages like x86, a register can be decomposed into multiple smaller subregisters.

If we only ever use the subregs rather than the aggregate register, there’s no point in keeping that register. It increases compilation flexibility if we do, so that’s what pass_lower_subreg optimization does.


After optionally outputting debugging info & checking if the target Assembly language supports subregs it iterates over all the available registers to see which are used.

If there’s anything to optimize it’ll optionally garbage collect the code, allocate collections, & iterate over the codeblocks & instructions therein.

For each instruction it memoizes them (struggle to find that code…), it handles shift instructions specially to set corresponding bitflags, extracts various info from the instruction, examines mov instructions, & examines the instruction’s operands for subregs.


If that yields results it alters the bitmaps to propagate movs & plan the decomposition before iterating over the codeblocks & instructions therein to (specially handling clobbering, use, & debugs alternatively):

If that altered control flow, another iteration over the instructions repairs it, the collections bitmasks & register copy graph are freed.

Dataflow

An important perspective from which GCC needs to optimize programs is dataflow, where looking at the paths data takes through your programs to transform into output. My Haskell experience tempts me to say this is the perspective we devs should take, controlflow is a weird fixation our industry seems to have.

In properation for dataflow optimizations GCC’s pass_df_initialize_opt analysis indexes these view of the code.


After allocating this index, it registers sets of callbacks to extract the dataflow, compute live regions, and optionally build a reverse index of which registers are available at each op. It computes a postorder, & inverse postorder, traversal over the codeblocks.

It initializes a bitmask of registers to eliminate, & an array of live registers whilst lowering psuedoregisters. Changes taking affect after it’s determined what to change.

It runs a scan, then reinitializes those collections.


That scan involves extracting all uses of variables into bitmasks (seperately for exception handling), processing entry & exit codeblocks specially, and then processing all the codeblocks to collect bitmasks & an array of “chains”. Normally using a recursive function over each opcode.

I’m struggling to see what those sets of callbacks previously registered do, but as best as I can tell they’ll be called by a later pass to clean up & apply all the previously performed optimizations.

RTL Common Subexpressions

Once GCC has parsed the expression-based syntax into purely statement-based syntax & (regardless of how poorly I explained it) computed dataflow, duplicate code is often revealed. Where, say, the developer found it easier to write z=(x+1)*(x+1) than y=x+1; z=x*x.

This is called “common subexpression elimination” (CSE).


To do so pass_cse will free the computed dominators, queue & run (alongside previously queued ones) a set of callbacks to annotate registers as dead or unused.

To run those sets of callbacks (“dataflow problems”) GCC first initializes some array & bitmask collections of live codeblocks & the order in which they should be processed, possibly removing the dead ones. Then it reinitializes various collections, & if appropriate bitflags are set iterates over instructions bitflagged to be deleted, rescanned itself, or have it’s notes rescanned. And if the corresponding bitflag is set it checks certain invariants (some can be disabled at compiletime).

With that preprocessing done GCC iterates over the “dataflow problems”, and if they need to be rerun it calls in sequence it’s allocation, local computation, dataflow, & finalize callbacks time-profiled, passing them the optionally-inverse (as requested by the problem) postorder array of codeblocks. 2 profiling callbacks can be compiletime enabled.


pass_rescan initializes more collections and iterates & recurses over instructions to extract first & last use of each pseudo register.

pass_cse initializes more collections, flags, etc and runs an analysis iterating over codeblocks in reverse postorder & their registers to see which might point to the same memory.

With it’s collections initialized & pre-analysis performed it iterates over all codeblocks in reverse postorder skipping visited codeblocks, determines controlflow paths to analyze, and for each it:

  1. Counts instruction sets, skips if none.
  2. Iterates over the path’s codeblocks & instructions therein.

For each instruction in the codepath pass_cse normalizes any dataflow notes attached to registers, clears clobbered registers, retrieves it’s instruction set, performs other normalizations, looks up in a table (with plenty of nuance) which previously-computed registers from this instruction set to use as input instead, computes costs to determine whether to actually apply this optimization, adjusts indexes to match, & inserts this instruction into the table for future CSE.


pass_cse has some per-instruction postprocessing to make sure these efforts don’t mess up control flow.

At which point it frees all it’s collections, & examines some flags it may have set to determine how much effort to put into repairing controlflow. That is whether to repair jump labels, just the control flow graph, or nothing.

A later pass_cse2 optimization will also remove duplicate stores & code before conditionally the control flow graph, but is otherwise identical.

RTL Forward Propagation

For the sake of concision (minimizing code fetches from RAM), there’s an optimization, pass_rtl_fwprop, which attempts to merge instructions into their single use once it’s in this “RTL” intermediate language resembling most Assembly languages.


Once dominators, loop indices, & dataflow has been updated/computed, pass_rtl_fwprop iterates over the instructions skipping non-debug ones that can’t be optimized queueing anything it can’t immediately handle for reprocessing.

For each instruction it iterates over all their non-memory uses making sure to handle loops for non-addressing uses. If any merge succeeds the instruction will be enqueued into the worklist for reprocessing.

After validating it can merge into the use/instruction (tried to write details out, but got too long) & specially handling loading constants from RAM pass_rtl_fwprop transactionally & co-recursively alters & simplifies the instructions based on which opcode is being merged into.


Failing to propagate into the instruction itself, it inserts human readable notes into the dataflow indices & determines whether to actually retry merging this instruction after all the others have been (re)processed.

RTL Constant Propagation

For similar reasons, there’s a pass for propagating constants & copies on the RTL IL, merging them into a single constant or copy.


To do so pass_rtl_cprop first garbage collects codeblocks & recomputes dataflow, skipping if there’s too few codeblocks.

After initializing counts & allocators it iterates over every codeblock, instruction therein, & the registers each uses to examine where that register was last assigned. If it’s followed by or being propagated into a (conditional) jump it’s treated specially.

Otherwise for valid sources it transactionally copies, simplifies, & validates the indirect register/constant reference into a direct one. Possibly writing notes into the dataflow structures. This may reveal common subexprs to remove.


After that processing it reinserts all “traps” (C++ exceptions?) by splitting the codeblocks with memory barriers. And if anything changed it garbage collects codeblocks.

It iterates over the codeblocks again to extract “implicit sets” constrained by some statically-known invariant. If either subpass yields any changes it reanalyzes dataflow.

It allocates some collections, including iterating codeblocks & instructions therein to gather assignments it optionally outputs for GCC debugging.


If it found any assignments to index it iterates over the codeblocks & their instructions to copy propagate them again this time with an assignment table to reference. As it does so it deletes newly-dead stores.

If any of this changed anything it flags CSE as needing to be rerun & cleans up the control flow graph.

RTL Partial Redundancy

To eliminate more subtle cases of redundant code in the goal of concise output GCC runs a Partial Redundancy Elimination pass specifically written for this low-level “RTL” intermediate language.


This starts by garbage collecting codeblocks & updating the dataflow analysis. If this pass manages to apply any optimizations it’ll flag CSE as needing to be rerun & will clean up the control flow graph. Much like the previous optimization pass.

If the function has enough codeblocks it’ll initialize counters, alias analysis, allocators, a hashtable of memory operations (with postprocessing), & a hashtable of the first/last instructions to set each register in each codeblock. If this is a noreturn function fake exit edges will be added over the duration of this analysis.

If that latter hashtable has any instructions it’ll allocate some bitmasks to compute some PRE-specific dataflow with extensive postprocessing to remove candidates.


After constructing a smallintmap from bitmask indexes to hashtable entries, pass_rtl_pre deletes then (along controlflow edges) inserts instructions according to that dataflow analysis. Then updates registers between codeblocks, in case it was referenced by a deleted instruction.

These alteration to the control flow edges are committed, after which it’ll cleanup with a bit of optional debugging output.

RTL Hoisting

Furthering Common Subexpression Elimination (CSE), expressions may be duplicated between different code branches. So there’s the pass_rtl_hoist to “hoist” these into common codeblocks in the dominators tree.


Being framed like the previous two passes, if there’s enough codeblocks & the Control Flow Graph isn’t so complex it’ll take forever to run this pass it’ll calculate roughly many registers are taken in each codeblock.

To calculate the “register pressure” for all codeblocks, pass_rtl_hoist optionally: temporarily initializes a smallintmap of registers from the dataflow analysis with seperately-initialized smallintmaps (and a temporary hashtable) for pseudoregisters, allocates additional memory per-codeblock, & iterates over every codeblock using those to compute said register pressure. Propagating bitflags through.

After initializing alias analysis, allocators, & collections it does the optimization.


If it’s found any expressions to add it’s hashtable (collected as per Partial Redundancy Elimination) pass_rtl_hoist allocates additional bitmasks initialized to track the effects of every instruction & post-processed to remove instructions from consideration, finds the “busiest” instruction where dataflow frequently intersects, calculates the dominators tree, initializes a few more collections, & iterates over all codeblocks in dominator order & busy instructions therein.


For each busy instruction pass_rtl_hoist‘ll check if it’s bitflagged for hoisting. If so it’ll optionally find & count the correct occurence of that register to hoist, then iterates over the dominated codeblocks. For each it relocates that occurance, figures out how far it can be hoisted, checks register pressure, checks whether it should hoist the instruction, & enqueues it for hoisting.

After edgecase handling, it iterates over that queue, deletes from the old spot, & adds to new spot.

RTL Store Motion

Where hoisting fails to move instructions up the dominator tree, pass_rtl_store_motion moves assignments down until later in the code. Probably to reduce register pressure.


pass_rtl_store_motion starts by garbage collecting the codeblocks, updating the dataflow analysis, initializes alias analysis, collects a hashtable of & counts store instructions (referring to the registers in each codeblock) determining which can be moved around or might be overwritten.

That hashtable is reprocessed into bitmasks & arrays, temporarily adds fake exit edges to noreturn functions & infinite loops, computes an order to the control flow edges, then iterates over the collected stores inserting & deleting them where previously decided after discarding abnormal edges.

After which it cleans up & if any changes were made a flag is set to rerun CSE. Constant propagation was rerun immediately before this pass.

CSE After RTL Global Optimizations

After those “global” optimizations have completed, it performs a final Common Subexpression Elimination as might have been flagged by them as being required.

This involves rebuilding jump labels, the common CSE routine, garbage collect control flow edges, deletes obviously dead instructions, flags whether a simpler CSE rerun will be required, rebuilds the control flow graph as indicated by that common CSE routine, & flags whether to follow jump instructions.

RTL If Conversion

CPUs don’t like evaluating conditional branches - it takes forever to load the referenced instructions from RAM, and it can’t always predict which instructions it should prefetch. So GCC has a pass_rtl_ifcvt optimization which replaces these branches with straightline code where possible.

Incidentally preventing Spectre vulnerabilities involves manually applying this optimization, sometimes getting quite clever about it!


Optionally before & after cleaning up the Control Flow Graph it’ll reanalyze dataflow (optionally adding liveness analysis), calculates loop exit edges, initializes counters & flags, & recomputes the dominators tree before repeatedly iterating over the non-dirtied codeblocks to find & optimize an if branch. After which it’s all clean up, and flagging the need for dead code elimination if anything changed.

To process a codeblock it first determine’s whether it’s actually an if-header…


It discards codeblocks with more (or less) than 2 successors, with dirty successors, abnormal edges, loop exits, or a non-fallthrough successor. If so it collects/normalizes info about this branch & tries 5 different approaches (with subapproaches) to fixing it.

Coverting lowerings like:

Each iteration it’ll recompute dataflow & update flags.

RTL Loop Optimizations

Starting the subsuite of RTL loop optimizations, during which the loop indexes (and dominators tree) are kept initialized. And the dataflow graph is rescanned in preparation.

RTL Loop Invariants

There may be instructions inside in the “RTL” Assembly-like intermediate code which are constant over all iterations of that loop. Though the same has already been done in GIMPLE so many cases have already been handled.


If there’s any (real) loops in the function pass_rtl_move_loop_invariants it’ll first extensively reanalyze dataflow when under register pressure.

That dataflow analysis includes recalculating register pressure, pseudo classes, n-sets, & optionally liveness.

pass_rtl_move_loop_invariants iterates over all indexed loops twice. The first iteration, from innermost to outermost, does the optimization; the second does cleanup.

To find candidate invariants it first locates loop exits, always-reached codeblocks, & definitions. Then iterates over the loop’s codeblocks & instructions therein to collect invariants, which it’ll then deduplicate.


After a little preprocessing to roughly estimate register pressure, pass_rtl_move_loop_invariants will repeatedly seek out & flag the most optimal (based on register pressure) invariants to move & where they should be moved to.

Now that we know which invariants to remove those registers & inserts into new codeblock, replace all uses with the new register. Some postprocessing might be done to relieve register pressure & it might reanalyze the dataflow without chaining.

RTL Loop Unrolling

Worse than conditional branches to a CPU are repeated conditional branches; that is loops. So GCC will (again) look for small frequently-run innermost loops it can unroll, where it hasn’t been told not to unroll.


After iterating over the loops from innermost to outermost to flag which should be unrolled & tries to determine how many iterations they’ll run. A second iteration applies the unroll in one of 3 different ways.

For constant number of iterations it’ll duplicate the loop body a pre-determined n-times.

Or it might run a variant which selects which copy of the loop body it should jump into for the first iteration if the number of iterations is known at runtime.

Or if there isn’t any loop counters it’ll duplicate the loop condition alongside the loop body.

If any of these succeeded it’ll recalculate the dominator tree & the loop structure. After all loops it fixes IV analysis.

Conversion to do-while

In Assembly languages, do { } while (cond) loops are often more concise than while (cond) {} loops, as while-do allows us to omit a jump instruction. So GCC’s pass_rtl_doloop will try to rewrite your code into while-do loops.


Optionally with liveness analysis enabled on the dataflow analysis it’ll iterate over all loops. Afterword it’ll cleanup the IV analysis & verify loop structures.

For each loop it initializes it’s IV analysis & verifies it can optimize this loop.

pass_rtl_doloop then estimates the number of iterations for this loop & it’s cost before restructuring the labels being jumped to followed by (with the help of bitmask register analysis) the code itself. Possibly deleting that trailing jump.

Independant Virtual Register Usage

In RTL GCC has a concept of “virtual registers” which have not yet been allocated a physical register (these “physical” registers are actually virtual too, allowing the CPU to rearrange code as it waits for data). Splitting independant uses of these virtual registers gives GCC more flexibility in assigning these registers, so that’s what pass_web does.


After reanalyzing dataflow (with chaining & altered flags), it iterates over all codeblocks, instructions, & uses to count pseudoregs.

After allocating some arrays (those counts determine how much memory to allocate) it iterates over codeblocks, instructions therein, & their uses again to collect the set of all uses for each psuedoregister. And it collects a register renaming smallint map, akin to what’s hardwired into your CPU.

A third iteration applies that register renaming map to the code, updating the map as it goes.

RTL Dead Store Elimination

(Skipping repeat optimizations)

Sometimes as code gets lowered closer to Assembly, assignment statements are generated but not used. For the sake of concision & not wasting CPU cycles these dead instructions are removed in an 8-step process by pass_rtl_dse1/2.


After reanalyzing dataflow with note updating, pass_rtl_dse1 & pass_rtl_dse2 will:

  1. Initialize counters, allocators, collections, & alias analysis.
  2. With additional temp bitmasks, iterates over all codeblocks. For each codeblock (skipping the function prelude) it clears relevant collections, pass_rtl_dse to collect grouped stores whilst processing dataflow & effects, optionally deletes stores not in group 1 skipping clobbering & flagged stores, frees a deferred changelist, & deletes stores based on CSE analysis.
  3. Iterates over the collected RTX group array twice: Once populating bitmasks, etc for each group, then reprocessing them into offset arrays. Reanalyzing dataflow if succesful.
  4. If (3) was successfull, iterates over all the codeblocks determining which stores aren’t subsequently read (differently based upon whether this is a function exit codeblock) garbage collecting codeblocks while it’s at it. An iteration over the bits that set postprocesses this bitmask.
  5. If (3) was successful incorporate dataflow into this analysis.
  6. Iterate over the codeblocks again to delete the determined instructions, whilst considering special cases.
  7. Iterates over codeblocks & instructions therein to delete now obviously dead stores overwriting a variable immediately after it’s previous assignment.
  8. Frees all the collections used accross the previous 6 steps.

Increment/Decrement

It’s more efficient to increment/decrement values in registers than RAM, so GCC’s pass_inc_dec ensures such values are loaded into CPU registers before computation. Maybe machine code requires this.


After register and (with note updates) dataflow analysis & initializing a lookup table & arrays, pass_inc_dec iterates over every codeblock in the function & instructions therein in reverse order to split it where necessary & note previous instructions to rewrite. Counting successes.

Zeroing Uninitilized Vars

In C semantics uninitialized variables/fields defaults to 0 (or is this just a GCC thing after warning about the undefined behaviour?), but this does not apply to using Assembly regs. So GCC has a pass_initialize_regs pass which sets pseudoregs to 0 on uninitialized codepaths.


After reanalyzing dataflow with liveness analysis pass_initialize_regs iterates over codeblock, instructions therein, & use of each instruction skipping non-pseudo, PIC, already moved regs. It checks per-codeblock bitmasks to determine whether to emit a new store & update indices.

Undefined Behaviour DCE

You may have been warned against allowing any undefined behaviour in your C programs, because any code it touches also becomes undefined behaviour allowing GCC to throw away large chunks of your code.

You can blame that on the pass_ub_rtl_dce optimization, which is a variant of normal Dead Code Elimination that also collects any “artificial” uses followed by anything using those results.

Merging Consecutive Instructions

Often a sequence of instructions can be replaced with a single instruction, so GCC has a pass_combine optimization which looks for these opportunities in the name of concision.


To do so (after duplicating registers, reanalyzing dataflow with notes, & analyzing n-sets & refs) pass_combine locates the first instruction for the function, initializes counters & arrays, it iterates over every codeblock, in reverse order instructions therein, & their definitions then uses to collect info about their uses.

A second iteration over the codeblocks (skipping non-trivial control flow) & instructions therein records dataflow notes, when regs were set or killed via dataflow, & how numbers were resized.

After clearing/reinitializing the start state a 3rd iteration skipping unreachable or nontrivial control flow attempts to actually (transactionally) merge sequences of 3 instructs into a simpler instruct under various conditions, handling numerous edge cases for the iteration.

Then cleans up.

Splitting Code By Frequency

Instructions (and data) are loaded into the CPU in sizable chunks at a time. It would beneficial if rarely executed code was left out of the frequently executed chunks, to minimize (except in degenerate cases) how many chunks need to be loaded from RAM.

So pass_partition_blocks splits the hot & cold code.


To do so it iterates over every codeblock & their edges to count & flag cold ones. If it found any it validates hot paths, marks DFS back edges, & collect those cold codeblocks.

Then it handles C++ exceptions (skipping that topic), & iterates over every codeblock & their control flow edges to flag such edges which cross to/from hot code from/to cold code. Exiting if no such edges were found.

The array of crossing edges are postprocessed to ensure they don’t have any fallthroughs & always includes a label to jump to. Control flow to the function’s exit doesn’t need to be added up. In 2 passes to make sure it didn’t miss anything.


Some Assembly languages may require into intermediate GOTOs to be added to any conditional control flow to cold code, due to not supporting conditionally jumping to somewhat-distant code. Correcting for this involves a conditional iteration over all codeblocks checking for such cases to alter.

If the Assembly language doesn’t support unconditionally jumping to somewhat-distant code, another optional iteration needs to convert such GOTOs into loading their value from a new pseudoregister.

After correcting the flags on control flow edges after these adjustments, it tidies up & reanalyzes dataflow if there’s any C++ exceptions in this function.

Splitting Instructions

Modern CPUs contain circuitry to schedule multiple instructions to run simultaneously upon different ALUs. GCC can help the CPU with this, in contrast to recently discussed passes, by splitting instructions up into multiple.

Several very similar passes will be run later.


To do so pass_split_all_insns with a per-codeblock bitmask allocated iterates over all codeblocks & instructions therein. For each deleting noops & splitting other instructions flagging the bitmask where successful.

After extracting probability pass_split_all_insns runs some auto-generated code to compute which instructions to replace it with (sorry about skimming over the bulk of this logic!). If there are any results it’ll ensure there aren’t any new infinite loops (via a count), stackframes are properly formatted, RTX usage (as well as control flow & function calls) is properly structured. Further postprocessing copies notes then control flow labels over, deletes old instruction & replaces it.


After some recursion this postprocessing (in a seperate function) propagates more notes, flags the replaced instruction as deleted, & tidies up subregs.

After this iteration pass_split_… profiles the RTL code, if anything changed adjusts the splits between codeblocks, if those changes impacted C++ exception handling the control flow graph is reestablished, & verifies the control flow.

Stackframe Analysis

(Skipping a near-repeat reinitialization of dataflow)

Sometimes C programs rely on the callstack being stored in RAM. Most of the time they don’t, in which case GCC can optimize them away. So pass_stack_ptr_mod conditionally iterates over all codeblocks & instructions therein to check whether any memory ops are accessing the callstack & set a flag.

If that’s not the case it updates the dataflow indices for the function’s exit block.

Mode Switching

Some Assembly languages (like ARM I believe) have multiple “modes”, some of which are more concise but less capable. When compiling for these CPUs GCC must select the most optimal mode for each run of instructions in order to do an optimal job.


To do so, if compiletime-enabled, pass_mode_switching first iterates over each mode to conditionally generate new exit codeblocks turning those modes off. If there are no modes it exits.

It splits the entry edge annotating it as normal mode, reanalyzes dataflow & initializes various bitmasks before iterating over every mode, codeblock, & registers (skipping ones with complex dataflow edges) then instructions therein collecting valid code segments for each mode.

For each mode, immediately after iterating over codeblocks, it adjusts for when no particular mode is required or when it control flows to the the function’s prelude or epilog. A 2nd iteration over codeblocks handles cases where no modes a valid.


Bitwise manipulation on the gathered (and other) bitmasks & a few more iterations over the codeblocks/edges/etc determines the optimal placement of mode switches amongst the valid placements. (I’m unclear what the logic here is)

An iteration over the modes, edges then codeblocks twice (the first to compute exit & entry modes) inserts the chosen mode switches.

Whilst tidying up it’ll commit the changes.

Tidyup asm() SSA

The Single Static Assignment invariant used to simplify mid-level optimizations introduces some funny quirks in inline Assembly statements which needs to be tidied up before compilation.


So the next GCC pass, pass_match_asm_constraints, iterates over all codeblocks, instructions therein (skipping non-Assembly ones), & parameters thereof. For each parameter it locates it’s use in the Assembly, validates it can optimize it, & adds relevant MOV.

Upon change dataflow reanalysis is needed.

In-Loop Scheduling

I’m not entirely clear on the exact benefits yet, but GCC may specially schedule operations in loops.


With the Control Flow Graph & loop indices computed and various collections/parameters (including those for the HAIFA scheduler) initialized pass_sms (for Swing Modulo Scheduler) iterates over loop in the function. It validates it can optimize each loop whilst optionally outputting debugging info (e.g. no write barriers or function calls) & computes it’s Data Dependance Graph.

A second iteration over the loops extracts each’s loop counter register & maximum number of iterations, schedules all nodes in the DDG (Data Dependancy Graph) into a new array with some postprocessing applying it to the RTL code. Then it’s all cleanup & applying the new order (in a seperate iteration over codeblocks) & register count.


To schedule codeblocks pass_sms takes the DDG (including per-codeblock read/write counts & the dominators graph), iterates over it’s edges & nodes to initialize new bitmasks specifically for this loop, pairs equally sized nodes (a “Floid-Warshall loop”), computes the lengths of cycles in the graph, sorts & validates the resulting SCCSs, computes worst case order parameters, iterates over SCCSs to extract paths from DDG start & compute schedule position before recomputing in reverse.


Postprocess handles edge cases & saves counts into the DDG. Yielding a “partial schedule” indicating valid positionings for each node.

And the seperate postprocessing loop reinitializes some arrays & bitmasks, essentially bruteforces a valid concrete ordering with postprocessing reconsidering degenerate cases, recomputes some counts, adjusts the schedule to (mostly) start in the correct spot, inserts MOVs in the plan where necessary, normalizes & profiles (& optionally versions) the loop, reorders the instructions to match the schedule, optionally disables future scheduling, sets a codeblock dirty flag, inserts the movs whilst rescanning dataflow, & where it couldn’t align the schedule to the start/end of the loop unrolls the selected instructions of the loop (knowing the min-iterations indicates whether this is valid).

P.S. This pass can be compiletime-disabled.

Liverange shortening

The shorter a pseudoregister lives the more flexibility the register allocator has in assigning it a CPU (virtual) register. So there’s a pass_live_range_shrinkage optimization (compiletime disablable) which rearranges (schedules) instructions to achieve this. The register allocator, as an NP-hard problem, will need all the help it can get!


With a flag set & another saved, it first initializes callbacks, collections, etc.

During that initialization it conditionally iterates over the code to collect “regions” using a selection of loops depending on the control flow. And it computes a standard dominators tree.

However they’re collected pass_live_range_shrinkage then iterates over these regions specially handling cases which really needs it (too many live registers) skipping where disabled, first initializing some more collections & computing forward & backwards dataflow with 2 instructions iterations.


It then collects bitmasks describing control flow, computes per-instruction priority taking the max, collects instructions to schedule, & iterates over the codeblocks again to schedule them utilizing the previously initialized callbacks. Using generic scheduling code to dequeue instructions & insert them in their new places with GCC debugging info.The specialized code inserts the instructions in a temp array (which it’ll iterate over yielding the optimized code) to plan their new positions.

Memfences

This pass is followed up by a near-repeat (of which there’s another) pass_sched which may alternatively, with regions collected & collections initialized, iterates over the regions. For each (with some variation) it removes empty codeblocks, initializes runtime memfences, shcedules the instruction around them (by splitting linked lists whilst assigning & sorting per-instruction sequence numbers), & specially bruteforces with reference to bitmasks an optimal order for CPU pipelining.

Recomputing Stack Values

When you call a function in C, most of the CPU registers need to be pushed to the callstack (the callee may push the rest). This callstack lives in RAM. It takes forever for the CPU to retrieve data from RAM (common theme of these optimizations!). So it can be counterintuitively more efficient to recompute a value after a function call than to read it back from the callstack.

pass_early_remat chooses the initial pseudoregs to benefit from this.


After some target-CPU specific code selects candidates into a bitmask, if there are any pass_early_remat allocates some bitmasks & reanalyzes dataflow, checks every register in the function to see if their referencing instructions warrants “rematerialization”. Transactionally collecting them into an array & bitmask. If there any the other registers are added as “dummies” to the collection, otherwise it bails on this pass.


Having collected candidates pass_early_remat allocates a sidetable, recomputes bitflagged label IDs corresponding to each candidate, sorts the candidates by precomputed dataflow postorder position, allocates a bitmask for each candidate register, & iterate over the candidates to populate that sidetable with candidate counts & indexes.

After optional GCC debugging output it iterates over candidates, codeblocks, dataflow, & candidates in reverse to collect usage info into bitmasks.


Another reverse-iteration over the candidates twice using that additional info & a depth-first traversal over data dependencies to discard invalid candidates from the list.

A postorder traversal over the codeblocks (skipping the “fixed” ones) with bitmasks normalized, & instructions therein, to iterate over uses determining via bitmasks where to insert the recomputations, checks if the instruction’s a function call before emitting the recomputation, & iterates over candidates to kill.


To capture all cases where the selected need to be rematerialized requires a second subpass taking into account control flow. This involves iterating over the dataflow & codeblocks to bitflag which values are already available, to traverse the control flow graph in loose postorder to determine where to where to recompute the values (possibly propagating them back into the codeblock’s predecessors), then iterates over the codeblocks to actually insert that recomputation.

Register Allocation

CPUs store provide some number of internal “registers” in which you can store the values currently being processed, in some architectures data must be loaded into a register before processing it. Due to the cost of storing said data elsewhere (in RAM) the hardest & most important optimization is to assign each pseudoregister (variable, etc) into one of the CPU’s registers.

This is NP-hard akin to the Map Colouring Problem, it can only be bruteforced or approximated. Approximation will do!

Tidyup Instructions

For this GCC’s pass_ira (Integrated Register Allocator) first, depending on the target CPU, transactionally iterates over every codeblock for trailing GOTOs. For each it iterates over all input operands, & twice over successor edges therein looking for abnormal edges it should skip or otherwise function exits to temporarily augment with noops.

Regardless of whether that’s done it next initializes some flags, counters, allocators, register sets, etc some of which is CPU-specific.


pass_ira might iterate over all codeblocks, instructions therein, & instruction therein again/nested to shorten recorded pseudoregister liveranges (periods where each is actually used) better match reality. Possibly with the dominators tree. If optimizing away indirect jumps yields any obviously dead instructions to delete it’ll reanalyze the dataflow again.

Several additional iterations over the registers & their uses, codeblocks, etc computes stats for register allocation to refer to.


This is the most compiletime-optimal place to put an unrelated setjmp() warning subpass.

With alias & loop analysis it iterates over codeblocks & instructions therein to note function calls, optionally iterates over caller-save regs to incorporate in it’s register-alloc datamodel, optionally iterates over codeblocks & instructions therein to note equivalent regs, iterates over pseudoregs to see if they can now be inlined, & optionally over instructions to record store equivilents.


It’ll optionally recompute register sets in case that freed anything up, recompute regsets, optionally iterate thrice over codeblocks, instructions therein, & twice over their uses to bitflag which pseudoregisters are movable utilizing several temp bitmasks, determines which registers are clobbered where, initialize cost counters, & optionally reinitializes loop analysis.

With this tidy workbench pass_ira is ready to actually do it’s job & allocate some registers!

Extracting Conflicts Graph

The Map Colouring Problem refers to the challenge of shading a political map with as few colours as possible so that no two bordering countries share the same colour. Or any equivalent challenge.

GCC’s pass_ira must not assign any two pseudoregisters alive at the same time the same CPU register. This is the Map Colouring Problem! Or maybe the more general Graph Colouring Problem.

Today I’ll discuss how GCC phrases register allocation as a Graph Colouring Problem.


For this GCC’s pass_ira starts by reanalyzing dataflow again & initializes various collections & allocators. Recursing over the dataflow populates bitmasks.

It iterates over all instructs with some special collections to collect pseudoregs with known values. It recursively iterates over control flow labels to collect their offsets. All in support (with other collections) of an iteration over codeblocks & instructs therein, then pseudoregs to compute the cost of spilling each reg.


GCC’s pass_ira continues initializing collections, recursing over loops to compute pseudoregister ranges after which it considers which loops it should exclude from consideration (because loops what makes register allocation complex).

It simplifies the live ranges collection by removing irrelevant nodes with no impact on liveranges.

It iterates over the pseudoregisters to flag which cannot be spilled. Then iterates over the loops to determine if any’s a “allocation region”.


If there were any such loops pass_ira will iterate over the pseudoregisters & their allocno’s to propagate some of this info where needed, then iterates over those allocno’s to initialize additional info.

All this analysis may require each allocno to reconsider how much they’d cost to spill, considering which ones live through function calls. It compiletime-optionally validates each allocno.

It iterates over allocnos then by pseudoregs to extract the min & max live ranges.


The allocno collection is restructured into a smallintmap, which pass_ira will iterate over to twice to extract the min then max conflicting IDs. Info which is used to allocate the conflicts graph.

To initialize this conflicts graph it optionally firsts iterates over the allocnos & their objects thrice to initialize an initial bitmask, if successful iterates over the loops & their instructions to adjust the bitmask & save relevant notes, propagate them, & deallocate duplicates.

Regardless of whether it does that bitmask analysis it checks spillage behaviour then iterates over the allocnos & their objects, bitwise-or’ing conflicting regs based on different conditions.


pass_ira then iterates over all allocno’s to record previously costs & preferred registers where relevant.

It optionally reconsiders whether to keep each loop or allocno.

And then finishes off with optional GCC debugging output.

Graph Colouring

In theory you could colour any political map with only 4 colours. You could compile any program to use only 4 CPU registers (or is the conflict graph non-planar?).

In practice it is computationally infeasable to do so. A perfect solution, as far as we know, must be bruteforced. Thankfully in GCC’s case x86 has at least 8, & ARM has 13, general purpose 32bit registers for example. GCC consistantly has a pallete of more than 4 colours!


To colour the conflicts graph (computed as described yesterday) with CPU registers GCC’s pass_ira optimization, after adjusting some flags, first initializes a couple of colouring-exclusive fields of the allocno’s before deciding whether a lack of conflicts allows it to use a fastpath.

This fastpath iterates over a new array, sorted by newly-computed priority, of allocno’s to allocate a valid register, per the instruction’s constraints & any conflicts. Gives great results given no loops!


The slow path (with a allocno stack, boolarray of allocated CPU regs, sorted allocnos array, priorities & cost sidetable, & sorted copies allocno copies array) iterates over every loop topdown.

For each loop, with additional collections (including a sidetable of “color data” & formulating “thread” linked lists) allocated, it first clears bitflags for any already assigned allocno’s, iterates over various bitmasks to choose preferable CPU regs before dropping any which can’t be satisfied.


There’s two core algorithms GCC’s pass_ira might then use to initially colour all the allocnos (nodes in the conflicts graph).

The priority algorithm iterates over the bitmask of allocnos to colour to flag where the allocno’s class has no CPU regs left & collect the others into the prioritized allocnos array. If it found any were found it’ll sort & iterate over those allocnos assigning available registers in that order.

Otherwise it doesn’t sort the order it assigns registers in.


After that initial register allocation & clearing some counts it iterates over the bitmask of allocnos to colour again to select a CPU register from the allocno’s preferred set checking those regs available from it’s class & estimated costs. Iterating over the allocno’s class, objects & their conflicts both twice to find it.

The reamining allocnos are assigned in order of cost.

Then it recomputes limits & unassigned allocnos in a conditional iteration over all allocnos then per subloop.

Merging Assigned Registers into IL

Yesterday I described how GCC’s pass_ira optimization selects which registers to use in it’s conflicts graph (“colours” it). Now how does that get merged back into the RTL intermediate language?

I’ll finish off this megathread today describing the final stages of pass_ira, and maybe the next, related one.


With additional collections initialized it iterates over every loop. Foreach it might iterate the instructions to see where it needs to reanalyze dataflow.

Or it might (with some debugging info) iterate over the allocnos (“border” ones conditionally first) to associate allocnos to the corresponding instructions. Then it sets some flags.

After freeing some of those instructions it iterates over every codeblock, their non-exit edges, & their live regs to emit relevant MOV instructions matching the ABI of the successor codeblocks.

With some more collections allocated it twice iterates over the codeblocks to deduplicated those planned MOVs.


A seperate iteration over the codeblocks yields the actual MOV instructs where decided, and another time to adjust bitmasks & via them costs.

Most of the collections used to perform all the steps so far are now freed, involving iterations the codeblocks & their edges or instructs. And numbers all instructs.

It retrieves the maximum reg number & optionally reinitializes the colouring collections. Possibly with updated virtual regs, dataflow, & loop indices to recolour missed allocnos.


That recolouring involves iterating over those remaining allocnos, their conflicts, & their objects to assign valid colours to gather a full collection of what needs to be recoloured. Then two prioritized iterations assigns the new, valid registers as per before.

It then frees allocnos which have already been incorporated into the bytecode & iterates over the remaining ones to recompute costs & whether they’re caller-save. Then again to calculate in-register vs in-stack costs.


After compiletime-optional validation of all allocno’s & their assigned regs/objects, pass_ira optionally frees some register sets (as per before).

It gathers an equivelant registers array in one of two ways, then optionally iterates over the registers to refer-to/alter this to adjust used registers. I think to remove accidental duplicates?

Pseudoregisters previously set aside are now rewritten to refer to their duplicate value. Then ends by restoring some globals.

Reloading

Once GCC has decided which CPU register corresponds to each instruction, it has yet to actually rewrite the code to do so. At least that’s what GCC’s docs seem to indicate pass_reload does…


After adjusting some globals & starting some profiling pass_reload either resets some indices & runs some local register allocation, or it iterates over the registers then codeblocks with various dataflow bitflags to record register lifetimes & do the main job.

The core logic involves (re)initializing various collections, iterate over pseudoregisters to propagate their lifetimes to the allocated CPU register, decides whether to set a flag saving all registers to the callstack in case of C++ exception upon function call, flags the CPU registers as being live, (re)initializes more collections, including a table of equivalent values for each register from an iteration over the instructions, …


optionally iterate over spilled regs & allocnos to allocate optimal stack slots, assign each reg new “homes” to indicate spillage to the callstack, iterate over instructs & dataflow to flag which regs can’t be eliminated, unspill where required by literal Assembly code, reset some globals, validate each gathered eliminatable regs to demote them to spillage where needed followed by the stackframe pointer, iterate over the CPU regs then spilled regs then pseudoregs then etc to apply spillage.


pass_reload then optionally runs dead code elimination (DCE, garbage collect the code) & finishes with some checking of the framepointer.

Post-Reload

The register allocators spillage logic may leave some mess behind which needs tidying up.

CSE

Register allocation may introduce minor cases of duplicate subexpresssions. Probably due to spillage. So there’s a pass_postreload_cse optimization to get rid of this for the sake of concision!


An initial iteration (with memory CSE records & alias analysis initialized) over the codeblocks & instructions therein first conditionally (skipping non-instructions & sideeffecting function calls) tracks stackpointer updates, serial & parallelized SET ops.

For SET ops it performs some checks to ensure it can optimize away this memory store into CPU registers before looking up the datasource in the memory CSE records if present & estimating the current cost.

It then iterates over all looked-up duplicate values to find the cheapest alternative (if any) & substitutes it in over the current instruction. If that yielded a change which turns the instruction into a noop (assign to self) & it’s not inc’d or dec’d the instruction is deleted.


If that yielded any changes pass_postreload_cse commits them, otherwise it iterates over the instruction’s operands validating any memory ops with possible recursion & looking up from the memory CSE records any CPU registers it can replace the operand with.

After retrieving a bitmask from the instruction (different for speed or size optimizations) a second iteration selects available registers based on cost thresholds & constriants. Operands will select from these after bubblesorting.


For PARALLELized SETs it discards any memory CSE records clobbered by literal asm code, iterates over every action in the PARALLEL to check whether it only contains noops - if so deleting this instruction, iterates over all those SETs again to (as described above) simplify them before committing changes or simplifying the operands.

Regardless of whether that happens it considers adding the instruction to the memory CSE records or discarding invalidated ones.

After tidying up, including the Control Flow Graph if anything changed, it tackles “combining” duplicate regs accross control flow. By computing the range of index regs & some counts, iterating over codeblocks in reverse to record label ops, & iterate over the CPU regs then instructs to determine when those regs are used whilst adding notes for where these regs should be merged.

Another iteration over registers then instructs determines where ADD is more optimal than constant MOV.

Global CSE

Yesterday I described how GCC tackles remove common subexpressions left over from it’s register allocator’s spillage logic. Today I’ll describe the subsequent pass which deduplicates loads between different codeblocks.


To do so pass_gcse2, with various collections (including allocators, a smallintmap of instructs counted by type, alias analysis, & a hashtable populated from an iteration over codeblocks, instructs twice, & regs) & if it indexed any instructs, reanalyzes dataflow, unless too expensive it populates a new bitmask with an iteration over that hashtable of operands, iterates over the codeblocks (unless there’s only one) & instructs therein skipping over abnormal edges & cold codepaths to remove (via various additional iterations) redundant loads whilst updating the table used to determine redundant loads, iterates over that hashtable again & the values’ occurances to determine when to delete them.

Afterwhich it’ll clean up & output debugging info.

Redundant Extension Elimination

Some CPU architectures like x86 can process numbers of different bitsizes, at which point GCC needs to consider conversion between those bitsizes. Register allocation may reveal these bitsize conversion to be redundant, so there’s a pass_ree (Redundant Extension Elimination) optimization to merge these bitsize conversions (“extensions”) possibly freeing up registers in doing so.


To do so it first reanalyzes dataflow with chaining, MIR, & including dead defs.

With several bitmasks to reference it iterates over the instructions & instructions therein, updating those bitmasks for each of the instruction’s definitions & referencing that to gather certain code (indicating redundant “extensions”) patterns into a smallintmap.


pass_ree then iterates over that smallintmap, foreach with debugging output:

  1. Gathers a transative closure of the instruction’s definitions, skipping if none
  2. Determines whether the reg allocator indicates a MOV’s needed.
  3. Extract components from the existing instructions & transactionally/iteratively reconstructs the new code depending on the above conditions.

The number of opportunities & the fraction it managed to take advantage of are counted, & the altered candidate & it’s first definition is gathered into a new array. And gathers another of the old instructions to delete.

These array are finally iterated over to apply the alterations to the code being optimized.

Duplicate Compare Elimination

In Assembly languages conditional control flow is performed by comparing two numbers (bitwise examining the result of, usually, subtraction) to set some bitflags for the conditional branch instructions to refer to. Booleans are usually optimized away.

Since this mismatches to how C expresses control flow duplicate comparisons can easily sneak into the generated machine code, which pass_compare_elim_after_reload eliminates now that code order’s finalized.


After reanalyzing dataflow pass_compare_elim_after_reload iterate over the (temporarily-computed) dominators tree to iterate over the instructions for each codeblock to gather valid compares & delete duplicates based on that. Invalid compares still need to be considered as they still invalidates the flags register. As do other instructions which change compared values.

That info might propagate into successors, & C++ exceptions are considered.

Aux field for these codeblocks is cleared.


The compares gathered by that iteration, if any, are iterated over to try merging with previous instructs (e.g. CMP & SUB merges trivially) with an inner iteration over previous instructs. Deleting the two old instructs & inserting the single new instruct.

Failing that it iterates over previous instructs to find earlier regs with the same values as the ones being compared. It’ll generate & validate a new compare back then in case it helps, updating dataflow notes & delete the old compare.

Function Prologues & Epilogues

When compiling a C function to Assembly, there’s prologue & codeblocks handling the setup & teardown. Merging these with the more explicit codeblocks can reveal more optimizations.


To do so pass_thread_prologue_and_epilogue optionally cleansup the control flow graph (CFG), reanalyze dataflow, call CPU-specific code to generate these prologues & epilogues (if supported), finds the optimal placement(s) for these instructs possibly restructuring control flow to allow simplifications.

In positioning the function prologue & epilogue: not all codeblocks requires the prologue, & if the prologue wasn’t executed we don’t require a full epilogue. Control flow code is duplicated to maximize the number of codepaths without a prologue, & hence the number of codepaths with simplified epilogues.

Dataflow analysis over the prologue & the codeblocks is performed to determine where the prologue is required before analyzing the (subsequently recomputed) dominators tree.


Once it knows where to place the function prologues it iterates over the codeblocks again to collect any which might run with or without the prologue, which can then use in a seperate iteration to consider duplicating those codeblocks, redirect edges & inserting the simplified epilogues in two seperate iterations.

Then inserts branches to the full epilogue & frees the dominators tree.

Then it’ll attempt a more thorough/complex positioning of function prologues & epilogues.


That more thorough “shrinkwrap” (as GCC calls it) after validation seperates the function into “components” which it’ll process independantly. Before deduplicating loads/saves until fixpoint & iterate over codeblocks to determine where it can’t insert the prologue/epilogue according to bitmasks. Before actually inserting the new code whilst outputting debugging info, then cleaning up.

If that succeeded it’ll regenerate the function prologue/epilogue & rewrites returning C++ exception branches.


It locates the fallthrough edge to the function epilogue & inserts it there, turning any other returns into GOTOs to this new code. Then it transactionally inserts the prologue code in the codeblocks where decided. Possibly restructuring the codeblocks according to a new bitmask.

A seperate iteration ensures tailcalls are compiled correctly.

And it conditionally considers reordering the epilogue. Before updating the dataflow entry/exit call info.


After that main optimization it reanalyzes it iterates over the codeblocks to find any marked “hot” which are only called from cold codepaths, as may be introduced once splitting off the codepaths without function prologues. Then it iterates over those invalidly-marked codeblocks to mark them as cold & update any groupings to reflect that via an inner iteration over the codeblocks.

Finally it’ll cleanup the now significantly altered CFG (optionally with more effort).

Optimizing Callstack (De)allocation

In Assembly temporary data is normally stored either in CPU regs or the callstack, with the reg allocator being largely responsible for choosing between them (though it prefers recomputation over callstack storage). When storing data in the callstack you first need to allocate that memory, which incurs some overhead. So there’s a pass_stack_adjustments optimization which attempts to minimize this cost!


With reanalyzed dataflow (+notes) & a bitmask it iterates over all codeblocks.

For each codeblock pass_stack_adjustments extracts a bitmask from the dataflow, iterates over all instructions therein, & cleansup.

For each instruction it determines whether it’s resizing the callstack. If it finds such an “adjustment” it checks whether it can efficiently combine it with the previously-recorded (if any) adjustment. Handling it slightly differently for downwards vs upwards growing stack. Thoroughly validating change before applying it, & noting in the dataflow graph.


If the callstack allocations & deallocations conspired to cancel each other out, it deletes the instruction.

If it finds a memory store which so happens to align with the new desired callstack value (stackslot referencing?) the callstack pointer will be copied from that register.

Function calls invalidates this pass’s state.

Optimizing Away Dynamic GOTOs

To run code quickly CPUs need to fetch instructions long before they’re actually run. As such it’s faster to run a GOTO to a constant memory address than it is to GOTO a dynamically-computed memory address. As such it’s worth even duplicating code to avoid computed GOTOs! Which is what GCC’s pass_duplicate_computed_gotos optimization does.


After computing some optimization parameters, to do so it iterates over the codeblocks then tidies up the CFG & partitions if anything changed.

Where it finds a computed GOTO at the end of a duplicatable codeblocks, pass_duplicate_computed_gotos first sums the instructions’ attr lengths to determine if the function’s too long to be worth applying this optimization.

If that indicate’s it’s appropriate it’ll iterate over the codeblock’s edges to find all predecessors which might benefit from this optimization & can be duplicated. Where so it duplicates the codeblocks & merges in that predecessor.

RTL Peephole Optimization

Like in most programming languages, in Assembly there’s often multiple different ways to express the same operation. Some of which are more concise then others. So throughout the optimization pipeline there’s passes, like pass_peephole2, which converts the less concise instructions into more concise ones.


To do so pass_peephole2 first sets some flags, reanalyzes dataflow, & initializes a bitmask & bitmask array, before iterating over codeblock in reverse.

For each codeblock pass_peephole2 first profiles it, copies the liveness bitmask in from previous analysis with dataflow postprocessing, & flag’s the sidetable’s entries as invalid. Except the last one is initialized with that liveness bitmask. Then iterates over every instruct therein.

For each instruct it fills & tweaks that liveness sidetable. With a full liveness sidetable it continues by calling some code generated from CPU-specific data to determine which instruct to replace it with.


After making sure to properly handle e.g. function calls & the callstack inserts that instruction metaprogrammed CPU-specific code told to, deleting the old instruct it’s replacing. Annotating with the proper register reallocation notes.

If that’s also successful it’ll repeatedly update the sidetable with updated liveness info until fixpoint, to inform the next attempt at replacing an instruct sequence. Regardless of any success it increments the index into the sidetable/ringbuffer.

The instructions iteration ends when it fails to fill the sidetable.

Once the outer codeblocks iteration ends it reprofiles the function, frees the bitmasks in the sidetable or not, & tidies up jump labels and/or control flow if that’s flagged as needed.

Register Renaming

After register allocation there were a few passes which might have lightened “register pressure”. So now that register allocation is refined given this new information by a register renamer.

Modern CPUs also include a regrenamer in order to allow instructions to get rearranged into whichever order their referenced data comes in.


After reanalyzing dataflow with dead code elimination & tidied notes, with an allocator & array initialized pass_regrename analyzes then apply renamings.

With some additional collections (postorder codeblock array, codeblocks sidetable, current ID counter, & chains array + bitmask), pass_regrename’s analysis involves an iteration over all the codeblocks (skipping flagged ones) in postorder, instructs therein (skipping DEBUGs), & their operands.


For each operand it gathers IN/OUT operands (with some preprocessing on OUT) to initialize the chains array. Per instruct that’s then used to update the available regs with a few more iterations.

Reads outside a instruct’s operands, function calls, & literal Assembly all partially invalidates this analysis requiring special handling. Then it concatenates the appropriate chains & flag the operands appropriately. Any chains live accross function calls should be moved to a caller-saved register, before determining which chains should be closed & updating the collection of live regs.

If that succeeded per-codeblock it truncates chains to not enlarge & propagates to next codeblock.


A followup iteration over the codeblocks, incoming chains bitmask from the previous iteration, control flow edges, & the connected codeblock’s open chains (again from previous iteration) determines where chains can be merged between codeblocks. Counting possibilities.

A third iteration actually merges their bitmasks.

A final iteration each codeblock’s auxiliary data.


After possibly reserving the callstack pointer regs, it iterates over those chains to apply them.

For each chain it validates if it can actually rename the register, counts the number of instructions in the (linkedlist) chain (skips if less than 2) & determines registers it can rename to. It chooses a register for the chain by first selecting an appropriate register set, then iterating over them possibly twice to check which are available. If this is a different register than the current one it iterates over the chain to set the new register where valid.

Copy Propagation Hardware Registers

When you have e.g. MOV instructions, it’s often possible to optimize them away using a “copy propagation” pass. So pass_cprop_hardreg (copy propagate hardware register) does so to aid concision & more flexible instruction scheduling.


To do so, with a bitmask, array, & reanalyzed dataflow+notes, it first iterates over all codeblocks validating it can apply the optimization to it (yet), & iterates over all the instructions therein referencing that array as regrename table.

After propagating reg rename tables where control flow is simple enough, or initializing a new one, it iterates over all codeblocks to update the codeblock’s reg rename table & the instructs themselves depending on which opcode it is. Deleting obviously dead stores, adjusting relevant OUT operands into INOUT operands, & updating the dataflow notes as it does so. Plain MOVs are handled specially as it’s possible this pass can remove them.

If any changes are made it’s queued for reprocessing.


After reanalyzing dataflow & specially handling DEBUG instructions, it iterates over all codeblocks queued for reprocessing.

Until fixpoint with reset dataflow analysis & bitmasks, for each codeblock it reruns that same iteration over it’s instructions.

RTL Dead Code Elimination

At this point another dead code elimination pass is run over the RTL intermediate code, in case the previous lowering passes introduced any mess. Dead code elimination is essentially is essential garbage collection for instructs.


With typical collections & some bitmasks initialized, pass_fast_rtl_dce repeatedly iterates over all the codeblocks in postorder (bitflagging them & deciding whether to reprocess) & instructs therein in reverse. Bitmasks are cleared between iterations.

Per instruction (after reanalyzing dataflow which it uses to approximate initial garbage, clearing some state, & optionally outputting GCC debugging info) pass_fast_rtl_dce) collects every use of a DEBUG or flagged instructions as being dead. For non-DEBUG instructions this is followed up by additional dataflow analysis.

The collected dead instructions are deleted in a seperate iteration & the bitmasks for per-codeblock analysis are freed.

RTL Codeblock Reordering

The order in which code is stored in memory is crucial for reducing the load on the instruction prefetcher, and can reduce the need for GOTO instructions!


Between normal Control Flow Graph optimizations pass_reorder_blocks:

  1. Iterates over codeblocks & their control flow edges to flag fallthru edges, and again to mark backedges.
  2. Either:
    • Iterates codeblocks then their stably sorted edges to gather edges that can be optimized away via reordering. Using this to choose an order.
    • Use possible codepaths with likelihood to choose the code order.
  3. Applies the new order in a seperate iteration.
  4. Applies normal CFG cleanup.
  5. Copy the computed codeblock order into each codeblock’s auxiliary dataflow field.

Afterwards a final iteration over the codeblocks recomputes LUID dataflow.

Leaf Registers

There’s a pass_leaf_regs analysis which flags functions which only uses “leaf” registers that can be safely renumbered (in an iteration over all it’s registers, consulting a CPU-specific lookup table) & doesn’t call any other function (iteration over instructions).

Stack Machines

After a few repeat passes, the final optimization in this subsuite is pass_stack_regs_run. Which I don’t think is relevant to popular CPU architectures. If I understand it’s blurb correctly, it lowers registers for stack machines. It might be disabled at compiletime.


After initializing collections, checking whether there’s actually any works to do, reanalyzing dataflow, & bitflags depth-first-search backedges it iterates over codeblocks then regs, followed by the actual conversion.

The iteration over codeblocks counts predecessors, and bitflagging register usage. The iteration over registers determines what to replace them with.

After choosing a NaN value & allocating a collection for stackregs pass_stack_regs_run determines entry & exit stackregs followed by an iteration over edges then codeblocks to (with a new array & updated stackpointer) repeatedly collects from all predecessors then:

  1. Chooses an optimal exit edge & register layout for the predecessors.
  2. An iteration over the instructions to rewrite them & rearrange the register stack as they (or the referenced function) expects.
  3. A seperate iteration handles DEBUG instructions.
  4. Iterate over the used registers to determine when it needs to initialize a value for return/output regs.
  5. Deletes dead control flow edges.

This repeats until the register stack stabilizes.


Abnormal edges are corrected, clears the auxiliary codeblocks field, & commits edge mutations.

If needed the control flow graph is tidied up. And if any changes were made it’ll thoroughly reanalyze dataflow. Flagging this pass has run in either case.

Backend passes

These passes largely relate to generating the compiled Assembly code, more than rewritting that code to be more optimal.

Zero Call Used Regs

Within the final subsuite of GCC optimizations it first rewrites any exit edges of function’s annotated with zero_call_used_regs to reset the specified registers, if used, to 0.

Compute Alignments

The following pass, pass_compute_alignments, computes per-branch memory GOTO alignments by, with a loop indices computed, iterating over codeblocks. For each codeblock it counts the edges & if there’s any exits it selects the max alignment considering a few fixes for GOTOs & loops.

Variable Tracking

When debugging your C programs in, say, GCC you need to see the value of local variables no matter where they’re currently stored. So this info needs to be annotated.


To do so pass_variable_tracking, with flags saved, deletes any debug instructions if bitflagged to do so or if there’s too much code, quick exits if not flagged to do this analysis, initializes collections (include per-codeblock data) whilst examining the callstack structure, …

collects info via the Common Subexpression Elimination infrastructure from all codeblocks & instructions therein, cleans up & deletes DEBUG instructions if those last two steps are unsuccessful, with several new collections (bitmasks, arrays, etc) repeatedly traverses a worklist of all codeblocks to propagate variable names, if unsuccessful it deletes DEBUG instructions & alters flags before trying again or hardfailing, & a seperate iteration outputs the differential annotations.

Free Control Flow Graph

GCC follows that up by a pass_free_cfg pass which, after reanalyzing dataflow & noting down where the cold/hot codepaths where split, frees the Control Flow Graph analysis heavily utilized by almost all of the previous optimization passes.

Machine Reorganize

And the next pass_machine_reorg optimization defers to CPU-specific code to finalize instruction layout, since we now know final register allocation & jump optimizations.

Membarrier Cleanup

When writing asynchronous code you will need to, at times, reintroduce synchronization. Since this synchronizing inherantly defeats the asynchronous optimizations it’s vital for duplicate synchronization instructions to be deleted.

To do so pass_cleanup_barriers iterates over instructs skipping non-“barriers” checking if the previous instruct (save for NOTEs or DEBUGs) is also a barrier. If so it’s deleted. If not it tries reordering the instructs in the previous’s recomputed codeblock.

Delay Slots

Whenever you run a (conditional) jump a couple clock cycles will be spent loading those instructions in. As such it’s worth selecting instructions to fill in those arguably-wasted clock cycles by rearranging instructions.


To do so pass_delay_slots, if there’s a non-minimal number of codeblocks & with a smallintmap from instruct UIDs to new positions + an allocator, iterates over all instructs it runs an initial iteration over all instructions.

For each instruct skipping jump tables & rewrites any jumps to deduplicate consecutive labels.

After reinitializing a few collections it runs a number of passes, involving two attempts at filling delay slots by iterating over all unfilled slots with a corresponding non-deleted branch instruct. It tretrieves the number of delay slots to fill & if any checks if they next “active” instruct is another branch to add to this delaylist. Then iterates backwards to find instructs it can move after.


If both attempts at that failed it’ll iterate over the unfilled slots a third time to locate it’s jump targets to iterate forward over those instructions (per some flags set differently for tight loops) to find probable instructions to speculatively prioritize. Taking care around control flow.

After which it’ll iterate over the instructions again to tidyup any messy JUMP instructions any of that reordering left behind. Whilst outputting those delay slots.


After those main passes it considers deleting either or both (fast vs slow) of the labels to function return code. Jumps to the fast return may be converted into even faster RETURN instructions in two additional iterations over the function’s instructions.

Another iteration deletes any USE instructions, as future optimizations don’t care about those. Then it’s just cleanup & GCC debugging output.

Shortening Edges

Skipping a near-repeat (simplified) GCC optimization decomposing instructs for greater parallelism & one further lowering C++ exceptions, the next optimization pass on my list is pass_shorten_edges.

Some machine & Assembly languages have variable-sized addressing in conditional & unconditional jumps. pass_shorten_branches specifically choose which variants of these control flow instructs should be used!


With several collections it runs an initial iteration over the instructs.

For each instruction, skipping non-labels, it finalizes previously-computed alignment info to layout jump tables. Membarriers (thread synchronization) impacts these alignment.

It stops here if the HAVE_ATTR_length flag is unset.

With some additional collections initialized it iterates over the instructions again in reverse looking at attached alignment logs to any labels to populate a sidetable via a lookup table.


Runtime & compiletime optionally it iterates over instructs a third time & the range of “SHUIDs” within. Retrieving the min alignment from the SHUID-referenced labels, from which it sets fields & bitflags on the instruct.

A fourth non-conditional iteration computes the instructs, skips metadata (notes/barriers/labels/debugs) & deleted instructs, handles sizing jump tables & literal assembly specially, otherwise take into consideration the previously computed “delay slots” or clobbering.


Finally, until fixpoint pass_shorten_branches repeatedly iterates over all instructs again handling jumptable labels, control flow or literal data references (thoroughly aligning the point), & SEQUENCEs (2 approaches) to determine whether it can use more compact instructions. Updating instruction size data as it goes & setting a repeat flag if anything changed.

NOTHROW flagging

Next in GCC’s optimization pipeline there’s a pass which iterates over a function’s instruction to flag more of them NOTHROW.

Followed by one outputting DWARF2 annotations for the sake of tools like GCC. Including static annotations, & ones per instruction (twice) or codepath. With lots of checks. Involving some collections & plenty of conditions.

Though without good idea of what DWARF2 is, I’m not following it well.

Codegen

The final step of GCC’s optimization pipeline (save cleanup) is codegen, or as it calls it pass_final. This is what generates the Assembly code which is (after assembling via Binutils) actually run.


After optionally deleting debug instructions, it’ll “start the function” by generating labels for the hot & cold partitions of the function if any, disables assembler’s preprocessing, outputs any constants used in this function, computes alignment to hold/code partitions if any, …

switches to func start section & outputs a label, instructs assembler about code alignment if any or dev-specified, outputs func prefix, calls a debugging callback, conditionally informs the linker about the func (enqueues in one or two queues & calls CPU-specific code), conditionally mark the function as preserved, outputs a “patching area”, outputs the function’s name or label, optionally outputs another patch area, & conditionally flags saw_no_split_stack. Most of which is CPU-specific.


This func prelude is followed up by optionally outputting address validation, (nearly) skips over some initial instructs, conditionally calls a debug callback, outputs DWARF2 metadata, optionally rename regs for “leaf” funcs that don’t call any others, optionally iterates over instructs to determine whether it needs to output profiling instructs & if so does so, outputs GCC debugging metadata with an iteration instructs then blocks, outputs func prologue, then maybe more profiling.


The first, optional, iteration annotates where LABELs were jumped to from.

If there were DEBUG instructions it extracts smallintmaps from the codeblocks array.

A main iteration switches over each instruction/NOTE to serialize, via the backend callbacks making sure to be in the right sections, the Assembly code whilst counting code addresses. Then cleans up some collections.


If a flag’s set & some function attributes aren’t it’ll iterate over the instructions yet again to collect which CPU registers are used. Unconditionally followed by backend-specific code to output the function epilogue.

C++ exceptions are handled. Followed by some more not-quite-as CPU-specific function epilogue output.

The register info is freed, the output is flushed out, a debug callback is called & timed, memory is freed, & C++ CPU-specific con/de-structors are outputted.

Cleanup

This pass_final is followed up by two more passes, to deallocate memory for dataflow analysis then compilation state.

During the state cleanup pass (pass_clean_state) it also puts a fair bit of work into outputting debugging info for the GCC devs.

Backend

The GCC backends consist of a lot of code which, at least on first glance, doesn’t appear to be doing anything that interesting. Implemented as part of the build system, a GCC backend consists of:

This code implements a shared a common API accross all backends, with the build system choosing which implementation files to use for the target architecture at compiletime.

Data Output

Within any program lives data, not all data comes from external input. This data documentation, part of the datamodelling, and/or decoration. It may be numbers, text, images, audio, or basically anything else, though you may want to use different tools like GIO’s GResource or Haskell’s file-embed to embed image, sound, etc files.

During GCC’s compilation this data needs to be optionally optimized & written out seperately.

Unstructured Data

The first step of either optimizing or compiling constants is to serialize them out from the Abstract Syntax Tree into a bytearray, which is done using a recursive function. Which may involve normalizing endianness & rounding floats.

If there were any constants to place in the bytearray the optional optimization pass will then sort the list of it’s slices by decreasing size then hash. Which it then iterates over to remove any duplication in a new bytearray with hashmap help.


That optional deduplicating optimization alters the AST for the output logic to iterate over the constant pool again. This checks if each constant needs to be outputted again (couple of different cases), where it should do so, & writes it out with it’s labelling to Assembly (not Machine Code). It also special cases in a “object block”, but I don’t think that’s relevant for C.

Once it has a place (if any) to place the constant it uses different serializer writing the bytearray as text.

Structured Data

To output unstructured data it iterates over every “block” of code in a consistant order. For each it starts by declaring a new section & adding alignment bytes with any relevant labels.

It iterates over every object in the block outputting additional constant pools seperately.


Structured Data may need to be outputted as constant or variable, variable blocks may need to be written out later (I suspect this isn’t a C thing). Both may optionally run the code to add debugging function calls for sanitizing memory addresses.

In either case it uses a recursive function to traverse the Abstract Syntax Tree (why not IL?). This starts by checking if it’s a pointer value to output that specially.

After doing some minimal integer & boolean constant propagation it outputs this data to Assembly taking endianness into account.