How does the C(++) programs you write, and all the C programs making up an OS, get converted into raw machine code your CPU’s Control Unit can directly understand? Primarily using GCC & GNU Binutils.
Many languages also follow a similar process, or they may have special programs for “interpreting” their bytecode.
“Binary” executable files contains more than raw machine code your CPU’s control unit can directly understand. There’s additional labelling for our sake, & for linkers (though I haven’t gotten through that code yet).
Whilst today’s topic GNU Binutils contains a suite of commands, the core of most of them is libbfd. LibBFD’s headerfiles are predominantly C structs & macros defining the numerous formats (including archive files) GNU supports, it also implements methodtables for them.
Usually the read & write implementations for each format doesn’t take that much code, but ELF for some reason takes a lot more. Can anyone please explain what ELF is about?
LibBFD also provides e.g. a hashmap implementation as being central to many GNU Binutils commands.
There’s several trivial (or at least as trivial as C gets) commands, though I’ll have to discuss the major commands over the next two days…
ELF “object” format
I don’t know other commonly used formats a lot but ELF is very complex.
ELF is basically a memory representation, that’s why it can be used for programs or core-dumps.
It describes memory sections and their contents, and can also provide other extra data, like symbols and debug headers.
– Ekaitz Zarraga
Actually ELF offers two complementary views into the binary:
The section header, meant to be used by the linker at build time, tells what are the contents of each section of the binary (e.g., symbols, relocation entries, debug info).
The program header, meant to be used by the loader (e.g., ld.so) to create the process image, tells which parts of the binary (“segments”) go where in memory, and which sections they are made of.
– Thiago Jung Bauermann
GNU ASsembler (GAS)
GAS (not to be confused with the recently GNU Assembly) program converts human readable assembly code, resolves any e.g. jump targets, & converts it into opcodes the machine directly understands.
After extensive initialization & parsing the commandline flags (some flags are specific to which CPU it’s assembling for; referenced files are parsed with their flag), it does sanity checks on input & output files.
Next it initializes all the datastructures used for resolving symbols, fragments, subsegs, reading, input scrubbing, expression parsing, macros, writing the output file (whilst opening it) via LibBFD as mentioned yesterday, dot symbols also via LibBFD, anything else specifically for the target CPU, optionally instruction tables, & DWARF.
Then it defines a “.gasversion” symbol & any others defined on the commandline, saving them all to a hashtable.
Now it may start the real work…
For Mach O files (compiletime flag) it starts by creating text, data, & bss subsegments via LibBFD then sets their flags. Regardless it adds ABS, UND, & special GAS reg/exp subsegments. CPU architecture-specific logic follows, which for i386 includes initializing hashtables for instruction mnemonics, register mnemonics, & lexing.
Then it itereates over all source files listed on the remaining commandline arguments & for each…
After this iteration it cleans up & ends.
For each file it resets several other subcomponents possibly creating a debug file.
For each line, chunks of which are read in by the “input scrubber” (scanner), it first checks if it’s seeing a label & if so defines a new symbol for it. After skipping preceding whitespace it checks if the following identifier is followed by a
==, otherwise looking it up in the psuedoops & macros tables to evaluate before parsing the opcode.
Non-identifiers may be empty, local labels, or comments.
Assembling each line of Assembly code (after labelling, psuedoops, & macros have been resolved) is CPU-specific. For i386 it:
- Zeros out some globals
- Parses out & looks up the mnemonic validating it’s appropriate for the target CPU with prefixes & suffixes
- Parses the operands depending on whether AT&T or Intel syntax is used, possibly involving the expression parser
- Applies typechecks & corrections
- Chooses best instruction for holding literals & displacements
- Locates the appropriate bitpattern from the looked up mnemonic
- Validates that opcode, adding missing FWAIT & BND prefixes
- Attempts to shorten the operand bitsizes
- Fills in missing suffixes
- Some further processing I don’t understand
- Something to do with VEX instructions
- Switch to more compact opcodes for literal 3 & JUMPs
- Add empty REX prefix for 8bit registers
- Output the instruction, possibly with LFENCEs before and/or after
The GNU Linker subproject is probably the main reason Linux From Scratch has you install GNU Binutils first, as it’s what combines multiple “object files” into a single executable whether at build time (for reliability) or by Linux immediately prior to running the program (for memory efficiency).
GNU Binutils also contains a Gold subproject contributed by Google that does the same thing in C++ with multithreaded optimizations, but I won’t discuss that further.
After extensive initialization & the correct filepaths have been found which in part involves looking up a methodtable for emulation, it parses remaining commandline flags with help from the emulator. As part of the commandline flags parsing it may edit the object file via LibBFD & (which has a stack) it’s ldlang subcomponent, open/parse config files parsed via a Yacc/Bison script, and/or as per usual set globals.
This is followed by validation & LibBFD adjustments.
During this validation/finalization it calls some additional methods on the emulator & loads plugins.
The ldlang subcomponent may now run to reflect it’s complex linked-tree-based datamodel in LibBFD with hooks for those plugins & the emulator.
After this initial linking pass it gathers the list of everything to be linked, from which it can build start & stop sections before applying all the gathered assignments via LibBFD. By this point we’re traversing “expressions”.
Next it makes sure all symbols are associated with the correct sections & optionally removes any unreferenced symbols. There’s some validation which may also be handed to an emulator method. It marks any “wild” sections. It traverses the script to associate inputs to outputs. It evaluates resulting “insert statements”.
After another emulator hook it gives any remaining input sections somewhere to go. It merges any sections requesting that, then for CTF sections which it’ll then write.
GNU Binutils bundles another library for handling CTF regions.
It then copies LMA regions forward a section. It optionally defines __start & __stop sections early, followed by startof & sizeof sections.
An emulator hook is called before recording sections taking into account any VMA memory hierarchy.
It locates the relro section if present, then computes the memory sizes of each section in the linked tree.
After another emulation hook it configures the __start & __stop sections.
The ldlang subcomponent then evaluates all remaining assignments, calls a “finish” emulation hook, rewrites ldexp’s symbols table to reference sections, validates the result, checks whether required symbols are defined, & performs some final adjustments.
Having processed the “lang” datamodels it outputs any warnings & errors before writing the data out by:
- Validating the output from ldlang.
- Traversing ldlang’s tree, adjusting the object file via LibBFD for each statement
- It may then “split” the output section by relocating it’s entries via LibBFD
- Finalize BFD’s linking, reporting any errors.
It may then output the resulting information to configured files for debugging, etc and/or do more expensive validation, before cleaning up after itself & copying the file into it’s destination.
Perhaps the most famous subproject of GNU Binutils (which didn’t come with their download, or would be relevant to Linux From Scratch at this early stage) is GNU DeBugger!
GDB’s vast codebase can reasonably be summarised as an event-driven commandline program following the Bridge object-oriented design pattern. That is, it’s behaviour is defined by the interaction between the appropriate implementations of:
- Converting to/from low-level memory addresses & high-level code references (LibBFD)
- How to actually stepwise-debug running machine code on each supported kernel/CPU combination (bulk of codebase, can communicate with GDBServer with which it shares code for remote debugging) ** Data files on available x86 features, registers, & kernel syscalls.
- Whether output should be more human readable or machine readable (Hey, I’d like a nice elementary GUI for GDB? More of a non-monolithic IDE?)
- All the different commands you might type, each bundle of which might require additional datamodelling.
- Python & Guile language bindings to those commands, so you can extend the set.
P.S. Both GDB & GtkSourceView take care to provide all the hooks you need to tie the two together, so I’m serious about wanting that elementary GDB GUI! I think this was done mainly for GNOME’s monolithic IDE.
But the real logic is ultimately in the kernel’s, in our case Linux, ptrace() syscall.
To handle GDB’s ptrace() syscalls:
PTRACE_TRACEMEafter security checks it sets a flag, possibly CPU-specifically “attaches”, & returns.
- Otherwise it finds the referenced process under a lock.
PTRACE_FREEZEit “attaches” (with some locks) by setting some flags, performing security checks, inserting it into a linked list, sends it a signal, updates the scheduler, & continues with CPU-specific code.
- More locks & checks
- CPU-specific code…
On x86, Linux continues by checking which subsystemcall was made:
SETFPXREGSall refers to the scheduler’s register table for the debugged process.
PTRACE_POKEUSRmay involve setting hardware breakpoints with globals.
PTRACE_GET/SET_THREAD_AREAuses normal kernel routines (which now includes Meltdown/Spectre mitigations) to copy data between the registers.
- I’m unclear what’s
All other ptrace() subsystemcalls return to being largely non-CPU-specific:
TEXTuses slightly special routines for copying data between the debugging & debugged processes.
PTRACE_SETOPTIONSupdates specified flags if permitted.
PTRACE_GETEVENTMSGreturns a field previously set on the debugged process.
PEEK_SIGINFOcopies metadata from the debugger & updates scheduler.
GET/SET_SIGINFOcopies it to/from the debugger.
GET/SET-SIGMASKretrieves/mutates teh saved_sigmask/blocked property of the debugged.
PTRACE_INTERRUPTmarks the debugged with the specified signal & tells the scheduler to wake it up for it to handle the UNIX signal.
PTRACE_LISTENchecks the siginfo, sets/checks bitflags, & wakes up the debugged under it’s lock.
PTRACE_DETACHstarts with CPU-specific bits (focused on single-step debugging for x86) then sets a specified exitcode & sends appropriate notifications.
KILLsets appropriate hardware & software bitflags before awaking the debugged.
PTRACE_GET/SET-REGSETcopies the scheduler’s register table for the debugged to/from userspace.
GET_SYSCALL_INFOretrieves various CPU-specific properties.
SECCOMP_GET_FILTER/METADATAretrieves a specified index in an array for the debugged.
To summarize: Having access to the scheduler which swaps the CPU’s state between the different “processes” makes it trivial for Linux’s ptrace() syscall to inspect the state of a process. Even if the exact structure being inspected is CPU-specific.
Having breakpoints & instruction/block stepping on the otherhand benefits from hardware assistance triggerring systemcalls from the debugged.
To evaluate literal arithmatic at compiletime GCC, for some reason, uses arbitrary-precision arithmatic provided by libGMP, libMPC, & libMPFR. In some cases (global constants & preprocessing) this “constant propagation” is required, in others it goes far towards outputting concise machine code. Because CPU opcodes can rarely embed more than one literal.
While it doesn’t use multipricision arithmatic at runtime, using it at compiletime allows for greater reliability for metaprogrammers.
Multiprecision arithmatic basically implements the long-arithmatic notations you learnt in school. So does the “Arithmatic Logic Unit” circuitry in a CPU, but that can only handle a fixed number of bits. Software on the otherhand can use arrays to store as many digits as they want!
LibGMP’s MPN subcomponent microoptimizes by implementing this in Assembly reimplemented for a wide variety of CPU architectures, to take advantage of bitflags & other CPU features not exposed directly to C.
The MPF subcomponent wraps MPQ to store the array’s length alongside it’s pointer, and adds preprocessing to speed it up further. And to treat as a signed integer (the sign of that size field is really the sign of the whole number) & exponents. It can also read/write textual notation. The exponent & “mantissa” are combined to create floating point numbers.
The MPZ subcomponent adds resizability beyond what MPF does (but not exponents), & MPQ wraps two MPZ numbers to form a fraction.
GMP also bundles:
- A “mini” version implemented in pure C, for up to 10x performance loss.
- C++ operator overloading.
- Reading/writing strings utilities.
- Pseudo-random number generation.
- demo scripts.
- Profiling & ofcourse unit testing.
It’s worth noting MPZ’s maths resembles Scientific Notation, & more closely how your CPUs Floating Point Unit works. The difference here is you get to decide how many bits to use. Whilst MPZ can store any Rational number ulike MPF or your CPU.
LibMPFR & LibMPC
LibMPFR (seperate project) wraps LibGMP’s MPN to implement it’s own scientific notation-based floating point notation seperate from MPF, adding additional trigonemetric operators. And LibMPC wraps LibMPFR to implement Complex numbers, composed of two subnumbers one of which is conceptually multiplied by the square root of -1 before being added.
It will be interesting to discuss how trigonetry is implemented in-software, but I’ll do so elsewhere.
I’ve recently discussed GNU ASsembler which converts from a textual representation to the “machine code” your CPU’s Control Unit directly understands.
But these “Assembly Languages” require you to rewrite the same code for each CPU architecture you want to support. And to decide in which register, stack slot, or other pointer offset in which to store every single individual piece of data; including function arguments & returns.
GNU Compiler collection (GCC) provides a better abstraction.
I’ll focus exclusively on GCC’s C compiler, although GNU also uses C++ & it also bundles Ada, HSAIL, D, Fortran, Go, & Objective C/C++.
After multi-pass parsing of the commandline arguments GCC through it’s TopLev C++ class initializes the localization, diagnostics with colours & URLs, OS-specific, exitting, callstack, & allocation.
Further configuration is read in from specified files & environment variables. Unrecognized commandline arguments are reported with possible autocompletion suggestions. It outputs & exits for help/info commands.
After all this initialization, some of which is language-specific, and some profiling it calls
compile_file. Then it outputs some diagnostics, tidies up, & exits.
compile_file (ignoring the time profiling) it starts by running the frontend’s parser possibly outputting computed locations to stderr before lowering the resulting Abstract Syntax Tree to “gimple” cyclicly-linked-list SSA bytecode form as described yesterday. A post-parsing language hook is now called before the backend.
With the program (regardless of whether it’s written in C or not) into GCC’s IL it finishes sanitizing addresses, threads, & hardware addresses. And finishes compiling OpenMP (a featuring I’m largely ignoring in these studies).
Then it outputs the shared constant pool, object blocks, finishes clone pairs, weak pointers, possibly DWARF frames, debugging info, indirect constants, & processes remaining external directives. And gives hints to the linker.
Language Hooks (for C)
The pipelining infrastructure in the GCC’s driver class discussed the other day calls some “hooks” for the appropriate language (decided at compiletime) registered using C macros. I’ll be going over all the significant hooks for C in a rough approximation of the correct order, ignoring anything for C++. Most of the rest are for initialization!
Which I think I’ll start with LANG_HOOKS_PARSE_FILE, or I might push this off until tomorrow…
Unix represents “files”, for better or worse, as a sequence of bytes which or may not represent (ASCII/UTF-8 encoded) text which are easier for humans to read & edit. UNIX idealogues would say always use text, but I don’t see how to the computer there’s much difference.
To do any serious process of this data it’s richer structure must first be “parsed” out. Which is definitely the case for converting C programs into Assembly & on into Machine Code.
All C-like languages first finalizes that initialization.
GCC’s C parser is a handwritten “pushdown automaton”, loosely resembling how you might express C’s syntax in EBNF. The lexer is, with some postprocessing to e.g. add types to literals & concatenate consecutive string literals, bundled as a core part of the C PreProcessor. Though the parser occasionally directly with it’s “pragmas”, possibly sourced from build config files.
This libcpp doesn’t appear to bundled with the download I have…
The parser incorporates behind feature flags Objective-C, OpenACC, OpenMP, and a little C++ at minimal additional complexity.
GCC provides a generic binary tree for the parser to use for representing expressions & types, as well as identifier lookup & pretty error reporting.
The parser calls into a suite of Builder (OO design pattern) functions to construct declarations, for resolving, validating, & constructing declarations & statements during parsing.
Not only are identifiers resolved during parsing (rather than a seperate compilation pass like, say, Haskell) but so is typechecking! Allowing for the parser to desugar
alignof() into literal values.
C makes it easy to typecheck each individual expression independantly of all others. Just referring to the local or global identifiers.
The code for building, validating, & desugaring the AST for statements & expressions is alongside the typechecking code.
The expression AST builders largely serves to make C’s implicit type coercions explicit, resolve pointer offsets from e.g. struct fields, perform the same operation over every field of an enum, and to partially resolve the
&reference operator which is meaningless in machine code. Assignment operators infers a dereference operator on the left-hand-side which must be a valid “lvalue”.
The statement AST builders largely serves to flatten C’s “structured” control flow.
Already quite class to machine code!
Lexing & Preprocessing
When programming in a relatively rudimentary language like C it may be useful to be able to abstract away the syntax using “macros”. Even if there’s plenty of gotchas with C’s preprocessor…
When parsing a text file it may be useful to first “lex” the text into a sequence of consecutive “tokens”, between which the parser can extract the relationships between.
In GCC’s C compiler these tasks are one & the same!
This “LibCPP” lexer starts with scanner class converting a stack of files in any text encoding to a sequence of UTF-8 characters. At a first pass (primarily for the preprocessor component) it uses CPU vector instructions to rapidly split the file into lines, before dispatching function for enacting the #directive or lexing the next character.
The directives may in turn dynamically dispatch to the appropriate “pragma” callback.
The directives may also interpret boolean expressions (entirely seperate from GCC’s constant propagation optimization!) often to manipulate a conditions stack instructing the scanner whether to skip lines. Or they might add “macros” to a hashtable. There’s also error reporting infrastructure.
These macros are resolved (with any arguments themselves becoming macros) in a seperate pass over the lexer I’ve already described.
Yes, LibCPP implements it’s own hashtables…
I would argue that, while it may also do additional optimizations, a C’s compiler main job is to save you from deciding where in memory every minuscale piece of data belongs in memory. Or in which register, but that’s a topic for another day,
C allows you to declare structs, enums, & unions which’ll be lowered to pointer offsets & constants at parsetime. Or to introspect the memory sizes & alignment of those custom types using sizeof() & alignof().
To parse sizeof() & alignof() to a literal integer, the parser calls c_sizeof_or_alignof_type() which’ll examine the type’s AST.
A function’s size is
size_one_node & it’s alignment is
FUNCTION_BOUNDARY / BITS_PER_UNIT.
void & errors have a size & alignment of
size_one_node. These types, & incomplete types throughs warnings/errors.
These constants & following verification code is CPU-specific.
To compute the size of other user-defined types, after handling commoncases & caching, it calls a bi-recursive routine that switches upon the type of expression which can also handle maths within array sizes.
The code to compute alignments is platform-specific & incorporated into the sizing code.
Memory offsets are stored in the AST for structs whilst parsing them, computed from this sizing & alignment info, to be looked up when one of it’s fields are accessed.
Enums meanwhile parse to constants.
It may make certain compiler optimizations more obvious once you’ve established a Single Static Assignment (SSA) invariant that any variable can only be assigned once when it’s initialized. In other words, functional languages are already SSA!
It may make other optimizations more obvious once you model the reality that every intermediate value in an expression also needs to be placed somewhere.
This is what the “gimplification” pass is for! Beyond what the parser has already desugared.
The first step of converting a function to SSA is to consider it’s arguments: which arguments can be stored in registers? How about it’s return value?
If all the relevant flags are set, it’ll create a hashtable to use for address sanitization. Then it tackles the body!
Once it’s (re)initialized all the relevant datastructures for this function body, it incorporates the arguments into the SSA tree taking types & the calling convention into account, though I’m not fully comprehending.
The next step is to convert the function’s body, so to convert a statement or expression to SSA it uses a recursive routine (with validation) switching over all the various different operators. Some of which may be language-specific.
There’s a bit of a different process for lowering booleans, because bools aren’t native to machine code: it’s just control flow & conditions.
The few remaining &reference operators are lowered away.
It tracks any sideeffects affecting this lowering.
There’s another recursive subroutine for deciding how much memory to allocate on the callstack. Address sanitization if enabled is tracked whenever lowering a variable declaration. Where possible variables are inlined into the SSA tree.
Structured control flow is further lowered to GOTOs.
cases are associated with containing
New variables are allocated throughout this process, some of which might be void’d.
The C-specific hook allows handles lowering shifts, declaration warnings, & pre/post increment/decrement specially.
There’s plenty of post-processing… It outputs a cyclicly linked-list of variable-sized opcodes which may include labels, including creating new variables.
And upon compiling a function body adds additional wrapping SSA ops, or rearranges existing ones. Maybe there’s callee-copy arguments?
The function lowering may add additional processing for profiling & address santization.
Other C Language Hooks
Looking over the next bundle of C compiler GCC “language hooks”, FINISH_INCOMPLETE_DECL addresses some corner cases.
WARN_UNUSED_GLOBAL_DECL checks whether it should print a warning, PRINT_IDENTIFIER makes the appropriate printf() calls, TYPES_COMPATIBLE_P calls into the typechecking system to determine whether given types can be coerced, MISSING_NORETURN_OK_P checks for special & legacy syntactic cases for main(), BUILTIN_FUNCTIONS[_EXT_SCOPE] special cases internal functions.
I’m not quite clear what SIMULATE_BUILTIN_FUNCTION_DECL does beyond adding implicit declarations. There’s a table of callbacks for all the different “attributes” you can attach to a declaration, under language hooks COMMON/FORMAT_ATTRIBUTE_TABLE.
TREE_DUMP_DUMP_TREE_FN handles C bitfields specially. SIMULATE_ENUM_DECL calls the same functions for parsing an enum.
TYPE_FOR_MODE does numerous checks to return an appropriate type, as does TYPE_FOR_SIZE.
INCOMPLETE_TYPE_ERROR runs the appropriate error_at() calls to report an incompletely specified type. TYPE_PROMOTES_TO returns an appropriate type for the input one.
REGISTER_BUILTIN_TYPE registers implicit type declarations.
TO_TARGET_CHARSET performs converts between character encodings via LibCPP.
EXPR_TO_DECL conditionally wraps an expression in a global variable. GET_DECLS is a noop for C. TREE_INLINING_VAR_MOD_TYPE_P checks for unsized types.
Then there’s OpenMP…
With your program parsed into a cyclicly-linked-list SSA variable-sized bytecode form, GCC finalizes some “sanitization” after the parser’s start.
For memory addresses this involves constructing “shadow pointer” types if missing, conditionally adding some calls to builtin functions, & constructing a global for those runtime checks to consult with enough for space for all relevant globals.
Thread sanitization imports those same functions calling the runtime checks at startup & shutdown.
Hardware address sanitization adds a different runtime initialization/finalization function call to the program.
All of these checks are are entirely optional to compile, and probably wouldn’t be included in the first bootstrapping of an OS like Linux From Scratch due to it’s runtime requirements.
I’ll skim some of this code to see if there’s anything worth commenting upon when I upload this thread…
Compiling the code, as opposed to embedded data, is done conditionally by calling
finalize_compilation_unit on theglobal
After some profiling this C++ method it computes the calling convention (partially CPU-specific) & marks alias targets. Processing each previously marked aliased pointer involves considers whether we can remove it.
It also remodels the aliased pointers in the varpool or control graph subsystems.
After some debugging output it lowers the programming Abstract Syntax Tree to the intermediate language both before & after additional alias removal/remodelling. I think I’ve already described that, but I’ll double check.
After some more debugging info, validation, & setting hooks it’ll move on to the actual compilation. If there weren’t any errors it’ll run the IPA optimization passes & plugin hooks.
After running the IPA passes it’ll may run some CPU-specific code & debugging output. Before rerunning those IPA passes & debug hooks.
After more debugging/profiling late IPA passes are run & figures out which functions we want to output.
Then it gathers, sorts, & outputs each cgraph, varpool, or ASM node whether defined or not mostly using CPU-specific code. Or directly outputting that data or ASM. CGraph nodes also refer to the generic subsystem for LTO sections & it’s hashmaps.
CGraph nodes also runs a SSA pass & applies those IPA & other optimization passes. After some optional warnings & tidyup it iterates over the CGraph to output any literal labels & CPU-specific Assembly instructions.
Varpools may also incorporate LTO sections.
It’ll then topologically sort all the functions & expand them seperately if they haven’t already been. Followed by variables, as per above.
It handles any new alias pairs & iterates over newly-created functions to output/optimize.
It outputs any weakreferences to be linked later before finishing with debugging, validation, & freeing memory.
GCC runs a suite of extra checks & optimizations over your programs to help ensure it works optimally. This section will describe each of those passes in turn.
Warn On Unused Result
pass_warn_unused_result warns when calling functions marked with the
warn_unused_result attribute, without doing anything with it’s result value. Presumably largely used for error handling.
For each function this passes iterates over all it’s bytecodes, recursing into any BIND, TRY, CATCH, or EH_FILTER. (relevant to C?)
On any relevant function calls it checks the output target stored in the opcode to decide whether to warn.
Lowering Control Flow
Skipping over GCC optimization passes which aren’t relevant to plain C, the next on my list is
pass_lower_cf which in part combines multiple return statements into a single one per function.
It also lowers away lexical scope (possibly relevant…) & exception handling (irrelevant).
Expecting parses to start each function with a BIND opcode it recurses over the tree of those BIND opcodes extracting vars & ensuring correct opcode ordering as it does so.
Within each BIND block it examines each statement’s opcode:
- BIND opcodes recurses
- COND/GOTO/SWITCH sets a
- Dead RETURN opcodes are removed, as determined by
- Live RETURN opcodes sets that
cannot_fallthroughflag, & finds or creates a compatible return to GOTO
- TRY opcodes traverses both subsequences of ops, updating
- EH_ELSE iterates over both subsequences of ops
- DEBUG ops are conditionally removed.
- NOP, ASM, ASSIGN, PREDICT, LABEL, EH_MUST_NOT_THROW, & OMP_* opcodes removed?
- All CALL ops set the “block” for all it’s arguments
__builtin_setjmpCALL ops lower to conditional gotos where that can be done statically, sets
- Adds conditional calls to
__builtin_assume_alignedafter calls to
- Runs a “fold” operation for other builtin functions, peephole optimizing arguments.
- More ops for OpenMP or Transactional Memory.
- No other opcodes allowed!
With the bulk of the lowering done, it optionally validates that all DEBUG instructions have been removed, setting the
return; instruction will be added if missing.
All RETURN opcodes located & replaced with GOTOs are moved to the end of the function with the appropriate labels at the end of the function.
And all temp data involved in theses tasks are cleaned up!
Control Flow Graph
Each “block” of C code can (if not for function pointers) transition to a statically-defined set of other blocks depending on control flow. This forms the “Control Flow Graph” (CFG). This analysis pass is performed by the
pass_build_cfg C++ class & the
build_gimple_cfg C functions.
This CFG will be vital if we ever we ever want to optimize it!
build_gimple_cfg starts by initializing a new graph with the functions codeblocks & their labels.
After optionally preprocessing DEBUG instructions GCC iterates every instruction in the function.
If it’s a CALL instruction it checks whether it should set or unset it’s control flow altering flag depending on whether it may trigger any exceptions. So seemingly irrelevant to C. I guess CFG only considers within-function control flow.
It splits the cyclicly-linked list into the seperate codeblocks, populating the labels mapping. Certain instructions start or ends a codeblock.
GCC’s CFG analysis ensures there’s at least one (empty) block, it trims the allocation of the codeblocks array, & removes duplicate labels including for exception handling. The later involves rewriting GOTOs, etc to not reference the labels about to be deleted.
switch statements (who’s cases have previously been sorted) are handled specially in a seperate subpass to collapse contiguous
It then constructs a hashtables when populating edges….
GCC creates an opening edge then iterates over each codeblock & the instructions therein, converting GOTO, RETURN, COND, SWITCH, exception handling, CALLs in case of exception handlers or no returns, ASSIGN in case of exceptions, any special ASM code, OpenMP & Transactional Memory into control flow edges in this Control Flow Graph.
It keeps a list of blocked entered via GOTO, CALL, or OpenMP & if there are any of the formers it goes over the basic blocks again to find & correct abnormal edges.
A basic block is created to dispatch to the abnormal edges, though I’m not clear on what defines an edge as being abnormal.
build_gimple_cfg iterates over the basic blocks again to assign “discriminators” based on line numbers & a hashmap. And it finishes up by once again removing duplicate labels revealed by those subpasses.
Basic CFG Optimization
With the function body this modified,
execute_build_gimple outputs some debugging, “cleans up” the CFG, & optimizes loops.
The cleanup subpass starts by optionally altering any loop headers to have only one entry edge, before traversing the tree in preorder looking for edges to remove & flagging each codeblock. And allowing constants to propagate into simplifying control flow. GOTOs to GOTOs are merged into a single GOTO.
If necessary SSA invariants are then reestablished, & dominators recomputed.
It initializes a hashmap & bitmap to track case labels.
GCC’s CFG cleanup then iterates over the basic blocks to remove GOTOs to the next instruction, before traversing the control flow again re-applying both those same optimizations in case opportunities got revealed.
Any codeblocks not marked are garbage collected.
Next it locates or validates any loops in this Control Flow Graph. Which it’ll output for debugging & flag in the opcodes.
Presumably OpenMP looped ANNOTATE instructions are handled specially.
Invalid Return warnings
The next GCC (warning) pass in my list is
pass_warn_function_return. This iterates over every exit edge from the Control Flow Graph (as described yesterday) to ensure there are RETURN opcodes & it checks whether the presence of a return value matches the signature of the function.
Two seperate loops are implemented for this behind different conditions. In a
noreturn function lowers
return statements to
The C parser fully typechecks returns.
The next (warning) pass validates calls to alloca(), once the codes in SSA form with a corresponding Control Flow Graph.
To do so it iterates over every codeblock & opcode within to find those alloca() calls. Depending on configuration this may be an unconditional warning. Or it may check whether the use of alloca() stays within fixed runtime limits: it’s not in a loop & it’s argument has a max or smallish constant value.
These warnings are nicely formatted for a good DX.
So far I’ve discussed use of SSA invariants & control flow graphs to aid optimization & additional typechecking. So today I’ll discuss how it extracts an explicit Call Graph from this data to further aid optimization.
To do so it iterates over each instruction of each codeblock skipping DEBUG instructions. If it’s a CALL opcode it’ll add an explicit or implicit edge to the callgraph. And there’s special handling for OpenMP.
Then it records references to every instruction & destructors.
Recording references from a statement involves iterating over it’s tree to mark each data is loaded & saved.
Each node (function) in this call graph has a hashmap storing all it’s edges to the other nodes. The nodes are populated one function at a time, like all other optimization/analysis/warning passes work.
The callgraph node to populate for each function is stored inside the IL declarations, created as needed when adding the edges.
Freeing Lang Data
The next GCC optimization/analysis/warning pass removes fields required by the frontend parser & (C-irrelevant) lowering of exception handling. They’re not needed anymore. Though I don’t find such deallocation very interesting, so I’ll otherwise skip this pass!
This optimization pass reduces duplicate calls to
strlen() arising from use of C’s string library, ammounting to the dominant C compiler admitting nil-terminated strings was a poor design choice.
First this pass computes the dominators for each block & possibly locates loops to initialize their “SCaler EVolution” checks, before conditionally initializing a hashmap & arrays for this pass’s analysis. From there it walks the dominators tree, outputs optional debugging info, & deallocates it’s temp data.
Upon entering a codeblock this tree traversal iterates over it’s instructions looking for relevant ones.
The first iteration over instructions upon entering a codeblock invalidates the previously runtime-computed lengths for modified strings. The second simplifies relevant PHI instructions from the SSA pass. The final iteration applies the optimization before storing aside the altered state.
Upon leaving a codeblock it clears this stored state.
str[n]len() it looks up the constant or variable size for the string regardless of intermediate assignments. Or stores result in the map.
Calls to functions annotated with
alloc_size stores a constant length in that mapping.
strchr() populates any missing second argument to hold an end pointer.
stpcpy(), or their
_chk variants are converted if possible to
memcpy() & makes the length of the 1st arg same as the 2nd.
strncpy(), & their
chk variants issues warnings for out-of-bounds access.
mem[p]cpy[_chk]() checks for string copying.
strcat() adds the length of the second argument to the first’s.
calloc() stores the allocated length as corresponding to that string.
memset(_, 0, _) removes duplicates (arising from
calloc() calls) & sets the string length to 0.
memcmp for tiny arrays converts it into inlined comparison operators.
str[n]cmp() involves adding that
n arg, or determining whether it’s definitely equal or not.
It tracks string assignments &, as noted earlier, clobbering. Whilst pointer equality can inform
str[n]cmp optimizations, so that’s tracked.
The C standard library has a
printf() function (don’t worry, I’ll be discussing GNU LibC shortly after GCC) which has it’s own type rules for safe usage. The variadic args much match what’s described in the format string. Traditionally these additional rules were not typechecked by the compiler, but now GCC provides a warning pass for
printf()-like functions as determined by an annotation.
Arguably this should be more than a warning, but they don’t want to break old programs.
This starts if lacking by making sure we have a lookup table initialized to decode the target’s charset, via it’s CPU-specific dispatch table.
printf()-like optimizations spends some effort locating
sprintf()’s destination argument to get at it’s size for all the previous optimizations, or computes the possible size range.
Then it locates the format string before doing the raw check & length computation on the extracted values. Before applying some simplifications & returning.
For small-bounded text formatting calls it’s faster to use the return value rather than a pointer arg. Or use
strcpy() rather than
To check a format string & compute the possible lengths of the result it iterates over & parses (using
strchr()) every “directive” in the literal format string.
For each it calls a method on the parsed directive, counts how much literal text to add, validates the computed range for the segment, issues common (e.g. NULL) warnings, sums them into the full length range (partially based on likelihood), & checks how much space is remaining in the output buffer if any whilst a generous limit.
Literal text in the format string is parsed to a
format_plain directive of that length.
”%%” parses to a
format_percent directive of length 1.
Upper or lower “%a”, “%e”, “%f”, or “%g” parses to a
format_floating directive with a length taken ideally from the precision parameters.
“%d”, “%i”, “%o”, “%u”, “%x”, or “%X” parses to a
format_integer directive similar but simpler than
“%p” disables further checks.
“%n” parses to a
format_none directive of length 0.
“%c” parses to a
format_character directive of length 1, adjusted for width or precision parameters. %C parses to a
format_character directive which checks the range of it’s unicode input to determine whether it’s 0, 1, or 2 bytes long at which likelihood.
“%s” or “%S” parses to a
format_string directive which gets it’s length from that of the passed string using all the previous measures.
And ofcourse, with the iteration, these issue warnings.
Warn Non-NULL Compare
The next (warning) GCC pass iterates every function’s arguments to check whether any of them are annotated
nonnull & are compared against NULL regardless. As that’s checked at compiletime.
Because while yesterday I discussed how GCC optimizes away C’s poor design choice of nil-terminated strings, this pass extends the language to address the poor design of C’s NULL.
In C you can declare a variable without assigning it a value. If you then read the variable before writing to it that is Undefined Behaviour. Single Static Assignment makes it easy to warn C devs whenever this might crop up!
And most languages now upgrade this warning to an outright error.
With the dominators computed, it iterates over every codeblock in the function & it’s instructions skipping DEBUG ops. A first pass checks all
SSA_NAMES referenced by the instruction.
After thusly issueing any warnings for instruction parameters it increments the number of vdefs if relevant & handles CALL & ASSIGN ops specially.
For CALLs it also checks if there’s any references to undefined memory passed to constant or specially-annotated pointers. The checks on ASSIGN ops are quite nuanced but also appears to relate to &referencing undefined memory.
A fuller pass will run later in GCC’s optimization pipeline!
Undefined Behaviour Sanitization
In C there’s various edge cases which the standards explicitly do not specify. This is the dreaded “Undefined Behaviour” which is generally less than useful, and widely avoided. As such GCC can warn you whenever it compiles such code.
So GCC allows you to optionally insert, via the
pass_ubsan pass, a suite of runtime checks to throw C++ exceptions.
It iterates over each instruction in each codeblock.
DEBUG or clobbering instructions are skipped. Maths operations are optionally replaced with library calls to handle signed integer overflow. Non-null pointer dereferences are optionally augmented with NULL checks. Enums & non-bitfield bools are optionally annotated with checks that their encoded values are in-range, which is 0 or 1 for bools.
Earlier I mentioned a pass warning against checking if
nonnull-annotated function parameters are NULL. This pass optionally checks against passing NULL.
The arguments passed to the clz() & ctz() builtin functions are optionally annotated with validation checks. It optionally adds checks that NULL isn’t returned from non-NULL functions. Pointers are optionally validated at runtime they are in-bound of valid memory. And pointer arithmatic is optionally annotated with functions to check the pointers remain valid.
A runtime C++ exception is thrown upon failing these checks. Control-flow edges for other exceptions are removed per codeblock.
To guide hueristics for later optimizations, GCC includes a pass to compute useful metrics for each function. Like (according to the file’s own summary):
- Function body size
- Size after “specializing” into certain contexts
- Average execution time for certain contexts
- Function frame size And for each call:
- Caller codesize
- Parameter change frequency
This “context” refers to which parameters are constant & whether the call is inlined or not.
It computes the summary for a given function, once it has been (re)allocated, by first computing the stackframe size. This “stackframe” is where local variables (not moved out into CPU registers) are stored in memory, so how much memory do we need to allocate from the callstack?
To do so it iterates over every variable declaration associated with the function, summing the size of every non-register non-global var taking into account alignment. The result is scaled to a distribution.
If I’m reading the code correctly computing code size & timing is cached. If it isn’t yet it reestablishes Single Static Assignment, checks inlining annotations, & analyzes the function body.
That analysis involves optionally computing dominators before summing the CPU-specifically scaled times/sizes for each instruction in the appropriate tablerow for the different contexts. Adding some semi-constant overhead for calling the function.
That context is reconsidered for each codeblock.
It’s often a useful optimization to replace a function call with that function’s “body”, as that frequently reveals more optimizations! The initial “inlining” pass I’ll describe today doesn’t use the analysis described yesterday, instead using simpler instead handling all the more obvious cases.
C++ abstractions often yields these obvious cases, but they can occur in C too.
This pass expects the instructions to be in SSA form, which will need to be reestablished after it runs.
After checking for errors it considers whether it should inline.
Any non-recursive functions annotated “always_inline” are always inlined, anything else is never inlined into an “always-inline” annotated function. “flatten”-annotated functions recusively inline all their non-cyclic callees. Then a couple of extra iterations inlines any small or “inline”-annotated functions, according to any function summaries described yesterday. It’s during these iterations the actual inlining happens!
These checks updates the relevant Call Graph edges to mark them for inlining & ensure they can actually be inlined regardless of other optimization options.
After each iteration of looking for small or “inline”-annotated functions it iterates over every codeblock/instruction to find these corresponding CALL instructions, split the containing codeblack around the call, & splice in the callee’s codeblocks. Turning arguments & returns into variables to be SSA’d.
A common issue in C programs is when they read from or write to memory outside of the data they intend to. This is trivially overlooked when skimming code, and is the source of real security vulnerabilities like Heartbleed.
To help C devs address this GCC provides the
__builtin_object_size() function, which is lowered in the next pass. This seems to resemble
sizeof(), but runs later in the compilation pipeline. Perhaps so it can be used by santization?
To do so it iterates over every codeblock & instruction therein looking for those
__builtin_object_size() CALL instructions.
The gotcha for GCC in lowering these CALLs is that the type system has already been lowered away at parse-time, and as such it needs to go recursively searching for all the type sizes to sum. So it refers heavily to a cache of computed sizes.
If I understand this chunk of code correctly, the actual types are on the SSA_NAME instructions - and propagated through the other instructions.
The resulting size is converted into a literal int.
Constant Propagation Bitflags
Sometimes programs have mathematical expressions where every number involved is a constant value. Maybe because it makes the intent clearer, or maybe as a consequence of other optimizations. In either case it’s optimal to do this computation at compiletime rather than runtime.
In part because it’ll always be the same anyways so why bother recomputing? In part because minimizing constants yields more compact machine code.
Sometimes these constants propagate between control flow branches.
To propagate these constants regardless of control flow,
pass_cpp with dominators computed splits this task between initialization. propagation, & finalization where it sets computed values & unsets appropriate loop LOOP_CLOSED_SSA flags.
To initialize it iterates over each codeblock & contained instruction to determine whether they’re definitely varying. If it is, it propagates that status to all it’s immediate outputs. Then it iterates over the codeblocks to handle PHI instructions.
The propagation phase seems to be little more than a SSA graph traversal using a bitmask as the “fringe” to store which edges still needs to be processed. As it does so it propagates the previously-set flags, taking special care around PHI instructions.
Constant Propagation & Peephole Optimizations
Continuing with the story of how GCC implements constant propagation, yesterday I described how it flags which values are defintely varying. Today I’ll discuss forward propagation.
This implements “peephole” optimizations to combine a handful of connected instructions into a more effecient alternative.
It creates a lattice to track SSA_NAMEs (it wasn’t SSA’d after inlining yet?) & postorder traverses the codeblock tree iterating over their instructions.
The first instruction iteration looks for PHI instructions to discard. To be replaced with a value to propagate via the lattice.
The second iteration applies forward propagation of assignment statements, taking special care around pointers, complex numbers, & some C-irrelevant types.
The third iteration attempts to merge each instruction going forwards with their arguments whilst retrieving propagated values from the lattice.
For pointer assignments this GCC pass reconsiders the flags for whether this expression is constant or varying, and whether it has sideeffects.
In this iteration it’ll repeatedly apply 2 subpasses to repeatedly replace the instruction & it’s arguments with more optimally alternatives. The first normalizes any pointers then constant propagation then ASSIGN/CALL/ASM/DEBUG/RETURN. If so it reconsiders the flags.
The second handles ASSIGN of CONDs, SWITCH, COND, or CALL.
Update the latice.
The third iteration updates destination PHI instructions. PHI instructions, btw, represents a value which may come from multiple control flow blocks, thereby making asserting the Single-Static-Assignment invariant easier.
Trailing loops removes instructions which were marked as replaced via a bitmask, or are now dead due to rearranged no-return function calls.
Like all computer data, architecture programs are stored as a sequence of bytes. At least if your computer at least pretends to follow the Von Neuman architecture, which they all do. So what’s the best order in which to output these bytes? That’s what
pass_early_thread_jumps estimates, if I understand this right.
With a loop optimizer initialized it hands the work off to a new
thread_jumps object. It selects a next codeblock for each one with nontrivial exits, than applies results.
For each codeblock with a nontrivial number of exits, it examines the final/branch instruction selecting which branch is taken at runtime. It looks backwards for where that condition is constant, making sure not to infinite loop over loops. Because that bumps the likelihood one branch is taken over another.
To extract the optimal ordering of codeblocks it examines the “paths” that computed, removing ones mentioning dead branches or (which becomes a bitmask) are already “jump-threaded”.
It converts all the paths into a tree it can traverse in postorder. Loop headers are threaded specially to assume we stay in the loop before examining their loop bodies to select each of their codeblock’s most common exit target. This done twice for each loop/function body, once without joiners & once with.
The function’s body is threaded whilst computing the postorder traversal, loops are all handled afterwords.
Scalar Replacement of Aggregates
As far as a CPU is concerned all data are numbers (or bitsequences), so GCC’s Scaler Replacement of Aggregates (early) pass deconstructs larger types into their scaler components. The C parser already does much of this, but I presume function arguments & returns are left for this pass to deal with.
A near identical pass will be run again amongst the final optimizations.
It starts by initializing it’s collection (bitmaps, hashtables, & a stack) for the current function, before looking scalarization candidates in the arguments & declarations.
A valid scalarization candidate is:
- An aggregate type
- Doesn’t need to live in memory unless it’s constant
- Isn’t marked
- Is complete
- Has a fixed size
- Requires some bits
- All it’s fields also meets these conditions
- It doesn’t contain bitfields
va_listis only lowered in the later pass, not initial.
These variables are marked in a bitmask.
Then it iterates over all codeblock’s instructions attempting to decompose any RETURN, ASSIGN, CALL, or ASM instruction.
The next subpass examines all the candidate instructions that last subpass extracted into the collection types, checking all reads to see if the intermediate storage is actually needed. Which it’ll then procede to remove.
Before iterating over the declarations again to apply further validation & if so flags it’s accesses. Then optionally & non-optionally checks some more invariants.
The final subpass modifies those RETURN/ASSIGN/CALL/ASM instructions according to what’s in the collections.
Those instruction modifications recurses over the type to generate corresponding instructions to replace the “aggregated” one. Because this enlarges the codesize it may have previously discarded larger structs from this optimization.
Then it applies the same decomposition to read in the function arguments.
Profiling info may be written out for the GCC devs, and the collections are freed. Steps may be skipped if there’s no work to do.
Full Redundancy Elimination
TLDR: Iterating over a function’s codeblocks in postorder (with or without tracking cyclic edges), it hashes instructions to determine what to replace future instructions with. Unfortunately the Intermediate Language’s data representation does not make this easy.
A prime concern of compiled machine code is to minimize it’s filesize. Because the better the program fits in the CPU’s instruction cache the less it needs to resort to the slow process of fetching it from RAM.
Redundant instructions, whether arising from other optimization passes or because it reads nicer, are a prime offender. Which the Full Redundancy Elimination optimization pass addresses.
With dominators calculators & loop optimizer optionally initialized it does it’s processing.
If these arguments are missing, this central function
do_rpo_vn finds the entry & exit points to analyze. It clears all the
EDGE_DFS_BACK bitflag on all entry edges.
It depth-first iterates over the control flow graph to set some flags, gather a postorder indices array, & optionally gather cyclic edges (loops). This is seperated into loops for locating the cycles & function/loop headers, shortcircuit tidyups on trivial cases, locating exits, computing postorder, & cleanup.
If those iteration over the control flow edges yields no loops it’ll skip processing PHI instructions. It creates a mapping from the codeblock indices to their index in the postorder array. It updates in-region flags for each basic block & edge. It creates a few more collections for the next loop.
Which iterates in postorder to gather unwind state (C-irrelevant exception handling?) & unset codeblocks’ EXECUTABLE flags.
It optionally iterates over cycles to expose them better to later loops.
If it is optimizing loops it performs a main iteration over all codeblocks checking/updating the unwind state, validating EXECUTABLE flags, process the basic block, & their PHI instructions to validate them, & inserts unwinding instructions.
Otherwise it uses a greedy worklist to iterate over the basic blocks in postorder to validate them & determine in which order to process them.
A few final checks/loops outputs debug information to the GCC devs & frees the collections.
To actually process a basic block it first checks if there’s actually any PHI if the caller told us to process them.
Unless we’re handling cycles, it’ll consider recursively inserting the loop header into place.
It’ll optionally iterate over the PHI instructions to recursively compute a “value number” (hash?) for the expressions they reference.
It updates the edges’ EXECUTABLE flags. It iterates over instructions to choose other branches to examine & lookup what can replace this instruction.
SSA_NAMES, assignments, conditions, & calls are handled specially. There’s plenty of debugging info written for the GCC devs. If it’s made any changes there may be per-instruction tidyup for it to perform, possibly defered to a later pass.
Failing to find another instruction to replace it
pass_fre will add it’s parameters to the collections of instructions which may replace later instructions.
So with dominators calculated & loops handled the
pass_early_vrp (Value Range Propagation) folds the appropriate “folder” selected by configuration over the current function.
The different expression fold-ers priorizes local expression or (using those computed dominators) control flow possibly exclusively. Handing each relevant instruction or codeblock off to a
gimple_range_analyzer to interpret for the previously-computed max & min values. Storing the extracted ranges temporarily in a cache, or in an array of bitmaps indicate which vars hold equivalent numeric ranges.
A PHI instruction is introduced during enforcing Single Static Assignment (SSA) to indicate a value which may come from multiple conditional codeblocks, a lightweight way to model control flow without variable reassignment.
But during optimization these PHI instructions may be altered to degenerate forms. Maybe a PHI instruction only wraps a single value, or is only used as input to another PHI instruct. The
pass_merge_phi iterates over PHIs for each block to locate & remove these cases.
Dead Store Elimination
Developers will probably avoid this for clearer code, but this case can crop up as the result of optimizations: Sometimes a variable is assigned to but never used. In which case it’s useless to computing that variable’s value, and the assignment should be removed.
So after reassigning identifiers, calculating dominators, & computing the postorder sequence
pass_dse (Dead Store Elimination) iterates over every codeblock & instruction therein.
If it finds any assignment or memcpy/strncpy/memset calls, gathering an array of which byteslices of each variable is read & written. Dropping anything that’s obviously, e.g. assigning zeros to the result of
calloc(), or less obviously, looking forward in the instruction stream, dead.
It’ll also examine any SSA “definitions” or PHI nodes to see if dropping those stores has obviously removed the need for them.
It may then remove exception handling edges.
Dead Code Elimination
Ofcourse variable assignment are from the only instructions which might have no effect other than to slow the CPU down with busywork. So at various points throughout the GCC compilation pipeline starting (optionally, for high-enough optimization levels) now after
pass_[cd_]dce essentially ammounts to a garbage collector for computer code rather than data.
Starting with the usual dominators & loops calculations. Marking backedges for full collections.
The first thing a garbage collector does is to find the “roots” which are obviously alive, and it’s no different for
pass_dce. For this it iterates over all codeblocks and their PHIs & other instructions. PHI/PREDICT/LABEL/DEBUGs aren’t inherantly necessary, ASM/RESX/RETURN/GOTO are. As well as volatile or global mutations, exception handling, & in full collections infinite loops. CALL is more nuanced, whilst control flow is treated as necessary in partial collections.
The next step of Garbage Collection (GC) or Dead Code Elimination (DCE) is to traverse the graph of objects/instructions to mark anything referenced directly or indirectly by a root. Which for GCC requires treating CALLs (especially to
free(), etc), ASSIGN, CAST, & transactions specially.
Now that we know what to keep, it can iterate through freeing anything else. And removing any resulting degenerate instructions.
Then it tidies up after itself & optionally outputs profiling info.
Control Flow Simplification
One technique for lowering away PHI nodes arising from enforcing Single Static Assignment is to duplicate the codeblocks where they occur for each of their possible values. This helps to reveal more opportunities for inlining, but can also increase the codesize which is bad for the instruction cache.
So with dominance info, possibly the (hash)set of nontrapping memory accesses, & the order to process codeblocks computed
pass_phiopt iterates over every “nonfixed” codeblock.
Codeblocks ending in a GOTO, with abnormal edges, where both the source or target have NULL successors, etc, etc are discarded.
Where there’s no more than two branch targets it optionally attempts a “cond store replacement”. If there’s only assignment on a certain branch & there’s a PHI node which reads it, the trailing code after both branches is duplicated with those branches merged in.
Otherwise it checks all the PHI nodes to see if it can inline them away.
In this initial iteration over PHI nodes it looks for values (who’s types don’t honour signed zeros) who’s value has been checked to see whether it’s equal to a literal that it can inline. It specially handles optimizing
(x != 0) ? x + y : y to
x + y.
A second iteration finds a common PHI node accross both branches that it can move out into the trailing code. Then it optimizes away various cases on conditionals, taking into account the possible numeric range of each variable.
Memory Access Analysis
The main bottleneck in a CPU, which motivates all their (vulnerable) optimizations, are memory accesses. As such GCC needs to optimize them away as much as possible, and for that it needs to gather some data. This is what GCC’s
To do so after examining some flags & (optionally) allocating a new summary for the function, it iterates over the function to determine whether the function can be analyzed.
During that iteration over codeblocks & their instructions it analyzes any arguments to each instruction to record any loads & stores into arrays in it’s summary, before handling ASM (which might disable this analysis) & CALLs (to get their summary) specially. Upon memory clobbering ASM this analysis is discarded.
A second iteration incorporates the memory clobbering data from any recursive calls, which may also lead to this analysis being discarded.
Behind the scenes functions allocate “stack frames” into a memory region starting from near the top of (virtual) RAM. These stackframes are freed upon return. Though if a function returns the value of another call (a “tailcall”), there’s no need to keep the caller’s stack frame around. Instead it can have the callee directly return to it’s caller.
pass_tail_recursion checks the exit branches for these cases, flags those CALLs, & removes setup/teardown overhead from recursion.
Sometimes it’s more optimal to compile conditions using
switch (looking up the branch to take in a table), at other times it’s more optimal to use
pass_if_to_switch handles converting obviously relevant
if/else branches to switch in case that’s more optimal.
To do so it iterates over every basic block looking for those taken upon a trivial condition upon a common variable to add to a hashmap. If sucessful it’ll iterate again in postorder to validate what it found.
Having asserted what it found was a straight line of
if/else if/else conditions it restructure the bytecode, control flow graph, & PHI instructions to turn them into a single
To convert a
switch branch into
if/else if/else branches where optimal GCC’s
pass_convert_switch iterates over the last instruction of all basic blocks. For each
switch it finds, with some debugging output, it:
cases by value, this may reveal degenerate
switchbranches with only one target.
- Compute upper & lower bounds for each set of labels, & counts how many comparisons would be required in
- Discards cases better handled by other optimizations
- Discards cases with too large of a range, or if there’s non-empty branches, or if it can’t handle the PHIs.
- It gathers loop targets as codeblocks rather than labels, iterates over those branches to determine in which order to emit the comparisions.
- It stores the value being switched over, rewrites PHIs whilst emitting the comparisons, &emits the initial comparison to assert the values in range.
It takes a very/relatively long time for a CPU to fetch the next batch of instructions from RAM to execute. So it fetches them ahead of time, at which point branches get in the way of this optimization. Runtime profiling can aid this, but for code the CPU’s never seen before (or has forgotten) the CPU requires the compiler’s guidance as to which of those branches it should prefetch.
pass_profile is the optimization which computes these CPU hints.
After validating this work hasn’t already been done, and initializing both loop optimizers (adjusting infinite loops to be handled correctly) & dominators info, GCC’s
pass_profile optimization iterates over the codeblocks to determine which are obviously unlikely according to developer annotations.
Any codeblock only branching to obviously unlikely codeblocks is also unlikely.
A second iteration & propagation counts the successors of codeblock, for a 3rd edge marking iteration.
With the unlikely codeblocks computed it iterates over all edges to check for return statements, to propagate that as a likely (or unlikely, if it hueristically appears we’re returning an error) codepath.
It then iterates over the codeblocks & their instructions to propagate probabilities from annotations on called functions or the codeblocks themselves.
Each loop or recursion is handled next, marking any exits as unlikely. Bailing if they’re absent. It estimates number of iterations.
It computes a precise probability for each exit edge. It incorporates any hueristically detected loop variables into it’s probability calculations for remaining edges. It iterates over each edge &, according to previously determined methods, computes it’s likelihood.
It iterates over edges to predict function calls.
It iterates again to combine probabilities if a branch ended up with multiple.
It cleans up & determines whether to recompute via a similar but different process.
Constant & Pure Functions
Often functions will only read (except for the callstack) from memory to return value (“constant” functions), usually these reads will be restricted to function arguments (“pure” functions). This makes code more maintainable, but it also enables greater optimization. So GCC has an analysis,
pass_local_pure_const, to detect these cases.
After checking whether to skip this function (no recursion, has direct calls) it examines the function, reports results, & reconstructs control flow.
After initializing a closure it iterates over every codeblock & instruction therein skipping DEBUGs.
If the instruction operates upon volatile globals it’s neither constant nor pure.
It looks over any memory it modifies skipping static or locals to find volatiles (again),
used annotations (neither constant nor pure), & optionally all other writes (neither constant nor pure), & constant memory (pure).
During tidyup it detects infinite loops & handles
malloc(), etc specially.
Sometimes functions have a fast codepath & a slow codepath. These are often explicitly split in the header files you import so the first codepath (and cheap condition) can be inlined, but if not GCC can split these two codepaths itself in the
Skipping noreturn, malloc, main(), unlikely, non-inlinable, versioned, (C-irrelevant) nested, single-direct-caller, non-inlined or noinline, & section functions it starts by finding back edges of loops.
After initializing some collections GCC’s
pass_split_functions iterates over codeblock examining their probabilities & instruction therein calculating the codesize for each block. It then depth-first-searches the control flow tree (backedges were previously removed) to consider where it can split & where it should.
If it found any condidates it adds the new function, adjusts the callgraph, & replaces the slow codepath with it’s call.
GCC has builtin functions for marking which codepaths are more or less likely, which I believe GNU LibC uses heavily. There are also opcodes in the intermediate language to do the same thing. The next (lowering) pass removes these, as their feedback has already been incorporated into their codeblock’s probabilities.
pass_release_ssa_names iterates over all
SSA_NAMEs in a function & allocates each a new identifier.
There’s a pair of optimization passes which clones functions annotated with multiple targets, and choose the correct clone for each of it’s callees (with inlining).
Within any program lives data, not all data comes from external input. This data documentation, part of the datamodelling, and/or decoration. It may be numbers, text, images, audio, or basically anything else, though you may want to use different tools like GIO’s GResource or Haskell’s file-embed to embed image, sound, etc files.
During GCC’s compilation this data needs to be optionally optimized & written out seperately.
The first step of either optimizing or compiling constants is to serialize them out from the Abstract Syntax Tree into a bytearray, which is done using a recursive function. Which may involve normalizing endianness & rounding floats.
If there were any constants to place in the bytearray the optional optimization pass will then sort the list of it’s slices by decreasing size then hash. Which it then iterates over to remove any duplication in a new bytearray with hashmap help.
That optional deduplicating optimization alters the AST for the output logic to iterate over the constant pool again. This checks if each constant needs to be outputted again (couple of different cases), where it should do so, & writes it out with it’s labelling to Assembly (not Machine Code). It also special cases in a “object block”, but I don’t think that’s relevant for C.
Once it has a place (if any) to place the constant it uses different serializer writing the bytearray as text.
To output unstructured data it iterates over every “block” of code in a consistant order. For each it starts by declaring a new section & adding alignment bytes with any relevant labels.
It iterates over every object in the block outputting additional constant pools seperately.
Structured Data may need to be outputted as constant or variable, variable blocks may need to be written out later (I suspect this isn’t a C thing). Both may optionally run the code to add debugging function calls for sanitizing memory addresses.
In either case it uses a recursive function to traverse the Abstract Syntax Tree (why not IL?). This starts by checking if it’s a pointer value to output that specially.
After doing some minimal integer & boolean constant propagation it outputs this data to Assembly taking endianness into account.