NOTE: This page is a recollection of what I've written previously. I do not guarantee it's accuracy.
SQLite3, possibly the most widely deployed database, is used by Odysseus to persist all your browser data. It is also used in various other elementary apps, for example Music uses it to store your collection’s metadata and power “smart playlists”.
Integrated into Odysseus’s templating language is a SQLite3’s concept of “prepared statements”. This seperates the compilation/optimization of the query from it’s evaluation, in fact these are the primitives
sqlite3_exec() is built on.
To start your query is sent to a parser implemented via SQLite’s own “Lemon” parser generator, dispatching to directly to the code for the various SQL statements and clauses. The SQL statements will then compile themselves down to SQLite3’s “VDBE” bytecode, some much more readily then others.
Often this will involve looking up the tables to which the query refers. So the
CREATE TABLE statements will insert their own AST into a hashtable used for that purpose, and those queries will get persisted (and updated) in a special table stored at a known location in the file. Upon startup this table will be read and all the contained SQL will be parsed.
These (fixed-sized 5 argument) opcodes are written out into a resizable array. And to write branches callers will retrieve the current pointer offsets in order to store it in the necessary opcodes.
Once we have a prepared statement, we need to run it. So once input data is written to a sidetable via the
sqlite3_stmt_bind* functions, the query’s VDBE bytecode is interpreted by a massive switch statement in a for loop until the next YIELD op. From there a slice of the program’s virtual registers can be accessed directly via
These bytecodes are in many cases will rewrite themselves for optimizations, and in which case their memory is laid out to make this trivial. Whilst the data it operates upon are stored in sidetables each associated, addressed into by the operator arguments. The data in these tables have a dynamic type, which is used for automatic conversion and to read/write it to/from disk.
At it’s simplest,
SELECT compiles via an AST into a loop which tests each row of the table’s BTree (or some variant there upon) against the
WHERE condition and computes the data to be YIELDed. Though optimization (a topic deserving it’s own page) and some of it’s clauses makes this more complex. When it gets really complex,
SELECT uses some special “coroutine” opcodes to simplify.
The BTrees (a highly versatile and performant datastructure), recognizing that data is loaded from disk in fixed-size “pages”, stores as much (sorted) data as it can in each page. If the data isn’t present on the current page, it directs the iterator to the page containing data between any two items on that page. This makes it very fast to query all rows that satisfy any comparison operator on the key.
To read these pages the BTrees use a choice of platform-specific bindings via a “pager” object.
While it can be optimized away by SELECT’s optimizers,
ORDER BY will gather up all of the query’s results sorting it in chunks. Then before yielding them out it performs a merge sort, possibly utilizing multiple processes and temp files where the data is “big” enough.
Judging it’s bytecode interface, it seems reasonable to assume this used to be done using the same BTree implementation used heavily elsewhere.
These clauses maintain a counter to determine when to skip otherwise valid rows, or to stop.
INSERT compiles to code which checks the database’s invariants (as prescribed by the
CREATE TABLE expression) and computes data to be written to disk. The way it handles failures is configurable by the query. Writing that data involves not only serializing it according it’s types (which unique to SQLite is stored, as an integer, alongside the data itself), but also inserting it into the correct place in the BTree.
There are two tricky aspects to inserting this data: handling the case where the data overflows the page, and making sure the database in the event of an unexpected crash.
Handling page overflows
When a page is about to overflow, it is first split into two pages, with the middlemost entry being inserted into the parent. If that overflows the same process is repeated until we reach the root, in which case a new root is created. This process ensures reliable performance, as all leaves are at the same depth.
However this process requires us to allocate new pages from the SQLite database. To find them the database header references the first free page, and a small handful of those will be packed with references to the rest. And track which of these pages it’s currently allocating out of.
In case of an unexpected crash, SQLite will first write each relevant page to a logfile before updating it in memory. Only once all of those pages have been logged does it write out the changes. It’ll then tell the kernel to “flush” all changes to disk, and mark the operation as “committed” by truncating that logfile. So that if SQLite sees that logfile on startup it can restore the database to the last valid state by writing each of it’s pages to the specified offset.
Throughout this it handles the more common case of coordinating multiple transactions by taking a read/write lock on the entire database. This is done using the kernel’s “file advisory locks”, though on older systems (mostly to keep the file format backwards compatible) it can write these locks to the database file.