SQLite3 Full Text Search 5

NOTE: This page is a recollection of what I've written previously. I do not guarantee it's accuracy.

You can free-text search your Odysseus browser history using a simple form on that page. This is powered by SQLite3’s builtin “FTS5” extension, so on this page I will be describing how SQLite3 FTS5 works.

FTS5 integrates into SQLite3 by way of it’s concept of “virtual tables”, which you can query like any other table but it runs arbitrary logic to:

  1. Create the virtual table
  2. Open the table (with optional arguments)
  3. Filter by column expression
  4. Iterate over results
  5. Add/remove rows

To start the provided searches, as passed in step 2 or 3, are parsed for each query execution, thereby allowing the same prepared statement to run different user-provided searches. This parsing is done using the “Lemon” parser generator used elsewhere by SQLite3 and outputs an Abstract Syntax Tree to be interpreted by dynamic dispatch. At this point the query terms are normalized according to the text encoding for splitting and some hueristics (the Porter Stemming algorithm) for detecting likely synonyms.

Then for step 4 FTS5 reads a binary-encoded sorted list of prefix-compressed and normalized “tokens” and feeds it through that AST for processing. So it can find the appropriate row to return this structure lists all rows for each token, all columns for each token/row, and all text offsets for each token/cell.

While it may need to refer to the raw text value being matched at times, for the most part the processing done by these AST nodes is very trivial (given NOT is an infix operator filtering results from it’s left-hand input). However the AND operator is where the fact these results are kept sorted is valuable, because there it tracks the alphabetically-highest value it has seen so far. With that it knows any text lower than that cannot be common across all the streams.

That binary data is stored compactly in a behind-the-scenes table, and if necessary so are the raw text values. There’s also a third table storing the configuration for this virtual table.

It optionally orders these results using postprocessing over the returned rows, using the BM25 hueristic by default.

To add some rows (which annoyingly must be done manually) to an FTS5 virtual table, a hashmap will be used for throttling and deduplicating that input after which it will be added as a new row indexed by it’s lowest and highest values. That range will later be used by querying to lookup the row(s) it should merge together as the input stream to the search query AST.

While that can restrict how much data is scanned for each search, it can also lead to innefficient access patterns when there’s too many rows that need to be merged. As such state is tracked in it’s own table row to guide FTS5 towards the rows that need to be merged on-disk.

Deletion works by finding the rows containing the necessary tokens, moving it’s data into that hashtable and removing it from disk, and processing it there by writing the updates as per normal.