SearchMySite.Net

Michael I. Lewis’s SearchMySite.Net is a curated search engine I use for finding additional interesting technical writing to link to. Though there’s others like Wiby.Me & Marginalia.

SearchMySite.Net is built upon Apache Solr (thus in turn Apache Lucene), Scrappy, PostgreSQL, and (to build a now opensource abstraction/UI layer I’m not interested in studying) Python’s Flask web framework.

Apache Lucene

Apache Lucene is server software for document search. While many relational databases like PostgreSQL & SQLite also supports this, Lucene is more full-featured. To the extent it supports a range of spoken languages including Arabic, Bulgarian, Bengali, Brazilian, Catalan, Czech, Danish, German, etc, etc, etc.

As such I’d heavily recommend or one of it’s abstractions for any search engine you might want to make. Unless that’s not your app’s focus, then might be easier to use e.g. Postgres’s.

I won’t discuss Lucene’s extensive localization (everything search-related I’ve read is English centric, hence my heavy recommendation of Lucene or one of it’s abstractions) further, ask a linguist to explain them. Beyond saying there’s a wide range of complexities (usually trivial), and they involve parsing via a Flex/Bison Java-equivalent. Nor will I dig into text-processing toolshed surrounding surrounding these localizations.

Instead I’ll start from the indexing (then search) demo commands.

Indexing

After parsing the commandline flags, validating that “-docs” was provided, parsing that filepath to it’s FSDirectory class.

FSDirectory wraps a given filePath, which it opens via one of two (local or NIO filesystems) other ByteBufferIndexInput wrappers, with an atomic pendingDeletes hashset & two atomic counters. One for operations since last deletion & another for naming temp files. Has directory traversal methods. Former determines when to delete those files.


It next constructs a StandardAnalyzer. StandardAnalyzer, which subclasses StopwordAnalyzerBase, pipelines StandardTokenizer to LowerCaseFilter to StopFilter to TokenStreamComponents.

StopwordAnalyzerBase stores a set of words, possibly parsed from a given file via WordlistLoader which parses a #-commented newline-seperated file.

StandardTokenizer wraps a StandardTokenizerImpl (which is implemented with help from the JFlex lexer generator) with it’s own scanner & configuration. I’m not entirely clear what this lexing’s for, but it primarily seems to serve to whitespace seperate words & determine which alphabet’s used.

LowerCaseFilter, ofcourse, converts all the words sent through it to lowercase via an internal utility iterating over the char arrays.

StopFilter implements it’s accept method to return whether the given query term is in the stopset by wrapping CharArraySet which in turn wraps CharArrayMap. CharArrayMap implements a simplistic hashmap around seperate key (2D character array) & value arrays sized to powers of two so can use bitmasks rather than divide. Hashes are computed by multiplying each char by 31 & adding the next.

I’m having trouble locating TokenStreamComponents, it’s probably just an output struct.


Next the demo program constructs a IndexWriterConfig struct, which subclasses LiveIndexWriterConfig struct, and sets some of it’s properties. I’m calling these “structs” because they contain little more than accessor methods.

Then it creates an IndexWriter, reads every file directly or indirectly in the specified directory, & loads them into that IndexWriter as dynamically-typed serializable Document structs.

And cleans up with time profiling & exception handling.


Lucene can index on-disk Documents, which dynamically typed structures with whatever properties you want, via a IndexWriter.

To add or update each given doc it first (which the demo loads from given dir) it ensures the indexfile is still open, defers to DocWriter, .process all atomically queued events under a semaphore, & handles merges. Handling merges involves validation with error messages, deferring to configured MergePolicy, & defering to a multithreaded MergeScheduler.

That MergeScheduler, after it’s state-tracking & validation defers back to each given MergeSource now each in a new thread.

To update a document Lucene’s DocWriter first validates the file is still open & stalled threads or flushes pending. Additional metadata is written out before each flush. Then obtains a per-thread lock which it, between tracking threading state, defers to. And before considering reflushing.

Each thread, between outputting optional logs, iterates over the given docs validating we aren’t exceding a document cap & deferring to an IndexingChain. After iterating over all given documents it enqueues a deletion of the previous docversion to populate a DeleteSlice it’ll then regardless .apply.


The IndexingChain will first tell it’s TermHash & (with exception handling) StoredFieldsConsumer to start the document before iterating thrice over each field with exception handling.

The first iteration gathers all the field structural metadata into a inlined closed hashmap validating there aren’t too many & updating the FieldSchema.

The second iteration initializes the FieldInfo for each of those fields & the appropriate writer for it’s type.

The third actually writes out the field values using the classes initialized in previous iteration.

Beneath this is a suite of allocator & serialization classes, with or without sorting/compression, wrapping StoredFieldsWriter

There’s essentially one shallow layer around the classes which writes raw bytes to the fields.

The basic structure of which is a little unobvious from this perspective, but seems to resemble what SQLite FTS5 does: store a prefix-compressed sorted list in each “page” (hardware & filesystems concept) of persistant storage mapping from search terms to sorted matching documents.

When such a page is too small that’s when the concurrent merge sort kicks in.

Searching

Now that I’ve described how Apache Lucene generates it’s index file, how does it actually use that to speed up searching?

The first thing the demo program for this does is to initialize an IndexSearcher wrapping a given IndexReaderContext & it’s related classes. It further gathers a sorted array of “leaf slices” from the IndexReaderContext’s .leaves() whilst grouping them into consistantly-sized buckets.

It also implements a .search method with various wrappers/signatures.

One of those .search wrappers/signatures dispatches the result to methods on a CollectorManager, others provide their own CollectorManagers. But ultimates it defers to the appropriate LeafCollector (from given Collector) & BulkScorer (from given Weight). Possibly after repeatedly rewriting the given query until fixpoint, then incrementing hitcounts.

That rewrite can be exposed to the caller/enduser.

Next the demo program initializes the same StandardAnalyzer pipeline used during indexing, maybe a KnnVectorDict (uses a Finite STate machine to map words to numeric IDs as parsed from a file), & a stream of input queries.

Then it create a QueryParser to extract the structure from input the enduser’s input text, which is written in JavaCC. Uses FastCharStream as a trivial lexer/scanner. Barely any syntax for endusers to learn…

For each whitespace-trimmed input line it parses the query, optionally resolves word IDs from that KnnVectorDict, optionally time-benchmarks the search a given number of times, before performing the search for real with additional queries to determine pagination information.

After those iterations it closes the input stream & optionally the KnnVectorDict.


Lucene ultimately calls down into a Collector implementation to load the search results from the index, of which there’s several trivial decorators around other implementations (to e.g. cache results) which I’m skipping. A few also implements Scorer to calculate hueristic rankings which I’m also skipping.

Today I’ll discuss the more substantial of Lucene’s Collectors!

Often there’s multiple possible interpretations of a search query, making it worth ensuring you return for all.

This works by maintaining a hashmap of priority queues, with a seperate overflow priority queue. The keys for this hashmap are determined elsewhere.

The central Collector appears to be GlobalOrdinalsWithScoreCollector which use a bitmask & 2 abstracted 2D arrays of scores/occurances to underly it’s .match() & .score() methods. Handing off, after loading a list of sorted docs, to it’s OrdinalMapCollector or SegmentOrdinalCollector embedded classes to gather those collections.

GlobalOrdinalsWithScoreCollector has several embedded concrete subclasses for Min, Max, Sum, Avg, or NoScore scorings.

Those a couple Collectors/Scorers to track a minimum competitive score to enforce.

TopFieldCollector wraps a MultiLeafFieldComparator (itself dispatching to type-appropriate comparators for the dynamic doc fields) enforces that min competitive score, gathers, & maybe paginates them.

Finally, for today, there’s a MultiCollector which wraps an array of other LeafCollectors deferring all it’s children’s collect methods until they have no more results to yield.

Then there’s SimpleSelector which defers to it’s subclasses’ DoSetNextReader method, which I’ll discuss tomorrow.


TopSuggestDocsCollector, a SimpleSelector subclass, wraps a priority queue, or a pending list & CharArraySet, to sort & deduplicate gathered results.

ValueSource provide it’s own embedded ValueSourceCompator subclass which defers to ValueSource subclasses (subclasses for floating point embedded).

Lucene has an on-disk cache following the same format as Lucene’s main index from which it can retrieve a cache & query ID as well as binary “Monitor Query”. There’s a corresponding SimpleCollector to consult it.

Queries usually consist of more than one term, so there’s a factory for SimpleCollector subclasses defferring to an appropriate scoreAggregator & joinScorer. Then the factory initializes appropriate score-computing Stream, & a returned PointInSetQuery subclass.

TermGroupFacet has an embedded SimpleCollector subclass which retrieves caller-given group & facet fields (two lookups) & allocates a smallintmap before iterating over the facet results looking up each value in the group field. Then lookups a count depending on whether there was a prefix to the facet.

There’s 2 cubclasses for grouping results. The second wraps a caller-given GroupReducer instance & GroupSelector instance. The firstpass wraps the same GroupSelector, HashMap, & TreeSet.

Also the first-pass constructs a few smallintmaps of field comparators & IDs, and serves to figure out which groups to categorize into as data comes in.

There’s a simpler SimpleCollector which tracks the HashSet of all values it’s seen for deduplication as determined by a given GroupSelector.

BlockGroupingCollector wraps it’s own PriorityQueue as directed by some multi-field sorting parameters, buffering into a smallintmap before populating the priority queue.

There’s a AllGroupHeadsCollector which wraps a HashMap & GroupSelector to yield only the first entry of each group streamed into it.

There’s a FacetsCollector which performs searches by wrapper the given IndexSearcher with a TotalHitCountCollector (results stored in TopDocs sorted collection & in turn TotalHits struct), TopFieldsCollector, or TopScoreDocCollector.

And FeatureSortField seeks out the given “feature name” in each result, filtering results without it.

Apache Solr

Apache Solr is an abstraction around Apache Lucene (and its full suite of internationalized search hueristics!), making it easier to use as a potentially-distributed webservice.

It was a bit difficult for me to figure out where to get started, but I found the SolrDispatchFilter servlet. Which seems to be the entrypoint!

After a little preprocessing & surrounded by rate-limitting logic via a dedicated class (collection of atomic 2nd class, following config), before considering the request.


After that preprocessing It performs authentication (if not /admin/info/key, /solr/, or /) via a configured class (range of choices!) with or without logging. Then performs some nullchecks before selecting whether to dispatch off to the v1 or v2 HTTP APIs.

With exception handling it calls the selected API version & at its behest forwards to next servlet in the chain, recurses to retry, or follows its redirection.

V2 HTTP API

After parsing the path & querystring from the given request Solr’s V2 HTTP API class (with exception handling) first checks for an empty path in which case it returns documention link & hardcoded description, otherwise gets & checks the path-prefix looking up an appropriate plugin. It checks if that prefix is a builtin type so it can defer CompositeApi for later handling, otherwise deffering to the plugin saving aside certain properties.


For “c” or “collections” prefixes it runs the core querying logic & saves results or throws an exception. For “cores”-prefix saves parameters.

If no core was computed Solr flags this request as an “admin” operation. Before finally configuring multithreading, run a final Request Parser, & handles exceptions.

Other Java Servlet methods defers to the class selected by that logic, surrounded by error-checks/handling.

That core-logic I mentioned wraps a “ZK State Reader” & “DocCollection Supplier lambda” to resolve a collection’s name, trying twice. I take that its the plugin’s responsibility to do something with the returned collection.

Parsing

Looking over Apache Solr’s parsing subsystem, I see an Exception subclass incluing message formatting with unescaping logic. Another exception subclass assembles syntax error messages.

There’s a Token structure to serve as the AST.

And ofcourse a parser generator is used to define the syntax itself with seperate lexer & parser! With utility methods building that AST, most of which are in a seperate subclass.

Request Handlers

Looking through Solr’s request handler classes, I see ones which:

All of these requesthandlers share a superclass, and are dynamically dispatched to from the servlets.

Request Handler Subsystems

The request handlers Solr subsystem described above includes several subsubsystems.

There’s extensive admin HTTP-APIs collating technical internal information including Apache Zookeeper integration, with its subsystem rerouting more modern HTTP-APIs to the core logic. Occasionally this core logic involves some statistics. Including extensive replication APIs.

There’s utilities for parsing components of HTTP-requests.


There’s a miscallaneous “components” subsystem including spellchecking, non-distributed configuration, sharding utilities, collections & other datamodelling, statistics like heatmaps, checking replicas in realtime, various facets, more complex sorting, automated suggestions with underlying term processing, more-like-this, scoring phrases, selecting a pivotal facet, cancelling queries, merge strategies, active task lists, highlighting where matches occurred, & more.


There’s a simple configuration API.

There’s a somewhat sophisticated “designer” subsystem to help webmasters decide on configurations before they’ve got real data.

There’s a suite of classes for serializing results, with sorted fields.

There’s classes for importing documents into the underlying Lucene database.

And there’s automated document tagging.

Plugins infrastructure

Most of Apache Solr, from what I gather from studying its servlets, are implemented as plugins. So here I’ll study its extensions framework!

This maintains several collections, can cluster query parameters to pass to the given handler class (discussed last time), can query counts from a Solr index searcher, can run a Solr index searcher over a doc list with optional match-highlighting, can parse debugging flags, convert explanations to serializable mappings & query those explanations.


There’s a staticmethod to parse & evaluate a query.

There’s regexp-based parsers for field-boosts.

There’s a counter for optional & necessary query clauses, & a regexp-based parser for that configuration.

There’s a boolean-query optimizer/flattener.

There’s an autoescaper.

There’s strippers for (regexp-based) invalid operators, unbalanced quotes, & null queries.

There’s a results converter to simpler objects.

There’s queryparser subclass which gathers “disjuncts”.

There’s a routine abstracting sortspec parsers to take the appropriate query parameter, as well as for queries.

There’s an introspective setter-caller.

A routine to query a “purposes” mapping to build a string, & an iterator over that purposes mapping.

Miscallaneous/Overview

Skimming over Apache Solr’s codebase to see what I missed as I attempted to trace it’s logic, I see subsystems of:

Surrounding that HTTP API is a larger suite of code distributing load, managing it’s own collections. Including automated elections for which instance should coordinate things.

Ofcourse there’s plenty of locking in this “cloud” subsystem! It’s implemented with the aid of Apache ZooKeeper, via a couple abstractions, with a CLI interface.

Amongst the “core” subsystem there’s routines for saving, modelling, & restoring backups with their repositories & snapshots.

There’s some core Java-interfaces & data modelling for everything to build around.

That “core” subsystem includes logic others can hook into to dispatch requests or whatever. And generally it provides infrastructure for everything else to use. Also there’s caching, configuration, plugin-management, etc.


Having skimmed through several such URL handlers (many of which defers to other Java classes, with or without light tweaks), I see a couple which wraps object proeprties or throws exceptions for deprecated APIs. Then there’s a few which defers to the CollectionsHandler instances & related classes adapting it to an HTTP API. Another wraps Apache Zookeeper. There’s classes & interfaces abstracting a collection of files to validate & index, persisting that index to disk in a distributed network. Has it’s own HTTP APIs.

There’s “admin” URL handlers. There’s a suite of operations admins can operate, many of which reformats responses methods on the “core container”. There’s “security handlers” several of which wraps Apache Zookeeper, or custom configuration files. There’s request handlers some of which also optionally-wraps Zookeeper, or decorates other request to e.g. add diffs or distributed computing. Several wraps accessors, possibly with analysis. And infrastructure to hook those together. There’s a SolrEnvironment struct.

Also Apache Solr’s admin URL handlers can be used to e.g. trigger an “election” for which machine should coordinate all the others. Votes are cast randomly, everyone’s equally-qualified for the job. Or they could parse registry file. Amongst other similar tasks, largely related to distributed computing & indexing. There’s aggregate admin operations, & backup/restore admin ops.


Skimming more of Apache Solr’s URL handlers… There’s a supportclass to merge spellchecking collections & other collection wrappers, various “replica sources”, an interface for computing stats, utils abstracting HTTP networking, & several classes tweaking requests/responses. There’s HTTP endpoints for managing for managing configsets, enqueueing commands to apply those operations if permitted. Including a class processing responses. Or retrieving properties from the core container.

One of those configsets handlers cleans up empty files, or uploading files. The decorators around search queries include things like analytics (with supporting classes), suggested queries, sharding (with supporting classes), & so much more. There’s a suite of APIs (with supporting classes) for loading up sample documents to see how different configurations impacts searches upon them, with recommendations using a heatmap, diffing, & more.

There’s several classes aiding serialization & comparison of key-value documents, including an implementation of a Heap. And there’s a seperate suite of importers from different formats. Another suite of URL handlers & supporting classes for applying “tags” to documents. There’s a suite of “source evaluators”. And there’s more URL handlers, most of which are similar to ones I’ve already mentioned. As well as supporting utils.


It took me a bit to figure out where to resume my skimming of Apache Solr, but I see subsystems for…


Studying the “search” module which appears to be providing a massive suite of trivial classes to be composed to evaluate a parsed search query. These operate on the dynamically-typed record datamodel I described recently.

There’s several “facet” such classes at least, including amongst other things iterators & aggregations upon those documents. A handful of these are non-trivial, or bundles several trivial classes into a single Java file.

There’s several classes implementing functions that can be called from the search query, including a suite implementing various formulas for distances between points or other datatypes. There’s also a suite of trivial grouping aggregators (& a collector class) whether using distributed computing or not. There’s a suite of mostly-trivial join interpretors. There’s a suite outputting statistics. Another suite computing simularity hueristics. And a couple small AI suites.

Finally there’s a larger suite of even more trivial classes which can be composed to interpret a search query, alongside exceptions to throw & builders (or as Solr calls them “qparsers”) to composite new instances of these classes. As well as string-parser classes & collection types to aid implementing the rest of this. And ofcourse there’s collections in which to lookup functions, the most trivial of which defined whilst populating it. Not to mention XML (de)serialization.

Basically, this all resembles the core of a relational database. But with a nicer syntax intended for non-developers (then again SQL for non-developers, that didn’t work out…).


I see a “security” including authentication, authorization, accounting, & a plugin system. As well as configurable rules. There’s another suite of (largely trivial) servlet abstractions. There’s a large suite of classes to handle misspelling, including a subsubsystem handling suggestions out of FST, Jasper, or TST files. This requires some datamodelling in addition to the collection lookups. There’s an negation “uninvert” optimization pass.

I see a large subsystem for updating the database, including a preprocessor with several classes (sharing an interface) for trivially populating its different fields. Including regexps. Or parsing different field types. As well ofcourse some surrounding logic to actually write the changes to the database, including vector clocks & logic for splitting indices.

Then there’s a large suite of utilities, including classes for managing RAM & CPU usage, configuration & its “providers” especially relating to SSL, binary serialization & deserialization, plugin infrastructure, stats-gathering API, & trace-debugging. As well as a larger bucket containing collections, thread-synchronizing, commandline-interfacing, & so much more! Some of these appear to be duplicated from Java’s stdlib, but specialized to Solr’s performance needs.


Some more small-to-trivial submodules in Apache Lucene. Including:

Also there’s auxiliary files for packaging, various support commands (e.g. convert Solr’s metrics to “Promotheus”’s format), documentation, build configuration, testing, etc.

Scrapy

Search engines also need to gather webpages to be searched. For that SearchMySite uses Scrapy, with the atypical twist that they only index curated & consenting sites.

Scrapy is a heavily optimized Python module upon which you can implement such crawlers.


In the main module there’s various interfaces for:

There’s main files exposing Scrapy as a commandline program, or as a Python module. There’s various collections for e.g. addons & “middleware”, & caching for e.g. DNS. There’s various exceptions.

There’s datamodelling covering:

There’s scheduling for:

There’s a couple files implementing a commandline interface for configuring the crawlers. There’s a class for parsing network responses. There’s various classes for outputting results in various fileformats, email, or logging. And there’s a couple classes integrating all these components.

Subcommands

Scrapy includes a number of subcommands you can invoke, including:

And there’s some superclasses.

A separate submodule a class (and subclasses) for the unit tests against your crawler to subclass from, & run in the built-in testrunner subcommand.

Scrapy Core

Scrapy’s Core submodule includes a “downloader” subsubsystem which schedules dispatching URLs to data:, file:, ftp:, http: (special implementations for 1.0, 1.1, & 2), & s3: (delegating to HTTP). Building on the Twisted networking framework (queue that project up). Supports invoking “middleware” callbacks, & optionally logging TLS certificate checks.

Scrapy implements HTTP2 itself, with pooling.

The outer & more complex scheduler queues items both in-memory & on-disk (they’re expecting to have massive task queues I see…) tracking stats & deduplicating tasks. A separate class delegates authentication & tracks the active downloads as well as the spiders triggering them. And another tracks error/success stats, & yet another collects the spiders’ outputs.

Scrapy Middleware

Scrapy’s crawlers can compose a “middlewares” to postprocess their HTTP requests before sending them.

Some builtin ones offer to:

Also there’s 2 separate middlewares for decompressing responses. Whether said compression was indicated at the HTTP level or the fileformat level.

Scrapy Extensions

Today I’ll quickly go over a bunch of builtin “extensions” which can run inside Scrapy to process your spiders’ data, which includes:

HTTP Abstractions

Scrapy includes datamodelling around Twisted’s HTTP client.

Including a Request object with Form, RPC, JSON subclasses; as well as a Response object with Text & in turn XML & HTML subclasses.

As well as a headers dictionary & a cookie jar.

Other Submodules

Skimming over a handful of Scrapy’s more trivial submodules there’s…

The pipeline assembling utils have filters that can handle files (across several backends), images, & media.

There’s template directories to get you started building Scrapy “projects” & their “spiders”, with Scrapy subcommands evaluating these templates.

Scrapy Utilities

And there’s several utils to help implement the rest of Scrapy, including:

This all very trivial.