Michael I. Lewis’s SearchMySite.Net is a curated search engine I use for finding additional interesting technical writing to link to. Though there’s others like Wiby.Me & Marginalia.
SearchMySite.Net is built upon Apache Solr (thus in turn Apache Lucene), Scrappy, PostgreSQL, and (to build a now opensource abstraction/UI layer I’m not interested in studying) Python’s Flask web framework.
Apache Lucene is server software for document search. While many relational databases like PostgreSQL & SQLite also supports this, Lucene is more full-featured. To the extent it supports a range of spoken languages including Arabic, Bulgarian, Bengali, Brazilian, Catalan, Czech, Danish, German, etc, etc, etc.
As such I’d heavily recommend or one of it’s abstractions for any search engine you might want to make. Unless that’s not your app’s focus, then might be easier to use e.g. Postgres’s.
I won’t discuss Lucene’s extensive localization (everything search-related I’ve read is English centric, hence my heavy recommendation of Lucene or one of it’s abstractions) further, ask a linguist to explain them. Beyond saying there’s a wide range of complexities (usually trivial), and they involve parsing via a Flex/Bison Java-equivalent. Nor will I dig into text-processing toolshed surrounding surrounding these localizations.
Instead I’ll start from the indexing (then search) demo commands.
After parsing the commandline flags, validating that “-docs” was provided, parsing that filepath to it’s
FSDirectory wraps a given file
Path, which it
opens via one of two (local or NIO filesystems) other
ByteBufferIndexInput wrappers, with an atomic
pendingDeletes hashset & two atomic counters. One for operations since last deletion & another for naming temp files. Has directory traversal methods. Former determines when to delete those files.
It next constructs a
StandardAnalyzer, which subclasses
StopwordAnalyzerBase stores a set of words, possibly parsed from a given file via
WordlistLoader which parses a #-commented newline-seperated file.
StandardTokenizer wraps a
StandardTokenizerImpl (which is implemented with help from the JFlex lexer generator) with it’s own scanner & configuration. I’m not entirely clear what this lexing’s for, but it primarily seems to serve to whitespace seperate words & determine which alphabet’s used.
LowerCaseFilter, ofcourse, converts all the words sent through it to lowercase via an internal utility iterating over the
StopFilter implements it’s
accept method to return whether the given query term is in the stopset by wrapping
CharArraySet which in turn wraps
CharArrayMap implements a simplistic hashmap around seperate key (2D character array) & value arrays sized to powers of two so can use bitmasks rather than divide. Hashes are computed by multiplying each char by 31 & adding the next.
I’m having trouble locating
TokenStreamComponents, it’s probably just an output struct.
Next the demo program constructs a
IndexWriterConfig struct, which subclasses
LiveIndexWriterConfig struct, and sets some of it’s properties. I’m calling these “structs” because they contain little more than accessor methods.
Then it creates an
IndexWriter, reads every file directly or indirectly in the specified directory, & loads them into that
IndexWriter as dynamically-typed serializable
And cleans up with time profiling & exception handling.
Lucene can index on-disk
Documents, which dynamically typed structures with whatever properties you want, via a
To add or update each given doc it first (which the demo loads from given dir) it ensures the indexfile is still open, defers to
.process all atomically queued events under a semaphore, & handles merges. Handling merges involves validation with error messages, deferring to configured
MergePolicy, & defering to a multithreaded
That MergeScheduler, after it’s state-tracking & validation defers back to each given
MergeSource now each in a new thread.
To update a document Lucene’s
DocWriter first validates the file is still open & stalled threads or flushes pending. Additional metadata is written out before each flush. Then obtains a per-thread lock which it, between tracking threading state, defers to. And before considering reflushing.
Each thread, between outputting optional logs, iterates over the given docs validating we aren’t exceding a document cap & deferring to an
IndexingChain. After iterating over all given documents it enqueues a deletion of the previous docversion to populate a
DeleteSlice it’ll then regardless
IndexingChain will first tell it’s
TermHash & (with exception handling)
StoredFieldsConsumer to start the document before iterating thrice over each field with exception handling.
The first iteration gathers all the field structural metadata into a inlined closed hashmap validating there aren’t too many & updating the
The second iteration initializes the
FieldInfo for each of those fields & the appropriate writer for it’s type.
The third actually writes out the field values using the classes initialized in previous iteration.
Beneath this is a suite of allocator & serialization classes, with or without sorting/compression, wrapping
There’s essentially one shallow layer around the classes which writes raw bytes to the fields.
The basic structure of which is a little unobvious from this perspective, but seems to resemble what SQLite FTS5 does: store a prefix-compressed sorted list in each “page” (hardware & filesystems concept) of persistant storage mapping from search terms to sorted matching documents.
When such a page is too small that’s when the concurrent merge sort kicks in.
Now that I’ve described how Apache Lucene generates it’s index file, how does it actually use that to speed up searching?
The first thing the demo program for this does is to initialize an
IndexSearcher wrapping a given
IndexReaderContext & it’s related classes. It further gathers a sorted array of “leaf slices” from the
.leaves() whilst grouping them into consistantly-sized buckets.
It also implements a
.search method with various wrappers/signatures.
One of those
.search wrappers/signatures dispatches the result to methods on a
CollectorManager, others provide their own
CollectorManagers. But ultimates it defers to the appropriate
LeafCollector (from given
BulkScorer (from given
Weight). Possibly after repeatedly rewriting the given query until fixpoint, then incrementing hitcounts.
That rewrite can be exposed to the caller/enduser.
Next the demo program initializes the same
StandardAnalyzer pipeline used during indexing, maybe a
KnnVectorDict (uses a Finite STate machine to map words to numeric IDs as parsed from a file), & a stream of input queries.
Then it create a
QueryParser to extract the structure from input the enduser’s input text, which is written in JavaCC. Uses
FastCharStream as a trivial lexer/scanner. Barely any syntax for endusers to learn…
For each whitespace-trimmed input line it parses the query, optionally resolves word IDs from that
KnnVectorDict, optionally time-benchmarks the search a given number of times, before performing the search for real with additional queries to determine pagination information.
After those iterations it closes the input stream & optionally the
Lucene ultimately calls down into a
Collector implementation to load the search results from the index, of which there’s several trivial decorators around other implementations (to e.g. cache results) which I’m skipping. A few also implements
Scorer to calculate hueristic rankings which I’m also skipping.
Today I’ll discuss the more substantial of Lucene’s
Often there’s multiple possible interpretations of a search query, making it worth ensuring you return for all.
This works by maintaining a hashmap of priority queues, with a seperate overflow priority queue. The keys for this hashmap are determined elsewhere.
Collector appears to be
GlobalOrdinalsWithScoreCollector which use a bitmask & 2 abstracted 2D arrays of scores/occurances to underly it’s
.score() methods. Handing off, after loading a list of sorted docs, to it’s
SegmentOrdinalCollector embedded classes to gather those collections.
GlobalOrdinalsWithScoreCollector has several embedded concrete subclasses for
Those a couple Collectors/Scorers to track a minimum competitive score to enforce.
TopFieldCollector wraps a
MultiLeafFieldComparator (itself dispatching to type-appropriate comparators for the dynamic doc fields) enforces that min competitive score, gathers, & maybe paginates them.
Finally, for today, there’s a
MultiCollector which wraps an array of other
LeafCollectors deferring all it’s children’s
collect methods until they have no more results to yield.
SimpleSelector which defers to it’s subclasses’
DoSetNextReader method, which I’ll discuss tomorrow.
SimpleSelector subclass, wraps a priority queue, or a pending list &
CharArraySet, to sort & deduplicate gathered results.
ValueSource provide it’s own embedded
ValueSourceCompator subclass which defers to
ValueSource subclasses (subclasses for floating point embedded).
Lucene has an on-disk cache following the same format as Lucene’s main index from which it can retrieve a cache & query ID as well as binary “Monitor Query”. There’s a corresponding
SimpleCollector to consult it.
Queries usually consist of more than one term, so there’s a factory for
SimpleCollector subclasses defferring to an appropriate
joinScorer. Then the factory initializes appropriate score-computing
Stream, & a returned
TermGroupFacet has an embedded
SimpleCollector subclass which retrieves caller-given group & facet fields (two lookups) & allocates a smallintmap before iterating over the facet results looking up each value in the group field. Then lookups a count depending on whether there was a prefix to the facet.
There’s 2 cubclasses for grouping results. The second wraps a caller-given
GroupReducer instance &
GroupSelector instance. The firstpass wraps the same
GroupSelector, HashMap, & TreeSet.
Also the first-pass constructs a few smallintmaps of field comparators & IDs, and serves to figure out which groups to categorize into as data comes in.
There’s a simpler
SimpleCollector which tracks the HashSet of all values it’s seen for deduplication as determined by a given
BlockGroupingCollector wraps it’s own
PriorityQueue as directed by some multi-field sorting parameters, buffering into a smallintmap before populating the priority queue.
AllGroupHeadsCollector which wraps a
GroupSelector to yield only the first entry of each group streamed into it.
FacetsCollector which performs searches by wrapper the given
IndexSearcher with a
TotalHitCountCollector (results stored in
TopDocs sorted collection & in turn
FeatureSortField seeks out the given “feature name” in each result, filtering results without it.
Apache Solr is an abstraction around Apache Lucene (and its full suite of internationalized search hueristics!), making it easier to use as a potentially-distributed webservice.
It was a bit difficult for me to figure out where to get started, but I found the SolrDispatchFilter servlet. Which seems to be the entrypoint!
After a little preprocessing & surrounded by rate-limitting logic via a dedicated class (collection of atomic 2nd class, following config), before considering the request.
After that preprocessing It performs authentication (if not /admin/info/key, /solr/, or /) via a configured class (range of choices!) with or without logging. Then performs some nullchecks before selecting whether to dispatch off to the v1 or v2 HTTP APIs.
With exception handling it calls the selected API version & at its behest forwards to next servlet in the chain, recurses to retry, or follows its redirection.