Michael I. Lewis’s SearchMySite.Net is a curated search engine I use for finding additional interesting technical writing to link to. Though there’s others like Wiby.Me & Marginalia.
SearchMySite.Net is built upon Apache Solr (thus in turn Apache Lucene), Scrappy, PostgreSQL, and (to build a now opensource abstraction/UI layer I’m not interested in studying) Python’s Flask web framework.
Apache Lucene
Apache Lucene is server software for document search. While many relational databases like PostgreSQL & SQLite also supports this, Lucene is more full-featured. To the extent it supports a range of spoken languages including Arabic, Bulgarian, Bengali, Brazilian, Catalan, Czech, Danish, German, etc, etc, etc.
As such I’d heavily recommend or one of it’s abstractions for any search engine you might want to make. Unless that’s not your app’s focus, then might be easier to use e.g. Postgres’s.
I won’t discuss Lucene’s extensive localization (everything search-related I’ve read is English centric, hence my heavy recommendation of Lucene or one of it’s abstractions) further, ask a linguist to explain them. Beyond saying there’s a wide range of complexities (usually trivial), and they involve parsing via a Flex/Bison Java-equivalent. Nor will I dig into text-processing toolshed surrounding surrounding these localizations.
Instead I’ll start from the indexing (then search) demo commands.
Indexing
After parsing the commandline flags, validating that “-docs” was provided, parsing that filepath to it’s FSDirectory
class.
FSDirectory
wraps a given filePath
, which it open
s via one of two (local or NIO filesystems) other ByteBufferIndexInput
wrappers, with an atomic pendingDeletes
hashset & two atomic counters. One for operations since last deletion & another for naming temp files. Has directory traversal methods. Former determines when to delete those files.
It next constructs a StandardAnalyzer
. StandardAnalyzer
, which subclasses StopwordAnalyzerBase
, pipelines StandardTokenizer
to LowerCaseFilter
to StopFilter
to TokenStreamComponents
.
StopwordAnalyzerBase
stores a set of words, possibly parsed from a given file via WordlistLoader
which parses a #-commented newline-seperated file.
StandardTokenizer
wraps a StandardTokenizerImpl
(which is implemented with help from the JFlex lexer generator) with it’s own scanner & configuration. I’m not entirely clear what this lexing’s for, but it primarily seems to serve to whitespace seperate words & determine which alphabet’s used.
LowerCaseFilter
, ofcourse, converts all the words sent through it to lowercase via an internal utility iterating over the char
arrays.
StopFilter
implements it’s accept
method to return whether the given query term is in the stopset by wrapping CharArraySet
which in turn wraps CharArrayMap
. CharArrayMap
implements a simplistic hashmap around seperate key (2D character array) & value arrays sized to powers of two so can use bitmasks rather than divide. Hashes are computed by multiplying each char by 31 & adding the next.
I’m having trouble locating TokenStreamComponents
, it’s probably just an output struct.
Next the demo program constructs a IndexWriterConfig
struct, which subclasses LiveIndexWriterConfig
struct, and sets some of it’s properties. I’m calling these “structs” because they contain little more than accessor methods.
Then it creates an IndexWriter
, reads every file directly or indirectly in the specified directory, & loads them into that IndexWriter
as dynamically-typed serializable Document
structs.
And cleans up with time profiling & exception handling.
Lucene can index on-disk Document
s, which dynamically typed structures with whatever properties you want, via a IndexWriter
.
To add or update each given doc it first (which the demo loads from given dir) it ensures the indexfile is still open, defers to DocWriter
, .process
all atomically queued events under a semaphore, & handles merges. Handling merges involves validation with error messages, deferring to configured MergePolicy
, & defering to a multithreaded MergeScheduler
.
That MergeScheduler, after it’s state-tracking & validation defers back to each given MergeSource
now each in a new thread.
To update a document Lucene’s DocWriter
first validates the file is still open & stalled threads or flushes pending. Additional metadata is written out before each flush. Then obtains a per-thread lock which it, between tracking threading state, defers to. And before considering reflushing.
Each thread, between outputting optional logs, iterates over the given docs validating we aren’t exceding a document cap & deferring to an IndexingChain
. After iterating over all given documents it enqueues a deletion of the previous docversion to populate a DeleteSlice
it’ll then regardless .apply
.
The IndexingChain
will first tell it’s TermHash
& (with exception handling) StoredFieldsConsumer
to start the document before iterating thrice over each field with exception handling.
The first iteration gathers all the field structural metadata into a inlined closed hashmap validating there aren’t too many & updating the FieldSchema
.
The second iteration initializes the FieldInfo
for each of those fields & the appropriate writer for it’s type.
The third actually writes out the field values using the classes initialized in previous iteration.
Beneath this is a suite of allocator & serialization classes, with or without sorting/compression, wrapping StoredFieldsWriter
There’s essentially one shallow layer around the classes which writes raw bytes to the fields.
The basic structure of which is a little unobvious from this perspective, but seems to resemble what SQLite FTS5 does: store a prefix-compressed sorted list in each “page” (hardware & filesystems concept) of persistant storage mapping from search terms to sorted matching documents.
When such a page is too small that’s when the concurrent merge sort kicks in.
Searching
Now that I’ve described how Apache Lucene generates it’s index file, how does it actually use that to speed up searching?
The first thing the demo program for this does is to initialize an IndexSearcher
wrapping a given IndexReaderContext
& it’s related classes. It further gathers a sorted array of “leaf slices” from the IndexReaderContext
’s .leaves()
whilst grouping them into consistantly-sized buckets.
It also implements a .search
method with various wrappers/signatures.
One of those .search
wrappers/signatures dispatches the result to methods on a CollectorManager
, others provide their own CollectorManager
s. But ultimates it defers to the appropriate LeafCollector
(from given Collector
) & BulkScorer
(from given Weight
). Possibly after repeatedly rewriting the given query until fixpoint, then incrementing hitcounts.
That rewrite can be exposed to the caller/enduser.
Next the demo program initializes the same StandardAnalyzer
pipeline used during indexing, maybe a KnnVectorDict
(uses a Finite STate machine to map words to numeric IDs as parsed from a file), & a stream of input queries.
Then it create a QueryParser
to extract the structure from input the enduser’s input text, which is written in JavaCC. Uses FastCharStream
as a trivial lexer/scanner. Barely any syntax for endusers to learn…
For each whitespace-trimmed input line it parses the query, optionally resolves word IDs from that KnnVectorDict
, optionally time-benchmarks the search a given number of times, before performing the search for real with additional queries to determine pagination information.
After those iterations it closes the input stream & optionally the KnnVectorDict
.
Lucene ultimately calls down into a Collector
implementation to load the search results from the index, of which there’s several trivial decorators around other implementations (to e.g. cache results) which I’m skipping. A few also implements Scorer
to calculate hueristic rankings which I’m also skipping.
Today I’ll discuss the more substantial of Lucene’s Collector
s!
Often there’s multiple possible interpretations of a search query, making it worth ensuring you return for all.
This works by maintaining a hashmap of priority queues, with a seperate overflow priority queue. The keys for this hashmap are determined elsewhere.
The central Collector
appears to be GlobalOrdinalsWithScoreCollector
which use a bitmask & 2 abstracted 2D arrays of scores/occurances to underly it’s .match()
& .score()
methods. Handing off, after loading a list of sorted docs, to it’s OrdinalMapCollector
or SegmentOrdinalCollector
embedded classes to gather those collections.
GlobalOrdinalsWithScoreCollector
has several embedded concrete subclasses for Min
, Max
, Sum
, Avg
, or NoScore
scorings.
Those a couple Collectors/Scorers to track a minimum competitive score to enforce.
TopFieldCollector
wraps a MultiLeafFieldComparator
(itself dispatching to type-appropriate comparators for the dynamic doc fields) enforces that min competitive score, gathers, & maybe paginates them.
Finally, for today, there’s a MultiCollector
which wraps an array of other LeafCollectors
deferring all it’s children’s collect
methods until they have no more results to yield.
Then there’s SimpleSelector
which defers to it’s subclasses’ DoSetNextReader
method, which I’ll discuss tomorrow.
TopSuggestDocsCollector
, a SimpleSelector
subclass, wraps a priority queue, or a pending list & CharArraySet
, to sort & deduplicate gathered results.
ValueSource
provide it’s own embedded ValueSourceCompator
subclass which defers to ValueSource
subclasses (subclasses for floating point embedded).
Lucene has an on-disk cache following the same format as Lucene’s main index from which it can retrieve a cache & query ID as well as binary “Monitor Query”. There’s a corresponding SimpleCollector
to consult it.
Queries usually consist of more than one term, so there’s a factory for SimpleCollector
subclasses defferring to an appropriate scoreAggregator
& joinScorer
. Then the factory initializes appropriate score-computing Stream
, & a returned PointInSetQuery
subclass.
TermGroupFacet
has an embedded SimpleCollector
subclass which retrieves caller-given group & facet fields (two lookups) & allocates a smallintmap before iterating over the facet results looking up each value in the group field. Then lookups a count depending on whether there was a prefix to the facet.
There’s 2 cubclasses for grouping results. The second wraps a caller-given GroupReducer
instance & GroupSelector
instance. The firstpass wraps the same GroupSelector
, HashMap, & TreeSet.
Also the first-pass constructs a few smallintmaps of field comparators & IDs, and serves to figure out which groups to categorize into as data comes in.
There’s a simpler SimpleCollector
which tracks the HashSet of all values it’s seen for deduplication as determined by a given GroupSelector
.
BlockGroupingCollector
wraps it’s own PriorityQueue
as directed by some multi-field sorting parameters, buffering into a smallintmap before populating the priority queue.
There’s a AllGroupHeadsCollector
which wraps a HashMap
& GroupSelector
to yield only the first entry of each group streamed into it.
There’s a FacetsCollector
which performs searches by wrapper the given IndexSearcher
with a TotalHitCountCollector
(results stored in TopDocs
sorted collection & in turn TotalHits
struct), TopFieldsCollector
, or TopScoreDocCollector
.
And FeatureSortField
seeks out the given “feature name” in each result, filtering results without it.
Apache Solr
Apache Solr is an abstraction around Apache Lucene (and its full suite of internationalized search hueristics!), making it easier to use as a potentially-distributed webservice.
It was a bit difficult for me to figure out where to get started, but I found the SolrDispatchFilter servlet. Which seems to be the entrypoint!
After a little preprocessing & surrounded by rate-limitting logic via a dedicated class (collection of atomic 2nd class, following config), before considering the request.
After that preprocessing It performs authentication (if not /admin/info/key, /solr/, or /) via a configured class (range of choices!) with or without logging. Then performs some nullchecks before selecting whether to dispatch off to the v1 or v2 HTTP APIs.
With exception handling it calls the selected API version & at its behest forwards to next servlet in the chain, recurses to retry, or follows its redirection.
TODO: V2HttpCall