SearchMySite.Net

Michael I. Lewis’s SearchMySite.Net is a curated search engine I use for finding additional interesting technical writing to link to. Though there’s others like Wiby.Me & Marginalia.

SearchMySite.Net is built upon Apache Solr (thus in turn Apache Lucene), Scrappy, PostgreSQL, and (to build a now opensource abstraction/UI layer I’m not interested in studying) Python’s Flask web framework.

Apache Lucene

Apache Lucene is server software for document search. While many relational databases like PostgreSQL & SQLite also supports this, Lucene is more full-featured. To the extent it supports a range of spoken languages including Arabic, Bulgarian, Bengali, Brazilian, Catalan, Czech, Danish, German, etc, etc, etc.

As such I’d heavily recommend or one of it’s abstractions for any search engine you might want to make. Unless that’s not your app’s focus, then might be easier to use e.g. Postgres’s.

I won’t discuss Lucene’s extensive localization (everything search-related I’ve read is English centric, hence my heavy recommendation of Lucene or one of it’s abstractions) further, ask a linguist to explain them. Beyond saying there’s a wide range of complexities (usually trivial), and they involve parsing via a Flex/Bison Java-equivalent. Nor will I dig into text-processing toolshed surrounding surrounding these localizations.

Instead I’ll start from the indexing (then search) demo commands.

Indexing

After parsing the commandline flags, validating that “-docs” was provided, parsing that filepath to it’s FSDirectory class.

FSDirectory wraps a given filePath, which it opens via one of two (local or NIO filesystems) other ByteBufferIndexInput wrappers, with an atomic pendingDeletes hashset & two atomic counters. One for operations since last deletion & another for naming temp files. Has directory traversal methods. Former determines when to delete those files.


It next constructs a StandardAnalyzer. StandardAnalyzer, which subclasses StopwordAnalyzerBase, pipelines StandardTokenizer to LowerCaseFilter to StopFilter to TokenStreamComponents.

StopwordAnalyzerBase stores a set of words, possibly parsed from a given file via WordlistLoader which parses a #-commented newline-seperated file.

StandardTokenizer wraps a StandardTokenizerImpl (which is implemented with help from the JFlex lexer generator) with it’s own scanner & configuration. I’m not entirely clear what this lexing’s for, but it primarily seems to serve to whitespace seperate words & determine which alphabet’s used.

LowerCaseFilter, ofcourse, converts all the words sent through it to lowercase via an internal utility iterating over the char arrays.

StopFilter implements it’s accept method to return whether the given query term is in the stopset by wrapping CharArraySet which in turn wraps CharArrayMap. CharArrayMap implements a simplistic hashmap around seperate key (2D character array) & value arrays sized to powers of two so can use bitmasks rather than divide. Hashes are computed by multiplying each char by 31 & adding the next.

I’m having trouble locating TokenStreamComponents, it’s probably just an output struct.


Next the demo program constructs a IndexWriterConfig struct, which subclasses LiveIndexWriterConfig struct, and sets some of it’s properties. I’m calling these “structs” because they contain little more than accessor methods.

Then it creates an IndexWriter, reads every file directly or indirectly in the specified directory, & loads them into that IndexWriter as dynamically-typed serializable Document structs.

And cleans up with time profiling & exception handling.


Lucene can index on-disk Documents, which dynamically typed structures with whatever properties you want, via a IndexWriter.

To add or update each given doc it first (which the demo loads from given dir) it ensures the indexfile is still open, defers to DocWriter, .process all atomically queued events under a semaphore, & handles merges. Handling merges involves validation with error messages, deferring to configured MergePolicy, & defering to a multithreaded MergeScheduler.

That MergeScheduler, after it’s state-tracking & validation defers back to each given MergeSource now each in a new thread.

To update a document Lucene’s DocWriter first validates the file is still open & stalled threads or flushes pending. Additional metadata is written out before each flush. Then obtains a per-thread lock which it, between tracking threading state, defers to. And before considering reflushing.

Each thread, between outputting optional logs, iterates over the given docs validating we aren’t exceding a document cap & deferring to an IndexingChain. After iterating over all given documents it enqueues a deletion of the previous docversion to populate a DeleteSlice it’ll then regardless .apply.


The IndexingChain will first tell it’s TermHash & (with exception handling) StoredFieldsConsumer to start the document before iterating thrice over each field with exception handling.

The first iteration gathers all the field structural metadata into a inlined closed hashmap validating there aren’t too many & updating the FieldSchema.

The second iteration initializes the FieldInfo for each of those fields & the appropriate writer for it’s type.

The third actually writes out the field values using the classes initialized in previous iteration.

Beneath this is a suite of allocator & serialization classes, with or without sorting/compression, wrapping StoredFieldsWriter

There’s essentially one shallow layer around the classes which writes raw bytes to the fields.

The basic structure of which is a little unobvious from this perspective, but seems to resemble what SQLite FTS5 does: store a prefix-compressed sorted list in each “page” (hardware & filesystems concept) of persistant storage mapping from search terms to sorted matching documents.

When such a page is too small that’s when the concurrent merge sort kicks in.

Searching

Now that I’ve described how Apache Lucene generates it’s index file, how does it actually use that to speed up searching?

The first thing the demo program for this does is to initialize an IndexSearcher wrapping a given IndexReaderContext & it’s related classes. It further gathers a sorted array of “leaf slices” from the IndexReaderContext’s .leaves() whilst grouping them into consistantly-sized buckets.

It also implements a .search method with various wrappers/signatures.

One of those .search wrappers/signatures dispatches the result to methods on a CollectorManager, others provide their own CollectorManagers. But ultimates it defers to the appropriate LeafCollector (from given Collector) & BulkScorer (from given Weight). Possibly after repeatedly rewriting the given query until fixpoint, then incrementing hitcounts.

That rewrite can be exposed to the caller/enduser.

Next the demo program initializes the same StandardAnalyzer pipeline used during indexing, maybe a KnnVectorDict (uses a Finite STate machine to map words to numeric IDs as parsed from a file), & a stream of input queries.

Then it create a QueryParser to extract the structure from input the enduser’s input text, which is written in JavaCC. Uses FastCharStream as a trivial lexer/scanner. Barely any syntax for endusers to learn…

For each whitespace-trimmed input line it parses the query, optionally resolves word IDs from that KnnVectorDict, optionally time-benchmarks the search a given number of times, before performing the search for real with additional queries to determine pagination information.

After those iterations it closes the input stream & optionally the KnnVectorDict.


Lucene ultimately calls down into a Collector implementation to load the search results from the index, of which there’s several trivial decorators around other implementations (to e.g. cache results) which I’m skipping. A few also implements Scorer to calculate hueristic rankings which I’m also skipping.

Today I’ll discuss the more substantial of Lucene’s Collectors!

Often there’s multiple possible interpretations of a search query, making it worth ensuring you return for all.

This works by maintaining a hashmap of priority queues, with a seperate overflow priority queue. The keys for this hashmap are determined elsewhere.

The central Collector appears to be GlobalOrdinalsWithScoreCollector which use a bitmask & 2 abstracted 2D arrays of scores/occurances to underly it’s .match() & .score() methods. Handing off, after loading a list of sorted docs, to it’s OrdinalMapCollector or SegmentOrdinalCollector embedded classes to gather those collections.

GlobalOrdinalsWithScoreCollector has several embedded concrete subclasses for Min, Max, Sum, Avg, or NoScore scorings.

Those a couple Collectors/Scorers to track a minimum competitive score to enforce.

TopFieldCollector wraps a MultiLeafFieldComparator (itself dispatching to type-appropriate comparators for the dynamic doc fields) enforces that min competitive score, gathers, & maybe paginates them.

Finally, for today, there’s a MultiCollector which wraps an array of other LeafCollectors deferring all it’s children’s collect methods until they have no more results to yield.

Then there’s SimpleSelector which defers to it’s subclasses’ DoSetNextReader method, which I’ll discuss tomorrow.


TopSuggestDocsCollector, a SimpleSelector subclass, wraps a priority queue, or a pending list & CharArraySet, to sort & deduplicate gathered results.

ValueSource provide it’s own embedded ValueSourceCompator subclass which defers to ValueSource subclasses (subclasses for floating point embedded).

Lucene has an on-disk cache following the same format as Lucene’s main index from which it can retrieve a cache & query ID as well as binary “Monitor Query”. There’s a corresponding SimpleCollector to consult it.

Queries usually consist of more than one term, so there’s a factory for SimpleCollector subclasses defferring to an appropriate scoreAggregator & joinScorer. Then the factory initializes appropriate score-computing Stream, & a returned PointInSetQuery subclass.

TermGroupFacet has an embedded SimpleCollector subclass which retrieves caller-given group & facet fields (two lookups) & allocates a smallintmap before iterating over the facet results looking up each value in the group field. Then lookups a count depending on whether there was a prefix to the facet.

There’s 2 cubclasses for grouping results. The second wraps a caller-given GroupReducer instance & GroupSelector instance. The firstpass wraps the same GroupSelector, HashMap, & TreeSet.

Also the first-pass constructs a few smallintmaps of field comparators & IDs, and serves to figure out which groups to categorize into as data comes in.

There’s a simpler SimpleCollector which tracks the HashSet of all values it’s seen for deduplication as determined by a given GroupSelector.

BlockGroupingCollector wraps it’s own PriorityQueue as directed by some multi-field sorting parameters, buffering into a smallintmap before populating the priority queue.

There’s a AllGroupHeadsCollector which wraps a HashMap & GroupSelector to yield only the first entry of each group streamed into it.

There’s a FacetsCollector which performs searches by wrapper the given IndexSearcher with a TotalHitCountCollector (results stored in TopDocs sorted collection & in turn TotalHits struct), TopFieldsCollector, or TopScoreDocCollector.

And FeatureSortField seeks out the given “feature name” in each result, filtering results without it.

Apache Solr

Apache Solr is an abstraction around Apache Lucene (and its full suite of internationalized search hueristics!), making it easier to use as a potentially-distributed webservice.

It was a bit difficult for me to figure out where to get started, but I found the SolrDispatchFilter servlet. Which seems to be the entrypoint!

After a little preprocessing & surrounded by rate-limitting logic via a dedicated class (collection of atomic 2nd class, following config), before considering the request.


After that preprocessing It performs authentication (if not /admin/info/key, /solr/, or /) via a configured class (range of choices!) with or without logging. Then performs some nullchecks before selecting whether to dispatch off to the v1 or v2 HTTP APIs.

With exception handling it calls the selected API version & at its behest forwards to next servlet in the chain, recurses to retry, or follows its redirection.

TODO: V2HttpCall