Michael I. Lewis’s SearchMySite.Net is a curated search engine I use for finding additional interesting technical writing to link to. Though there’s others like Wiby.Me & Marginalia.
SearchMySite.Net is built upon Apache Solr (thus in turn Apache Lucene), Scrappy, PostgreSQL, and (to build a now opensource abstraction/UI layer I’m not interested in studying) Python’s Flask web framework.
Apache Lucene
Apache Lucene is server software for document search. While many relational databases like PostgreSQL & SQLite also supports this, Lucene is more full-featured. To the extent it supports a range of spoken languages including Arabic, Bulgarian, Bengali, Brazilian, Catalan, Czech, Danish, German, etc, etc, etc.
As such I’d heavily recommend or one of it’s abstractions for any search engine you might want to make. Unless that’s not your app’s focus, then might be easier to use e.g. Postgres’s.
I won’t discuss Lucene’s extensive localization (everything search-related I’ve read is English centric, hence my heavy recommendation of Lucene or one of it’s abstractions) further, ask a linguist to explain them. Beyond saying there’s a wide range of complexities (usually trivial), and they involve parsing via a Flex/Bison Java-equivalent. Nor will I dig into text-processing toolshed surrounding surrounding these localizations.
Instead I’ll start from the indexing (then search) demo commands.
Indexing
After parsing the commandline flags, validating that “-docs” was provided, parsing that filepath to it’s FSDirectory
class.
FSDirectory
wraps a given filePath
, which it open
s via one of two (local or NIO filesystems) other ByteBufferIndexInput
wrappers, with an atomic pendingDeletes
hashset & two atomic counters. One for operations since last deletion & another for naming temp files. Has directory traversal methods. Former determines when to delete those files.
It next constructs a StandardAnalyzer
. StandardAnalyzer
, which subclasses StopwordAnalyzerBase
, pipelines StandardTokenizer
to LowerCaseFilter
to StopFilter
to TokenStreamComponents
.
StopwordAnalyzerBase
stores a set of words, possibly parsed from a given file via WordlistLoader
which parses a #-commented newline-seperated file.
StandardTokenizer
wraps a StandardTokenizerImpl
(which is implemented with help from the JFlex lexer generator) with it’s own scanner & configuration. I’m not entirely clear what this lexing’s for, but it primarily seems to serve to whitespace seperate words & determine which alphabet’s used.
LowerCaseFilter
, ofcourse, converts all the words sent through it to lowercase via an internal utility iterating over the char
arrays.
StopFilter
implements it’s accept
method to return whether the given query term is in the stopset by wrapping CharArraySet
which in turn wraps CharArrayMap
. CharArrayMap
implements a simplistic hashmap around seperate key (2D character array) & value arrays sized to powers of two so can use bitmasks rather than divide. Hashes are computed by multiplying each char by 31 & adding the next.
I’m having trouble locating TokenStreamComponents
, it’s probably just an output struct.
Next the demo program constructs a IndexWriterConfig
struct, which subclasses LiveIndexWriterConfig
struct, and sets some of it’s properties. I’m calling these “structs” because they contain little more than accessor methods.
Then it creates an IndexWriter
, reads every file directly or indirectly in the specified directory, & loads them into that IndexWriter
as dynamically-typed serializable Document
structs.
And cleans up with time profiling & exception handling.
Lucene can index on-disk Document
s, which dynamically typed structures with whatever properties you want, via a IndexWriter
.
To add or update each given doc it first (which the demo loads from given dir) it ensures the indexfile is still open, defers to DocWriter
, .process
all atomically queued events under a semaphore, & handles merges. Handling merges involves validation with error messages, deferring to configured MergePolicy
, & defering to a multithreaded MergeScheduler
.
That MergeScheduler, after it’s state-tracking & validation defers back to each given MergeSource
now each in a new thread.
To update a document Lucene’s DocWriter
first validates the file is still open & stalled threads or flushes pending. Additional metadata is written out before each flush. Then obtains a per-thread lock which it, between tracking threading state, defers to. And before considering reflushing.
Each thread, between outputting optional logs, iterates over the given docs validating we aren’t exceding a document cap & deferring to an IndexingChain
. After iterating over all given documents it enqueues a deletion of the previous docversion to populate a DeleteSlice
it’ll then regardless .apply
.
The IndexingChain
will first tell it’s TermHash
& (with exception handling) StoredFieldsConsumer
to start the document before iterating thrice over each field with exception handling.
The first iteration gathers all the field structural metadata into a inlined closed hashmap validating there aren’t too many & updating the FieldSchema
.
The second iteration initializes the FieldInfo
for each of those fields & the appropriate writer for it’s type.
The third actually writes out the field values using the classes initialized in previous iteration.
Beneath this is a suite of allocator & serialization classes, with or without sorting/compression, wrapping StoredFieldsWriter
There’s essentially one shallow layer around the classes which writes raw bytes to the fields.
The basic structure of which is a little unobvious from this perspective, but seems to resemble what SQLite FTS5 does: store a prefix-compressed sorted list in each “page” (hardware & filesystems concept) of persistant storage mapping from search terms to sorted matching documents.
When such a page is too small that’s when the concurrent merge sort kicks in.
Searching
Now that I’ve described how Apache Lucene generates it’s index file, how does it actually use that to speed up searching?
The first thing the demo program for this does is to initialize an IndexSearcher
wrapping a given IndexReaderContext
& it’s related classes. It further gathers a sorted array of “leaf slices” from the IndexReaderContext
’s .leaves()
whilst grouping them into consistantly-sized buckets.
It also implements a .search
method with various wrappers/signatures.
One of those .search
wrappers/signatures dispatches the result to methods on a CollectorManager
, others provide their own CollectorManager
s. But ultimates it defers to the appropriate LeafCollector
(from given Collector
) & BulkScorer
(from given Weight
). Possibly after repeatedly rewriting the given query until fixpoint, then incrementing hitcounts.
That rewrite can be exposed to the caller/enduser.
Next the demo program initializes the same StandardAnalyzer
pipeline used during indexing, maybe a KnnVectorDict
(uses a Finite STate machine to map words to numeric IDs as parsed from a file), & a stream of input queries.
Then it create a QueryParser
to extract the structure from input the enduser’s input text, which is written in JavaCC. Uses FastCharStream
as a trivial lexer/scanner. Barely any syntax for endusers to learn…
For each whitespace-trimmed input line it parses the query, optionally resolves word IDs from that KnnVectorDict
, optionally time-benchmarks the search a given number of times, before performing the search for real with additional queries to determine pagination information.
After those iterations it closes the input stream & optionally the KnnVectorDict
.
Lucene ultimately calls down into a Collector
implementation to load the search results from the index, of which there’s several trivial decorators around other implementations (to e.g. cache results) which I’m skipping. A few also implements Scorer
to calculate hueristic rankings which I’m also skipping.
Today I’ll discuss the more substantial of Lucene’s Collector
s!
Often there’s multiple possible interpretations of a search query, making it worth ensuring you return for all.
This works by maintaining a hashmap of priority queues, with a seperate overflow priority queue. The keys for this hashmap are determined elsewhere.
The central Collector
appears to be GlobalOrdinalsWithScoreCollector
which use a bitmask & 2 abstracted 2D arrays of scores/occurances to underly it’s .match()
& .score()
methods. Handing off, after loading a list of sorted docs, to it’s OrdinalMapCollector
or SegmentOrdinalCollector
embedded classes to gather those collections.
GlobalOrdinalsWithScoreCollector
has several embedded concrete subclasses for Min
, Max
, Sum
, Avg
, or NoScore
scorings.
Those a couple Collectors/Scorers to track a minimum competitive score to enforce.
TopFieldCollector
wraps a MultiLeafFieldComparator
(itself dispatching to type-appropriate comparators for the dynamic doc fields) enforces that min competitive score, gathers, & maybe paginates them.
Finally, for today, there’s a MultiCollector
which wraps an array of other LeafCollectors
deferring all it’s children’s collect
methods until they have no more results to yield.
Then there’s SimpleSelector
which defers to it’s subclasses’ DoSetNextReader
method, which I’ll discuss tomorrow.
TopSuggestDocsCollector
, a SimpleSelector
subclass, wraps a priority queue, or a pending list & CharArraySet
, to sort & deduplicate gathered results.
ValueSource
provide it’s own embedded ValueSourceCompator
subclass which defers to ValueSource
subclasses (subclasses for floating point embedded).
Lucene has an on-disk cache following the same format as Lucene’s main index from which it can retrieve a cache & query ID as well as binary “Monitor Query”. There’s a corresponding SimpleCollector
to consult it.
Queries usually consist of more than one term, so there’s a factory for SimpleCollector
subclasses defferring to an appropriate scoreAggregator
& joinScorer
. Then the factory initializes appropriate score-computing Stream
, & a returned PointInSetQuery
subclass.
TermGroupFacet
has an embedded SimpleCollector
subclass which retrieves caller-given group & facet fields (two lookups) & allocates a smallintmap before iterating over the facet results looking up each value in the group field. Then lookups a count depending on whether there was a prefix to the facet.
There’s 2 cubclasses for grouping results. The second wraps a caller-given GroupReducer
instance & GroupSelector
instance. The firstpass wraps the same GroupSelector
, HashMap, & TreeSet.
Also the first-pass constructs a few smallintmaps of field comparators & IDs, and serves to figure out which groups to categorize into as data comes in.
There’s a simpler SimpleCollector
which tracks the HashSet of all values it’s seen for deduplication as determined by a given GroupSelector
.
BlockGroupingCollector
wraps it’s own PriorityQueue
as directed by some multi-field sorting parameters, buffering into a smallintmap before populating the priority queue.
There’s a AllGroupHeadsCollector
which wraps a HashMap
& GroupSelector
to yield only the first entry of each group streamed into it.
There’s a FacetsCollector
which performs searches by wrapper the given IndexSearcher
with a TotalHitCountCollector
(results stored in TopDocs
sorted collection & in turn TotalHits
struct), TopFieldsCollector
, or TopScoreDocCollector
.
And FeatureSortField
seeks out the given “feature name” in each result, filtering results without it.
Apache Solr
Apache Solr is an abstraction around Apache Lucene (and its full suite of internationalized search hueristics!), making it easier to use as a potentially-distributed webservice.
It was a bit difficult for me to figure out where to get started, but I found the SolrDispatchFilter servlet. Which seems to be the entrypoint!
After a little preprocessing & surrounded by rate-limitting logic via a dedicated class (collection of atomic 2nd class, following config), before considering the request.
After that preprocessing It performs authentication (if not /admin/info/key, /solr/, or /) via a configured class (range of choices!) with or without logging. Then performs some nullchecks before selecting whether to dispatch off to the v1 or v2 HTTP APIs.
With exception handling it calls the selected API version & at its behest forwards to next servlet in the chain, recurses to retry, or follows its redirection.
V2 HTTP API
After parsing the path & querystring from the given request Solr’s V2 HTTP API class (with exception handling) first checks for an empty path in which case it returns documention link & hardcoded description, otherwise gets & checks the path-prefix looking up an appropriate plugin. It checks if that prefix is a builtin type so it can defer CompositeApi for later handling, otherwise deffering to the plugin saving aside certain properties.
For “c” or “collections” prefixes it runs the core querying logic & saves results or throws an exception. For “cores”-prefix saves parameters.
If no core was computed Solr flags this request as an “admin” operation. Before finally configuring multithreading, run a final Request Parser, & handles exceptions.
Other Java Servlet methods defers to the class selected by that logic, surrounded by error-checks/handling.
That core-logic I mentioned wraps a “ZK State Reader” & “DocCollection Supplier lambda” to resolve a collection’s name, trying twice. I take that its the plugin’s responsibility to do something with the returned collection.
Parsing
Looking over Apache Solr’s parsing subsystem, I see an Exception subclass incluing message formatting with unescaping logic. Another exception subclass assembles syntax error messages.
There’s a Token structure to serve as the AST.
And ofcourse a parser generator is used to define the syntax itself with seperate lexer & parser! With utility methods building that AST, most of which are in a seperate subclass.
Request Handlers
Looking through Solr’s request handler classes, I see ones which:
- Sends the search query through the context’s registered analyzer; typically including its char filters, tokenizer, & token filters; whilst sorting & grouping these tokens utilizing struct innerclasses.
- Rewrite the query to apply to a particular field.
- Expose a backups API via an inner
CollectionsHandler
. - A more extensive API wrapping the same
CollectionsHandler
class. - HTTP-stream the
Processor
’s results via aDocumentLoader
. - Report configuration & status serialized from other classes.
- Dereference files with their metadata & hashes into the common structures.
- Serialize processed request data directly into the response.
- Parameterized integration of
ExportWriter
output & possibly Apache Zookeeper. - Postprocessing upon queried results to annotate how they match the query, according to provided XML.
- Parse out multiple field types & fieldnames.
- Multipass convert search terms to sigmoided weighted-sum score, offering “explanations” from its docstream & modelstream.
- Attach additional metadata to the stream context.
- HTTP-streams concatenated files.
- Score results based on haversine metrics.
- 404
- Query results which match similar terms to another result.
- Incremental backups.
- Flag requests to be handled in “realtime”.
- Handle pings, including amongst “shards”.
- Restore from backup in server filesystem.
- Query the document schema.
- Load updated data into Lucene.
- Capture database snapshots.
- Aggregate responses by composing some innerclasses.
- Alter Solr’s configuration, & that of its plugins & shards.
- Propagate information to shards.
- Propagate queries accross shards according to their index files.
All of these requesthandlers share a superclass, and are dynamically dispatched to from the servlets.
Request Handler Subsystems
The request handlers Solr subsystem described above includes several subsubsystems.
There’s extensive admin HTTP-APIs collating technical internal information including Apache Zookeeper integration, with its subsystem rerouting more modern HTTP-APIs to the core logic. Occasionally this core logic involves some statistics. Including extensive replication APIs.
There’s utilities for parsing components of HTTP-requests.
There’s a miscallaneous “components” subsystem including spellchecking, non-distributed configuration, sharding utilities, collections & other datamodelling, statistics like heatmaps, checking replicas in realtime, various facets, more complex sorting, automated suggestions with underlying term processing, more-like-this, scoring phrases, selecting a pivotal facet, cancelling queries, merge strategies, active task lists, highlighting where matches occurred, & more.
There’s a simple configuration API.
There’s a somewhat sophisticated “designer” subsystem to help webmasters decide on configurations before they’ve got real data.
There’s a suite of classes for serializing results, with sorted fields.
There’s classes for importing documents into the underlying Lucene database.
And there’s automated document tagging.
Plugins infrastructure
Most of Apache Solr, from what I gather from studying its servlets, are implemented as plugins. So here I’ll study its extensions framework!
This maintains several collections, can cluster query parameters to pass to the given handler class (discussed last time), can query counts from a Solr index searcher, can run a Solr index searcher over a doc list with optional match-highlighting, can parse debugging flags, convert explanations to serializable mappings & query those explanations.
There’s a staticmethod to parse & evaluate a query.
There’s regexp-based parsers for field-boosts.
There’s a counter for optional & necessary query clauses, & a regexp-based parser for that configuration.
There’s a boolean-query optimizer/flattener.
There’s an autoescaper.
There’s strippers for (regexp-based) invalid operators, unbalanced quotes, & null queries.
There’s a results converter to simpler objects.
There’s queryparser subclass which gathers “disjuncts”.
There’s a routine abstracting sortspec parsers to take the appropriate query parameter, as well as for queries.
There’s an introspective setter-caller.
A routine to query a “purposes” mapping to build a string, & an iterator over that purposes mapping.
Miscallaneous/Overview
Skimming over Apache Solr’s codebase to see what I missed as I attempted to trace it’s logic, I see subsystems of:
- A randomized benchmark (do they use this for fuzzing?) & other minor tests.
- A long case-sensitive word list, supporting those.
- Shellscript to aid running Solr.
- A few postprocessors on output text that can be invoked from configuration.
- An HTTP API (couple main classes, with trivial support classes).
- A client for that HTTP API.
- A sizable subsystem (which is becoming decreasingly useful as harddrives enlarge) to manage multi-machine deployments. With several small-ish classes aggregating various cluster state, or updating collections via (often) a handful of support classes.
Surrounding that HTTP API is a larger suite of code distributing load, managing it’s own collections. Including automated elections for which instance should coordinate things.
Ofcourse there’s plenty of locking in this “cloud” subsystem! It’s implemented with the aid of Apache ZooKeeper, via a couple abstractions, with a CLI interface.
- A seperate subsystem provides messages with which to synchronize “clusters”.
Amongst the “core” subsystem there’s routines for saving, modelling, & restoring backups with their repositories & snapshots.
There’s some core Java-interfaces & data modelling for everything to build around.
That “core” subsystem includes logic others can hook into to dispatch requests or whatever. And generally it provides infrastructure for everything else to use. Also there’s caching, configuration, plugin-management, etc.
Having skimmed through several such URL handlers (many of which defers to other Java classes, with or without light tweaks), I see a couple which wraps object proeprties or throws exceptions for deprecated APIs. Then there’s a few which defers to the CollectionsHandler
instances & related classes adapting it to an HTTP API. Another wraps Apache Zookeeper. There’s classes & interfaces abstracting a collection of files to validate & index, persisting that index to disk in a distributed network. Has it’s own HTTP APIs.
There’s “admin” URL handlers. There’s a suite of operations admins can operate, many of which reformats responses methods on the “core container”. There’s “security handlers” several of which wraps Apache Zookeeper, or custom configuration files. There’s request handlers some of which also optionally-wraps Zookeeper, or decorates other request to e.g. add diffs or distributed computing. Several wraps accessors, possibly with analysis. And infrastructure to hook those together. There’s a SolrEnvironment struct.
Also Apache Solr’s admin URL handlers can be used to e.g. trigger an “election” for which machine should coordinate all the others. Votes are cast randomly, everyone’s equally-qualified for the job. Or they could parse registry file. Amongst other similar tasks, largely related to distributed computing & indexing. There’s aggregate admin operations, & backup/restore admin ops.
Skimming more of Apache Solr’s URL handlers… There’s a supportclass to merge spellchecking collections & other collection wrappers, various “replica sources”, an interface for computing stats, utils abstracting HTTP networking, & several classes tweaking requests/responses. There’s HTTP endpoints for managing for managing configsets, enqueueing commands to apply those operations if permitted. Including a class processing responses. Or retrieving properties from the core container.
One of those configsets handlers cleans up empty files, or uploading files. The decorators around search queries include things like analytics (with supporting classes), suggested queries, sharding (with supporting classes), & so much more. There’s a suite of APIs (with supporting classes) for loading up sample documents to see how different configurations impacts searches upon them, with recommendations using a heatmap, diffing, & more.
There’s several classes aiding serialization & comparison of key-value documents, including an implementation of a Heap. And there’s a seperate suite of importers from different formats. Another suite of URL handlers & supporting classes for applying “tags” to documents. There’s a suite of “source evaluators”. And there’s more URL handlers, most of which are similar to ones I’ve already mentioned. As well as supporting utils.
It took me a bit to figure out where to resume my skimming of Apache Solr, but I see subsystems for…
- Transforming encodings to desired output.
- Further abstractions for writing HTTP responses, including for the specific formats Solr deals with.
- Additional logic powering the ReST APIs, including an inner subsystem specifically for analysis & XML conversion, much of which is dedicated to file processing.
Studying the “search” module which appears to be providing a massive suite of trivial classes to be composed to evaluate a parsed search query. These operate on the dynamically-typed record datamodel I described recently.
There’s several “facet” such classes at least, including amongst other things iterators & aggregations upon those documents. A handful of these are non-trivial, or bundles several trivial classes into a single Java file.
There’s several classes implementing functions that can be called from the search query, including a suite implementing various formulas for distances between points or other datatypes. There’s also a suite of trivial grouping aggregators (& a collector class) whether using distributed computing or not. There’s a suite of mostly-trivial join interpretors. There’s a suite outputting statistics. Another suite computing simularity hueristics. And a couple small AI suites.
Finally there’s a larger suite of even more trivial classes which can be composed to interpret a search query, alongside exceptions to throw & builders (or as Solr calls them “qparsers”) to composite new instances of these classes. As well as string-parser classes & collection types to aid implementing the rest of this. And ofcourse there’s collections in which to lookup functions, the most trivial of which defined whilst populating it. Not to mention XML (de)serialization.
Basically, this all resembles the core of a relational database. But with a nicer syntax intended for non-developers (then again SQL for non-developers, that didn’t work out…).
I see a “security” including authentication, authorization, accounting, & a plugin system. As well as configurable rules. There’s another suite of (largely trivial) servlet abstractions. There’s a large suite of classes to handle misspelling, including a subsubsystem handling suggestions out of FST, Jasper, or TST files. This requires some datamodelling in addition to the collection lookups. There’s an negation “uninvert” optimization pass.
I see a large subsystem for updating the database, including a preprocessor with several classes (sharing an interface) for trivially populating its different fields. Including regexps. Or parsing different field types. As well ofcourse some surrounding logic to actually write the changes to the database, including vector clocks & logic for splitting indices.
Then there’s a large suite of utilities, including classes for managing RAM & CPU usage, configuration & its “providers” especially relating to SSL, binary serialization & deserialization, plugin infrastructure, stats-gathering API, & trace-debugging. As well as a larger bucket containing collections, thread-synchronizing, commandline-interfacing, & so much more! Some of these appear to be duplicated from Java’s stdlib, but specialized to Solr’s performance needs.
Some more small-to-trivial submodules in Apache Lucene. Including:
- A couple generic yet internationalized analyzers
- Data clustering
- Extracting data from common file formats (wasn’t that a different Apache project, Tika?).
- Client for Google Cloud Storage
- Hadoop authentication client
- Client for Hadoop Distributed File System
- Export trace-logging to Apache Jaeger
- JSON Web Tokens implementation
- Language IDs, with sniffing
- Extract Solr “features” into machine learning
- Amazon S3 storage client.
- Abstraction over a range of scripting languages.
- Specialized SQL abstraction.
Also there’s auxiliary files for packaging, various support commands (e.g. convert Solr’s metrics to “Promotheus”’s format), documentation, build configuration, testing, etc.
Scrapy
Search engines also need to gather webpages to be searched. For that SearchMySite uses Scrapy, with the atypical twist that they only index curated & consenting sites.
Scrapy is a heavily optimized Python module upon which you can implement such crawlers.
In the main module there’s various interfaces for:
- loading spiders (with a dynamic import implementation)
- signal management
- stats collection (with some basic implementations)
- Parsing robots.txt
- Filtering duplicate requests (with fingerprinting implementation)
There’s main files exposing Scrapy as a commandline program, or as a Python module. There’s various collections for e.g. addons & “middleware”, & caching for e.g. DNS. There’s various exceptions.
There’s datamodelling covering:
- Links
- Scraped “items”
There’s scheduling for:
- Serializing (“pickling”) results
- Requesting pages via priority queue spreading out across domains
There’s a couple files implementing a commandline interface for configuring the crawlers. There’s a class for parsing network responses. There’s various classes for outputting results in various fileformats, email, or logging. And there’s a couple classes integrating all these components.
Subcommands
Scrapy includes a number of subcommands you can invoke, including:
- Opening a download webpage in the browser (presumably so you can see any diff from the browser doing so)
- Display version numbers, possibly including dependencies
- Iterate over running spiders to print them
- Open a specified crawler script in your configured commandline text editor
- Carefully start the “crawler process”
- Runs a server & benchmarks its crawlers against it.
- A custom testrunner
- Fetch a webpage to stdout
- Generate a template Python script implementing a crawler upon Scrapy, & another for an entire project
- Output results from fetching & processing a URL
- Import a specified crawler & run it
- Output specified settings
- Run a REPL shell
And there’s some superclasses.
A separate submodule a class (and subclasses) for the unit tests against your crawler to subclass from, & run in the built-in testrunner subcommand.
Scrapy Core
Scrapy’s Core submodule includes a “downloader” subsubsystem which schedules dispatching URLs to data:, file:, ftp:, http: (special implementations for 1.0, 1.1, & 2), & s3: (delegating to HTTP). Building on the Twisted networking framework (queue that project up). Supports invoking “middleware” callbacks, & optionally logging TLS certificate checks.
Scrapy implements HTTP2 itself, with pooling.
The outer & more complex scheduler queues items both in-memory & on-disk (they’re expecting to have massive task queues I see…) tracking stats & deduplicating tasks. A separate class delegates authentication & tracks the active downloads as well as the spiders triggering them. And another tracks error/success stats, & yet another collects the spiders’ outputs.
Scrapy Middleware
Scrapy’s crawlers can compose a “middlewares” to postprocess their HTTP requests before sending them.
Some builtin ones offer to:
- Attach a useragent string
- Authenticate using HTTP Basic Auth
- Cancel requests on a timeout
- Attach other configurable HTTP headers
- Gather request statistics
- Reroute requests via a proxy server
- Follow HTTP redirects
- Retry failed requests
- Check robots.txt
- Traverse “AJAX crawlable” pages
- Caching
- Cookies
Also there’s 2 separate middlewares for decompressing responses. Whether said compression was indicated at the HTTP level or the fileformat level.
Scrapy Extensions
Today I’ll quickly go over a bunch of builtin “extensions” which can run inside Scrapy to process your spiders’ data, which includes:
- emailing you the gathered stats
- Adds persistence to closing & opening spiders.
- exposing a telnet console (though I’m not following what happens over this telnet console…)
- Output Python’s garbage collection stats
- Log gathered stats
- Collect some stats from spiders.
- Outputting callstacks
- Interfacing to GDB
- Throttle spiders’ requests further.
- Several or compressing output using different algorithms.
- Monitor memory usage, & if it passes configurable limits outputs warnings (stdout & email) or intentionally crashes.
- Compresses outputs in a choice of formats.
- Periodically logs various stats.
- A full HTTP caching implementation.
- Generates webfeeds & outputs them to a choice of storage servers.
- Autoclose spiders based on certain conditions.
HTTP Abstractions
Scrapy includes datamodelling around Twisted’s HTTP client.
Including a Request object with Form, RPC, JSON subclasses; as well as a Response object with Text & in turn XML & HTML subclasses.
As well as a headers dictionary & a cookie jar.
Other Submodules
Skimming over a handful of Scrapy’s more trivial submodules there’s…
- A wrapper class around XPath queries
- Utilities for assembling processing pipelines
- Configuration infrastructure (as Python globals) with defaults.
- Spider decorators/”middleware” including limits graph-traversal depth, HTTP error handling, block external link traversal, attaching Referer headers, & cap URI length
- Extract traversable links from HTML or XML, checking filetype for whether to do so
- Module-loading infrastructure
- A spider superclass with declarative APIs & subclasses for webfeeds & sitemaps.
The pipeline assembling utils have filters that can handle files (across several backends), images, & media.
There’s template directories to get you started building Scrapy “projects” & their “spiders”, with Scrapy subcommands evaluating these templates.
Scrapy Utilities
And there’s several utils to help implement the rest of Scrapy, including:
- Collecting various version numbers
- Tracking live references
- URL processing
- Unit-testing
- Evaluating such a template file
- OpenSSL abstractions
- Processing sitemaps
- JSON encoding
- Event dispatch/logging
- Handling spiders
- Register shutdown handlers to Twisted
- Managing a Scrapy project
- Locating a “JOBS” dir
- Cache parsed URLs
- Iterate over extracted data
- Abstract GUnzip
- Logging
- FTP abstractions
- Collect various stats from the execution engine
- Nicer terminal output
- Useful “function decorators”
- Test whether “botocore” is installed
- Internal server to benchmark against
- Abstracts async generators
- Configuration
- CURL syntax
- Twisted deferreds
- Etc, etc, etc
This all very trivial.