170 likes | 405 Views
Introduction to the Xapian Search Engine. Sébastien François, EPrints Lead Developer EPrints Developer Powwow, ULCC. Presentation. Open Source Search Engine Library Written in C++ (we use the PERL bindings) Uses the BM25 ranking function which gives the relevance matching
E N D
Introduction to the Xapian Search Engine Sébastien François, EPrints Lead Developer EPrints Developer Powwow, ULCC
Presentation • Open Source Search Engine Library • Written in C++ (we use the PERL bindings) • Uses the BM25 ranking function which gives the relevance matching • “Scales well”: 100+ million documents • Oh… code that we don’t need to maintain!
Core Concepts • Database • Document • data • terms • Values • (Xapian) Metadata management • Searching • Are you ready for it?
Core Concepts: Database • Collection of files storing indexes, positions, term frequencies, … • One write-lock, multiple read-locks • Stored in archives/<id>/var/xapian/ • Supports multiple-DB’s (unused in EPrints) • Can store arbitrary metadata
Core Concepts: Document • A Document is an item returned by a search • So it’s also the meaty bit of indexing • Maps to a single data-obj in EPrints • Has three main components: • data • terms • values
Core Concepts: Document Data • Arbitrary blob of data • Un-processed by Xapian • Used to store information needed to display the results • Used to store the data-obj identifier in EPrints in order to quickly build EPrints::List objects • Could be used to store more complex data: cached citations, JSON/PERL representation of the data-obj • Limit ~100MB per Document
Core Concepts: Document Terms • Basis of relevance search: a search is a process of comparing the terms specified by a Query against the terms in the DB • Three main types of terms: • Un-prefixed terms: can be seen as a general pool of indexed terms • Prefixed terms: allow to search a sub-set of information (title, authors…) • Boolean terms: used to index identifiers (which don’t add any useful information to the probabilistic indexes)
Core Concepts: Document Terms (2) • Boolean terms useful for filtering exact values (e.g. subjects:PM, type:article, …). No text processing involved, values appear 0 or 1 time in Documents. • Textual data - TermGenerator class: • Provides the Stemmer and Stopper classes (note: language-dependent) • Spelling correction • Exact matching (“hello world”) and the termpos joys
Core Concepts: Document Terms (3) • Unprefixed terms used for the simple search • Prefixed terms used for a field-based search (such as the advanced search) • Boolean terms used for any identifier-type of fields – this includes facets (when searching)
Core Concepts: Document Values • “search helpers” – we used them for ordering and faceting (occurences& available facets) • Each value (e.g. an order-value, a facet value) is stored in a numbered slot (32-bit integer) • Mappings between a meaningful string and a slot are stored in the Xapian DB as metadata • eprint.creators_name.en(1000000) is the slot for the order-value for the field “creators_name” on the dataset “eprint” for English
Core Concepts: Document Values (2) • eprint.facet.type.0 (1500300) is the 1st slot for a facet “type” on the dataset eprint • Used by the MultiValueSorter class to order data (when not ordered by relevance) • Used to find out available facets (after a search) and the occurrences of the values e.g. there are 3 items of type ‘article’, 14 items of date ‘2013’ • Xapian documentation advises on keeping the number of values low (slow down searching) • We usually limit the number of slots for a facet to 5
Core Concepts: Metadata management • We need to keep track of our slot mappings in the Xapian Database (not done by Xapian for us ) • EPrints reserves 1 000 000 slots per dataset: • 500 000 for order-values (1 per orderable field) • 500 000 for facet slots (1 per facetable value) • EPrints also stores the current slot offsets to know: • where the range for the next dataset starts • where the next slot of order-values are • EPrints also stores some other useful information as Metadata
Core Concepts: Searching • Reverse process of indexing • Composed of a tree of Query objects (and sometime a QueryParser object) linked by boolean operators • $query = new Query( “hello” )$query = new Query( AND, $query, “world” ) • Can be stringified to see how the query is interpreted (easier to read than SQL!)
Core Concepts: Searching - QueryParser • Parses user queries • Supports: • wildcards: wild* will match wildcat • boolean op’s: pear AND (red OR green NOT blue) • love/hate op’s: crab +nebula –crustacean • exact match: “loremipsum” • synonyms: colour/color, realise/realize • stemming: happiness/happy -> happi • suggestions: may provide a corrected query • Features can be turned on/off (all are enabled on EPrints)
Core Concepts: Search - Enquire • The object which runs the query • Alternative ordering methods can be applied • A MatchDecider method may be provided to filter out results (in fact, we use that to compute facets) • Returns an MSet (Match Set) which contains the actual matching Documents
Final words • http://xapian.org • architecture overview • documentation • advice for implementation • Questions? • EPrints implementation…