1 / 17

Introduction to the Xapian Search Engine

Introduction to the Xapian Search Engine. Sébastien François, EPrints Lead Developer EPrints Developer Powwow, ULCC. Presentation. Open Source Search Engine Library Written in C++ (we use the PERL bindings) Uses the BM25 ranking function which gives the relevance matching

erwin
Download Presentation

Introduction to the Xapian Search Engine

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Introduction to the Xapian Search Engine Sébastien François, EPrints Lead Developer EPrints Developer Powwow, ULCC

  2. Presentation • Open Source Search Engine Library • Written in C++ (we use the PERL bindings) • Uses the BM25 ranking function which gives the relevance matching • “Scales well”: 100+ million documents • Oh… code that we don’t need to maintain!

  3. Core Concepts • Database • Document • data • terms • Values • (Xapian) Metadata management • Searching • Are you ready for it?

  4. Core Concepts: Database • Collection of files storing indexes, positions, term frequencies, … • One write-lock, multiple read-locks • Stored in archives/<id>/var/xapian/ • Supports multiple-DB’s (unused in EPrints) • Can store arbitrary metadata

  5. Core Concepts: Document • A Document is an item returned by a search • So it’s also the meaty bit of indexing • Maps to a single data-obj in EPrints • Has three main components: • data • terms • values

  6. Core Concepts: Document Data • Arbitrary blob of data • Un-processed by Xapian • Used to store information needed to display the results • Used to store the data-obj identifier in EPrints in order to quickly build EPrints::List objects • Could be used to store more complex data: cached citations, JSON/PERL representation of the data-obj • Limit ~100MB per Document

  7. Core Concepts: Document Terms • Basis of relevance search: a search is a process of comparing the terms specified by a Query against the terms in the DB • Three main types of terms: • Un-prefixed terms: can be seen as a general pool of indexed terms • Prefixed terms: allow to search a sub-set of information (title, authors…) • Boolean terms: used to index identifiers (which don’t add any useful information to the probabilistic indexes)

  8. Core Concepts: Document Terms (2) • Boolean terms useful for filtering exact values (e.g. subjects:PM, type:article, …). No text processing involved, values appear 0 or 1 time in Documents. • Textual data - TermGenerator class: • Provides the Stemmer and Stopper classes (note: language-dependent) • Spelling correction • Exact matching (“hello world”) and the termpos joys

  9. Core Concepts: Document Terms (3) • Unprefixed terms used for the simple search • Prefixed terms used for a field-based search (such as the advanced search) • Boolean terms used for any identifier-type of fields – this includes facets (when searching)

  10. Core Concepts: Document Values • “search helpers” – we used them for ordering and faceting (occurences& available facets) • Each value (e.g. an order-value, a facet value) is stored in a numbered slot (32-bit integer) • Mappings between a meaningful string and a slot are stored in the Xapian DB as metadata • eprint.creators_name.en(1000000) is the slot for the order-value for the field “creators_name” on the dataset “eprint” for English

  11. Core Concepts: Document Values (2) • eprint.facet.type.0 (1500300) is the 1st slot for a facet “type” on the dataset eprint • Used by the MultiValueSorter class to order data (when not ordered by relevance) • Used to find out available facets (after a search) and the occurrences of the values e.g. there are 3 items of type ‘article’, 14 items of date ‘2013’ • Xapian documentation advises on keeping the number of values low (slow down searching) • We usually limit the number of slots for a facet to 5

  12. Core Concepts: Metadata management • We need to keep track of our slot mappings in the Xapian Database (not done by Xapian for us ) • EPrints reserves 1 000 000 slots per dataset: • 500 000 for order-values (1 per orderable field) • 500 000 for facet slots (1 per facetable value) • EPrints also stores the current slot offsets to know: • where the range for the next dataset starts • where the next slot of order-values are • EPrints also stores some other useful information as Metadata

  13. Core Concepts: Metadata management (2)

  14. Core Concepts: Searching • Reverse process of indexing  • Composed of a tree of Query objects (and sometime a QueryParser object) linked by boolean operators • $query = new Query( “hello” )$query = new Query( AND, $query, “world” ) • Can be stringified to see how the query is interpreted (easier to read than SQL!)

  15. Core Concepts: Searching - QueryParser • Parses user queries • Supports: • wildcards: wild* will match wildcat • boolean op’s: pear AND (red OR green NOT blue) • love/hate op’s: crab +nebula –crustacean • exact match: “loremipsum” • synonyms: colour/color, realise/realize • stemming: happiness/happy -> happi • suggestions: may provide a corrected query • Features can be turned on/off (all are enabled on EPrints)

  16. Core Concepts: Search - Enquire • The object which runs the query • Alternative ordering methods can be applied • A MatchDecider method may be provided to filter out results (in fact, we use that to compute facets) • Returns an MSet (Match Set) which contains the actual matching Documents

  17. Final words • http://xapian.org • architecture overview • documentation • advice for implementation • Questions? • EPrints implementation…

More Related