1 / 37

Advanced Indexing Techniques with

Advanced Indexing Techniques with. Michael Busch (buschmi@apache.org). http://people.apache.org/~buschmi/apachecon/. Agenda. Part 1: Inverted Index 101 Posting Lists Stored Fields vs. Payloads Part 2: Use cases for Payloads BoostingTermQuery Simple facet counting.

filomena
Download Presentation

Advanced Indexing Techniques with

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Advanced Indexing Techniques with Michael Busch (buschmi@apache.org) http://people.apache.org/~buschmi/apachecon/ Advanced Indexing Techniques with Apache Lucene - Payloads

  2. Agenda • Part 1: Inverted Index 101 • Posting Lists • Stored Fields vs. Payloads • Part 2: Use cases for Payloads • BoostingTermQuery • Simple facet counting Advanced Indexing Techniques with Apache Lucene - Payloads

  3. Lucene’s data structures Inverted Index Store search retrieve stored fields Hits Results Advanced Indexing Techniques with Apache Lucene - Payloads

  4. Query: not c:\docs\einstein.txt: The important thing is not to stop questioning. String comparison slow! Solution: Inverted index c:\docs\shakespeare.txt: To be or not to be. Advanced Indexing Techniques with Apache Lucene - Payloads

  5. Inverted index Query: not be important is not or questioning stop to the thing 1 0 0 0 1 1 0 0 0 1 0 0 c:\docs\einstein.txt: The important thing is not to stop questioning. 0 c:\docs\shakespeare.txt: To be or not to be. 1 Document IDs Advanced Indexing Techniques with Apache Lucene - Payloads

  6. 0 1 2 3 4 5 6 7 0 1 2 3 4 5 Inverted index Query: ”not to” be important is not or questioning stop to the thing 1 0 0 0 1 1 0 0 0 1 0 0 c:\docs\einstein.txt: The important thing is not to stop questioning. 0 c:\docs\shakespeare.txt: To be or not to be. 1 Document IDs Advanced Indexing Techniques with Apache Lucene - Payloads

  7. Inverted index Query: ”not to” be important is not or questioning stop to the thing 1 0 0 0 1 0 0 0 0 0 1 1 3 4 2 7 6 5 0 2 5 c:\docs\einstein.txt: The important thing is not to stop questioning. 0 0 1 2 3 4 5 1 3 6 7 c:\docs\shakespeare.txt: To be or not to be. 1 1 0 4 0 1 2 3 4 5 Document IDs Positions Advanced Indexing Techniques with Apache Lucene - Payloads

  8. 5 1 1 4 0 Inverted index with Payloads 1 0 0 0 1 0 0 0 0 0 1 1 3 4 2 7 6 5 0 2 be important is not or questioning stop to the thing c:\docs\einstein.txt: The important thing is not to stop questioning. 0 0 1 2 3 4 5 6 7 c:\docs\shakespeare.txt: To be or not to be. 1 B 0 1 2 3 4 5 Document IDs Positions Payloads Advanced Indexing Techniques with Apache Lucene - Payloads

  9. So far… • String comparison slow • Inverted index used to accelerate search • Store positions in posting lists to allow phrase searches • Store payloads in posting lists to store arbitrary data with each position Advanced Indexing Techniques with Apache Lucene - Payloads

  10. Lucene’s data structures Inverted Index Store search retrieve stored fields Hits Results Advanced Indexing Techniques with Apache Lucene - Payloads

  11. Documents: Store Field 1: title Field 2: content Field 3: hashvalue D0 D1 D2 F3 F3 F1 F2 F3 F1 F2 F1 F2 Store Advanced Indexing Techniques with Apache Lucene - Payloads

  12. Store D0 D1 D2 F3 F3 F1 F2 F3 F1 F2 F1 F2 • Optimized for random access • Document-locality Advanced Indexing Techniques with Apache Lucene - Payloads

  13. X X X Posting list with Payloads Document IDs D0 D1 D1 0 F3 0 F3 0 F3 Positions Payloads Store D0 D1 D2 F3 F3 F1 F2 F3 F1 F2 F1 F2 • Optimized for scanning and skipping • Space-efficient encoding Advanced Indexing Techniques with Apache Lucene - Payloads

  14. Agenda • Part 1: Inverted Index 101 • Posting Lists • Stored Fields vs. Payloads • Part 2: Use cases for Payloads • BoostingTermQuery • Simple facet counting Advanced Indexing Techniques with Apache Lucene - Payloads

  15. Payloads - API org.apache.lucene.analysis.Token void setPayload(Payload payload) org.apache.lucene.index.Payload Payload(byte[] data) Payload(byte[] data, int offset, int length) Advanced Indexing Techniques with Apache Lucene - Payloads

  16. Payloads - API org.apache.lucene.index.TermPositions boolean next(); int doc() int freq(); int nextPosition(); int getPayloadLength(); byte[] getPayload(byte[] data, int offset) Advanced Indexing Techniques with Apache Lucene - Payloads

  17. Example: BoostingTermQuery Use case: • Score certain occurrences of a term higher than others • E. g.: Query: ‘warning’ doc1: ”HURRICANE WARNING” doc2: “The Warning Label Generator is a fun way to generate your own warning labels!” (www.warninglabelgenerator.com) Advanced Indexing Techniques with Apache Lucene - Payloads

  18. Example: BoostingTermQuery Analyzer: final byte BoldBoost = 5; … Token token = new Token(…); … if (isBold) { token.setPayload( new Payload(new byte[] {BoldBoost})); } … return token; Advanced Indexing Techniques with Apache Lucene - Payloads

  19. Example: BoostingTermQuery Similarity: Similarity boostingSimilarity = new DefaultSimilarity() { // @override public float scorePayload(byte [] payload, int offset, int length) { if (length == 1) return payload[offset]; }; Advanced Indexing Techniques with Apache Lucene - Payloads

  20. Example: BoostingTermQuery BoostingTermQuery: Query btq = new BoostingTermQuery( new Term(“field”, “searchterm”)); Searching: Searcher searcher = new IndexSearcher(…); Searcher.setSimilarity(boostingSimilarity); … Hits hits = searcher.search(btq); Advanced Indexing Techniques with Apache Lucene - Payloads

  21. Example from java-user: Unique Doc Ids Use case: • Store a unique document id (UID) that maps to a row in a database table • Retrieve UID at search time to influence matching/scoring • FieldCache takes to long to load Advanced Indexing Techniques with Apache Lucene - Payloads

  22. Example from java-user: Unique Doc Ids Solution: • Index one special term for each document, e. g. ID:UID • Index one occurrence for each document • Store UID in the Payload of the occurrence Advanced Indexing Techniques with Apache Lucene - Payloads

  23. Example from java-user: Unique Doc Ids For indexing: TokenStream class SinglePayloadTokenStream extends TokenStream { boolean done = false; public void setUID(int uid) {...} public Token next() throws IOException { if (done) return null; Token token = new Token(“UID”); token.setPayload(new Payload(uid); done = true; return token; } } Advanced Indexing Techniques with Apache Lucene - Payloads

  24. Example from java-user: Unique Doc Ids For retrieving: TermPositions public int[] getCachedUIDs(IndexReader reader) { int[] cache = new int[reader.maxDoc()]; TermPositions tp = reader.termPositions( new Term(“ID”, “UID”); byte[] buffer = new byte[4]; while(tp.next()) { // iterate over docs tp.nextPosition(); // only one pos per doc tp.getPayload(buffer, 0); cache[tp.doc()] = bytesToInt(buffer); } return cache; } Advanced Indexing Techniques with Apache Lucene - Payloads

  25. Example from java-user: Unique Doc Ids Performance: • Load UIDs for 2M docs into memory • FieldCache: 16.5 s • Payloads: 430 ms Advanced Indexing Techniques with Apache Lucene - Payloads

  26. Example: (Very) Simple facet counting Use case: • Collection with docs from different sources • Show top-n results from each source instead of top-n results from entire collection Advanced Indexing Techniques with Apache Lucene - Payloads

  27. Example: (Very) Simple facet counting Analyzer: public TokenStream tokenStream(String fieldName, Reader reader) { if (fieldName.equals(“_facet”)) { return new TokenStream() { boolean done = false; public Token next() { if (done) return null; Token token = new Token(…); token.setPayload( new Payload(computeHash(url)); done = true; return token; }}}} Advanced Indexing Techniques with Apache Lucene - Payloads

  28. Example: (Very) Simple facet counting Hitcollector: • Use different PriorityQueues for different sites • Instead of returning top-n results of the whole data set, return top-n results per site Advanced Indexing Techniques with Apache Lucene - Payloads

  29. Example: (Very) Simple facet counting Summary • In this example: facet (site) used for scoring, but extendable for facet counting • Good performance due to locality of facet values Advanced Indexing Techniques with Apache Lucene - Payloads

  30. Example: Efficient Numeric Search Use case: • Find documents that have a numeric value in a specific range, e. g. all docs with a date >2006 and <2007 Currently in Lucene: • RangeQuery • Store all values in the dictionary • Query expansion Advanced Indexing Techniques with Apache Lucene - Payloads

  31. Example: Efficient Numeric Search Dictionary Postinglists 01/01/2006 01/02/2006 01/04/2006 . . . 12/30/2006 Query:[01/05/2006 TO 11/25/2006] Problem: A large number of postinglists have to be processed Advanced Indexing Techniques with Apache Lucene - Payloads

  32. Example: Efficient Numeric Search Idea: • Index special term, e. g. ‘numeric:date’ and store actual value in a Payload for each doc • Problem: Postinglist can become very big -> entire list has to be processed • Solution: Hybrid approach Advanced Indexing Techniques with Apache Lucene - Payloads

  33. Example: Efficient Numeric Search Dictionary Postinglists date:01/2006 date:02/2006 . . . date:12/2006 Store day in payload Store position where date occurred Document IDs Positions Payloads Advanced Indexing Techniques with Apache Lucene - Payloads

  34. Example: Efficient Numeric Search • Tradeoff between number of postinglists to process and size of postinglists • Significant speedup possible with good choice of chunk size Advanced Indexing Techniques with Apache Lucene - Payloads

  35. Conclusion • Payloads offer great flexibility • Payloads are stored very space-efficient • Sophisticated data structures enable efficient skipping over payloads • Payloads should be used whenever special data is required for finding hits and scoring Advanced Indexing Techniques with Apache Lucene - Payloads

  36. Outlook • Finalize API (currently Beta) • Add more out-of-the-box query types • Per-document Payloads – updateable • FieldCache implementation that uses Payloads Advanced Indexing Techniques with Apache Lucene - Payloads

  37. Advanced Indexing Techniques with Questions ? http://people.apache.org/~buschmi/apachecon/ Advanced Indexing Techniques with Apache Lucene - Payloads

More Related