240 likes | 351 Views
Advanced Indexing Techniques with. Michael Busch (buschmi@apache.org). Agenda. Part 1: Inverted Index 101 Posting Lists Stored Fields vs. Payloads Part 2: Use cases for Payloads BoostingTermQuery Simple facet counting. Lucene’s data structures. Inverted Index. Store. search.
E N D
Advanced Indexing Techniques with Michael Busch (buschmi@apache.org) Advanced Indexing Techniques with Apache Lucene - Payloads
Agenda • Part 1: Inverted Index 101 • Posting Lists • Stored Fields vs. Payloads • Part 2: Use cases for Payloads • BoostingTermQuery • Simple facet counting Advanced Indexing Techniques with Apache Lucene - Payloads
Lucene’s data structures Inverted Index Store search retrieve stored fields Hits Results Advanced Indexing Techniques with Apache Lucene - Payloads
Query: not c:\docs\einstein.txt: The important thing is not to stop questioning. String comparison slow! Solution: Inverted index c:\docs\shakespeare.txt: To be or not to be. Advanced Indexing Techniques with Apache Lucene - Payloads
Inverted index Query: not be important is not or questioning stop to the thing 1 0 0 0 1 1 0 0 0 1 0 0 c:\docs\einstein.txt: The important thing is not to stop questioning. 0 c:\docs\shakespeare.txt: To be or not to be. 1 Document IDs Advanced Indexing Techniques with Apache Lucene - Payloads
0 1 2 3 4 5 6 7 0 1 2 3 4 5 Inverted index Query: ”not to” be important is not or questioning stop to the thing 1 0 0 0 1 1 0 0 0 1 0 0 c:\docs\einstein.txt: The important thing is not to stop questioning. 0 c:\docs\shakespeare.txt: To be or not to be. 1 Document IDs Advanced Indexing Techniques with Apache Lucene - Payloads
Inverted index Query: ”not to” be important is not or questioning stop to the thing 1 0 0 0 1 0 0 0 0 0 1 1 3 4 2 7 6 5 0 2 5 c:\docs\einstein.txt: The important thing is not to stop questioning. 0 0 1 2 3 4 5 1 6 7 c:\docs\shakespeare.txt: To be or not to be. 1 1 0 4 0 1 2 3 4 5 Document IDs Positions Advanced Indexing Techniques with Apache Lucene - Payloads
5 1 1 4 0 Inverted index with Payloads 1 0 0 0 1 0 0 0 0 0 1 1 3 4 2 7 6 5 0 2 be important is not or questioning stop to the thing c:\docs\einstein.txt: The important thing is not to stop questioning. 0 0 1 2 3 4 5 6 7 c:\docs\shakespeare.txt: To be or not to be. 1 0 1 2 3 4 5 Document IDs Positions Payloads Advanced Indexing Techniques with Apache Lucene - Payloads
So far… • String comparison slow • Inverted index used to accelerate search • Store positions in posting lists to allow phrase searches • Store payloads in posting lists to store arbitrary data with each position Advanced Indexing Techniques with Apache Lucene - Payloads
Lucene’s data structures Inverted Index Store search retrieve stored fields Hits Results Advanced Indexing Techniques with Apache Lucene - Payloads
Documents: Store Field 1: title Field 2: content Field 3: hashvalue D0 D1 D2 F3 F3 F1 F2 F3 F1 F2 F1 F2 Store Advanced Indexing Techniques with Apache Lucene - Payloads
Store D0 D1 D2 F3 F3 F1 F2 F3 F1 F2 F1 F2 • Optimized for random access • Document-locality Advanced Indexing Techniques with Apache Lucene - Payloads
X X X Posting list with Payloads Document IDs D0 D1 D1 0 F3 0 F3 0 F3 Positions Payloads Store D0 D1 D2 F3 F3 F1 F2 F3 F1 F2 F1 F2 • Optimized for scanning and skipping • Value-locality Advanced Indexing Techniques with Apache Lucene - Payloads
Agenda • Part 1: Inverted Index 101 • Posting Lists • Stored Fields vs. Payloads • Part 2: Use cases for Payloads • BoostingTermQuery • Simple facet counting Advanced Indexing Techniques with Apache Lucene - Payloads
Payloads - API org.apache.lucene.analysis.Token void setPayload(Payload payload) org.apache.lucene.index.TermPositions int getPayloadLength(); byte[] getPayload(byte[] data, int offset) Advanced Indexing Techniques with Apache Lucene - Payloads
Example: BoostingTermQuery Analyzer: final byte BoldBoost = 5; … Token token = new Token(…); … If (isBold) { token.setPayload( new Payload(new byte[] {BoldBoost})); } … return token; Advanced Indexing Techniques with Apache Lucene - Payloads
Example: BoostingTermQuery Similarity: Similarity boostingSimilarity = new DefaultSimilarity() { // @override public float scorePayload(byte [] payload, int offset, int length) { if (length == 1) return payload[offset]; }; Advanced Indexing Techniques with Apache Lucene - Payloads
Example: BoostingTermQuery BoostingTermQuery: Query btq = new BoostingTermQuery( new Term(“field”, “searchterm”)); Searching: Searcher searcher = new IndexSearcher(…); Searcher.setSimilarity(boostingSimilarity); … Hits hits = searcher.search(btq); Advanced Indexing Techniques with Apache Lucene - Payloads
Example: Simple facet counting Analyzer: public TokenStream tokenStream(String fieldName, Reader reader) { if (fieldName.equals(“_facet”)) { return new TokenStream() { boolean done = false; public Token next() { if (done) return null; Token token = new Token(…); token.setPayload( new Payload(computeHash(url)); done = true; return token; }}}} Advanced Indexing Techniques with Apache Lucene - Payloads
Example: Simple facet counting Hitcollector: • Use different PriorityQueues for different sites • Instead of returning top-n results of the whole data set, return top-n results per site Advanced Indexing Techniques with Apache Lucene - Payloads
Example: Simple facet counting Summary • In this example: facet (site) used for scoring, but extendable for facet counting • Good performance due to locality of facet values Advanced Indexing Techniques with Apache Lucene - Payloads
Conclusion • Payloads offer great flexibility • Payloads are stored very space-efficient • Sophisticated data structures enable efficient skipping over payloads • Payloads should be used whenever special data is required for finding hits and scoring Advanced Indexing Techniques with Apache Lucene - Payloads
Outlook • Finalize API (currently Beta) • Add more out-of-the-box query types • Per-document Payloads Advanced Indexing Techniques with Apache Lucene - Payloads
Advanced Indexing Techniques with Questions ? Advanced Indexing Techniques with Apache Lucene - Payloads