1 / 29

SOLR SIDE-CAR INDEX

SOLR SIDE-CAR INDEX. Andrzej Bialecki . LucidWorks. ab@lucidworks.com. About the speaker. Started using Lucene in 2003 (1.2-dev…) Created Luke – the Lucene Index Toolbox Apache Nutch , Hadoop , Solr committer, Lucene PMC member LucidWorks engineer. Agenda.

Download Presentation

SOLR SIDE-CAR INDEX

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. SOLR SIDE-CAR INDEX Andrzej Bialecki. LucidWorks ab@lucidworks.com

  2. About the speaker • Started using Lucene in 2003 (1.2-dev…) • Created Luke – the Lucene Index Toolbox • Apache Nutch, Hadoop, Solr committer, Lucene PMC member • LucidWorksengineer

  3. Agenda • Challenge: incremental document updates • Existing solutions and workarounds • Sidecar index strategy and components • Scalability and performance • QA

  4. Challenge: incremental document updates • Incremental update (field-level update): modification of a part of document • Sounds like a fundamentally useful functionality! • But Lucene / Solr doesn’t offer true field-level updates (yet!) • “Update” is really a sequence of “retrieve old document, update fields, add updated document, delete old document” • “Atomic update” functionality in Solr is a (useful) syntactic sugar

  5. Common use cases for field updates • Documents composed logically of two parts with different update schedules • E.g. mostly staticdocuments with some quickly changing fields • Two different classes of data in changing fields • Numeric / boolean fields: e.g. popularity, in-stock status, promo campaigns • Text fields: e.g. reviews, tags, click-through feedback, user profiles • Challenge: how to integrate these modifications with the main index content? • Re-indexing whole documents isn’t always an option

  6. True full-text (inverted fields) incremental updates • Very complex issue, broad impact on many Lucene internals • Inverted index structure is not optimized for partial document updates • At least another 6-12 months away? • LUCENE-4258 – work in progress

  7. Handling updates via full re-index • If the corpus is small, or incremental updates infrequent… just re-index everything! • Pros: • Relatively easy to implement – update source documents and re-index • Allows adding all types of data, including e.g. labels as searchable text • Cons: • Infeasible for larger corpora or frequent updates, time-wise and cost-wise • Requires keeping around the source documents • Sometimes inconvenient, when documents are assembled in a complex pipeline

  8. Handling updates via Solr’sExternalFileField • Pros: • Simple to implement • Updates are easy – just file edits, no need to re-index • Cons: • Only docId => field : number • Not suitable for full-text searchable field updates • E.g. can’t support user-generated labels attached to a doc • Still useful if a simple “popularity”-type metric is sufficient • Internally implemented as an in-memory ValueSource usable by function queries doc0=1.5 doc1=2.5 doc2=0.5 …

  9. Numeric DocValues updates • Since Lucene/Solr 4.6 … to be released Really Soon  • Details can be found in LUCENE-5189 • As simple as: indexWriter.updateNumericDocValue(term, field, value) • Neatly solves the problem of numeric updates: popularity, in-stock, etc. • Some limitations: • Massive updates still somewhat costly until the next merge (like deletes) • Can only update existing fields • Obviously doesn’t address the full-text inverted fieldupdates

  10. Lucene ParallelReader overview 0 f1, f2, f3, f4… • Pretends that two or more IndexReader-s are slices of the same index • Slices contain data for different fields • Both stored and inverted parts are supported • Data for matching docs is joined on the fly • Structure of all indexes MUST match 1:1 !!! • The same number of segments • The same count of docs per segment • Internal document ID-s must match 1:1 • List of deletes is taken from the first index • Sounds cool … but in practice it’s rarely used: • It’s very difficult to meet these requirements • This is even more difficult in the presence of index updates and merges ParallelReader main IR parallel IR 0 1 2 3 4 5 6 f1, f2, ... f1, f2, ... f1, f2, ... f1, f2, … f3, f4, ... f3, f4, ... f3, f4, ... f3, f4, … 0 1 2 3 0 1 2 3 f1, f2, ... f1, f2, … f3, f4, ... f3, f4, … 0 1 0 1 0 0 f1, f2, … f3, f4, …

  11. Handling updates via ParallelReader • Pros: • All types of data (e.g. searchable full-text labels) can be added • Cons: • Must ensure that the other index always matches the structure of the main index • Complicated and fragile (rebuild on every update?) • No tools to manage this parallel index in Solr ParallelReader main IR parallel IR f3, f4, ... f3, f4, … 0 1 0 1 2 3 4 5 6 f1, f2, ... f1, f2, ... f1, f2, ... f1, f2, … 0 1 2 3 f3, f4, ... f3, f4, … 0 1 f1, f2, ... f1, f2, … 0 1 0 f3, f4, … 0 f1, f2, …

  12. Sidecar Index Components for Solr • Uses the ParallelReader strategy for field updates • “Main” and “sidecar” data comes from two different Solr collections • “Sidecar” collection is updated independently from the main collection • “Sidecar” collection is used as a source of document fields for building and updating a parallel index • Integrates the management of ParallelReader (“sidecar index”) into Solr • Initial creation of ParallelReader, including synchronization of internal ID-s • Tracking of updates and IndexReader.reopen(…) events • Partly based on a version of Click Framework in LucidWorks Search • Available under Apache License here: http://github.com/LucidWorks/sidecar_index

  13. “Main”, “sidecar” collections and parallel index • “Main” collection contains only the parts of documents with “main” fields • “Sidecar” collection is a source of documents with “sidecar” fields • SidecarIndexReaderFactory creates and maintains the parallel index (sidecar index) • “Main” collection uses SidecarIndexReader that acts as ParallelReader • Main index is updated as usual, via the “main” collection’s IndexWriter Solr Main_collection Sidecar_collection SidecarIndexReader main index sidecar index

  14. Implementation details • SidecarIndexReaderFactory extends Solr’sIndexReaderFactory • newReader(Directory, SolrCore) – initial open • newReader(IndexWriter, SolrCore) – NRT open • SidecarIndexReader acts like a ParallelReader • Solr wants DirectoryReader, but ParallelReader is not a DirectoryReader • Basically had to re-implement the logic from ParallelReader  • ParallelReader challenges: • How to synchronize internal ID-s? • How to create segments that are of the same size as those of the main index? • How to handle deleted documents? • How to handle updates to the main index? • How to handle updates to the sidecar data?

  15. Sidecar collection ParallelReader challenges and solutions • How to synchronize internal ID-s? • “Main” collection is traversed sequentially by internal docId • Primary key is retrieved for each document • Matching document is found in the “sidecar” collection • Matching document is added to the “sidecar” index • Very costly phase! • Random seek and retrieval from “sidecar” collection • Primary key lookup is fast • … but stored field retrieval and indexing isn’t main IR sidecar IR 0 1 2 3 4 5 6 q=id:D D, f2, ... B, f2, ... A, f2, ... F, f2, … f3, f4, ... f3, f4, ... f3, f4, ... 0 1 2 3 0 1 2 f3, f4, ... f3, f4, ... f3, f4, ... f3, f4, … f3, f4, ... f3, f4, ... f3, f4, … G B C E A F D C, f2, ... G, f2, … 0 1 0 E, f2, … Main collection

  16. ParallelReader challenges and solutions • Optimization 1: don’t rebuild data for unmodified segments • Optimization 2 (cheating): ignore NRT segments • How to handle deleted docs? • Insert dummy (empty) documents so that the number and the order of documents still match ParallelReader main IR sidecar IR 0 1 2 3 4 5 7 f1, f2, ... f1, f2, ... f1, f2, ... f1, f2, … f3, f4, ... f3, f4, ... f3, f4, ... f3, f4, … 0 1 2 3 0 1 2 3 dummy f3, f4, … 0 1 f1, f2, ... f1, f2, … 0 1 X f1, f2, ... f1, f2, … f3, f4, ... f3, f4, … 0 1 0 1 NRT 0 f1, f2, …

  17. Implementation: SidecarMergePolicy • How to create segments that are of the same size as the “main” index? • Carefully manage the “sidecar” index creation: • IndexWriter uses SerialMergeScheduler to prevent out-of-order merges • Force flush when reaching the next target count of documents • Merges are enforced using SidecarMergePolicy that tracks the sizes of the “main” index segments ParallelReader main IR sidecar IR 0 1 2 3 4 5 6 f1, f2, ... f1, f2, ... f1, f2, ... f1, f2, … f3, f4, ... f3, f4, ... f3, f4, ... f3, f4, … 0 1 2 3 0 1 2 3 f1, f2, ... f1, f2, … f3, f4, ... f3, f4, … 0 1 0 1 0 0 f1, f2, … f3, f4, … SidecarMergePolicy target sizes: Seg0 – 4 docs Seg1 – 2 docs Seg2 – 1 doc

  18. Implementation: SidecarIndexReader • Re-implements the logic of ParallelReader • ParallelReader != DirectoryReader • Exposes Directory of the “main” index for replication • Replicas need the “sidecar” collection replica to rebuild the sidecar index locally • If document routing and shard placement is the same then we don’t have to use distributed search – all data will be local • Reopen(…) avoids rebuilding unmodified segments • Reopen(…) uses SidecarIndexReaderFactory to rebuild the sidecar index when necessary • When there’s a major merge in the “main” index • When “sidecar” data is updated • Ref-counting of IndexReaders at different levels is very tricky!

  19. Example configuration in solrconfig.xml <indexReaderFactory name="IndexReaderFactory" class="com.lucid.solr.sidecar.SidecarIndexReaderFactory"> <str name="docIdField">id</str> <str name="sourceCollection">source</str> <bool name="enabled">true</bool> </indexReaderFactory>

  20. Example use case: integration of click-through data • Raw click-through data: • Query, query_time, docId, click_time [, user] • Aggregated click-through data: • User-generated popularity score: F(number and timing of clicks per docId) • Numeric updates • User-generated labels: F(top-N queries that led to clicks on docId) • Full-text searchable updates • User profiles: F(top-N queries per user, top-N docId-s clicked, etc) • … • Queries can now be expanded to score based on TF/IDF in user-generated labels

  21. Scalability and performance

  22. Scalability and performance • Initial full rebuild is very costly • ~0.6 ms / document • 1 mln docs = 600 sec = 10 min • Not even close to “real time” … • Cost related to new segments in “main” index depends on the size of segments • Major merge events will trigger full rebuild • BUT: search-time cost is negligible

  23. Caveats • Combination of ref-counting in Lucene, Solr and ParallelReader is difficult to track • The sidecar code is still unstable and occasionally explodes • Performance of full rebuild quickly becomes the bottleneck on frequent updates • So the main use case is massive but infrequent updates of “sidecar” data • Code: http://github.com/LucidWorks/sidecar_index • Fixes and contributions are welcome – the code is Apache licensed

  24. Agenda • Challenge: incremental document updates • Existing solutions and workarounds • Sidecar index strategy and components • Scalability and performance • QA

  25. QA Andrzej Bialecki ab@lucidworks.com

More Related