1 / 17

Lucene Near Realtime Search

Lucene Near Realtime Search. Jason Rutherglen & Jake Mannix LinkedIn 6/3/2009 SOLR/Lucene User’s Group San Francisco. What is NRT?. Search on documents nearly as fast as they are indexed Delete documents in a way that is immediate and IO efficient

brad
Download Presentation

Lucene Near Realtime Search

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Lucene Near Realtime Search Jason Rutherglen & Jake Mannix LinkedIn 6/3/2009 SOLR/Lucene User’s Group San Francisco

  2. What is NRT? • Search on documents nearly as fast as they are indexed • Delete documents in a way that is immediate and IO efficient • Good for things like Twitter and other apps that require realtime searching (Social 2.0)

  3. Today? • Users expect to search their data immediately after updating it (Web/Social 2.0 apps) • Search engines are designed to perform efficient batch indexing (not realtime) • Batch indexing is slow and updates take a while to be searchable

  4. NRT in Lucene • Uses core Lucene code to make existing batch indexing nearly realtime • Required retrofitting of some of the core implementation • Details are hidden • Hopefully really easy for developers to use

  5. Lucene NRT Patches • LUCENE-1314 – IndexReader.clone • LUCENE-1516 – IndexWriter.getReader • LUCENE-1313 – RAMDir in IndexWriter • LUCENE-1483 – Fast FieldCache loading • LUCENE-1231 – Column stride fields • LUCENE-1526 – Incremental copy-on-write

  6. LUCENE-1314 • IndexReader.clone is like reopen • However it performs a copy-on-write of norms and deletes • Used by LUCENE-1516 to keep deletes in RAM (rather than flush them to disk)

  7. LUCENE-1516 • Adds ability to obtain an IndexReader from IndexWriter • Efficient in ram deletes • Call IndexWriter.getReader instead of IndexReader.reopen • All updating, deletes, roepening, and flushing details hidden from user • Will be in Lucene 2.9

  8. Sample IW.getReader Code IndexWriter writer; Document doc = new Document(); writer.addDocument(doc); IndexReader reader = writer.getReader(); Document sameDoc= reader.document(0); assert doc.equals(sameDoc);

  9. LUCENE-1313 • Near Realtime Search • Makes IW.getReader faster • New segments are flushed to IndexWriter internal RAMDirectory • Could increase overall indexing performance because there’s no pause while the ram buffer is being written to disk • Will be in Lucene 2.9?

  10. LUCENE-1483 • Searches on fieldcaches at the segment level • Means faster field cache loading and more efficient memory usage • Good for realtime because field cache loading is less of a bottleneck, less ram usage • Will be in Lucene 2.9

  11. LUCENE-1526 • Optimize copy-on-write • When we’re doing IndexReader.clone, we may be creating a huge new array for a small number of deletes or norms updates • So we need to do incremental copy-on-write of things like deletes, norms, and field caches (?) • Lucene 3.0?

  12. LUCENE-1231 • Column stride fields will make field cache loading faster because data will be loaded sequentially from disk • Today there are potentially two hard drive seeks per field cache value (TermEnum.next, TermDocs.next) • Lucene 3.0?

  13. Future of Lucene NRT • LUCENE-1292 – Realtime parallel untokenized field index (for tags) • Pulsing - Store smaller postings directly in the term dictionary (to avoid seeks) for faster field cache loading • Replication • More benchmarks

  14. LinkedIn Open Source Projects • Bobo – Facet library that counts using custom field caches http://code.google.com/p/bobo-browse/ • Zoie – Realtime search on top of Lucene http://code.google.com/p/zoie/ • Voldemort – Distributed key-value storage http://project-voldemort.com/

  15. BoboBrowse: facet features • MultiSelect • Runtime-defined facets (query-based, etc) • Fast (custom field-cache based) • Custom facet types: • Hierarchical (/a/b/c) • Range • Multivalued

  16. Zoie: realtime features • No modifications to core lucene • Multiple read/write: RAMDir + FSDir • IndexReader on (small) RAMDir opened per request: instantly realtime • IndexReaderDecorator for custom Reader • Transparent Indexing: implement StreamDataProvider then inject

  17. Next Steps • Help work on the patches? https://issues.apache.org/jira/browse/LUCENE • LinkedIn is hiring • Contact: jason.rutherglen@gmail.com or jake.mannix@gmail.com

More Related