Lucene Performance

Lucene Performance Grant Ingersoll November 16, 2007 Atlanta, GA

Overview • Defining Performance • Basics • Indexing • Parameters • Threading • Search • Document Retrieval • Search Quality

Defining Performance • Many factors in assessing Lucene (and search) performance • Speed • Quality of results (subjective) • Precision • # relevant retrieved out of # retrieved • Recall • # relevant retrieved out of total # relevant • Size of index • Compression rate • Other Factors: • Local vs. distributed

Basics • Consider latest version of Lucene • Lucene 2.3/Trunk has many performance improvements over prior versions • Consider Solr • Solr employs many Lucene best practices • contrib/benchmark can help assess many aspects of performance, including speed, precision and recall • Task based approach makes for easy extension • Sanity check your needs • Profile to identify bottlenecks

Indexing Factors • Lucene indexes Documents into memory • On certain occasions, memory is flushed to the index representation (called a segment) • Segments are periodically merged • Internal Lucene models are changing and (drastically) improving performance

IndexWriter factors • setMaxBufferedDocs controls minimum # of docs before merge occurs • Larger == faster • > RAM • setMergeFactor controls how often segments are merged • Smaller == less RAM, better for large # of updates • Larger == faster, better for batch • setMaxFieldLength controls the # of terms indexed from a document • setUseCompoundFile controls the file format Lucene uses. Turning off compound file format is faster, but you could run out of file descriptors

Lucene 2.3 IndexWriter Changes • setRAMBufferSizeMB • New model for automagically controlling indexing factors based on the amount of memory in use • Obsoletes setMaxBufferedDocs and setMergeFactor • Takes storage and term vectors out of the merge process • Turn off auto-commit if there are stored fields and term vectors • Provides significant performance increase

Analysis • An Analyzer is a Tokenizer and one or more TokenFilters • More complicated analysis, slower indexing • Many applications could use simpler Analyzers than the StandardAnalyzer • StandardTokenizer is now faster in 2.3 (thus making StandardAnalyzer faster) • Reuse in 2.3: • Re-use Token, Document and Field instances • Use the char[] API with Token instead of String API

Thread Safety • Use a single IndexWriter for the duration of indexing • Share IndexWriter between threads • Parallel Indexing • Index to separate Directory instances • Merge when done with IndexWriter.addIndexes() • Distribute and collect

Other Indexing Factors • NFS • Have been some improvements lately, but… • “proceed with caution” • Not as good as local filesystem • Replication • Index locally and then use rsync to replicate copies of index to other servers • Have I mentioned Solr?

Benchmarking Indexing • contrib/benchmark • Try out different algorithms between Lucene 2.2 and trunk (2.3) • contrib/benchmark/conf: • indexing.alg • indexing-multithreaded.alg • Info: • Mac Pro 2 x 2GHz Dual-Core Xeon • 4 GB RAM • ant run-task -Dtask.alg=./conf/indexing.alg -Dtask.mem=1024M

Benchmarking Results

Search Performance • Many factors influence search speed • Query Type, size, analysis, # of occurrences, index size, index optimization, index type • Known Enemies • Search Quality also has many factors • Query formulation, synonyms, analysis, etc. • How to judge quality?

Query Types • Some queries in Lucene get rewritten into simpler queries: • WildcardQuery rewrites to a BooleanQuery of all the terms that satisfy the wildcards • a* -> abe, apple, an, and, array… • Likewise with RangeQuery, especially with date ranges

Query Size • Stopword removal can help reduce size • Choose expansions carefully • Consider using fewer fields to search over • When doing relevance feedback, don’t use whole document, instead focus on most important terms

Index Factors for Search • Size: • more unique terms, more to search • Stopword removal and stemming can help reduce • Not a linear factor due to index compression • Type • RAMDirectory if index smaller • MMapDirectory may perform better

Search Speed Tips • IndexSearcher • Thread-safe, so share • Open once and use as long as possible • Cache Filters when appropriate • Optimize if you have the time • Warm up your Searcher first by sending it some preliminary queries before making it live

Known Enemies • CPU, Memory, I/O are all known enemies of performance • Can’t live without them, either! • Profile, run benchmarks, look at garbage collection policies, etc. • Check your needs • Do you need wildcards? • Do you need so many Fields?

Document Retrieval • Common Search Scenario: • Many small Fields containing info about the Document • One or two big Fields storing content • Run search, display small Fields to user • User picks one result to view content

FieldSelector • Gives developer greater control over how the Document is loaded • Load, Lazy, No Load, Load and Break, Size, etc. • In previous scenario, lazy load the large Fields • Easier to store original content without performance penalty

Quality Queries • Evaluating search quality is difficult and subjective • Lucene provides good out of the box quality by most accounts • Can evaluate using TREC or other experiments, but these risk overtuning • Unfortunately, judging quality is a labor-intensive task

Quality Experiments • Needs: • Standard collection of docs - easy • Set of queries • Query logs • Develop in-house • TREC, other conferences • Set of judgments • Labor intensive • Can use log analysis to determine estimates of which queries are relevant based on clicks, etc.

Query Formulation • Invest the time in determining the proper analysis of the fields you are searching • Case sensitive search • Punctuation analysis • Strict matching • Stopword policy • Stopwords can be useful • Operator choice • Synonym choices

Effective Scoring • Similarity class provides callback mechanism for controlling how some Lucene scoring factors count towards the score • tf(), idf(), coord() • Experiment with different length normalization factors • You may find Lucene is overemphasizing shorter or longer documents

Effective Scoring • Can also implement your own Query class • Ask if anyone else has done it first on java-user mailing list • Go beyond the obvious: • org.apach.lucene.search.function package provides means for using values of Fields to change the scores • Geographic scoring, user ratings, others • Payloads (stay tuned for next presentation)

Resources • Talk available at: http://lucene.grantingersoll.com/apachecon07/LucenePerformance.ppt • http://lucene.apache.org • Mailing List • java-user@lucene.apache.org • Lucene In Action • http://www.lucenebook.com

Lucene Performance

Lucene Performance

Presentation Transcript

Advanced Lucene

Lucene-Demo

Apache Lucene

Apache Lucene

Lucene

Lucene

Lucene (Concluded) ‏

Apache Lucene and Apache Solr Performance Tuning

Advanced Lucene

Lucene Tutorial

Apache Lucene

Lucene/SOLR 2: Lucene search API

Lucene

Lucene

Topic: Lucene

Lucene Part3 ‏

Lucene (Concluded) ‏

Lucene Homework