1 / 32

Apache Lucene and Apache Solr Performance Tuning

Mark Miller (markrmiller@apache.org). Apache Lucene and Apache Solr Performance Tuning. Brief Intro To. Lucene: Java library for building and searching “inverted” indices. Small, efficient, fast Approx 1 MB jar file. Inverted Index. Think of a book index. Segments. Incremental indexing.

elda
Download Presentation

Apache Lucene and Apache Solr Performance Tuning

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Mark Miller (markrmiller@apache.org) Apache Lucene and Apache Solr Performance Tuning Lucene and Solr Performance Tuning

  2. Lucene and Solr Performance Tuning Brief Intro To • Lucene: Java library for building and searching “inverted” indices. • Small, efficient, fast • Approx 1 MB jar file

  3. Lucene and Solr Performance Tuning Inverted Index • Think of a book index

  4. Lucene and Solr Performance Tuning Segments • Incremental indexing Segments

  5. Lucene and Solr Performance Tuning Index Files • segments file • .fnm - Field Namesindexed? payloads? Termvectors? • .tix .tii - Term Dictionary • .frq – Term frequencies • .fdt .fdx – Stored Fields • .tvx .tvf .tfd – TermVectors – freq and opt pos/offset • .nrm – norms • .del - deletions

  6. Lucene and Solr Performance Tuning Brief Intro To • Solr: search server built on top of Lucene • Manages index views, provides different access protocols (http, java, php, ruby, etc). • Adds many features: faceting, spellchecking, distribution, replication, caching, etc

  7. Lucene and Solr Performance Tuning Solr's solrconfig.xml • Controls Solr's settings and in some cases, Lucene's settings • The example solrconfig is exactly that – an example (a starting point, not an end point)

  8. Lucene and Solr Performance Tuning solrconfig.xml • useCompoundFile – writes each segment file into a single .cfs file - slower indexing (~10%) • mergeFactor – control how often merges occur, number of segments • ramBufferSizeMB (generally better than maxMergeDocs)

  9. Lucene and Solr Performance Tuning solrconfig.xml • <!-- If true, IndexReaders will be reopened (often more efficient) instead of closed and then opened. --> • <reopenReaders>true</reopenReaders>

  10. Lucene and Solr Performance Tuning solrconfig.xml • <!-- An optimization for use with the queryResultCache. When a search • is requested, a superset of the requested number of document ids • are collected. For example, if a search for a particular query • requests matching documents 10 through 19, and queryWindowSize is 50, • then documents 0 through 49 will be collected and cached. Any further • requests in that range can be satisfied via the cache. --> • <queryResultWindowSize>20</queryResultWindowSize> • <!-- Maximum number of documents to cache for any entry in the • queryResultCache. --> • <queryResultMaxDocsCached>200</queryResultMaxDocsCached>

  11. Lucene and Solr Performance Tuning solrconfig.xml • <!-- If a search request comes in and there is no current registered searcher, • then immediately register the still warming searcher and use it. If • "false" then all requests will block until the first searcher is done • warming. --> • <useColdSearcher>false</useColdSearcher> • <!-- Maximum number of searchers that may be warming in the background • concurrently. An error is returned if this limit is exceeded. Recommend • 1-2 for read-only slaves, higher for masters w/o cache warming. --> • <maxWarmingSearchers>2</maxWarmingSearchers>

  12. Lucene and Solr Performance Tuning • <!-- a newSearcher event is fired whenever a new searcher is being prepared • and there is a current searcher handling requests (aka registered). • It can be used to prime certain caches to prevent long request times for • certain requests. • --> • <!-- QuerySenderListener takes an array of NamedList and executes a • local query request for each NamedList in sequence. --> • <listener event="newSearcher" class="solr.QuerySenderListener"> • <arr name="queries"> • <!-- • <lst> <str name="q">solr</str> <str name="start">0</str> <str name="rows">10</str> </lst> • <lst> <str name="q">rocks</str> <str name="start">0</str> <str name="rows">10</str> </lst> • <lst><str name="q">static newSearcher warming query from solrconfig.xml</str></lst> • --> • </arr> • </listener> • <!-- a firstSearcher event is fired whenever a new searcher is being • prepared but there is no current registered searcher to handle • requests or to gain autowarming data from. --> • <listener event="firstSearcher" class="solr.QuerySenderListener"> • <arr name="queries"> • <lst> <str name="q">solr rocks</str><str name="start">0</str><str name="rows">10</str></lst> • <lst><str name="q">static firstSearcher warming query from solrconfig.xml</str></lst> • </arr> • </listener>

  13. Lucene and Solr Performance Tuning Solr Caches • Turn them off? • Size them correctly • Look at your cache stats to decide what to do (eg hits, evictions) • Play with autowarmCount

  14. Lucene and Solr Performance Tuning

  15. Lucene and Solr Performance Tuning • There are two implementations of cache available for Solr, LRUCache, based on a synchronized LinkedHashMap, and FastLRUCache, based on a ConcurrentHashMap. FastLRUCache has faster gets and slower puts in single threaded operation and thus is generally faster than LRUCache when the hit ratio of the cache is high (> 75%), and may be faster under other scenarios on multi-cpu systems. • The solrconig.xml uses FastLRUCache for the filter cache

  16. Lucene and Solr Performance Tuning NIOFSDirectory • An {@link FSDirectory} implementation that uses java.nio's FileChannel's positional read, which allows multiple threads to read from the same file without synchronizing. • Solr automatically selects when it detects a Non Windows System – poor performance on Windows due to a Sun JVM bug

  17. Lucene and Solr Performance Tuning Lucene Autocommit • <!-- • Expert: Turn on Lucene's auto commit capability. This causes intermediate • segment flushes to write a new lucene index descriptor, enabling it to be • opened by an external IndexReader. This can greatly slow down indexing • speed. NOTE: Despite the name, this value does not have any relation to • Solr's autoCommit functionality • --> • <!--<luceneAutoCommit>false</luceneAutoCommit>--> • There is no guarantee when exactly an auto commit will occur (it • used to be after every flush, but it is now after every • completed merge, as of 2.4).

  18. Lucene and Solr Performance Tuning Merge Policy • <!-- • Expert: The Merge Policy in Lucene controls how merging is handled by • Lucene. The default in 2.3 is the LogByteSizeMergePolicy, previous • versions used LogDocMergePolicy. • LogByteSizeMergePolicy chooses segments to merge based on their size. The • Lucene 2.2 default, LogDocMergePolicy chose when to merge based on number • of documents • Other implementations of MergePolicy must have a no-argument constructor • --> • <!--<mergePolicy class="org.apache.lucene.index.LogByteSizeMergePolicy"/>-->

  19. Lucene and Solr Performance Tuning Merge Scheduler • <mergeScheduler class="org.apache.lucene.index.ConcurrentMergeScheduler"> • <int name="maxThreadCount">3</int> • </mergeScheduler>

  20. Lucene and Solr Performance Tuning solrconfig.xml • <!-- Expert: • Controls how often Lucene loads terms into memory --> • <!--<termIndexInterval>256</termIndexInterval>-->This parameter determines the amount of computation required per query • term, regardless of the number of documents that contain that term. In • particular, it is the maximum number of other terms that must be • scanned before a term is located and its frequency and position information • may be processed. In a large index with user-entered query terms, query • processing time is likely to be dominated not by term lookup but rather • by the processing of frequency and positional data. In a small index • or when many uncommon query terms are generated (e.g., by wildcard • queries) term lookup may become a dominant cost.

  21. Lucene and Solr Performance Tuning Solr's schema.xml • Controls how content is going to be processed and stored in Solr. • Again, the version that comes with Solr is an example – not a final schema for your application.

  22. Lucene and Solr Performance Tuning schema.xml • Only store the fields you need to retrievestored=”false” • Lazy loading (on by default in solrconfig.xml) can help if you have large stored fields that are not always returned. • Don't index the fields you only want to return - indexed=”false”

  23. Lucene and Solr Performance Tuning schema.xml • copyfields → copy fields to target field • Remove unused copyfields. • Consider using a copyfield rather than searching many fields. • You probably don't want to store the target of a copyfield.

  24. Lucene and Solr Performance Tuning schema.xml • Consider Trie field types for numerics • Breaks up numerics into multiple tokens • Much faster search performance on large indexes • Doesn't yet work with some features (eg faceting – though date faceting does currently work with TrieDateField) Replaces both plain numerics and sortable numerics (unless you need "sortMissingFirst" or "sortMissingLast" )

  25. Lucene and Solr Performance Tuning schema.xml • Omit Norms where it makes sense • You lose index time boosting and document length normalization • Norms take up a byte per document in ram – allocated for every document per field no matter how many documents have that field - byte[maxDoc]

  26. Lucene and Solr Performance Tuning schema.xml • Use omitTermFreqAndPositions when it makes sense • true by default except for text fields. • Drops tf and position info for a field • Can be useful for short db type fields – where you want term matching, but not scores or positional matching.

  27. Lucene and Solr Performance Tuning Leading Wildcard Performance • Very slow by default – enumerates every term in the index • Lucene QueryParser does not allow by default – Solr hasn't allowed at all in the past. • Use solr.ReversedWildcardFilterFactory • A filter that reverses tokens to provide faster leading wildcard and prefix queries. Add this filter to the index analyzer, but not the query analyzer. The standard Solr query parser (SolrQuerySyntax) will use this to reverse wildcard and prefix queries to improve performance (for example, translating myfield:*foo into myfield:oof*). To avoid collisions and false matches, reversed tokens are indexed with a prefix that should not otherwise appear in indexed text.

  28. Lucene and Solr Performance Tuning JVM Settings • Most Important: -Xmx -Xms Most ram usage: fieldcaches, solr caches, index searchers (term index, norms?) • How to choose -Xmx -Xms? • Leave room for the filesystem cache

  29. Lucene and Solr Performance Tuning Filesystem Cache • Leave room for it. • Warming queries help fire it up • Ensure important files are in the cache? cp *.prx *.frq *.tis > /dev/null

  30. Lucene and Solr Performance Tuning Garbage Collection Tuning • Large multi gig heap? Choose your collector: • Likely, the concurrent low pause collector – but perhaps the parallel (throughput) collector. • Adventurous? Try the G1 collector. Sill likely buggy. -XX:+UnlockExperimentalVMOptions -XX:+UseG1GC

  31. Lucene and Solr Performance Tuning GC Tuning • Parallel compaction is used by default in JDK 6, but can be enabled by adding the option -XX:+UseParallelOldGC to the command line in JDK 5 update 6 and later. • With CMS, UseParNew is on by default on multiprocess machines.

  32. Lucene and Solr Performance Tuning Logging • Solr logging is chatty – defaults to info • Raising the level can increase performance • Its often not worth the information loss though

More Related