Solr Performance & Key Innovations

Solr Performance & Key Innovations Yonik Seeley, Lucid Imaginationyonik@lucidimagination.com, May 26 2011

Solr 3.1 Highlights • Numeric range facets (similar to date faceting). • New spatial search, including spatial filtering, boosting and sorting capabilities. • Example Velocity driven search UI at http://localhost:8983/solr/browse • A new faster termvector-based highlighter. • Extended dismax (edismax) query parser with support for fielded queries, enhanced relevancy, and full lucene syntax support. • Distributed search support for the Spell check and Terms components.

Solr 3.1 Highlights (continued) • Suggester, a fast trie-based autocomplete component. • Sort results by any function query. • JSON document indexing. • CSV response format • Apache UIMA integration for metadata extraction. • Tons of optimizations, bugfixes, and new analysis capabilities via Apache Lucene 3.1.

What’s not in 3.1? • Result Grouping (AKA Field Collapsing) • Pivot Faceting • SolrCloud • Pseudo-fields • Pseudo-join • Relevancy function queries • Per-segment faceting • *Tons* of new Lucene performance/efficiency goodness

Recent Lucene Performance • TieredMergePolicy – the new default • Much better for incremental indexing / NRT • Ignores segment order when selecting best merge • Takes deletes into account • Does not over-merge (no cascading merges) • Finite State Transducer (FST) based terms index

DocumentWriterPerThread (DWPT) Indexing thread • Flushing new segment is now concurrent w/ indexing • Use multiple indexing threads/connections • When max mem is hit, biggest DWPT is concurrently flushed Index Writer DWPT DWPT DWPT in-memory Flush segment to disk _1_0.tiv _1_0.prx _1_0.frq … _2_0.tiv _2_0.prx _2_0.frq … _3_0.tiv _3_0.prx _3_0.frq …

Solr Cloud http://.../solr/collection1?distrib=true Load-balanced sub-request shard1(replica1) shard2(replica1) replica2 replica2 replica3 replica3 ZK node /livenodes server1:8983/solr server2:8983/solr server2:8983/solr ZK node /collections /collection1 configName=myconf /shards /shard1 server1:8983/solr server2:8983/solr /shard2 server3:8983/solr server4:8983/solr ZK node /configs /myconf solrconfig.xml schema.xml ZK node ZK node ZooKeeper quorum

Solr Cloud: Getting Started http://wiki.apache.org/solr/SolrCloud java -Dbootstrap_confdir=./solr/conf -Dcollection.configName=myconf -DzkRun -jar start.jar Upload /solr/conf to ZK and call it “myconf” Run an internal ZK server http://localhost:8983/solr/collection1/admin/zookeeper.jsp

Distributed Requests • Explicitly specify node addresses to load-balance across shards=localhost:8983/solr|localhost:8900/solr, localhost:7574/solr|localhost:7500/solr • A list of equivalent nodes are separated by “|” • Different phases of the same distributed request use the same node • Specify logical shard ids to search across shards=NY_shard,NJ_shard • Query across all shards in the collection http://localhost:8983/solr/collection1/select?distrib=true • public CloudSolrServer(String zkHost) • SolrJ Java client that load-balances across all nodes in cluster

Extended Dismax Parser • Superset of dismax • Designed to directly handle user queries w/o exceptions &defType=edismax&q=foo&qf=body • Fixes edge cases where dismax could still throw exceptions OR AND NOT - “ • Full lucene syntax support • Tries lucene syntax first • Smart escaping is done if syntax errors • Optionally supports treating “and”/”or” as AND/OR in lucene syntax • Fielded queries (e.g. myfield:foo) even in degraded mode • uf parameter controls what field names may be directly specified in “q”

Extended Dismax Parser (continued) • boost parameter for multiplicative boost-by-function • Pure negative query clauses Example: solr OR (-solr) • Enhanced term proximity boosting • pf2=myfield – results in term bigrams in sloppy phrase queries myfield:“aa bb cc”-> myfield:“aa bb” myfield:“bb cc” • Enhanced stopword handling • stopwords omitted in main query, but added in optional proximity boosting part Example: q=solr is awesome & qf=myfield & pf2=myfield -> +myfield:(solr awesome) (myfield:”solr is” myfield:”is awesome”) • Currently controlled by the absence of StopWordFilter in index analyzer, and presence in query analyzer

Faceting Performance Improvements • For facet.method=enum, speed up initial population of the filterCache (i.e. first time facet): from 30% to 32x improvement • Optimized facet.method=fc for multi-valued fields and large facet.limit – up to 3x faster • Optimized deep facet paging – up to 10x faster with really large facet.offsets • Less memory consumed by field cache entries • Per-segment faceting with facet.method=fcs • Only faster when re-opening index frequently (many times a second) • Only works for single-valued fields

Pivot Faceting • Other names that could have made sense: • Grid Faceting, Cross-Product Faceting, Matrix Faceting • Syntax: facet.pivot=field1,field2,field3,… facet.pivot=cat,inStock

Pivot Faceting http://...&facet=true&facet.pivot=cat,popularity (continued) { "field":"popularity", "value":"1", "count":2}]}, { "field":"cat", "value":"memory", "count":3, "pivot":[]}, […] "facet_counts":{ "facet_pivot":{ "cat,popularity":[{ "field":"cat", "value":"electronics", "count":14, "pivot":[{ "field":"popularity", "value":"6", "count":5}, { "field":"popularity", "value":"7", "count":4}, 14 docs w/ cat==electronics 5 docs w/ cat==electronics && popularity==6

Range Faceting • Like Date faceting, but more generic http://...&facet=true &facet.range=price &facet.range.start=0 &facet.range.end=500 &facet.range.gap=50 "facet_counts":{ "facet_ranges":{ "price":{ "counts":{ "0.0":5, "50.0":2, "100.0":0, "150.0":2, "200.0":0, "250.0":1, "300.0":2, "350.0":2, "400.0":0, "450.0":1}, "gap":50.0, "start":0.0, "end":500.0}}}}

Spatial Search Step1: Index some locations! <field name=“name”>The Alpine Shop</field> <field name=“store”>44.013617,-73.168264</field> Step2: Decide where you are &pt=44.0153371,-73.16734 &d=1 &sfield=store Step3: Profit! Spatial Filter: &fq={!geofilt} Bounding Box: &fq={!bbox} Distance Function: &sort=geodist() asc Returning the distance: &fl=geodist() Pseudo-fields! Note: You can now sort by any arbitrary function query!

Pseudo-Fields Returns other info along with document stored fields • Function queries fl=name,location,geodist(),add(myfield,10) • Fieldname globs fl=id,attr_* • Multiple “fl” (field list) values &fl=id,attr_*&fl=geodist()&fl=termfreq(text,’solr’) • Aliasing fl=id,location:loc,_dist_:geodist() • Future: inlined highlighting, “explain”, sort-values, group-value

Result Grouping / Field Collapsing • Goal • Limit the number of results per category • “category” normally defined by unique values in a field • Uses • Web Search – collapse by web site • Email threads – collapse by thread id • Ecommerce/retail • Show the top 5 items for each store category (music, movies, etc)

Field Collapsing by Site

Result Grouping by Category Field Collapse on Product Type

Group by Field "grouped":{ "manu_exact":{ "matches":3, "groups":[{ "groupValue":"Belkin", "doclist":{"numFound":2,"start":0,"docs":[ { "id":"IW-02", "name":"iPod & iPod Mini USB 2.0 Cable"}] }}, { "groupValue":"Apple Computer Inc.", "doclist":{"numFound":1,"start":0,"docs":[ { "id":"MA147LL/A", "name":"Apple 60 GB iPod with Video Playback Black"}] }}]}}} http://...&fl=id,name&q=ipod&group=true&group.field=manu_exact

Group by Query http://...&group=true&group.query=price:[0 TO 99.99]&group.query=price:[100 TO *]&group.limit=5 "grouped":{ "price:[0 TO 99.99]":{ "matches":3, "doclist":{"numFound":2,"start":0,"docs":[ { "id":"IW-02", "name":"iPod & iPod Mini USB 2.0 Cable"}, { "id":"F8V7067-APL-KIT", "name":"Belkin Mobile Power Cord for iPod"}] }}, "price:[100 TO *]":{ "matches":3, "doclist":{"numFound":1,"start":0,"docs":[ { "id":"MA147LL/A", "name":"Apple 60 GB iPod with Video Playback Black"}] }}}}

Grouping Params

Pseudo-Join id: post1 blog_id: blog1 author: Yonik Seeley title: Solr relevancy function queries body: Lucene’s default ranking […] id: blog1 name: Solr ‘n Stuff owner: Yonik Seeley Started: 2007-10-26 id: post2 blog_id: blog1 author: Yonik Seeley title: Solr result grouping body: Result Grouping, also called […] id: blog2 name: lifehacker owner: Gawker Media started: 2005-1-31 id: post3 blog_id: blog2 author: Whitson Gordon title: How to Install Netflix on Almost Any Android Device Restrict to blogs mentioning netflix fq={!join from=blog_id to=id}body:netflix • Finds all documents matching “netflix” • Maps to different docs by following blog_id to id

Pseudo-Join Examples • Only show posts from blogs started after 2010 q=foo&fq={!join from=id to=blog_id}started:[2010 TO *] • If any post in a blog mentions “obama”, then search all posts in that blog for “bomb” (self-join) q=bomb&fq={!join from=blog_id to=blog_id}obama • If any blog post mentions “obama”, then search all websites with the same blog owner for “bomb” q=bomb&fq={!join from=owner to=website_owner}{!join from=blog_id to=id}obama

Cross-Core Join http://localhost:8983/solr/collection1/select?q=foo&fq={!join fromIndex=sec1 from=security_groups to=security}user:john id: doc1 security: managers title: doc for managers only body: … id: mary security_groups: managers, employees id: john security_groups: employees id: doc1 security: managers, employees title: doc for everyone body: … sec1 collection1 Single Solr Server

Pseudo-Join vs Grouping

Auto-Suggest • Many people previously used terms component • Can be slow for a large corpus • New auto-suggest builds off SpellCheck component • TST implementation: compact memory based trie • FST implementation: slower to build, but smaller & faster lookup • Based on a field in the main index, or on a dictionary file http://localhost:8983/solr/suggest?wt=json&indent=true&q=ult "spellcheck":{ "suggestions":[ "ult",{ "numFound":1, "startOffset":0, "endOffset":3, "suggestion":["ultrasharp"]}, "collation","ultrasharp"]}}

Index with JSON $ URL=http://localhost:8983/solr/update/json $ curl $URL -H 'Content-type:application/json' -d ’ [ { "id" : "978-0641723445", "cat" : ["book","hardcover"], "title" : "The Lightning Thief", "author" : "Rick Riordan", "series_t" : "Percy Jackson and the Olympians", "sequence_i" : 1, "genre_s" : "fantasy", "inStock" : true, "price" : 12.50, "pages_i" : 384 } ]'

Query Results in CSV http://localhost:8983/solr/select?q=ipod&fl=name,price,cat,popularity&wt=csv name,price,cat,popularity iPod & iPod Mini USB 2.0 Cable,11.5,"electronics,connector",1 Belkin Mobile Power Cord for iPod w/ Dock,19.95,"electronics,connector",1 Apple 60 GB iPod with Video Playback Black,399.0,"electronics,music",10 • Can handle multi-valued fields (see “cat” field in example) • Completely compatible with the CSV update handler (can round-trip) • Results are streamed – good for dumping entire parts of the index

http://localhost:8983/solr/browse

Q&A

Solr Performance & Key Innovations