Solr 1.5 and Beyond Yonik Seeley May 11, 2010

NYC Lucene/SolrMeetup Solr 1.5 and BeyondYonik SeeleyMay 11, 2010

Agenda • Lucene/Solr merge • Relevancy (Extended Dismax Parser) • Scalability (Solr Cloud) • Spatial/Geo Search • Near Real Time • Field Collapsing • Q&A

Lucene-Solr Merge • Lucene/Solr voted to merge (March 2010) • Were already separate sub-projects of the Lucene TLP • High committer overlap • Solr had stopped using Lucene trunk/development versions • Much code duplication • What it means • Single set of committers • Single developer mailing list (dev@lucene.apache.org) • Single subversion trunk • Keep separate downloads, user mailing lists

Lucene/Solr Development Changes • Nutch, Tika, Mahout spun off to their own TLP • Still may be considered part of “Lucene Ecosystem” • Lucene/Solr development changes • trunk is now always next major release (currently 4.0) • branch_3x will be base for all 3.x releases • No back compat guarantees between major releases

Relevance

Extended Dismax Parser • Superset of dismax &defType=edismax&q=foo&qf=body • Fixes edge cases where dismax could still throw exceptions OR AND NOT - “ • Full lucene syntax support • Tries lucene syntax first • Smart escaping is done if syntax errors • Optionally supports treating “and”/”or” as AND/OR in lucene syntax • Fielded queries (e.g. myfield:foo) even in degraded mode • uf parameter controls what field names may be directly specified in “q”

Extended Dismax Parser (continued) • boost parameter for multiplicative boost-by-function • Pure negative query clauses Example: solr OR (-solr) • Enhanced term proximity boosting • pf2=myfield – results in term bigrams in sloppy phrase queries myfield:“aa bb cc” -> myfield:“aa bb” myfield:“bb cc” • Enhanced stopword handling • stopwords omitted in main query, but added in optional proximity boosting part Example: q=solr is awesome & qf=myfield & pf2=myfield -> +myfield:(solr awesome) (myfield:”solr is” myfield:”is awesome”) • Currently controlled by the absence of StopWordFilter in index analyzer, and presence in query analyzer

Scalability

SolrCloud • First steps toward simplifying cluster management • Integrates Zookeeper • Central configuration (schema.xml, solrconfig.xml, etc) • Tracks live nodes + shards of collections • Removes need for external load balancers shards=localhost:8983/solr|localhost:8900/solr, localhost:7574/solr|localhost:7500/solr • Can specify logical shard ids shards=NY_shard,NJ_shard • Clients don’t need to know shards: http://localhost:8983/solr/collection1/select?distrib=true

SolrCloud : The Future • Eliminate all single points of failure • Remove Master/Searcher distinction • Enables near real-time search in a highly scalable environment • High Availability for Writes • Eventual consistency model (like Amazon Dynamo, Cassandra) • Elastic • Simply add/subtract servers, cluster will rebalance automatically • By default, Solr will handle document partitioning

Spatial Search

Spatial Search • PointType • Generic improvement: polyField – single value -> multiple indexed fields • Compound values: 38.89,-77.03 • Range queries and exact matches supported • q=location:21.33,51.37 • q=location:[10,20 TO 30,40] • Distance Functions • Generic improvement: function queries can yield multiple values • Haversine: hsin(3963.205, store, vector(10,20)) • Many possibilities, including boost by distance

Spatial Search (continued) • Sorting by function query • sort=hsin(3963.205,store,vector(10,20)) asc • Distance Filtering (SOLR-1568) • fq={!sfilt fl=store_tiles}&pt=45.17614,-93.87341&d=1 • Implementations: trie range queries, spatial tiles, geohash • Return sort values or function query values for each doc • FunctionQuery results as pseudo-fields (SOLR-1298) fl=field1,field2,{!func key=dist}hsin(…) ???

Near Real Time

Near Real-Time Search • Shorter times until updates are searchable/visible • Lucene 2.9 first laid the groundwork w/ per-segment searching • Per-segment FieldCache entries for sorting and FunctionQueries • NRT IndexWriter.getReader() • Make new segments available before merging is done in background • Doesn’t cause commit/fsync first • Solr still needs • Per-segment faceting • Per-segment caching • Per-segment statistics (and anything else that uses FieldCache)

Existing single-valued faceting algorithm Documents matching the base query “Juggernaut” LuceneFieldCache Entry (StringIndex) for the “hero” field q=Juggernaut &facet=true &facet.field=hero 0 order: for each doc, an index into the lookup array lookup 2 lookup: the string values 7 5 (null) 3 batman accumulator 5 flash 0 1 spiderman 1 4 superman 0 increment 5 wolverine 0 2 0 1 2

Per-segment single-valued faceting algorithm Segment1 FieldCache Entry Segment2 FieldCache Entry Segment3 FieldCache Entry Segment4 FieldCache Entry accumulator1 accumulator2 accumulator3 accumulator4 inc lookup 0 0 1 0 3 2 3 1 0 flash, 5 Base DocSet 5 1 0 0 Batman, 3 2 0 0 4 7 thread4 thread3 1 thread2 2 FieldCache + accumulator merger (Priority queue) Priority queue thread1

Per-segment faceting • Enable with facet.method=fcs • Controllable multi-threading • facet.field={!threads=4}myfield • Disadvantages • Larger memory use (FieldCaches + accumulators) • Slower (extra FieldCache merge step needed) • Advantages • Rebuilds FieldCache entries only for new segments (NRT friendly) • Multi-threaded

Per-segment faceting performance comparison Test index: 10M documents, 18 segments, single valued field Base DocSet=100 docs, facet.field on a field with 100,000 unique terms A Base DocSet=1,000,000 docs, facet.field on a field with 100 unique terms B *complete request time, measured externally

Field Collapsing

Field Collapsing • Field collapsing • Limit the number of results per category • “category” defined by unique values in a field • Uses • Web Search – collapse by web site • Email threads – collapse by thread id • Ecommerce/retail • Show the top 5 items for each store category (music, movies, etc)

Field Collapsing by Site

Field Collapse on Product Type

Q&A

Solr 1.5 and Beyond Yonik Seeley May 11, 2010

Solr 1.5 and Beyond Yonik Seeley May 11, 2010

Presentation Transcript

Tanker safety Lisbon 11 May 2010

Solr 3.1 and Beyond

11-16 May 2014 and beyond...

2010-11 GOVERNOR’S MAY REVISE BUDGET

LISP 1.5 and beyond

Apache Solr

11 th May 2010

SCHOOL FUNDING 2010/11 AND BEYOND CONSULTATION JANUARY 2010

HFT project Overview and Status May 11, 2010

Geneva, 11 May 2010

Governor’s 2010-11 May Revision

Wednesday, May 11, 2010

Overview of ELM Licensing May 2010 Version 1.5

Planning for Grade 11 and Beyond February 2010

Estate Planning in 2010 and Beyond May 20, 2010

Surviving 2010…. AND BEYOND

C++11, VC++11, and Beyond

May 11, 2010 Magnus Callavik

Governor’s 2010-11 May Revision

Apache Solr Beyond The Box

POWER CONVERTERS 11 MAY 2010