1 / 24

Lucene/Solr Meetup: Solr 1.5 and Beyond

Join us for the Lucene/Solr Meetup on May 11, 2010, where we will discuss topics such as Lucene/Solr merge, relevancy, scalability, spatial/geo search, near real-time, field collapsing, and more.

rwarren
Download Presentation

Lucene/Solr Meetup: Solr 1.5 and Beyond

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. NYC Lucene/SolrMeetup Solr 1.5 and BeyondYonik SeeleyMay 11, 2010

  2. Agenda • Lucene/Solr merge • Relevancy (Extended Dismax Parser) • Scalability (Solr Cloud) • Spatial/Geo Search • Near Real Time • Field Collapsing • Q&A

  3. Lucene-Solr Merge • Lucene/Solr voted to merge (March 2010) • Were already separate sub-projects of the Lucene TLP • High committer overlap • Solr had stopped using Lucene trunk/development versions • Much code duplication • What it means • Single set of committers • Single developer mailing list (dev@lucene.apache.org) • Single subversion trunk • Keep separate downloads, user mailing lists

  4. Lucene/Solr Development Changes • Nutch, Tika, Mahout spun off to their own TLP • Still may be considered part of “Lucene Ecosystem” • Lucene/Solr development changes • trunk is now always next major release (currently 4.0) • branch_3x will be base for all 3.x releases • No back compat guarantees between major releases

  5. Relevance

  6. Extended Dismax Parser • Superset of dismax &defType=edismax&q=foo&qf=body • Fixes edge cases where dismax could still throw exceptions OR AND NOT - “ • Full lucene syntax support • Tries lucene syntax first • Smart escaping is done if syntax errors • Optionally supports treating “and”/”or” as AND/OR in lucene syntax • Fielded queries (e.g. myfield:foo) even in degraded mode • uf parameter controls what field names may be directly specified in “q”

  7. Extended Dismax Parser (continued) • boost parameter for multiplicative boost-by-function • Pure negative query clauses Example: solr OR (-solr) • Enhanced term proximity boosting • pf2=myfield – results in term bigrams in sloppy phrase queries myfield:“aa bb cc” -> myfield:“aa bb” myfield:“bb cc” • Enhanced stopword handling • stopwords omitted in main query, but added in optional proximity boosting part Example: q=solr is awesome & qf=myfield & pf2=myfield -> +myfield:(solr awesome) (myfield:”solr is” myfield:”is awesome”) • Currently controlled by the absence of StopWordFilter in index analyzer, and presence in query analyzer

  8. Scalability

  9. SolrCloud • First steps toward simplifying cluster management • Integrates Zookeeper • Central configuration (schema.xml, solrconfig.xml, etc) • Tracks live nodes + shards of collections • Removes need for external load balancers shards=localhost:8983/solr|localhost:8900/solr, localhost:7574/solr|localhost:7500/solr • Can specify logical shard ids shards=NY_shard,NJ_shard • Clients don’t need to know shards: http://localhost:8983/solr/collection1/select?distrib=true

  10. SolrCloud : The Future • Eliminate all single points of failure • Remove Master/Searcher distinction • Enables near real-time search in a highly scalable environment • High Availability for Writes • Eventual consistency model (like Amazon Dynamo, Cassandra) • Elastic • Simply add/subtract servers, cluster will rebalance automatically • By default, Solr will handle document partitioning

  11. Spatial Search

  12. Spatial Search • PointType • Generic improvement: polyField – single value -> multiple indexed fields • Compound values: 38.89,-77.03 • Range queries and exact matches supported • q=location:21.33,51.37 • q=location:[10,20 TO 30,40] • Distance Functions • Generic improvement: function queries can yield multiple values • Haversine: hsin(3963.205, store, vector(10,20)) • Many possibilities, including boost by distance

  13. Spatial Search (continued) • Sorting by function query • sort=hsin(3963.205,store,vector(10,20)) asc • Distance Filtering (SOLR-1568) • fq={!sfilt fl=store_tiles}&pt=45.17614,-93.87341&d=1 • Implementations: trie range queries, spatial tiles, geohash • Return sort values or function query values for each doc • FunctionQuery results as pseudo-fields (SOLR-1298) fl=field1,field2,{!func key=dist}hsin(…) ???

  14. Near Real Time

  15. Near Real-Time Search • Shorter times until updates are searchable/visible • Lucene 2.9 first laid the groundwork w/ per-segment searching • Per-segment FieldCache entries for sorting and FunctionQueries • NRT IndexWriter.getReader() • Make new segments available before merging is done in background • Doesn’t cause commit/fsync first • Solr still needs • Per-segment faceting • Per-segment caching • Per-segment statistics (and anything else that uses FieldCache)

  16. Existing single-valued faceting algorithm Documents matching the base query “Juggernaut” LuceneFieldCache Entry (StringIndex) for the “hero” field q=Juggernaut &facet=true &facet.field=hero 0 order: for each doc, an index into the lookup array lookup 2 lookup: the string values 7 5 (null) 3 batman accumulator 5 flash 0 1 spiderman 1 4 superman 0 increment 5 wolverine 0 2 0 1 2

  17. Per-segment single-valued faceting algorithm Segment1 FieldCache Entry Segment2 FieldCache Entry Segment3 FieldCache Entry Segment4 FieldCache Entry accumulator1 accumulator2 accumulator3 accumulator4 inc lookup 0 0 1 0 3 2 3 1 0 flash, 5 Base DocSet 5 1 0 0 Batman, 3 2 0 0 4 7 thread4 thread3 1 thread2 2 FieldCache + accumulator merger (Priority queue) Priority queue thread1

  18. Per-segment faceting • Enable with facet.method=fcs • Controllable multi-threading • facet.field={!threads=4}myfield • Disadvantages • Larger memory use (FieldCaches + accumulators) • Slower (extra FieldCache merge step needed) • Advantages • Rebuilds FieldCache entries only for new segments (NRT friendly) • Multi-threaded

  19. Per-segment faceting performance comparison Test index: 10M documents, 18 segments, single valued field Base DocSet=100 docs, facet.field on a field with 100,000 unique terms A Base DocSet=1,000,000 docs, facet.field on a field with 100 unique terms B *complete request time, measured externally

  20. Field Collapsing

  21. Field Collapsing • Field collapsing • Limit the number of results per category • “category” defined by unique values in a field • Uses • Web Search – collapse by web site • Email threads – collapse by thread id • Ecommerce/retail • Show the top 5 items for each store category (music, movies, etc)

  22. Field Collapsing by Site

  23. Field Collapse on Product Type

  24. Q&A

More Related