240 likes | 256 Views
Join us for the Lucene/Solr Meetup on May 11, 2010, where we will discuss topics such as Lucene/Solr merge, relevancy, scalability, spatial/geo search, near real-time, field collapsing, and more.
E N D
NYC Lucene/SolrMeetup Solr 1.5 and BeyondYonik SeeleyMay 11, 2010
Agenda • Lucene/Solr merge • Relevancy (Extended Dismax Parser) • Scalability (Solr Cloud) • Spatial/Geo Search • Near Real Time • Field Collapsing • Q&A
Lucene-Solr Merge • Lucene/Solr voted to merge (March 2010) • Were already separate sub-projects of the Lucene TLP • High committer overlap • Solr had stopped using Lucene trunk/development versions • Much code duplication • What it means • Single set of committers • Single developer mailing list (dev@lucene.apache.org) • Single subversion trunk • Keep separate downloads, user mailing lists
Lucene/Solr Development Changes • Nutch, Tika, Mahout spun off to their own TLP • Still may be considered part of “Lucene Ecosystem” • Lucene/Solr development changes • trunk is now always next major release (currently 4.0) • branch_3x will be base for all 3.x releases • No back compat guarantees between major releases
Extended Dismax Parser • Superset of dismax &defType=edismax&q=foo&qf=body • Fixes edge cases where dismax could still throw exceptions OR AND NOT - “ • Full lucene syntax support • Tries lucene syntax first • Smart escaping is done if syntax errors • Optionally supports treating “and”/”or” as AND/OR in lucene syntax • Fielded queries (e.g. myfield:foo) even in degraded mode • uf parameter controls what field names may be directly specified in “q”
Extended Dismax Parser (continued) • boost parameter for multiplicative boost-by-function • Pure negative query clauses Example: solr OR (-solr) • Enhanced term proximity boosting • pf2=myfield – results in term bigrams in sloppy phrase queries myfield:“aa bb cc” -> myfield:“aa bb” myfield:“bb cc” • Enhanced stopword handling • stopwords omitted in main query, but added in optional proximity boosting part Example: q=solr is awesome & qf=myfield & pf2=myfield -> +myfield:(solr awesome) (myfield:”solr is” myfield:”is awesome”) • Currently controlled by the absence of StopWordFilter in index analyzer, and presence in query analyzer
SolrCloud • First steps toward simplifying cluster management • Integrates Zookeeper • Central configuration (schema.xml, solrconfig.xml, etc) • Tracks live nodes + shards of collections • Removes need for external load balancers shards=localhost:8983/solr|localhost:8900/solr, localhost:7574/solr|localhost:7500/solr • Can specify logical shard ids shards=NY_shard,NJ_shard • Clients don’t need to know shards: http://localhost:8983/solr/collection1/select?distrib=true
SolrCloud : The Future • Eliminate all single points of failure • Remove Master/Searcher distinction • Enables near real-time search in a highly scalable environment • High Availability for Writes • Eventual consistency model (like Amazon Dynamo, Cassandra) • Elastic • Simply add/subtract servers, cluster will rebalance automatically • By default, Solr will handle document partitioning
Spatial Search • PointType • Generic improvement: polyField – single value -> multiple indexed fields • Compound values: 38.89,-77.03 • Range queries and exact matches supported • q=location:21.33,51.37 • q=location:[10,20 TO 30,40] • Distance Functions • Generic improvement: function queries can yield multiple values • Haversine: hsin(3963.205, store, vector(10,20)) • Many possibilities, including boost by distance
Spatial Search (continued) • Sorting by function query • sort=hsin(3963.205,store,vector(10,20)) asc • Distance Filtering (SOLR-1568) • fq={!sfilt fl=store_tiles}&pt=45.17614,-93.87341&d=1 • Implementations: trie range queries, spatial tiles, geohash • Return sort values or function query values for each doc • FunctionQuery results as pseudo-fields (SOLR-1298) fl=field1,field2,{!func key=dist}hsin(…) ???
Near Real-Time Search • Shorter times until updates are searchable/visible • Lucene 2.9 first laid the groundwork w/ per-segment searching • Per-segment FieldCache entries for sorting and FunctionQueries • NRT IndexWriter.getReader() • Make new segments available before merging is done in background • Doesn’t cause commit/fsync first • Solr still needs • Per-segment faceting • Per-segment caching • Per-segment statistics (and anything else that uses FieldCache)
Existing single-valued faceting algorithm Documents matching the base query “Juggernaut” LuceneFieldCache Entry (StringIndex) for the “hero” field q=Juggernaut &facet=true &facet.field=hero 0 order: for each doc, an index into the lookup array lookup 2 lookup: the string values 7 5 (null) 3 batman accumulator 5 flash 0 1 spiderman 1 4 superman 0 increment 5 wolverine 0 2 0 1 2
Per-segment single-valued faceting algorithm Segment1 FieldCache Entry Segment2 FieldCache Entry Segment3 FieldCache Entry Segment4 FieldCache Entry accumulator1 accumulator2 accumulator3 accumulator4 inc lookup 0 0 1 0 3 2 3 1 0 flash, 5 Base DocSet 5 1 0 0 Batman, 3 2 0 0 4 7 thread4 thread3 1 thread2 2 FieldCache + accumulator merger (Priority queue) Priority queue thread1
Per-segment faceting • Enable with facet.method=fcs • Controllable multi-threading • facet.field={!threads=4}myfield • Disadvantages • Larger memory use (FieldCaches + accumulators) • Slower (extra FieldCache merge step needed) • Advantages • Rebuilds FieldCache entries only for new segments (NRT friendly) • Multi-threaded
Per-segment faceting performance comparison Test index: 10M documents, 18 segments, single valued field Base DocSet=100 docs, facet.field on a field with 100,000 unique terms A Base DocSet=1,000,000 docs, facet.field on a field with 100 unique terms B *complete request time, measured externally
Field Collapsing • Field collapsing • Limit the number of results per category • “category” defined by unique values in a field • Uses • Web Search – collapse by web site • Email threads – collapse by thread id • Ecommerce/retail • Show the top 5 items for each store category (music, movies, etc)