Solr 4 The NoSQL Search Server

Solr 4The NoSQL Search Server Yonik Seeley May 30, 2013

NoSQL Databases • Wikipedia says: A NoSQL database provides a mechanism for storage and retrieval of data that use looser consistency models than traditional relational databases in order to achieve horizontal scaling and higher availability. Some authors refer to them as "Not only SQL" to emphasize that some NoSQL systems do allow SQL-like query language to be used. • Non-traditional data stores • Doesn’t use / isn’t designed around SQL • May not give full ACID guarantees • Offers other advantages such as greater scalability as a tradeoff • Distributed, fault-tolerant architecture

Solr Cloud Design Goals • Automatic Distributed Indexing • HA for Writes • Durable Writes • Near Real-time Search • Real-timeget • Optimistic Concurrency

Solr Cloud • Distributed Indexing designed from the ground up to accommodate desired features • CAP Theorem • Consistency, Availability, Partition Tolerance (saying goes “choose 2”) • Reality: Must handle P – the real choice is tradeoffs between C and A • Ended up with a CP system (roughly) • Value Consistency over Availability • Eventual consistency is incompatible with optimistic concurrency • Closest to MongoDB in architecture • We still do well with Availability • All N replicas of a shard must go down before we lose writability for that shard • For a network partition, the “big” partition remains active (i.e. Availability isn’t “on” or “off”)

Solr 4

Solr 4 at a glance • Document Oriented NoSQL Search Server • Data-format agnostic (JSON, XML, CSV, binary) • Schema-less options (more coming soon) • Distributed • Multi-tenanted • Fault Tolerant • HA + No single points of failure • Atomic Updates • Optimistic Concurrency • Near Real-time Search • Full-Text search + Hit Highlighting • Tons of specialized queries: Faceted search, grouping, pseudo-join, spatial search, functions The desire for these features drove some of the “SolrCloud” architecture

Quick Start • Unzip the binary distribution (.ZIP file) Note: no “installation” required • Start Solr • Go! Browse to http://localhost:8983/solr for the new admin interface $ cd example $ java –jar start.jar

New admin UI

Add and Retrieve document $ curl http://localhost:8983/solr/update -H 'Content-type:application/json' -d ' [ { "id" : "book1", "title" : "American Gods", "author" : "Neil Gaiman" } ]' Note: no type of “commit” is necessary to retrieve documents via /get (real-time get) $ curl http://localhost:8983/solr/get?id=book1 { "doc": { "id" : "book1", "author": "Neil Gaiman", "title" : "American Gods", "_version_": 1410390803582287872 } }

Simplified JSON Delete Syntax • Singe delete-by-id {"delete":”book1"} • Multiple delete-by-id {"delete":[”book1”,”book2”,”book3”]} • Delete with optimistic concurrency {"delete":{"id":”book1", "_version_":123456789}} • Delete by Query {"delete":{”query":”tag:category1”}}

Atomic Updates $ curl http://localhost:8983/solr/update -H 'Content-type:application/json' -d ' [ {"id" : "book1", "pubyear_i" : { "add" : 2001 }, "ISBN_s" : { "add" : "0-380-97365-1"} } ]' $ curl http://localhost:8983/solr/update -H 'Content-type:application/json' -d ' [ {"id" : "book1", "copies_i" : { "inc" : 1}, "cat" : { "add" : "fantasy"}, "ISBN_s" : { "set" : "0-380-97365-0"} "remove_s" : { "set" : null } } ]'

Optimistic Concurrency client • Conditional update based on document version 1. /get document 2. Modify document, retaining _version_ Solr 4. Go back to step #1 if fail code=409 3. /update resulting document

Version semantics • Specifying _version_ on any update invokes optimistic concurrency

Optimistic Concurrency Example $ curl http://localhost:8983/solr/get?id=book2 { "doc” : { "id":"book2", "title":["Neuromancer"], "author":"William Gibson", "copiesIn_i":7, "copiesOut_i":3, "_version_":123456789 }} Get the document $ curl http://localhost:8983/solr/update -H 'Content-type:application/json' -d ' [ { "id":"book2", "title":["Neuromancer"], "author":"William Gibson", "copiesIn_i":6, "copiesOut_i":4, "_version_":123456789 } ]' Modify and resubmit, using the same _version_ Alternately, specify the _version_ as a request parameter curl http://localhost:8983/solr/update?_version_=123456789 -H 'Content-type:application/json' -d […]

Optimistic Concurrency Errors • HTTP Code 409 (Conflict) returned on version mismatch $ curl -i http://localhost:8983/solr/update -H 'Content-type:application/json' -d ' [{"id":"book1", "author":"Mr Bean", "_version_":54321}]' HTTP/1.1 409 Conflict Content-Type: text/plain;charset=UTF-8 Transfer-Encoding: chunked { "responseHeader":{ "status":409, "QTime":1}, "error":{ "msg":"version conflict for book1 expected=12345 actual=1408814192853516288", "code":409}}

Schema

Schema REST API • Restlet is now integrated with Solr • Get a specific field curl http://localhost:8983/solr/schema/fields/price {"field":{ "name":"price", "type":"float", "indexed":true, "stored":true }} • Get all fields curl http://localhost:8983/solr/schema/fields • Get Entire Schema! curl http://localhost:8983/solr/schema

Dynamic Schema • Add a new field (Solr 4.4) curl -XPUT http://localhost:8983/solr/schema/fields/strength -d ‘ {"type":”float", "indexed":"true”} ‘ • Works in distributed (cloud) mode too! • Schema must be managed& mutable (not currently the default) <schemaFactory class="ManagedIndexSchemaFactory"> <bool name="mutable">true</bool> <str name="managedSchemaResourceName">managed-schema</str> </schemaFactory>

Schemaless • “Schemaless” really normally means that the client(s) have an implicit schema • “No Schema” impossible for anything based on Lucene • A field must be indexed the same way across documents • Dynamic fields: convention over configuration • Only pre-define types of fields, not fields themselves • No guessing. Any field name ending in _i is an integer • “Guessed Schema” or “Type Guessing” • For previously unknown fields, guess using JSON type as a hint • Coming soon (4.4?) based on the Dynamic Schema work • Many disadvantages to guessing • Lose ability to catch field naming errors • Can’t optimize based on types • Guessing incorrectly means having to start over

Solr Cloud

SolrCloud http://.../solr/collection1/query?q=awesome shard1 shard2 Load-balanced sub-request replica1 replica1 replica2 replica2 replica3 replica3 ZooKeeper quorum ZK node /livenodes server1:8983/solr server2:8983/solr • ZooKeeper holds cluster state • Nodes in the cluster • Collections in the cluster • Schema & config for each collection • Shards in each collection • Replicas in each shard • Collection aliases ZK node /collections /collection1 configName=myconf /shards /shard1 server1:8983/solr server2:8983/solr /shard2 server3:8983/solr server4:8983/solr ZK node /configs /myconf solrconfig.xml schema.xml /clusterstate.json /aliases.json ZK node ZK node

Distributed Indexing http://.../solr/collection1/update shard1 shard2 • Update sent to any node • Solr determines what shard the document is on, and forwards to shard leader • Shard Leader versions document and forwards to all other shard replicas • HA for updates (if one leader fails, another takes it’s place)

Collections API • Create a new document collection http://localhost:8983/solr/admin/collections? action=CREATE &name=mycollection &numShards=4 &replicationFactor=3 • Delete a collection http://localhost:8983/solr/admin/collections? action=DELETE &name=mycollection • Create an alias to a collection (or a group of collections) http://localhost:8983/solr/admin/collections? action=CREATEALIAS &name=tri_state &collections=NY,NJ,CT

http://localhost:8983/solr/#/~cloud

Distributed Query Requests • Distributed query across all shards in the collection http://localhost:8983/solr/collection1/query?q=foo • Explicitly specify node addresses to load-balance across shards=localhost:8983/solr|localhost:8900/solr, localhost:7574/solr|localhost:7500/solr • A list of equivalent nodes are separated by “|” • Different phases of the same distributed request use the same node • Specify logical shards to search across shards=NY,NJ,CT • Specify multiple collections to search across collection=collection1,collection2 • public CloudSolrServer(String zkHost) • ZK aware SolrJJava client that load-balances across all nodes in cluster • Calculate where document belongs and directly send to shard leader (new)

Durable Writes • Lucene flushes writes to disk on a “commit” • Uncommitted docs are lost on a crash (at lucene level) • Solr 4 maintains it’s own transaction log • Contains uncommitted documents • Services real-time get requests • Recovery (log replay on restart) • Supports distributed “peer sync” • Writes forwarded to multiple shard replicas • A replica can go away forever w/o collection data loss • A replica can do a fast “peer sync” if it’s only slightly out of date • A replica can do a full index replication (copy) from a peer

Near Real Time (NRT) softCommit • softCommit opens a new view of the index without flushing + fsyncing files to disk • Decouples update visibility from update durability • commitWithin now implies a soft commit • Current autoCommit defaults from solrconfig.xml: <autoCommit> <maxTime>15000</maxTime> <openSearcher>false</openSearcher> </autoCommit>

Document Routing numShards=4 router=compositeId id = BigCo!doc5 (MurmurHash3) 3c71 hash ring 9f27 shard4 shard1 40000000-7fffffff 80000000-bfffffff q=my_query shard.keys=BigCo! 00000000-3fffffff c0000000-ffffffff (hash) shard3 shard2 0000 9f27 ffff 9f27 to shard1

Seamless Online Shard Splitting update • http://localhost:8983/solr/admin/collections?action=SPLITSHARD&collection=mycollection&shard=Shard2 • New sub-shards created in “construction” state • Leader starts forwarding applicable updates, which are buffered by the sub-shards • Leader index is split and installed on the sub-shards • Sub-shards apply buffered updates then become “active”leaders and old shard becomes “inactive” Shard1 Shard2 Shard3 leader leader leader replica replica replica Shard2_0 Shard2_1

Questions?

Solr 4 The NoSQL Search Server

Solr 4 The NoSQL Search Server

Presentation Transcript

Apache Solr

NoSQL

NoSQL

NoSQL

NoSQL and NOSQL

Solr 4 The NoSQL Database

Scaling Big Data Search with Solr and HBase

NOSQL

Apache Solr

The noSQL Mouvement

Search Server Index

Open-Source Search Engines and Lucene/Solr

Solr Facets in Alfresco 4

NoSQL

Lucene/SOLR 2: Lucene search API

NoSQL

NoSQL

Introduction to NoSQL with Couchbase Server

Advanced Search with Solr - User Guide

Magento Advanced Search With Solr Extension

Open-Source Search Engines and Lucene/Solr

Solr Facets in Alfresco 4