110 likes | 246 Views
Search at Scale. Hadoop, Katta & Solr the smores of Search. Issues. Managing the raw data Building Indexes Handling updates Reliable Search Search Latency. The Tools. HDFS for Raw Data Storage SolrIndexWriter for building indexes (Solr-1301) Katta for Search Latency
E N D
Search at Scale Hadoop, Katta & Solr the smores of Search
Issues • Managing the raw data • Building Indexes • Handling updates • Reliable Search • Search Latency
The Tools • HDFS for Raw Data Storage • SolrIndexWriter for building indexes (Solr-1301) • Katta for Search Latency • Katta for Reliable Search • Brute Force Map/Reduce for index updates • Near Real Time updates – Jason Rutherglen
HOW-TO SolrRecordWriter • Solr config with schema • An implementation of SolrDocumentConverter • A Hadoop Cluster you can trash, wrong tuning will crash your machines. • ZipFile Output – some compression, reduces the number of files in your hdfs, easy deployment. Use jar xf to unpack, zip will fail.
SolrRecordWriter and your Cluster • Each SolrRecordWriter instance uses substantial quantities of system resources: • Processor – analyzing the input records • Memory – buffering the processed index records • IOP, optimize saturates storage devices • Be very careful in how many instances you have running per machine.
Katta • Distributed Search • Replicated Indexes • Fault Tolerant • Direct deployment from hdfs
Katta Issues • Solr is a pig, run few instances per machine. • Large indexes can take time to copy in and start, consuming substantial io resources • Use hftp: to reference your indexes, passes through firewalls and hdfs version independent. • Use one of the balancing distribution policies • Nodes don’t handle Solr OOMs gracefully
Search Latency • Run as many replicas of your indexes as needed to ensure that your latency is low enough • Run as many solr front ends to manage latency.
Solr Issues • Poorly chosen facets can cause OOMs be careful • Solr is slow to start, so rolling new indexes in takes time • Solr is a black box to Katta, unlike Lucene which is intimate.
Updates • Brute Force, rebuild the entire corpus and redeploy • Distribute updates to deployed indexes (not implemented) • Merge indexes (Jason Rutherglen) • Distribute new indexes and handle merge in the fronting solr intances (not implemented)
Code and Configurations • We run a 12 node katta cluster, with 3 masters and 3 zookeeper machines, for 18 machines. • We give each kata node jvm 4gig of heap. • I run 1-3 solr front end instances with 6gig of heap, • Code and configurations will be on www.prohadoop.com, for members.