310 likes | 429 Views
ROAR: Increasing the Flexibility and Performance of Distributed Search. Costin Raiciu University College London Joint work with Felipe Huici, Mark Handley, David S. Rosenblum. We Rely on Distributed Search Everyday. Distributed search apps Web search (Google, Bing, etc.)
E N D
ROAR: Increasing the Flexibility and Performance of Distributed Search Costin Raiciu University College London Joint work with Felipe Huici, Mark Handley, David S. Rosenblum
We Rely on Distributed Search Everyday • Distributed search apps • Web search (Google, Bing, etc.) • Online database search (Wikipedia, Amazon, eBay, etc.) • More general: parallel databases • Characteristics • Data too big to fit on one server • Latency too high if queries are run on one server
P: Number of Query Partitions P: Number of Query Partitions Servers N=6 1/3 Data 1/3 Data 1/3 Data Query is Partitioned, P=3 Query Distributed Search At Work [Barroso et al., 2003] Data is replicated R=2 Data is replicated R=2 Data Query is Partitioned, P=3 P R=N Frontend Server
P: Number of Query Partitions P Affects System Behavior • It dictates how much data each node stores • It impacts • Query Latency • Overheads Problem P is difficult to change Our Contribution A system that can change P efficiently at runtime
P: Number of Query Partitions Work to be done Partitioning Determines Latency and Cost P=4 P=2 Query Query
P: Number of Query Partitions P Lower Bound For 6q/s P Lower Bound For 10q/s Partitioning Dictates Latency
P: Number of Query Partitions Partitioning Dictates Cost
P: Number of Query Partitions The Problem • P is very difficult to change with existing solutions • Google change it out of necessity when web index outgrows memory • Not changing it dynamically means • The system is either inefficient OR • Misses the delay latency for some workloads
P: Number of Query Partitions Cluster 1’ Cluster 2’ Cluster 3’ Cluster 1 Cluster 2 How Google Changes P [Jeffrey Dean, Google, 2009] Queries • Requires over-provisioning • Copies a lot of data • Our estimate: 20TB/data center Queries
Our proposal: Rendez-Vous On A Ring (ROAR) • Key Observation • We do not need clusters to ensure each query meets all the data! • Changes the way data are partitioned and replicated • Allows on-the-fly reconfiguration with minimal bandwidth cost
P: Number of Query Partitions 0.5 0.4 Server Range Rendez-Vous On A Ring (ROAR) • Uses consistent hashing [Karger et al.] • Parameter P: number of query partitions 0 1
P: Number of Query Partitions Object Object Object ROAR: Storing Data Hashed ID • P=4
P: Number of Query Partitions ROAR: Storing Data • P=4
P: Number of Query Partitions Query ROAR: Running Queries • P=4 Start here
P: Number of Query Partitions Object Object A A matched Twice! Matched Twice! PQ=5 PQ=5 PQ=5 B PQ=5 PQ=5 Query ROAR Can Run Queries at Higher P P=4 Matched once!
P: Number of Query Partitions Object Object Set P=5 Set P=5 Set P=5 Set P=5 Object Object Object Object ROAR Copies Zero Data to Increase P [P=4] Delay is higher than target => change P to 5
P: Number of Query Partitions Minimal Data is Copied When P Decreases [P=5] Delay is lower than target => change P to 4 • Frontend tells servers to switch to P=4 • Starts using P=4 when servers finish loading replicas • Copying done when latency is low
P: Number of Query Partitions PQ=8 PQ=8 Query ROAR Tolerates Faults P=4 X
Experimental Evaluation • We have implemented ROAR • Tested • Extensively on ~50 servers in UCL’s HEN testbed • Briefly on 1000 servers in Amazon EC2
Application • Privacy Preserving Search (PPS) • Servers match encrypted queries against encrypted data • PPS is CPU-bound • Applications with different bottlenecks should have qualitatively similar behavior
P: Number of Query Partitions Can ROAR Repartition Dynamically? • Workload • Index of 1M files • Generate 4 random queries / second • Target average delay 1s • Frontend server change P dynamically based on average delay • Start network with P=40
P: Number of Query Partitions Frontend Changes P To 5 ROAR Changes P Efficiently Frontend Changes P to 10 Query Delays Are Stable During Change Query Delays Stay Stable
Other experiments in the paper • Fault tolerance • Load balancing • Energy savings • Scaling:1000 servers on Amazon EC2 • Unexpected delay variation caused by high packet loss rate
Conclusion • Today’s cluster-based distributed search is rigid: • Locks the system to specific values of P • When load exceeds expectation: target delay missed • When load undershoots: resources wasted • Changing P is costly • We don’t have to accept a fixed operating point • ROAR dynamically adapts P to fluctuations in load • With minimal resources • Without disrupting queries • Tolerates faults and balances load
Roar Scales to 1000 servers on Amazon EC2 • Frontend Overhead: • 25ms to schedule query on 1000 servers • Matching delay at each server decreases as expected • Unexpected problem: huge variation of end-to-end query delay
Does ROAR tolerate failures? • Experiment • Set P=20 (R ~ 2) • Generate 6 queries/second • Kill one server • Measure query delay, load on neighbors + rest of servers • Expect • No disruption to queries • Load to increase by 10% on neighbors
ROAR Load Balancing: Fast Solution • Experiment • 43 servers of which 15 more powerful (8x faster) • Equal ranges, 3 queries/sec • Use PQ> P to dynamically reduce query delay • Faster servers get more work - implicitly balances load