1 / 31

ROAR: Increasing the Flexibility and Performance of Distributed Search

ROAR: Increasing the Flexibility and Performance of Distributed Search. Costin Raiciu University College London Joint work with Felipe Huici, Mark Handley, David S. Rosenblum. We Rely on Distributed Search Everyday. Distributed search apps Web search (Google, Bing, etc.)

hada
Download Presentation

ROAR: Increasing the Flexibility and Performance of Distributed Search

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. ROAR: Increasing the Flexibility and Performance of Distributed Search Costin Raiciu University College London Joint work with Felipe Huici, Mark Handley, David S. Rosenblum

  2. We Rely on Distributed Search Everyday • Distributed search apps • Web search (Google, Bing, etc.) • Online database search (Wikipedia, Amazon, eBay, etc.) • More general: parallel databases • Characteristics • Data too big to fit on one server • Latency too high if queries are run on one server

  3. P: Number of Query Partitions P: Number of Query Partitions Servers N=6 1/3 Data 1/3 Data 1/3 Data Query is Partitioned, P=3 Query Distributed Search At Work [Barroso et al., 2003] Data is replicated R=2 Data is replicated R=2 Data Query is Partitioned, P=3 P R=N Frontend Server

  4. P: Number of Query Partitions P Affects System Behavior • It dictates how much data each node stores • It impacts • Query Latency • Overheads Problem P is difficult to change Our Contribution A system that can change P efficiently at runtime

  5. P: Number of Query Partitions Work to be done Partitioning Determines Latency and Cost P=4 P=2 Query Query

  6. P: Number of Query Partitions P Lower Bound For 6q/s P Lower Bound For 10q/s Partitioning Dictates Latency

  7. P: Number of Query Partitions Partitioning Dictates Cost

  8. P: Number of Query Partitions The Problem • P is very difficult to change with existing solutions • Google change it out of necessity when web index outgrows memory • Not changing it dynamically means • The system is either inefficient OR • Misses the delay latency for some workloads

  9. P: Number of Query Partitions Cluster 1’ Cluster 2’ Cluster 3’ Cluster 1 Cluster 2 How Google Changes P [Jeffrey Dean, Google, 2009] Queries • Requires over-provisioning • Copies a lot of data • Our estimate: 20TB/data center Queries

  10. Our proposal: Rendez-Vous On A Ring (ROAR) • Key Observation • We do not need clusters to ensure each query meets all the data! • Changes the way data are partitioned and replicated • Allows on-the-fly reconfiguration with minimal bandwidth cost

  11. P: Number of Query Partitions 0.5 0.4 Server Range Rendez-Vous On A Ring (ROAR) • Uses consistent hashing [Karger et al.] • Parameter P: number of query partitions 0 1

  12. P: Number of Query Partitions Object Object Object ROAR: Storing Data Hashed ID • P=4

  13. P: Number of Query Partitions ROAR: Storing Data • P=4

  14. P: Number of Query Partitions Query ROAR: Running Queries • P=4 Start here

  15. P: Number of Query Partitions Object Object A A matched Twice! Matched Twice! PQ=5 PQ=5 PQ=5 B PQ=5 PQ=5 Query ROAR Can Run Queries at Higher P P=4 Matched once!

  16. P: Number of Query Partitions Object Object Set P=5 Set P=5 Set P=5 Set P=5 Object Object Object Object ROAR Copies Zero Data to Increase P [P=4] Delay is higher than target => change P to 5

  17. P: Number of Query Partitions Minimal Data is Copied When P Decreases [P=5] Delay is lower than target => change P to 4 • Frontend tells servers to switch to P=4 • Starts using P=4 when servers finish loading replicas • Copying done when latency is low

  18. P: Number of Query Partitions PQ=8 PQ=8 Query ROAR Tolerates Faults P=4 X

  19. Experimental Evaluation • We have implemented ROAR • Tested • Extensively on ~50 servers in UCL’s HEN testbed • Briefly on 1000 servers in Amazon EC2

  20. Application • Privacy Preserving Search (PPS) • Servers match encrypted queries against encrypted data • PPS is CPU-bound • Applications with different bottlenecks should have qualitatively similar behavior

  21. P: Number of Query Partitions Can ROAR Repartition Dynamically? • Workload • Index of 1M files • Generate 4 random queries / second • Target average delay 1s • Frontend server change P dynamically based on average delay • Start network with P=40

  22. P: Number of Query Partitions Frontend Changes P To 5 ROAR Changes P Efficiently Frontend Changes P to 10 Query Delays Are Stable During Change Query Delays Stay Stable

  23. Other experiments in the paper • Fault tolerance • Load balancing • Energy savings • Scaling:1000 servers on Amazon EC2 • Unexpected delay variation caused by high packet loss rate

  24. Conclusion • Today’s cluster-based distributed search is rigid: • Locks the system to specific values of P • When load exceeds expectation: target delay missed • When load undershoots: resources wasted • Changing P is costly • We don’t have to accept a fixed operating point • ROAR dynamically adapts P to fluctuations in load • With minimal resources • Without disrupting queries • Tolerates faults and balances load

  25. Backup Slides

  26. ROAR Tolerates Failures Gracefully

  27. Roar Scales to 1000 servers on Amazon EC2 • Frontend Overhead: • 25ms to schedule query on 1000 servers • Matching delay at each server decreases as expected • Unexpected problem: huge variation of end-to-end query delay

  28. Potential Energy Savings

  29. Does ROAR tolerate failures? • Experiment • Set P=20 (R ~ 2) • Generate 6 queries/second • Kill one server • Measure query delay, load on neighbors + rest of servers • Expect • No disruption to queries • Load to increase by 10% on neighbors

  30. ROAR Balances Load Properly

  31. ROAR Load Balancing: Fast Solution • Experiment • 43 servers of which 15 more powerful (8x faster) • Equal ranges, 3 queries/sec • Use PQ> P to dynamically reduce query delay • Faster servers get more work - implicitly balances load

More Related