1 / 26

Online Balancing of Range-Partitioned Data with Applications to P2P Systems

Online Balancing of Range-Partitioned Data with Applications to P2P Systems. Prasanna Ganesan Mayank Bawa Hector Garcia-Molina Stanford University. Motivation. Parallel databases use range partitioning Advantages: Inter-query parallelism

nura
Download Presentation

Online Balancing of Range-Partitioned Data with Applications to P2P Systems

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Online Balancing of Range-Partitioned Data with Applications to P2P Systems Prasanna Ganesan Mayank Bawa Hector Garcia-Molina Stanford University

  2. Motivation • Parallel databases use range partitioning • Advantages: Inter-query parallelism • Data Locality  Low-cost range queries  High thru’put 0 20 35 60 80 100 Key Range

  3. The Problem • How to achieve load balance? • Partition boundaries have to change over time • Cost: Data Movement • Goal: Guarantee load balance at low cost • Assumption: Load balance beneficial !! • Contribution • Online balancing -- self-tuning system • Slows down updates by small constant factor

  4. Roadmap • Model and Definitions • Load Balancing Operations • The Algorithms • Extension to P2P Setting • Experimental Results

  5. Model and Definitions (1) • Nodes maintain range partition (on a key) • Load of a node = # tuples in its partition • Load imbalance σ = Largest load/Smallest load • Arbitrary sequence of tuple inserts and deletes • Queries not relevant • Automatically directed to relevant node

  6. Model and Definitions (2) • After each insert/delete: • Potentially fix “imbalance” by modifying partitioning • Cost= # tuples moved • Assume no inserts/deletes during balancing • Non-critical simplification • Goal: σ < constant always • Constant amortized cost per insert/delete • Implication: Faster queries, slower updates

  7. Load Balancing Operations (1) • NbrAdjust: Transfer data between “neighbors’’ A B [0,50) [50,100) [0,35) [35,100)

  8. Is NbrAdjust good enough? • Can be highly inefficient • (n) amortized cost per insert/delete ( n=#nodes ) A B C D E F

  9. Load Balancing Operations (2) • Reorder: Hand over data to neighbor and split load of some other node A B C D E F [0,5) [0,10) [10,20) [5,10) [20,30) [30,40) [40,50) [40,60) [50,60)

  10. Roadmap • Model and Definitions • Load Balancing Operations • The Algorithms • Experimental Results • Extension to P2P Setting

  11. The Doubling Algorithm • Geometrically divide loads into levels • Level i  Load in ( 2i,2i+1 ] • Will try balancing on level change • Two Invariants • Neighbors tightly balanced • Max 1 level apart • All nodes within 3 levels • Guarantees σ ≤ 8 2i+2 2i+1 Level i 2i 8 Level 2 4 Level 1 2 Level 0 1 Load Scale

  12. The Doubling Algorithm (2) A B C D E F

  13. The Doubling Algorithm (2) A B C D E F

  14. The Doubling Algorithm (2) A B C D E F

  15. The Doubling Algorithm: Case 2 • Search for a blue node • If none, do nothing! A B C D E F

  16. The Doubling Algorithm: Case 2 • Search for a blue node • If none, do nothing! A B E C D F

  17. The Doubling Algorithm (3) • Similar operations when load goes down a level • Try balancing with neighbor • Otherwise, find a red node and reorder yourself • Costs and Guarantees • σ ≤ 8 • Constant amortized cost per insert/delete

  18. From Doubling to Fibbing • Change thresholds to Fibonacci numbers • σ ≤ 3  4.2 • Can also use other geometric sequences • Costs are still constant Fi+2 Fi+1 + Fi =

  19. More Generalizations • Improve σ to (1+) for any >0 [BG04] • Generalize neighbors to c-neighbors • Still constant cost O(1/ ) • Dealing with concurrent inserts/deletes • Allow multiple balancing actions in parallel • Paper claims it is ok

  20. Application to P2P Systems • Goal: Construct P2P system supporting efficient range queries • Provide asymptotic performance a la DHTs • What is a P2P system? A parallel DB with • Nodes joining and leaving at will • No centralized components • Limited communication primitives • Enhance load-balancing algorithms to • Allow dynamic node joins/leaves • Decentralize implementation

  21. Experiments • Goal: Study cost of balancing for different workloads • Compare to periodic re-balancing algorithms (Paper) • Trade-off between cost and imbalance ratio (Paper) • Results presented on Fibbing Algorithm (n=256) • Three-phase Workload • (1) Inserts (2) Alternating inserts and deletes (3) Deletes • Workload 1: Zipf • Random draws from Zipf-like distribution • Workload 2: HotSpot • Think key=timestamp • Workload 3: ShearStress • Insert at most-loaded, delete from least-loaded

  22. Load Imbalance (Zipf) 4.5 Growing Phase Steady Phase Shrinking Phase 4 3.5 3 2.5 Load Imbalance 2 1.5 1 0.5 0 0 500 1000 1500 2000 2500 3000 Time (x1000)

  23. Load Imbalance (ShearStress)

  24. Cost of Load Balancing

  25. Related Work • Karger & Ruhl [SPAA 04] • Dynamic model, weaker guarantees • Load balancing in DBs • Partitioning static relations, e.g., [GD92,RZML02, SMR00] • Migrating fragments across disks, e.g., [SWZ93] • Intra-node data structures, e.g., [LKOTM00] • Litwin et al. SDDS

  26. Conclusions • Indeed possible to maintain well-balanced range partitions • Range partitions competitive with hashing • Generalize to more complex load functions • Allow tuples to have dynamic weights • Change load definition in algorithms!* • Range partitioning is powerful • Enables P2P system supporting range queries • Generalizes DHTs with same asymptotic guarantees *Lots of caveats apply. Need load to be evenly divisible. No guarantees offered on costs. This offer not valid with any other offers. Etc, etc. etc.

More Related