MOAT: A Multi-Object Assignment Toolkit

MOAT: A Multi-Object Assignment Toolkit Haifeng Yu Intel Research Pittsburgh / CMU Joint work with: Phillip B. Gibbons Intel Research Pittsburgh

Background • Availability has become principle design goal: • 0.1% improvement  $2M / year for Amazon and Ebay [internetweek.com] • One major focus of 8 OSDI’04 papers (out of 27) • Two orthogonal efforts: • Lower-level system components robustness • Example: disk, individual machine, Internet routing • Higher-level redundancy • Example: data replication • This talk focuses on higher-level redundancy Haifeng Yu, Intel Research Pittsburgh / CMU

High Availability via Replication • Large amount of data accessed by many users: • Distributed file systems • Network monitoring (PIER, SDIMS, IRISLOG) • Index databases for search engine (Google, p2p) • Scientific / medical databases • Data replicated across multiple machines • Object: The unit for replication • File, file block, database table, database tuple, inverted index for a certain keyword Haifeng Yu, Intel Research Pittsburgh / CMU

Multi-object Accesses • Many accesses request multiple objects • Compile a project • Writing a paper under Latex • Asking for aggregates of network conditions • Search for web pages containing multiple keywords • Availability of single object can be misleading: • An access requesting 1,000 objects can observe up to 1,000 times higher unavailability • There’s more subtlety..... Haifeng Yu, Intel Research Pittsburgh / CMU

A B A B C D C D A C A B C D B D A Simple Example • Compile a small project with four files, each file has two replicas: A, A, B, B, C, C, D, D • Four machines fail independently with same prob, each holds two file • Which assignment gives better avail: or Better Assignment matters because objects are now correlated Haifeng Yu, Intel Research Pittsburgh / CMU

A B A B C D C D A C A B C D B D A Simple Example - Continued • Suppose user is happy even if only three objects are available (e.g., when computing average) or Better • Assignment makes a difference • Even if we are using the same machines (same amount of redundancy/resource) • Easily have multiple-nine difference Haifeng Yu, Intel Research Pittsburgh / CMU

Goal and Contributions • MOAT (Multi-Object Assignment Toolkit): • Goal: High availability for multi-object accesses • Key issue: Replica assignment • Contributions: • First to observe the importance of replica assignment • Strong theoretical results regarding best and worst assignments • Practical designs to approximate optimal assignments • MOAT toolkit implementation for replica assignments Haifeng Yu, Intel Research Pittsburgh / CMU

Outline • Motivation and MOAT contributions  • System model and case studies of existing systems • Theoretical results • Designs for approximating optimal assignments • Designs for mixed accesses • Conclusions Haifeng Yu, Intel Research Pittsburgh / CMU

Assumptions for This Talk • Assume: • Replication (no erasure coding) • Crash failures (no Byzantine failures) • Eventual consistency (no quorum or voting) • Most of our results hold without these assumptions • Assume same replication degree for all objects • We have results for different replication degrees as well • Talk to me if interested in the more complete story... Haifeng Yu, Intel Research Pittsburgh / CMU

file system p2p DB search engine network monitoring Data API obj create / delete / read / write Control API assignment policy MOAT raw data on distributed machines or disks MOAT Architecture Overview Storage System App replication / repair / load balancing / naming / assignment Haifeng Yu, Intel Research Pittsburgh / CMU

A B C D A B C D System Model • Basic system model: • N objects, each with k replicas • Load balancing among all machines • Machines fail independently with same prob • An assignment is a mapping: replica  machine, for all Nk replicas Haifeng Yu, Intel Research Pittsburgh / CMU

Some Simple Assignments • PTN: partition assignment • Used in most practice of Coda [Satyanarayanan et al.’90] ........... A B C D E F ........... A B C D E F for k = 2 • RAND: pick a random replica each time • Similar as in Google File System [Ghemawat et al.’03] Haifeng Yu, Intel Research Pittsburgh / CMU

C C hash(A) = 95 B B C A A B Assignment in Chord [Stoica et al.’01] • DHTs: • Hash machine IP to get machine id • Assignment in Chord: • Sliding window • Neither PTN nor RAND 120 080 104 090 101 098 Haifeng Yu, Intel Research Pittsburgh / CMU

Assignment in CAN [Ratnasamy et al.’01] • Hash object k times • CAN uses a similar approach • Similar as RAND • But machines may have slightly different number of objects 120 080 hash1(A) = 95 104 090 101 098 A Haifeng Yu, Intel Research Pittsburgh / CMU

Assignment in CAN [Ratnasamy et al.’01] • Hash object k times • CAN uses a similar approach • Similar as RAND • But machines may have slightly different number of objects 120 080 A hash2(A) = 119 104 090 101 098 A Haifeng Yu, Intel Research Pittsburgh / CMU

Assignment in CAN [Ratnasamy et al.’01] • Hash object k times • CAN uses a similar approach • Similar as RAND • But machines may have slightly different number of objects 120 080 A hash1(B) = 84 hash2(B) = 100 104 090 B 101 098 A B Haifeng Yu, Intel Research Pittsburgh / CMU

Which assignment should we use? • MOAT Goal: Improve avail of multi-object accesses • If an access requests n (n  N) objects, what if only x are available? • Threshold-based success definition: • If x≥t, user happy  Available • If x < t, too low confidence  Unavailable • Availability for an access defined as: • Prob[  t objects available out of n requested objects] Haifeng Yu, Intel Research Pittsburgh / CMU

Examples of t • t = n • File systems • Search for terrorist images in image database • t close n • Query for top-10 most-loaded machines on PlanetLab • t not close n • Sample with confidence Haifeng Yu, Intel Research Pittsburgh / CMU

Outline • Motivation and MOAT contributions  • System model and case studies of existing systems  • Theoretical results • Designs for approximating optimal assignments • Designs for mixed accesses • Conclusions Haifeng Yu, Intel Research Pittsburgh / CMU

Formal Results • For access requesting N objects • Theorem: Among all assignments, when t = N: • PTN is best (within constant) • RAND is worst (within constant) • Difference is about c folds (c is #obj / machine) • Theorem: Among all assignments, when t = c+1 < N: • PTN is worst • RAND is best (within constant) • Difference is even larger Haifeng Yu, Intel Research Pittsburgh / CMU

c times difference if p is small, where c is # obj/machine Numerical Examples (from Simulation) 40,000 objects, 4 replicas each, 400 machines, fail prob = 0.2 unavailability PTN Chord RAND (CAN) unavail of single obj threshold Haifeng Yu, Intel Research Pittsburgh / CMU

A Spectrum of Assignments 40,000 objects, 4 replicas each, 400 machines, fail prob = 0.2 unavailability PTN RAND (CAN) threshold Haifeng Yu, Intel Research Pittsburgh / CMU

More Formal Arguments • Tradeoff is fundamental: • Impossible to achieve the best of RAND and PTN • Previous results only for access requesting N objects • Similar results hold for accesses requesting n (n  N) objects • But each machine may not be filled to capacity: • For PTN, use as few machines as possible • For RAND, use as many machines as possible • I have more....talk to me if you are interested Haifeng Yu, Intel Research Pittsburgh / CMU

Access Requesting 500 Objects 40,000 objects, 4 replicas each, 400 machines, fail prob = 0.2 RAND (CAN) unavailability Chord PTN threshold Haifeng Yu, Intel Research Pittsburgh / CMU

Outline • Motivation and MOAT contributions  • System model and case studies of existing systems  • Theoretical results  • Designs for approximating optimal assignments • Designs for mixed accesses • Conclusions Haifeng Yu, Intel Research Pittsburgh / CMU

Design of Replica Assignment • Trivial in a static / centralized environment • Challenging in dynamic environment: • We may not have global knowledge with many objects and many machines • Basic solution: Consistent hashing • But some re-design is necessary Haifeng Yu, Intel Research Pittsburgh / CMU

Approximating RAND • Multi-hash DHT: • Hash the object k times • As in CAN 120 080 A hash1(B) = 84 hash2(B) = 100 104 090 B 101 098 A B Haifeng Yu, Intel Research Pittsburgh / CMU

Approximating PTN • Chord does not achieve PTN C 120 080 C hash(A) = 95 104 090 B B C 101 098 A A B Haifeng Yu, Intel Research Pittsburgh / CMU

120 101 090 120 101 090 Approximating PTN • Chord does not achieve PTN • Group DHT: • (Arbitrarily) group machine into groups of k size C C C hash(A) = 95 B A B A B Haifeng Yu, Intel Research Pittsburgh / CMU

Node Join and Leave in Group DHT • Maintain r rondevour points in DHT • Diminishing Chord [Karger et al.’04] / ReDir [Karp et al.’04] • New node reports to a random rondevour point • If group can be formed, join DHT • Two options upon node leave: • Dismiss group and delete the group from DHT • The group wait to recruit a new node • Groups use rondevour point to decide Haifeng Yu, Intel Research Pittsburgh / CMU

Complexity Analysis Haifeng Yu, Intel Research Pittsburgh / CMU

Outline • Motivation and MOAT contributions  • System model and case studies of existing systems  • Theoretical results  • Designs for approximating optimal assignments  • Designs for mixed accesses • Conclusions Haifeng Yu, Intel Research Pittsburgh / CMU

Mixture of Queries • Previous design only for single access requesting all N objects • PTN if t close to N • RAND if t far from N • But there are other accesses • Requests n (n < N) objects with threshold t • How does t change with n ? • Infinite possibilities • We focus on 4 large categories Haifeng Yu, Intel Research Pittsburgh / CMU

Four Application Scenarios Strict accesses: t n Loose accesses: t< n Haifeng Yu, Intel Research Pittsburgh / CMU

Loosefor both small and large n • Goal: • Approach RAND for both small and large n • Design: • Multi-hash DHT 120 080 A hash1(B) = 84 hash2(B) = 100 104 090 B 101 098 A B Haifeng Yu, Intel Research Pittsburgh / CMU

120 101 090 120 101 090 Loosefor small n; Strict for large n • Goal: • Approach RAND for small n • Approach PTN for large n • Design: • Group DHT C C C A A B A B Haifeng Yu, Intel Research Pittsburgh / CMU

120 101 090 120 101 090 Strictfor both small andlarge n • Goal: • Approach PTN for both small and large n • Assume accesses are tree accesses • Design: • Group DHT with item-balancing [Karger et al.’04] C C A = 95 B A B A B Haifeng Yu, Intel Research Pittsburgh / CMU

Strictfor small n; Loose for large n • Goal: • Approaches PTN for n < R • Approaches RAND for n >> R • Design: • Multi-hash DHT • But cluster objects into clusters of constant size R 120 080 hash1(AB) = 84 hash2(AB) = 100 104 090 A B 101 098 A B Haifeng Yu, Intel Research Pittsburgh / CMU

Simulation Results for Strict Accesses Here an access needs all n objects to be successful 400 machines fail prob = 0.2 40,000 obj 4 replica / obj unavailability number (n) of objects requested by an access Haifeng Yu, Intel Research Pittsburgh / CMU

Simulation Results for Loose Accesses Here an access needs only t = n - 150 objects to be successful 400 machines fail prob = 0.2 40,000 obj 4 replica / obj unavailability number (n) of objects requested by an access Haifeng Yu, Intel Research Pittsburgh / CMU

Current Status • Waiting for paper deadlines • Finishing implementing MOAT • Evaluation on IrisLog trace and file system traces Haifeng Yu, Intel Research Pittsburgh / CMU

Related Work • Multi-object accesses rarely addressed • CFS [Dabek et al.’01] focuses on individual file blocks • Chain replication [Renesse et al.’04] considers single data object • A long list ..... • Replica assignment largely ignored • Different DHTs (e.g., Chord, Pastry, CAN) use dramatically different replica assignment: Effects not understood / studied • Replica placement[Douceur et al.’01, Li et al.’99, Qiu et al.’01, Venkataramani et al.’01, Yu et al.’04] well studied: • Typically for machines in different locations in the network • Machines are heterogeneous • Approaches does not apply to replica assignment Haifeng Yu, Intel Research Pittsburgh / CMU

Conclusions • Availability becoming key design goal • Multi-object access availability dramatically different from single-object availability • MOAT Contributions: • First to observe the importance of replica assignment • Strong theoretical results regarding the best and worst assignments • Practical designs to approximate optimal assignments • MOAT toolkit implementation Haifeng Yu, Intel Research Pittsburgh / CMU

My Other Recent Work • Om [NSDI’04]: • Consistent and automatic replica regeneration • Regenerate from any single replica rather than a majority • Signed quorum systems [PODC’04]: • Constant quorum size at the cost of small prob of inconsistency • Node failure characteristics in WAN [WORLDS’04]: • Answer subtle questions regarding real-world failure properties Haifeng Yu, Intel Research Pittsburgh / CMU

Haifeng Yu, Intel Research Pittsburgh / CMU

Erasure Coding • Encode the object into k fragments and any m (m < k) out of k fragments can reconstruct the object • RAID techniques are special cases • Replication is a special case where m = 1 Haifeng Yu, Intel Research Pittsburgh / CMU

A B A B C D C D A C A B C D B D Example Revisited • Need four files to compile: or Better Can we treat A, B, C, D as a single obj and use erasure coding? So that all files can be reconstructed from any 4 out of 8 fragments • Erasure coding is hard to be applied across large amount of data • Updating any portion of data needs to update k - m + 1 fragments  the size of original data • We cannot use erasure coding across 1,000 files Haifeng Yu, Intel Research Pittsburgh / CMU

Threshold Semantics and Erasure Coding In short, they are different, orthogonal concepts Haifeng Yu, Intel Research Pittsburgh / CMU

c times difference if p is small, where c is # obj/machine Numerical Examples (from Simulation) 40,000 objects, 4 replicas each, 400 machines, fail prob = 0.2 Chord unavailability PTN CRAND (100) CRAND (10) RAND (CAN) threshold Haifeng Yu, Intel Research Pittsburgh / CMU

MOAT: A Multi-Object Assignment Toolkit