Scalable Computing on Open Distributed Systems

Scalable Computing on Open Distributed Systems Jon Weissman University of Minnesota National E-Science Center CLADE 2008

What is the Problem? • Open distributed systems • Tasks submitted to the “system” for execution • Workers do the computing, execute a task, return an answer • The Challenge • Computations that are erroneous or late are less useful • Failure, errors, hacked, misconfigured • Unpredictable time to return answers • Both local- and wide-area systems • Focus on volunteer wide-area systems

Shape of the Solution • Replication • Works for all sources of unreliability • computation and data • How to do this intelligently - scalably?

Replication Challenges • How many replicas? • too many – waste of resources • too few – application suffers • Most approaches assume ad-hoc replication • under-replicate: task re-execution (^ latency) • over-replicate: wasted resources (v throughput) • Using information about the pastbehavior of a node, we can intelligently size the amount of redundancy

Problems with ad-hoc replication Unreliable node Task x sent to group A Reliable node Task y sent to group B

System Model • Reputation rating ri– degree of node reliability • Dynamically size the redundancy based on ri • Note: variable sized groups • Assume no correlated errors, relax later 0.9 0.8 0.8 0.7 0.7 0.4 0.3 0.4 0.8 0.8

Smart Replication • Rating based on past interaction with clients • prob. (ri) over window t • correct/total or timely/total • extend to worker group (assuming no collusion) => likelihood of correctness (LOC) • Smarter Redundancy • variable-sized worker groups • intuition: higher reliability clients => smaller groups

Terms • LOC (Likelihood of Correctness), lg • computes the ‘actual’ probability of getting a correct or timely answer from a group g of clients • Target LOC (ltarget) • the success-rate that the system tries to ensure while forming client groups

Scheduling Metrics • Guiding metrics • throughput r: is the set of successfully completed tasks in an interval • success rate s: ratio of throughput to number of tasks attempted

Algorithm Space • How many replicas? • algorithms compute how many replicas to meet a success threshold • How to reach consensus? • Majority (better for byzantine threats) • M-1 (better for timeliness) • M-2 (2 matching)

One Scheduling Algorithm

Evaluation • Baselines • Fixed algorithm: statically sized equal groups uses no reliability information • Random algorithm: forms groups by randomly assigning nodes until ltarget is reached • Simulated a wide-variety of node reliability distributions

Experimental Results: correctness Simulation: byzantine behavior only … majority voting

Role of ltarget • Key parameter • hard to specify • Too large • groups will be too large (low throughput) • Too small • groups will be too small (low success rate) • Instead, adaptively learn it • bias toward r or s or both

Adaptive Algorithm

What about time? • Timeliness • Result > time T is less (or not) useful • (1) soft deadlines • user interacting, visualization output from computation • (2) hard deadlines • need to get X results done before HPDC/NSDI/… deadline • Live experimentation on PlanetLab • Real application: BLAST

Some PL data Computation - both across and within nodes Temporal variability Communication - both across and within nodes

PL Environment Ridge is our live system that implements reputation 120 wide-area nodes, fully correct,M-1 consensus 3 Timeliness environments based on deadlines D=120s D=180s D=240s

Experimental Results: timeliness Best BOINC (BOINC*), conservative (BOINC-) vs. RIDGE

Makespan Comparison

Collusion • Suppose errors are correlated? • How? • Widespread bug (hardware or software) • Misconfiguration • Virus • Sybil attack • Malicious group • With Emmanuel Jeannot (Inria)

Key Ideas • Execute a task => answer groups • A1, A2, … Ak • For each Ai there are associated workers Wi1, Wi2…Win • Pcollusion(workers in Ai) • Learn probability of correlated errors • Pcollusion(W1, W2) • Estimate probability of group correlated errors • Pcollusion(G), G=[W1, W2, W3, …] via f {Pcollusion(Wi, Wj), for all i,j} • Rank and select answer • Pcollusion(G) and |G| • Update matrix: Pcollusion(W1, W2)

Bootstrap Problem • Building collusion matrix • Must first “bait” colluders • Over-replicate such that majority group is still correct to expose colluders • a : probability of worker collusion • e : probability colluders fool the system • Given a, e => group size k

correctness 4: 1 group 30% colluders, always collude 5. Same group – colludes 30% of the time 7. 2 groups (40%, 30% colluders)

throughput

Summary • Reliable Scalable computing • correctness and timeliness • Future work • combined models and metrics • workflows: coupling data and computation reliability Visit ridge.cs.umn.edu to learn more

Scalable Computing on Open Distributed Systems

Scalable Computing on Open Distributed Systems

Presentation Transcript

Distributed Computing Systems

Distributed Computing Systems

Distributed Computing Systems

Distributed Computing Systems

Distributed Computing Systems

CS4513 Distributed Computing Systems

CS4513 Distributed Computing Systems

Distributed Computing Systems

Distributed Computing Systems

Distributed Computing Systems

Distributed Computing Systems

Distributed Computing Systems

CENG 532 - Distributed Computing Systems

SEMINAR 236825 OPEN PROBLEMS IN DISTRIBUTED COMPUTING

Reliable and Scalable Checkpointing Systems for Distributed Computing Environments

Scalable Parallel Computing on Clouds

Large Scale Distributed Computing Systems

Review Lecture Distributed Computing systems

RHIC, STAR computing towards distributed computing on the Open Science Grid