Sabotage-Tolerance Mechanisms for Volunteer Computing Systems

Sabotage-Tolerance Mechanisms for Volunteer Computing Systems Luis F. G. Sarmenta Ateneo de Manila University, Philippines (formerly MIT LCS) Project Bayanihan MIT LCS and Ateneo de Manila University

Volunteer Computing • Idea: Make it very easy for even non-expert users to join a NOW by themselves • Minimal setup requirements  Maximum participation • Very large NOWs very quickly! • just invite people • SETI@home, distributed.net, others • The Dream:Electronic Bayanihan • achieving the impossible through cooperation “Bayanihan” Mural by Carlos “Botong” Francisco, commissioned by Unilab, Philippines. Used with permission. Project Bayanihan MIT LCS and Ateneo de Manila University

The Problem • Allowing anyone to join means possibility of malicious attacks • Sabotage • bad data from malicious volunteers • Traditional Approach • Encryption works against spoofing by outsiders • but not against registered volunteers • Checksums guard against random faults • but not against saboteurs who disassemble code • Another Approach: Obfuscation • prevent saboteurs from disassembling code • periodically reobfuscate to avoid disassembly • Promising. But what if we can’t do it, or it doesn’t work? Project Bayanihan MIT LCS and Ateneo de Manila University

Voting and Spot-checking • Assume worst case • we can’t trust workers, so need to double-check • Voting • Everything must be done at least m times • Majority wins (like Elections) • e.g., Triple-Modular-Redundancy, NMR, etc. • Problem: not so efficient. • Spot-checking • Don’t check all the time. Only sometimes. • But if you’re caught, you’re “dead” • Backtrack – all results of caught saboteur are invalidated • Blacklist – saboteur’s results are ignored • Scare people into compliance (like Customs at Airport!) • More efficient (?) Project Bayanihan MIT LCS and Ateneo de Manila University

Theoretical Analysis: Assumptions • Master-worker model • eager scheduling lets us redo, undo work • several batches, each batch N works • P workers, fraction f are saboteurs • same speed saboteurs, so work roughly evenly distributed • no spare workers, so higher redundancy (# of work given out) means worse slowdown (time) • Non-zero acceptable error rate, erracc • error rate (err) = • average fraction of bad final results in a batch • probability of error of an individual final result • relatively high for naturally fault-tolerant apps • e.g., image rendering, genetic algorithms, etc. • correspondingly small for most other apps • e.g., to guarantee 1% failure rate for 1000 works, erracc = 1% / 1000 = 1e-5 Project Bayanihan MIT LCS and Ateneo de Manila University

Theoretical Analysis: Assumptions • Assume saboteurs are Bernoulli processes with independent, identical, constant sabotage rate, s • implies that saboteurs do not agree on when to give bad answers (unless they always give them) • simplifying assumption, may not be realistic • but may be OK if we assume saboteurs receive works at different times and cannot distinguish them • Assume saboteurs’ bad answers agree • allows them to vote (if they happen to give bad answers at the same time) • pessimistic assumption • we can use crypto and checksums to make it hard to generate agreeing answers • implies that there are only 2 kinds of answers: bad and good Project Bayanihan MIT LCS and Ateneo de Manila University

Majority Voting • m-majority voting • m out of 2m-1 must match • used in hardware, and systems with spare processors • redundancy = 2m-1 • m-first voting • accept as soon as m match • same error rate but faster • redundancy = m/(1-f) • exponential error rate • err = (cf)m • where c is between 1.7 and 4 • Good for small f, but bad for large f • Minimum redundancy & slowdown of 2 Project Bayanihan MIT LCS and Ateneo de Manila University

Spot-Checking w/ blacklisting • Lower redundancy • 1/(1-q) • Good error rates dueto backtracking • no error as long as saboteur is caught by end of batch • err = sf(1-qs)n (1-f)+f(1-qs)n • where n is number of work received in batch (related but a bit more than N/P) • Saboteurs strategy: only give a few bad answers • s* = 1/(q(n+1)) • Max error, err* < (f/(1-f))(1/qne) • Linear error reductionaccording to n • larger batches, better error rates Simulator Results.Note that it workseven if f > 0.5 Project Bayanihan MIT LCS and Ateneo de Manila University

Spot-Checking w/o blacklisting • What if saboteurs can comeback under new identity? • Saboteur’s strategy: stay for L turns only • Max error • err* < f / qLe, if L << n • err* < f / qL, as L -> n • err* < f / qL in all cases • Linear error reductionaccording to L, not n • larger batches don’t guaranteebetter error rates anymore • L = 1 gives worst errors, err = f(1-q) • Try to force larger L’s • make forging new ID difficult; impose sign-on delays • batch-limited blacklisting Project Bayanihan MIT LCS and Ateneo de Manila University

Voting and Spot-Checking • Simply running them together works! • With blacklisting, we can usespot-checking err rate in placeof f •  exponentially reduce linearly-reduced error rate • (qne(1-f))m improvement • big difference! (esp. for large f) • Unfortunately, doesn’t workas well w/o blacklisting • bad err rate to begin with • substituting err for f doesn’t work • Problem are saboteurs who come back near end of batch Project Bayanihan MIT LCS and Ateneo de Manila University

CredWorkPool nextUnDoneWork θ = 0.999, assuming f  0.2 Done, res=Z Done, res=J CrW = 0.8 CrW = 0.492 CrW = 0.967 CrW = 0.9992 CrW= 0.999 . . . Work1 Work998 Work0 Work997 Work999 CrG = 0.9992 CrG = 0.0008 CrG = 0.967 CrG = 0.492 CrG = 0.999 CrG = 0.8 CrG = 0.492 res res res res res res pid pid pid pid pid pid CrR CrR CrR CrR CrR CrR res pid CrR A J M G H B P2 P8 P2 P9 P6 P1 0.8 0.999 0.967 0.967 0.967 0.967 Z Z P6 P7 0.967 0.998 . . . Worker P6 Worker P7 Worker P1 Worker P2 Worker P9 Worker P8 Crp = 0.8 Crp = 0.933 Crp = 0.967 Crp = 0.998 Crp = 0.999 Crp = 0.967 k k k k k k 3 125 6 200 0 6 Credibility-Based Fault-Tolerance • Problem: errors come from saboteurs who have not yet been spot-checked enough • Idea: give workers credibility depending on number of spot-checks passed, k • General Idea: attach credibility values to objects in system • Credibility of X, Cr(X) = probability that X is, or will, give a good result Project Bayanihan MIT LCS and Ateneo de Manila University

CredWorkPool nextUnDoneWork θ = 0.999, assuming f  0.2 Done, res=Z Done, res=J CrW = 0.8 CrW = 0.492 CrW = 0.967 CrW = 0.9992 CrW= 0.999 . . . Work1 Work998 Work0 Work997 Work999 CrG = 0.9992 CrG = 0.0008 CrG = 0.967 CrG = 0.492 CrG = 0.999 CrG = 0.8 CrG = 0.492 res res res res res res pid pid pid pid pid pid CrR CrR CrR CrR CrR CrR res pid CrR A J M G H B P2 P8 P2 P9 P6 P1 0.8 0.999 0.967 0.967 0.967 0.967 Z Z P6 P7 0.967 0.998 . . . Worker P6 Worker P7 Worker P1 Worker P2 Worker P9 Worker P8 Crp = 0.8 Crp = 0.933 Crp = 0.967 Crp = 0.998 Crp = 0.999 Crp = 0.967 k k k k k k 3 125 6 200 0 6 Credibility-Based Fault-Tolerance • 4 types of credibility (in this implementation) • worker, result, result group, work entry • Credibility Threshold Principle: if we only accept a final result if the conditional probability of it being correct is at least θ, then overall ave. err rate will be at most (1-θ) • Wait until credibility is high enough Project Bayanihan MIT LCS and Ateneo de Manila University

Computing Credibility • Worker, CrP(P) • dubiosity (1-Cr) decreases linearly with # of spot-checks passed, k • CrP(P) = 1 – f, without spot-checking • CrP(P) = 1 – f / ke(1-f), with spot-checking and blacklisting • CrP(P) = 1 – f / k, with spot-checking without blacklisting • Result, CrR(R) • taken from CrP(R.solver) • Result Group, CrG(G) • generally increases as # of matching good-credibility results increase • conditional probability given other groups, and CrR of results • CrG(Ga) = P(Ga good)P(all others bad) P(getting the groups we got) • e.g., if CrR(R) = 1 – f, for all R, and only 2 groupsCrG = (1-f)m1 fm2 / ((1-f)m1 fm2 + fm1 (1-f)m2 + fm1fm2 ) • Work Entry, CrW(W) • CrG(G) of best group Project Bayanihan MIT LCS and Ateneo de Manila University

Results: Credibility w/ blacklisting • N=10000, P=200, f = 0.2, 0.1, 0.05q = 0.1batch-limited blacklisting • Note that error never goes above threshold • Trade-off is in slowdown • Slowdown / err ratio is very good • each additional repetition gives > 100x improvement in error rate Project Bayanihan MIT LCS and Ateneo de Manila University

Results: Credibility w/o blacklisting • Error still never goes above threshold • A bit slower • immune to short-staying saboteurs • encourages longer stay Project Bayanihan MIT LCS and Ateneo de Manila University

Results: Using Voting to Spot-check • Normally, spot-check rate is low because it implies overhead • We can use cred-based voting to spot-check since cred-based voting has guaranteed low err • if redundancy >= 2, then effectively, q = 1 • Saboteurs get caught quickly -> low error rates • Good workers gain high credibility by passing a lot -> reach threshold faster • Very good slowdown to err slope • about 3 orders-of-magnitude per extra redundancy • good for non-fault-tolerant apps Project Bayanihan MIT LCS and Ateneo de Manila University

Slowdown vs.Err voting only cred, w/ SC & BL At f=20%, for the same slowdown, cred w/ V-SC, w/o BL gets 10^5 times better err rate than m-first majority voting! cred, w/ SC, w/o BL cred, using Voting for SC, w/o BL Project Bayanihan MIT LCS and Ateneo de Manila University

Variations • Credibility-based fault-tolerance is highly-generalizable • Credibility Threshold Principle holds in all cases • provided that we compute conditional probability correctly • Change in assumptions and implementations lead to change in credibility metrics, e.g., • if we assume saboteurs communicate, then change result group credibility • if we have trustable hosts, or untrustable domains, adjust worker cred. accordingly • if we can use checksums, encryption, obfuscation, etc., then adjust CrP, CrG, etc. • time-varying credibility • compute credibility of batches or work pools Project Bayanihan MIT LCS and Ateneo de Manila University

Summary of Mechanisms • Voting • error reduction exponential with redundancy • but bad for large f • minimum redundancy of 2 • Spot-checking with backtracking and blacklisting • error reduction linear with work done by each volunteer • lower redundancy • good for large f • Voting and Spot-checking • exponentially reduce linearly-reduced error rate • Credibility-based Fault-Tolerance • guarantee limit on error by watching conditional prob. • automatically combines voting and spot-checking as necessary • more efficient than simple voting and spot-checking • open to variations Project Bayanihan MIT LCS and Ateneo de Manila University

For more information • Recently finished Ph.D. thesis • Volunteer Computing by Luis F. G. Sarmenta, MIT. • This, and other papers available from: • http://www.cag.lcs.mit.edu/bayanihan/ • Paper at IC 2001 (w/ PDPTA 2001)Las Vegas, June 25-28 • more on how we parallelized the simulation • details are also in thesis • Email: • lfgs@admu.edu.ph or lfgs@alum.mit.edu Project Bayanihan MIT LCS and Ateneo de Manila University

Sabotage-Tolerance Mechanisms for Volunteer Computing Systems

Sabotage-Tolerance Mechanisms for Volunteer Computing Systems

Presentation Transcript

Computing for Embedded Systems

Financial Mechanisms for Risk Tolerance: Establishing a Dialogue

Mechanisms of solvent tolerance in Pseudomonas putida

Security Mechanisms for Distributed Computing Systems

Security Mechanisms for Distributed Computing Systems

Introspective Fault Tolerance for Exascale Systems

Safety Enhancing Mechanisms for Pervasive Computing Systems in Intelligent Environments

Volunteer Computing Games

Volunteer Systems

From Volunteer Computing To Volunteer Cloud

Using Volunteer Computing for monitoring Broadband QoSE

Volunteer Computing and Hubs

Addiction – mechanisms of alcohol tolerance

Volunteer Computing Games

Exa-Scale Volunteer Computing

Performance Evaluation for Volunteer Computing

Volunteer Computing and Virtualization

Volunteer Computing

Mechanisms of T Cell Tolerance

Fault-Tolerant Computing Systems #2 Hardware Fault Tolerance