200 likes | 374 Views
Sabotage-Tolerance Mechanisms for Volunteer Computing Systems. Luis F. G. Sarmenta Ateneo de Manila University, Philippines (formerly MIT LCS). Volunteer Computing. Idea: Make it very easy for even non-expert users to join a NOW by themselves
E N D
Sabotage-Tolerance Mechanisms for Volunteer Computing Systems Luis F. G. Sarmenta Ateneo de Manila University, Philippines (formerly MIT LCS) Project Bayanihan MIT LCS and Ateneo de Manila University
Volunteer Computing • Idea: Make it very easy for even non-expert users to join a NOW by themselves • Minimal setup requirements Maximum participation • Very large NOWs very quickly! • just invite people • SETI@home, distributed.net, others • The Dream:Electronic Bayanihan • achieving the impossible through cooperation “Bayanihan” Mural by Carlos “Botong” Francisco, commissioned by Unilab, Philippines. Used with permission. Project Bayanihan MIT LCS and Ateneo de Manila University
The Problem • Allowing anyone to join means possibility of malicious attacks • Sabotage • bad data from malicious volunteers • Traditional Approach • Encryption works against spoofing by outsiders • but not against registered volunteers • Checksums guard against random faults • but not against saboteurs who disassemble code • Another Approach: Obfuscation • prevent saboteurs from disassembling code • periodically reobfuscate to avoid disassembly • Promising. But what if we can’t do it, or it doesn’t work? Project Bayanihan MIT LCS and Ateneo de Manila University
Voting and Spot-checking • Assume worst case • we can’t trust workers, so need to double-check • Voting • Everything must be done at least m times • Majority wins (like Elections) • e.g., Triple-Modular-Redundancy, NMR, etc. • Problem: not so efficient. • Spot-checking • Don’t check all the time. Only sometimes. • But if you’re caught, you’re “dead” • Backtrack – all results of caught saboteur are invalidated • Blacklist – saboteur’s results are ignored • Scare people into compliance (like Customs at Airport!) • More efficient (?) Project Bayanihan MIT LCS and Ateneo de Manila University
Theoretical Analysis: Assumptions • Master-worker model • eager scheduling lets us redo, undo work • several batches, each batch N works • P workers, fraction f are saboteurs • same speed saboteurs, so work roughly evenly distributed • no spare workers, so higher redundancy (# of work given out) means worse slowdown (time) • Non-zero acceptable error rate, erracc • error rate (err) = • average fraction of bad final results in a batch • probability of error of an individual final result • relatively high for naturally fault-tolerant apps • e.g., image rendering, genetic algorithms, etc. • correspondingly small for most other apps • e.g., to guarantee 1% failure rate for 1000 works, erracc = 1% / 1000 = 1e-5 Project Bayanihan MIT LCS and Ateneo de Manila University
Theoretical Analysis: Assumptions • Assume saboteurs are Bernoulli processes with independent, identical, constant sabotage rate, s • implies that saboteurs do not agree on when to give bad answers (unless they always give them) • simplifying assumption, may not be realistic • but may be OK if we assume saboteurs receive works at different times and cannot distinguish them • Assume saboteurs’ bad answers agree • allows them to vote (if they happen to give bad answers at the same time) • pessimistic assumption • we can use crypto and checksums to make it hard to generate agreeing answers • implies that there are only 2 kinds of answers: bad and good Project Bayanihan MIT LCS and Ateneo de Manila University
Majority Voting • m-majority voting • m out of 2m-1 must match • used in hardware, and systems with spare processors • redundancy = 2m-1 • m-first voting • accept as soon as m match • same error rate but faster • redundancy = m/(1-f) • exponential error rate • err = (cf)m • where c is between 1.7 and 4 • Good for small f, but bad for large f • Minimum redundancy & slowdown of 2 Project Bayanihan MIT LCS and Ateneo de Manila University
Spot-Checking w/ blacklisting • Lower redundancy • 1/(1-q) • Good error rates dueto backtracking • no error as long as saboteur is caught by end of batch • err = sf(1-qs)n (1-f)+f(1-qs)n • where n is number of work received in batch (related but a bit more than N/P) • Saboteurs strategy: only give a few bad answers • s* = 1/(q(n+1)) • Max error, err* < (f/(1-f))(1/qne) • Linear error reductionaccording to n • larger batches, better error rates Simulator Results.Note that it workseven if f > 0.5 Project Bayanihan MIT LCS and Ateneo de Manila University
Spot-Checking w/o blacklisting • What if saboteurs can comeback under new identity? • Saboteur’s strategy: stay for L turns only • Max error • err* < f / qLe, if L << n • err* < f / qL, as L -> n • err* < f / qL in all cases • Linear error reductionaccording to L, not n • larger batches don’t guaranteebetter error rates anymore • L = 1 gives worst errors, err = f(1-q) • Try to force larger L’s • make forging new ID difficult; impose sign-on delays • batch-limited blacklisting Project Bayanihan MIT LCS and Ateneo de Manila University
Voting and Spot-Checking • Simply running them together works! • With blacklisting, we can usespot-checking err rate in placeof f • exponentially reduce linearly-reduced error rate • (qne(1-f))m improvement • big difference! (esp. for large f) • Unfortunately, doesn’t workas well w/o blacklisting • bad err rate to begin with • substituting err for f doesn’t work • Problem are saboteurs who come back near end of batch Project Bayanihan MIT LCS and Ateneo de Manila University
CredWorkPool nextUnDoneWork θ = 0.999, assuming f 0.2 Done, res=Z Done, res=J CrW = 0.8 CrW = 0.492 CrW = 0.967 CrW = 0.9992 CrW= 0.999 . . . Work1 Work998 Work0 Work997 Work999 CrG = 0.9992 CrG = 0.0008 CrG = 0.967 CrG = 0.492 CrG = 0.999 CrG = 0.8 CrG = 0.492 res res res res res res pid pid pid pid pid pid CrR CrR CrR CrR CrR CrR res pid CrR A J M G H B P2 P8 P2 P9 P6 P1 0.8 0.999 0.967 0.967 0.967 0.967 Z Z P6 P7 0.967 0.998 . . . Worker P6 Worker P7 Worker P1 Worker P2 Worker P9 Worker P8 Crp = 0.8 Crp = 0.933 Crp = 0.967 Crp = 0.998 Crp = 0.999 Crp = 0.967 k k k k k k 3 125 6 200 0 6 Credibility-Based Fault-Tolerance • Problem: errors come from saboteurs who have not yet been spot-checked enough • Idea: give workers credibility depending on number of spot-checks passed, k • General Idea: attach credibility values to objects in system • Credibility of X, Cr(X) = probability that X is, or will, give a good result Project Bayanihan MIT LCS and Ateneo de Manila University
CredWorkPool nextUnDoneWork θ = 0.999, assuming f 0.2 Done, res=Z Done, res=J CrW = 0.8 CrW = 0.492 CrW = 0.967 CrW = 0.9992 CrW= 0.999 . . . Work1 Work998 Work0 Work997 Work999 CrG = 0.9992 CrG = 0.0008 CrG = 0.967 CrG = 0.492 CrG = 0.999 CrG = 0.8 CrG = 0.492 res res res res res res pid pid pid pid pid pid CrR CrR CrR CrR CrR CrR res pid CrR A J M G H B P2 P8 P2 P9 P6 P1 0.8 0.999 0.967 0.967 0.967 0.967 Z Z P6 P7 0.967 0.998 . . . Worker P6 Worker P7 Worker P1 Worker P2 Worker P9 Worker P8 Crp = 0.8 Crp = 0.933 Crp = 0.967 Crp = 0.998 Crp = 0.999 Crp = 0.967 k k k k k k 3 125 6 200 0 6 Credibility-Based Fault-Tolerance • 4 types of credibility (in this implementation) • worker, result, result group, work entry • Credibility Threshold Principle: if we only accept a final result if the conditional probability of it being correct is at least θ, then overall ave. err rate will be at most (1-θ) • Wait until credibility is high enough Project Bayanihan MIT LCS and Ateneo de Manila University
Computing Credibility • Worker, CrP(P) • dubiosity (1-Cr) decreases linearly with # of spot-checks passed, k • CrP(P) = 1 – f, without spot-checking • CrP(P) = 1 – f / ke(1-f), with spot-checking and blacklisting • CrP(P) = 1 – f / k, with spot-checking without blacklisting • Result, CrR(R) • taken from CrP(R.solver) • Result Group, CrG(G) • generally increases as # of matching good-credibility results increase • conditional probability given other groups, and CrR of results • CrG(Ga) = P(Ga good)P(all others bad) P(getting the groups we got) • e.g., if CrR(R) = 1 – f, for all R, and only 2 groupsCrG = (1-f)m1 fm2 / ((1-f)m1 fm2 + fm1 (1-f)m2 + fm1fm2 ) • Work Entry, CrW(W) • CrG(G) of best group Project Bayanihan MIT LCS and Ateneo de Manila University
Results: Credibility w/ blacklisting • N=10000, P=200, f = 0.2, 0.1, 0.05q = 0.1batch-limited blacklisting • Note that error never goes above threshold • Trade-off is in slowdown • Slowdown / err ratio is very good • each additional repetition gives > 100x improvement in error rate Project Bayanihan MIT LCS and Ateneo de Manila University
Results: Credibility w/o blacklisting • Error still never goes above threshold • A bit slower • immune to short-staying saboteurs • encourages longer stay Project Bayanihan MIT LCS and Ateneo de Manila University
Results: Using Voting to Spot-check • Normally, spot-check rate is low because it implies overhead • We can use cred-based voting to spot-check since cred-based voting has guaranteed low err • if redundancy >= 2, then effectively, q = 1 • Saboteurs get caught quickly -> low error rates • Good workers gain high credibility by passing a lot -> reach threshold faster • Very good slowdown to err slope • about 3 orders-of-magnitude per extra redundancy • good for non-fault-tolerant apps Project Bayanihan MIT LCS and Ateneo de Manila University
Slowdown vs.Err voting only cred, w/ SC & BL At f=20%, for the same slowdown, cred w/ V-SC, w/o BL gets 10^5 times better err rate than m-first majority voting! cred, w/ SC, w/o BL cred, using Voting for SC, w/o BL Project Bayanihan MIT LCS and Ateneo de Manila University
Variations • Credibility-based fault-tolerance is highly-generalizable • Credibility Threshold Principle holds in all cases • provided that we compute conditional probability correctly • Change in assumptions and implementations lead to change in credibility metrics, e.g., • if we assume saboteurs communicate, then change result group credibility • if we have trustable hosts, or untrustable domains, adjust worker cred. accordingly • if we can use checksums, encryption, obfuscation, etc., then adjust CrP, CrG, etc. • time-varying credibility • compute credibility of batches or work pools Project Bayanihan MIT LCS and Ateneo de Manila University
Summary of Mechanisms • Voting • error reduction exponential with redundancy • but bad for large f • minimum redundancy of 2 • Spot-checking with backtracking and blacklisting • error reduction linear with work done by each volunteer • lower redundancy • good for large f • Voting and Spot-checking • exponentially reduce linearly-reduced error rate • Credibility-based Fault-Tolerance • guarantee limit on error by watching conditional prob. • automatically combines voting and spot-checking as necessary • more efficient than simple voting and spot-checking • open to variations Project Bayanihan MIT LCS and Ateneo de Manila University
For more information • Recently finished Ph.D. thesis • Volunteer Computing by Luis F. G. Sarmenta, MIT. • This, and other papers available from: • http://www.cag.lcs.mit.edu/bayanihan/ • Paper at IC 2001 (w/ PDPTA 2001)Las Vegas, June 25-28 • more on how we parallelized the simulation • details are also in thesis • Email: • lfgs@admu.edu.ph or lfgs@alum.mit.edu Project Bayanihan MIT LCS and Ateneo de Manila University