1 / 20

Sabotage-Tolerance Mechanisms for Volunteer Computing Systems

Sabotage-Tolerance Mechanisms for Volunteer Computing Systems. Luis F. G. Sarmenta Ateneo de Manila University, Philippines (formerly MIT LCS). Volunteer Computing. Idea: Make it very easy for even non-expert users to join a NOW by themselves

ponce
Download Presentation

Sabotage-Tolerance Mechanisms for Volunteer Computing Systems

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Sabotage-Tolerance Mechanisms for Volunteer Computing Systems Luis F. G. Sarmenta Ateneo de Manila University, Philippines (formerly MIT LCS) Project Bayanihan MIT LCS and Ateneo de Manila University

  2. Volunteer Computing • Idea: Make it very easy for even non-expert users to join a NOW by themselves • Minimal setup requirements  Maximum participation • Very large NOWs very quickly! • just invite people • SETI@home, distributed.net, others • The Dream:Electronic Bayanihan • achieving the impossible through cooperation “Bayanihan” Mural by Carlos “Botong” Francisco, commissioned by Unilab, Philippines. Used with permission. Project Bayanihan MIT LCS and Ateneo de Manila University

  3. The Problem • Allowing anyone to join means possibility of malicious attacks • Sabotage • bad data from malicious volunteers • Traditional Approach • Encryption works against spoofing by outsiders • but not against registered volunteers • Checksums guard against random faults • but not against saboteurs who disassemble code • Another Approach: Obfuscation • prevent saboteurs from disassembling code • periodically reobfuscate to avoid disassembly • Promising. But what if we can’t do it, or it doesn’t work? Project Bayanihan MIT LCS and Ateneo de Manila University

  4. Voting and Spot-checking • Assume worst case • we can’t trust workers, so need to double-check • Voting • Everything must be done at least m times • Majority wins (like Elections) • e.g., Triple-Modular-Redundancy, NMR, etc. • Problem: not so efficient. • Spot-checking • Don’t check all the time. Only sometimes. • But if you’re caught, you’re “dead” • Backtrack – all results of caught saboteur are invalidated • Blacklist – saboteur’s results are ignored • Scare people into compliance (like Customs at Airport!) • More efficient (?) Project Bayanihan MIT LCS and Ateneo de Manila University

  5. Theoretical Analysis: Assumptions • Master-worker model • eager scheduling lets us redo, undo work • several batches, each batch N works • P workers, fraction f are saboteurs • same speed saboteurs, so work roughly evenly distributed • no spare workers, so higher redundancy (# of work given out) means worse slowdown (time) • Non-zero acceptable error rate, erracc • error rate (err) = • average fraction of bad final results in a batch • probability of error of an individual final result • relatively high for naturally fault-tolerant apps • e.g., image rendering, genetic algorithms, etc. • correspondingly small for most other apps • e.g., to guarantee 1% failure rate for 1000 works, erracc = 1% / 1000 = 1e-5 Project Bayanihan MIT LCS and Ateneo de Manila University

  6. Theoretical Analysis: Assumptions • Assume saboteurs are Bernoulli processes with independent, identical, constant sabotage rate, s • implies that saboteurs do not agree on when to give bad answers (unless they always give them) • simplifying assumption, may not be realistic • but may be OK if we assume saboteurs receive works at different times and cannot distinguish them • Assume saboteurs’ bad answers agree • allows them to vote (if they happen to give bad answers at the same time) • pessimistic assumption • we can use crypto and checksums to make it hard to generate agreeing answers • implies that there are only 2 kinds of answers: bad and good Project Bayanihan MIT LCS and Ateneo de Manila University

  7. Majority Voting • m-majority voting • m out of 2m-1 must match • used in hardware, and systems with spare processors • redundancy = 2m-1 • m-first voting • accept as soon as m match • same error rate but faster • redundancy = m/(1-f) • exponential error rate • err = (cf)m • where c is between 1.7 and 4 • Good for small f, but bad for large f • Minimum redundancy & slowdown of 2 Project Bayanihan MIT LCS and Ateneo de Manila University

  8. Spot-Checking w/ blacklisting • Lower redundancy • 1/(1-q) • Good error rates dueto backtracking • no error as long as saboteur is caught by end of batch • err = sf(1-qs)n (1-f)+f(1-qs)n • where n is number of work received in batch (related but a bit more than N/P) • Saboteurs strategy: only give a few bad answers • s* = 1/(q(n+1)) • Max error, err* < (f/(1-f))(1/qne) • Linear error reductionaccording to n • larger batches, better error rates Simulator Results.Note that it workseven if f > 0.5 Project Bayanihan MIT LCS and Ateneo de Manila University

  9. Spot-Checking w/o blacklisting • What if saboteurs can comeback under new identity? • Saboteur’s strategy: stay for L turns only • Max error • err* < f / qLe, if L << n • err* < f / qL, as L -> n • err* < f / qL in all cases • Linear error reductionaccording to L, not n • larger batches don’t guaranteebetter error rates anymore • L = 1 gives worst errors, err = f(1-q) • Try to force larger L’s • make forging new ID difficult; impose sign-on delays • batch-limited blacklisting Project Bayanihan MIT LCS and Ateneo de Manila University

  10. Voting and Spot-Checking • Simply running them together works! • With blacklisting, we can usespot-checking err rate in placeof f •  exponentially reduce linearly-reduced error rate • (qne(1-f))m improvement • big difference! (esp. for large f) • Unfortunately, doesn’t workas well w/o blacklisting • bad err rate to begin with • substituting err for f doesn’t work • Problem are saboteurs who come back near end of batch Project Bayanihan MIT LCS and Ateneo de Manila University

  11. CredWorkPool nextUnDoneWork θ = 0.999, assuming f  0.2 Done, res=Z Done, res=J CrW = 0.8 CrW = 0.492 CrW = 0.967 CrW = 0.9992 CrW= 0.999 . . . Work1 Work998 Work0 Work997 Work999 CrG = 0.9992 CrG = 0.0008 CrG = 0.967 CrG = 0.492 CrG = 0.999 CrG = 0.8 CrG = 0.492 res res res res res res pid pid pid pid pid pid CrR CrR CrR CrR CrR CrR res pid CrR A J M G H B P2 P8 P2 P9 P6 P1 0.8 0.999 0.967 0.967 0.967 0.967 Z Z P6 P7 0.967 0.998 . . . Worker P6 Worker P7 Worker P1 Worker P2 Worker P9 Worker P8 Crp = 0.8 Crp = 0.933 Crp = 0.967 Crp = 0.998 Crp = 0.999 Crp = 0.967 k k k k k k 3 125 6 200 0 6 Credibility-Based Fault-Tolerance • Problem: errors come from saboteurs who have not yet been spot-checked enough • Idea: give workers credibility depending on number of spot-checks passed, k • General Idea: attach credibility values to objects in system • Credibility of X, Cr(X) = probability that X is, or will, give a good result Project Bayanihan MIT LCS and Ateneo de Manila University

  12. CredWorkPool nextUnDoneWork θ = 0.999, assuming f  0.2 Done, res=Z Done, res=J CrW = 0.8 CrW = 0.492 CrW = 0.967 CrW = 0.9992 CrW= 0.999 . . . Work1 Work998 Work0 Work997 Work999 CrG = 0.9992 CrG = 0.0008 CrG = 0.967 CrG = 0.492 CrG = 0.999 CrG = 0.8 CrG = 0.492 res res res res res res pid pid pid pid pid pid CrR CrR CrR CrR CrR CrR res pid CrR A J M G H B P2 P8 P2 P9 P6 P1 0.8 0.999 0.967 0.967 0.967 0.967 Z Z P6 P7 0.967 0.998 . . . Worker P6 Worker P7 Worker P1 Worker P2 Worker P9 Worker P8 Crp = 0.8 Crp = 0.933 Crp = 0.967 Crp = 0.998 Crp = 0.999 Crp = 0.967 k k k k k k 3 125 6 200 0 6 Credibility-Based Fault-Tolerance • 4 types of credibility (in this implementation) • worker, result, result group, work entry • Credibility Threshold Principle: if we only accept a final result if the conditional probability of it being correct is at least θ, then overall ave. err rate will be at most (1-θ) • Wait until credibility is high enough Project Bayanihan MIT LCS and Ateneo de Manila University

  13. Computing Credibility • Worker, CrP(P) • dubiosity (1-Cr) decreases linearly with # of spot-checks passed, k • CrP(P) = 1 – f, without spot-checking • CrP(P) = 1 – f / ke(1-f), with spot-checking and blacklisting • CrP(P) = 1 – f / k, with spot-checking without blacklisting • Result, CrR(R) • taken from CrP(R.solver) • Result Group, CrG(G) • generally increases as # of matching good-credibility results increase • conditional probability given other groups, and CrR of results • CrG(Ga) = P(Ga good)P(all others bad) P(getting the groups we got) • e.g., if CrR(R) = 1 – f, for all R, and only 2 groupsCrG = (1-f)m1 fm2 / ((1-f)m1 fm2 + fm1 (1-f)m2 + fm1fm2 ) • Work Entry, CrW(W) • CrG(G) of best group Project Bayanihan MIT LCS and Ateneo de Manila University

  14. Results: Credibility w/ blacklisting • N=10000, P=200, f = 0.2, 0.1, 0.05q = 0.1batch-limited blacklisting • Note that error never goes above threshold • Trade-off is in slowdown • Slowdown / err ratio is very good • each additional repetition gives > 100x improvement in error rate Project Bayanihan MIT LCS and Ateneo de Manila University

  15. Results: Credibility w/o blacklisting • Error still never goes above threshold • A bit slower • immune to short-staying saboteurs • encourages longer stay Project Bayanihan MIT LCS and Ateneo de Manila University

  16. Results: Using Voting to Spot-check • Normally, spot-check rate is low because it implies overhead • We can use cred-based voting to spot-check since cred-based voting has guaranteed low err • if redundancy >= 2, then effectively, q = 1 • Saboteurs get caught quickly -> low error rates • Good workers gain high credibility by passing a lot -> reach threshold faster • Very good slowdown to err slope • about 3 orders-of-magnitude per extra redundancy • good for non-fault-tolerant apps Project Bayanihan MIT LCS and Ateneo de Manila University

  17. Slowdown vs.Err voting only cred, w/ SC & BL At f=20%, for the same slowdown, cred w/ V-SC, w/o BL gets 10^5 times better err rate than m-first majority voting! cred, w/ SC, w/o BL cred, using Voting for SC, w/o BL Project Bayanihan MIT LCS and Ateneo de Manila University

  18. Variations • Credibility-based fault-tolerance is highly-generalizable • Credibility Threshold Principle holds in all cases • provided that we compute conditional probability correctly • Change in assumptions and implementations lead to change in credibility metrics, e.g., • if we assume saboteurs communicate, then change result group credibility • if we have trustable hosts, or untrustable domains, adjust worker cred. accordingly • if we can use checksums, encryption, obfuscation, etc., then adjust CrP, CrG, etc. • time-varying credibility • compute credibility of batches or work pools Project Bayanihan MIT LCS and Ateneo de Manila University

  19. Summary of Mechanisms • Voting • error reduction exponential with redundancy • but bad for large f • minimum redundancy of 2 • Spot-checking with backtracking and blacklisting • error reduction linear with work done by each volunteer • lower redundancy • good for large f • Voting and Spot-checking • exponentially reduce linearly-reduced error rate • Credibility-based Fault-Tolerance • guarantee limit on error by watching conditional prob. • automatically combines voting and spot-checking as necessary • more efficient than simple voting and spot-checking • open to variations Project Bayanihan MIT LCS and Ateneo de Manila University

  20. For more information • Recently finished Ph.D. thesis • Volunteer Computing by Luis F. G. Sarmenta, MIT. • This, and other papers available from: • http://www.cag.lcs.mit.edu/bayanihan/ • Paper at IC 2001 (w/ PDPTA 2001)Las Vegas, June 25-28 • more on how we parallelized the simulation • details are also in thesis • Email: • lfgs@admu.edu.ph or lfgs@alum.mit.edu Project Bayanihan MIT LCS and Ateneo de Manila University

More Related