1 / 46

How To Tell A Secret Without Revealing It

How To Tell A Secret Without Revealing It. Enhanced Data Privacy in a Distributed Implementation of the Smith-Waterman Genome Sequence Comparison Algorithm Barry Lawson University of Richmond. Outline. Distributed volunteer computing A problem Related work A real-world application

sandra_john
Download Presentation

How To Tell A Secret Without Revealing It

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. How To Tell A Secret Without Revealing It Enhanced Data Privacy in a Distributed Implementation of the Smith-Waterman Genome Sequence Comparison Algorithm Barry Lawson University of Richmond

  2. Outline • Distributed volunteer computing • A problem • Related work • A real-world application • Our enhanced privacy approach • Results & analysis • Conclusions, future & ongoing work

  3. Your PC (few GFLOPS) NOW (GFLOPS) Supercomputer (< 280.6 TFLOPS) ~$128M Our Scenario • You have a very large, compute intensive project

  4. Welcome to the Internet, my friend. How can I help you? What To Do?

  5. Distributed Volunteer Computing (DVC) Participants (Bob) (< 200 TFLOPS) Supervisor (Alice)

  6. DVC Computation • Large-scale distributed computation: • compute intensive • easily parallelizable • Supervisor: • divide computation into tasks (independent) • ship tasks to participants • collect significant results • Participants: • download and execute tasks (when o/w idle) • return significant results

  7. Real-world Examples

  8. DVC Group @ Richmond • Faculty • Doug Szajda (CS) • Barry Lawson (CS) • Jason Owen (Statistics) • Students • Current Mike Pohl, Greg Steffensen, Andy White • Past Ed Kenney (CMU), Dan Upton (UVA), Rom Chan, Trin C. • Four more this summer Stefan Chipilov, Brittany Williams, Matt King, Ivan Jibaja

  9. Telling a Secret Without Revealing It • A problem: • Participants are untrustworthy • Code executes outside supervisor’s control • Computation data may be proprietary • Goal: • Participants provide meaningful results • Supervisor does not divulge data

  10. Related Work(not exhaustive) • Computing with encrypted data: • Alice has x, wants Bob to compute f(x) But does not want to divulge x • Alice gives Bob x’ and f’( ) • Bob computes f’(x’) • The key: • Alice can determine f(x) from f’(x’) • Bob cannot determine x from x’ and/or f’(x’) • Difficult (often impossible) in practice

  11. “Alice” “Alice” Less Formally x x’ f f’ ? f(x) f’(x’) f’(x’) “Bob”

  12. Flexibility In Our Context • The computation: • Alice (supervisor) has many x’s • Bob (participant) determines x’s that are significant • Alice doesn’t need the valuef(x) • Alice will post-process • A few false positives are OK • Sufficient accuracy: flexibility in f’( )

  13. The Adversary • Assumed to be intelligent • can decompile, analyze, modify code • understands task algorithms • understands enhanced privacy scheme(s) • Motivation • may not be obvious: business competitor? • may not care if leak is detected

  14. D1 D D2 D3 General Model • The Computation: • evaluate an algorithm f : D Rfor all x in D • Task T( ): • partition D into subsets Di • T(Di) evaluates f(xi) for all xi inD • Filter function G( ): • determines “significance” • returns indices of significant xi

  15. Our General Approach • Transform Di, f, G into Di’, f’, G’ • Replace task T(Di) with T(Di’) • Desirable properties: • T(Di) does not leak additional info about values in Di • significance in T(Di) significance in T(Di’) • any difference is reasonably small

  16. In Reality… • Providing desired properties is difficult • even with increased flexibility • impossible for some apps • When possible, application-specific • Bottom line: we have a potential approach • where few, if any, others exist

  17. Application: Genome Sequence Comparison • Compare sequences over genome alphabet ∑ = {A,C,G,T} • Track evolutionary changes by aligning columns of sequences (an alignment) • E.g.: CTGTTA CAGTTA

  18. Sequence Evolution CTGTTA CTG–TA • Deletion: • Insertion: • Substitution: indels C–TGTTA CGTGTTA CTGTTA CAGTTA

  19. Sequence Evolution • After several “generations” • Note: # of alignments is huge (for realistic-length sequences) C–TGT––TA–– CTA–TGCTACG

  20. Alignment Types • Global alignment • considers entire sequence • Local alignment • considers substrings • biologists usually use local

  21. Measuring Alignments • Scoring function: +1 if symbols match -1 if not • Gap penalty • g(k) = a + b(k-1) • k is gap length (# consecutive dashes in single sequence) • Alignment score: • sum of column scores minus gap penalties

  22. A Simple Example • Global alignment: • Scoring function: +1 match, -1 no match • Gap penalty: g(k) = 2 + 1(k-1) C – T G T – – T A – – C T A – T G C T A C G +1 -2 -1 -2 +1 -3 +1 +1 -3 • Alignment score: +4 - 11 = -7

  23. Smith-Waterman • Dynamic programming algorithm • Produces an optimal alignment Global: O(n2) Local: O(n3) • Implemented on commercial DVC platforms

  24. Significance in S-W • Significance of scores based on probability • Empirical evidence: • given randomly-generated sequences • scores exhibit extreme value distribution

  25. Determining Significance • Choosing a significance threshold p : • want small probability that a random score >p • typically, probability < 0.003 p

  26. A Smith-Waterman Task • Pairwise comparison of two sets of sequences, A and B • A : proprietary sequences • B : sequences from public database • Returned: indices of well-matched pairs • Notation: T(A,B,s,g,p)

  27. Our Transformation • Use offset sequences: • compare relative distances b/w specific nucleotides • X: GCACTTACGCCCTTACGACG • F(X,A) = {3,4,8,3} • F(X,C) = {2,2,4,2,1,1,4,3} • F(X,G) = {1,8,8,3} • F(X,T) = {5,1,7,1}

  28. Modified Tasks • X: GCACTTACGCCCTTACGACG F(X,C) = {2,2,4,2,1,1,4,3} • Y: GCACTCGCCACTTAGCACG F(Y,C) = {2,2,2,2,1,2,5,2} • Apply S-W to F(X,C) and F(Y,C) • Scoring function, gap penalty • “Goodness” threshold

  29. Intuition • Similar sequences similar offsets • consider effects of indels, substitutions • What about false positives? • multiple nucleotides • e.g., assign A & C tasks to distinct participants • good match if both tasks indicate significance

  30. 2351 2351 1624 1624 2323 2323 1352 1352 “Alice” A C CAGGATCTCAAGC CAGCATATCACGT ? ? “Bob 1” “Bob 2”

  31. Using Multiple Nucleotides • Maximum method • one task for each of A,C,G,T • result significant if any of the four indicate • Adding method • one task for each of A,C,G,T • result significant if sum of four scores indicates significance • Costs reduced in either case • on average, 1/4 length of original sequence • runtime for an offset sequence ~1/64

  32. Does This Provide Real Data Privacy? • Recall desired properties: • T(Di) does not leak additional info about values in Di • significance in T(Di) significance in T(Di’) • any difference is reasonably small

  33. Data Privacy? • Property 1 fails: • T(Di) does leak additional info about values in Di • adversary knows all info about one nucleotide • How much info is leaked? • conditional entropy gives rough estimate • e.g., N = 600, C∂ =N/4  • 487 bits (of 1200) leaked • 713 bits of uncertainty remain

  34. Analysis • Clearly, not provable security • Suggests two questions: • Can adversary determine additional symbols; if so, how many? • How much info leakage is too much?

  35. “4 out of 5 [Biologists] Agree” • Given only the position of a single nucleotide literal: • No additional nucleotides can be inferred • No “biologically useful” information that can be inferred • Given current understanding of the structure and function of the genome

  36. Does It Work? • In general, yes • strong correlation b/w our scores and S-W • not as sensitive as S-W • some “weak” matches missed • Via statistical inference: • very few false positives: < 10-4 • very few false negatives (usually none)

  37. An Extension • Sequences can be “masked” • For each task, choose random binary mask • Remove from sequence all “zero” elements • Our experiments suggest mask with “1” in 90% of positions works well X: 2 2 4 2 1 1 4 3 1 1 1 0 1 1 1 0 2 2 4 1 1 4

  38. Simulation Results • Well-matched sequences artificially generated • Substring mutated over several generations • Placed at random location into random sequences • Scoring function: +1 match, -1 no match • Gap penalty: g(k) = 2 + 1(k-1)

  39. 10000 comparisons, no mask, maximum method • Sequence length 600-800, matching portion length 300, average of 52.5 subs and 52.5 indels

  40. 10000 comparisons, no mask, adding method • Sequence length 600-800, matching portion length 300, average of 52.5 subs and 52.5 indels

  41. 1000 comparisons, no mask, maximum method • Sequence length 2000, matching portion length 1000, average of 150 subs and 150 indels

  42. 1000 comp, 90% mask, maximum method • Sequence length 1000-1300, matching portion length 500, average of 86.25 subs and 86.25 indels

  43. Conclusions • Introduced notion of sufficient accuracy • Presented a strategy for enhancing data privacy in important real-world application • Present important real-world app that: • requires privacy • efficiently parallelizable • Potential first entry for benchmark suite of apps for privacy study

  44. Future Work • Solution is less than ideal • lack of formal privacy model / provable security • need more testing on real genetic data • But it’s a start • general problem is very difficult • this is a potential avenue of attack • S-W requires more careful study in this context • Consider additional apps

  45. Ongoing DVC Work @ UR • Augmenting BOINC software for campus-wide distribution • want to collect participant/server/data info & patterns • Greg Steffensen • Exploring AI to catch malicious behavior • can we catch omitted results? • Andy White, Matt Kretchmar (Denison CS)

  46. Thanks • NSF CyberTrust • Doug Szajda, Jason Owen • All the UR students • UR Biologists: • Rafael de Sa, Laura Runyen-Janecky, Joe Gindhart • Tadayoshi Kohno (UCSD)

More Related