460 likes | 573 Views
How To Tell A Secret Without Revealing It. Enhanced Data Privacy in a Distributed Implementation of the Smith-Waterman Genome Sequence Comparison Algorithm Barry Lawson University of Richmond. Outline. Distributed volunteer computing A problem Related work A real-world application
E N D
How To Tell A Secret Without Revealing It Enhanced Data Privacy in a Distributed Implementation of the Smith-Waterman Genome Sequence Comparison Algorithm Barry Lawson University of Richmond
Outline • Distributed volunteer computing • A problem • Related work • A real-world application • Our enhanced privacy approach • Results & analysis • Conclusions, future & ongoing work
Your PC (few GFLOPS) NOW (GFLOPS) Supercomputer (< 280.6 TFLOPS) ~$128M Our Scenario • You have a very large, compute intensive project
Welcome to the Internet, my friend. How can I help you? What To Do?
Distributed Volunteer Computing (DVC) Participants (Bob) (< 200 TFLOPS) Supervisor (Alice)
DVC Computation • Large-scale distributed computation: • compute intensive • easily parallelizable • Supervisor: • divide computation into tasks (independent) • ship tasks to participants • collect significant results • Participants: • download and execute tasks (when o/w idle) • return significant results
DVC Group @ Richmond • Faculty • Doug Szajda (CS) • Barry Lawson (CS) • Jason Owen (Statistics) • Students • Current Mike Pohl, Greg Steffensen, Andy White • Past Ed Kenney (CMU), Dan Upton (UVA), Rom Chan, Trin C. • Four more this summer Stefan Chipilov, Brittany Williams, Matt King, Ivan Jibaja
Telling a Secret Without Revealing It • A problem: • Participants are untrustworthy • Code executes outside supervisor’s control • Computation data may be proprietary • Goal: • Participants provide meaningful results • Supervisor does not divulge data
Related Work(not exhaustive) • Computing with encrypted data: • Alice has x, wants Bob to compute f(x) But does not want to divulge x • Alice gives Bob x’ and f’( ) • Bob computes f’(x’) • The key: • Alice can determine f(x) from f’(x’) • Bob cannot determine x from x’ and/or f’(x’) • Difficult (often impossible) in practice
“Alice” “Alice” Less Formally x x’ f f’ ? f(x) f’(x’) f’(x’) “Bob”
Flexibility In Our Context • The computation: • Alice (supervisor) has many x’s • Bob (participant) determines x’s that are significant • Alice doesn’t need the valuef(x) • Alice will post-process • A few false positives are OK • Sufficient accuracy: flexibility in f’( )
The Adversary • Assumed to be intelligent • can decompile, analyze, modify code • understands task algorithms • understands enhanced privacy scheme(s) • Motivation • may not be obvious: business competitor? • may not care if leak is detected
D1 D D2 D3 General Model • The Computation: • evaluate an algorithm f : D Rfor all x in D • Task T( ): • partition D into subsets Di • T(Di) evaluates f(xi) for all xi inD • Filter function G( ): • determines “significance” • returns indices of significant xi
Our General Approach • Transform Di, f, G into Di’, f’, G’ • Replace task T(Di) with T(Di’) • Desirable properties: • T(Di) does not leak additional info about values in Di • significance in T(Di) significance in T(Di’) • any difference is reasonably small
In Reality… • Providing desired properties is difficult • even with increased flexibility • impossible for some apps • When possible, application-specific • Bottom line: we have a potential approach • where few, if any, others exist
Application: Genome Sequence Comparison • Compare sequences over genome alphabet ∑ = {A,C,G,T} • Track evolutionary changes by aligning columns of sequences (an alignment) • E.g.: CTGTTA CAGTTA
Sequence Evolution CTGTTA CTG–TA • Deletion: • Insertion: • Substitution: indels C–TGTTA CGTGTTA CTGTTA CAGTTA
Sequence Evolution • After several “generations” • Note: # of alignments is huge (for realistic-length sequences) C–TGT––TA–– CTA–TGCTACG
Alignment Types • Global alignment • considers entire sequence • Local alignment • considers substrings • biologists usually use local
Measuring Alignments • Scoring function: +1 if symbols match -1 if not • Gap penalty • g(k) = a + b(k-1) • k is gap length (# consecutive dashes in single sequence) • Alignment score: • sum of column scores minus gap penalties
A Simple Example • Global alignment: • Scoring function: +1 match, -1 no match • Gap penalty: g(k) = 2 + 1(k-1) C – T G T – – T A – – C T A – T G C T A C G +1 -2 -1 -2 +1 -3 +1 +1 -3 • Alignment score: +4 - 11 = -7
Smith-Waterman • Dynamic programming algorithm • Produces an optimal alignment Global: O(n2) Local: O(n3) • Implemented on commercial DVC platforms
Significance in S-W • Significance of scores based on probability • Empirical evidence: • given randomly-generated sequences • scores exhibit extreme value distribution
Determining Significance • Choosing a significance threshold p : • want small probability that a random score >p • typically, probability < 0.003 p
A Smith-Waterman Task • Pairwise comparison of two sets of sequences, A and B • A : proprietary sequences • B : sequences from public database • Returned: indices of well-matched pairs • Notation: T(A,B,s,g,p)
Our Transformation • Use offset sequences: • compare relative distances b/w specific nucleotides • X: GCACTTACGCCCTTACGACG • F(X,A) = {3,4,8,3} • F(X,C) = {2,2,4,2,1,1,4,3} • F(X,G) = {1,8,8,3} • F(X,T) = {5,1,7,1}
Modified Tasks • X: GCACTTACGCCCTTACGACG F(X,C) = {2,2,4,2,1,1,4,3} • Y: GCACTCGCCACTTAGCACG F(Y,C) = {2,2,2,2,1,2,5,2} • Apply S-W to F(X,C) and F(Y,C) • Scoring function, gap penalty • “Goodness” threshold
Intuition • Similar sequences similar offsets • consider effects of indels, substitutions • What about false positives? • multiple nucleotides • e.g., assign A & C tasks to distinct participants • good match if both tasks indicate significance
2351 2351 1624 1624 2323 2323 1352 1352 “Alice” A C CAGGATCTCAAGC CAGCATATCACGT ? ? “Bob 1” “Bob 2”
Using Multiple Nucleotides • Maximum method • one task for each of A,C,G,T • result significant if any of the four indicate • Adding method • one task for each of A,C,G,T • result significant if sum of four scores indicates significance • Costs reduced in either case • on average, 1/4 length of original sequence • runtime for an offset sequence ~1/64
Does This Provide Real Data Privacy? • Recall desired properties: • T(Di) does not leak additional info about values in Di • significance in T(Di) significance in T(Di’) • any difference is reasonably small
Data Privacy? • Property 1 fails: • T(Di) does leak additional info about values in Di • adversary knows all info about one nucleotide • How much info is leaked? • conditional entropy gives rough estimate • e.g., N = 600, C∂ =N/4 • 487 bits (of 1200) leaked • 713 bits of uncertainty remain
Analysis • Clearly, not provable security • Suggests two questions: • Can adversary determine additional symbols; if so, how many? • How much info leakage is too much?
“4 out of 5 [Biologists] Agree” • Given only the position of a single nucleotide literal: • No additional nucleotides can be inferred • No “biologically useful” information that can be inferred • Given current understanding of the structure and function of the genome
Does It Work? • In general, yes • strong correlation b/w our scores and S-W • not as sensitive as S-W • some “weak” matches missed • Via statistical inference: • very few false positives: < 10-4 • very few false negatives (usually none)
An Extension • Sequences can be “masked” • For each task, choose random binary mask • Remove from sequence all “zero” elements • Our experiments suggest mask with “1” in 90% of positions works well X: 2 2 4 2 1 1 4 3 1 1 1 0 1 1 1 0 2 2 4 1 1 4
Simulation Results • Well-matched sequences artificially generated • Substring mutated over several generations • Placed at random location into random sequences • Scoring function: +1 match, -1 no match • Gap penalty: g(k) = 2 + 1(k-1)
10000 comparisons, no mask, maximum method • Sequence length 600-800, matching portion length 300, average of 52.5 subs and 52.5 indels
10000 comparisons, no mask, adding method • Sequence length 600-800, matching portion length 300, average of 52.5 subs and 52.5 indels
1000 comparisons, no mask, maximum method • Sequence length 2000, matching portion length 1000, average of 150 subs and 150 indels
1000 comp, 90% mask, maximum method • Sequence length 1000-1300, matching portion length 500, average of 86.25 subs and 86.25 indels
Conclusions • Introduced notion of sufficient accuracy • Presented a strategy for enhancing data privacy in important real-world application • Present important real-world app that: • requires privacy • efficiently parallelizable • Potential first entry for benchmark suite of apps for privacy study
Future Work • Solution is less than ideal • lack of formal privacy model / provable security • need more testing on real genetic data • But it’s a start • general problem is very difficult • this is a potential avenue of attack • S-W requires more careful study in this context • Consider additional apps
Ongoing DVC Work @ UR • Augmenting BOINC software for campus-wide distribution • want to collect participant/server/data info & patterns • Greg Steffensen • Exploring AI to catch malicious behavior • can we catch omitted results? • Andy White, Matt Kretchmar (Denison CS)
Thanks • NSF CyberTrust • Doug Szajda, Jason Owen • All the UR students • UR Biologists: • Rafael de Sa, Laura Runyen-Janecky, Joe Gindhart • Tadayoshi Kohno (UCSD)