Doug Szajda Mike Pohl * Jason Owen Barry Lawson

Toward a Practical Data Privacy Scheme for a Distributed Implementation of the Smith-Waterman Genome Sequence Comparison Algorithm Doug Szajda Mike Pohl* Jason Owen Barry Lawson 1

Large-Scale Distributed Computations Easily parallelizable, compute intensive Divide into independent tasks to be executed on participant PCs Significant results collected by supervisor 2

seti@home Finding Martians folding@home Protein folding GIMPS (Entropia) Mersenne Prime search United Devices, IBM, DOD: Smallpox study DNA sequencing Graphics Exhaustive Regression Genetic Algorithms Data Mining Monte Carlo simulation Examples

A Problem • Code is executing in untrusted environments • Data required for task execution may be proprietary • Can we find a way to have participants execute tasks without divulging data?

Related Work (not exhaustive) • Computing with Encrypted Data • Feigenbaum (1985) • Abadi, Feigenbaum, Killian (1987) • Secure Circuit Evaluation • Abadi and Feigenbaum (1990) • Sander, Young, and Yung (1999)

Related Work (not exhaustive) • Privacy Homomorphisms • Rivest, Adleman, Dertouzos (1978) • Ahituv, Lapid, Neumann (1987) • Brickell and Yacobi (1987) • Multiparty function computation • Yao (1986) • Goldreich, Micali, Wigderson (1987) • Ben-Or, Goldwasser, and Wigderson (1988) • Chaum, Crepeau, and Damgard (1988)

Computing With Encrypted Data • Alice has x, wants Bob to compute f(x), but does not want to divulge x • Alice gives Bob E(x) and f’, tells him to return f’(E(x)) • Alice can determine f(x) from f’(E(x)), but Bob cannot determine x from knowledge of E(x), f’(E(x))

In Present Context • Alice has several x values. Asks Bob to identify those that are significant • Alice doesn’t need f(x), so greater flexibility in definition of f’ (Sufficient Accuracy) • Post-filtering means that some false positives are OK. • Lots of Bobs offering computing services

Adversary (as usual) • Assumed to be intelligent • Can decompile, analyze, modify code • Understands task algorithms and measures used to prevent disclosure of data

The Model • Computation: evaluate f : D -> R • Partition D into subsets Di • Task T(Di): evaluate f(xi) for all xi in Di • Each task assigned filter function Gi • Gi returns indices of interesting xi

Basic Approach • Transform Di, f, Gi into Di’, f’, Gi’ • Replace T(Di) with T(Di’) such that • T(Di’) does not leak additional information about values in Di • Identifiers returned by T(Di’) contains those that would be returned by T(Di) • Difference is reasonably small

Reality • Providing required properties is difficult (impossible for some apps) • Even when possible, implementation is application specific • Bottom line: A potential approach, where few (if any) others exist

An Example: Smith-Waterman Genome Sequence Comparison

Genetic Sequence Alignment • Comparing sequences over alphabet ∑={A,C,G,T} • Biologists track evolutionary changes by writing sequences with columns aligned (called an alignment) • Ex. CTGTTA CAGTTA

Sequence Evolution • Deletion: CTGTTA CTGTA • Insertion: CTGTTA CGTGTTA • Substitution: CTGTTA CAGTTA indels

Sequence Evolution (cont.) • After several “generations”: CTGTTA CTATGCTCG • Note: Number of alignments (for pair of realistic length sequences) is huge

Alignment “Types” • Global alignment • Considers entire sequence • Local alignment • Considers substrings • Biologists usually consider local alignments

Measuring Alignments • Scoring function • +1 if symbols match • -1 if not • Gap penalty • g(k) = a + b(k-1) • k is gap length (# consecutive dashes in single sequence) • Alignment score is sum of column scores minus gap penalties

Smith-Waterman • Dynamic programming algorithm guaranteed to produce an optimal alignment • Global: O(n2); local: O(n3) • Widely used by biologists • Implemented on commercial volunteer distributed computing platforms

Using Smith-Waterman • Significance of Smith-Waterman score based on probabilistic considerations • Empirical Evidence: Similarity scores of randomly generated sequences exhibit an extreme value distribution • Significance threshold p chosen so that probability random score > p is small (typically <0.003)

A Smith-Waterman Task • Pairwise comparison of two sets of sequences, A and B • A : proprietary sequences • B : sequences from public database • Returned: indices of well-matched pairs • Notation: T(A,B,s,g,p)

Our Transformation • Offset sequences: compare relative distances b/w specific nucleotide • U: GCACTTACGCCCTTACGACG • F(U,A) = {3,4,8,3} • F(U,C) = {2,2,4,2,1,1,4,3} • F(U,G) = {1,8,8,3} • F(U,T) = {5,1,7,1}

Modified Tasks • U: GCACTTACGCCCTTACGACG F(U,C) = {2,2,4,2,1,1,4,3} • V: GCACTCGCCACTTAGCACG F(V,C) = {2,2,2,2,1,2,5,2} • Apply S-W to F(U,C) and F(V,C) • Scoring function, gap penalty • “Goodness” threshold

Intuition • Similar sequences should have similar offsets • Consider effects of indels, substitutions • False positives can be reduced • Consider multiple nucleotides • I.e., assign A and C info to distinct participants • Good match if both tasks indicate significance

Using Multiple Nucleotide Literals • Maximum method • One task for each of A,C,G,T • Result significant if any of the four says so • Adding method • One task for each of A,C,G,T, results passed to fifth participant • Result significant if sum of four scores indicates significance • Costs reduced in either case

Security?

Recall… • T(Di’) does not leak additional information about values in Di • Identifiers returned by T(Di’) contains those that would be returned by T(Di) • Difference is reasonably small

Data Privacy? • Property 1 fails: adversary will know all info about a single nucleotide literal • Conditional entropy gives rough estimate of amount of information leaked • Bits leaked: 2N - (N - C∂ ) log 3 • C∂ is # of occurrences of ∂ in sequence • Ex. N = 600, C∂ =N/4  487 bits (of 1200) leaked (713 bits of uncertainty remain)

Analysis • Clearly, our scheme does not provide provable security, but it does suggest two questions: • Can an adversary determine additional symbols (and if so, how many)? • How much information leakage is too much in this context?

“4 out of 5 [Biologists] Agree” • Given only the position of a single nucleotide literal: • No additional elements can be inferred • There is no “biologically useful” information that can be inferred • Given current understanding of the structure and function of the genome

An Extension • Sequences can be “masked” • For each task, choose random binary mask • Remove from sequence all “zeroed” elements • Our experiments suggest mask with “1” in 90% of positions works well

Does it Work? • In general, yes • Strong correlation between our scores and S-W • Not as sensitive as Smith-Waterman • Some weak matches missed • Statistical inference techniques show: • Very few false positives ( < 10-4) • Very few false negatives (often none)

Simulation Results • Well-matched sequences artificially generated • Substring mutated over several generations • Placed at random location into random sequences • Scoring function as given earlier (1, -1) • Gap penalty: g(k) = 2 + 1(k-1)

10000 comp, no mask, maximum method for determining significance • Sequence length 600-800, matching portion length 300, average of 52.5 subs and 52.5 indels

10000 comp, no mask, adding method for determining significance • Sequence length 600-800, matching portion length 300, average of 52.5 subs and 52.5 indels

1000 comp, no mask, maximum method for determining significance • Sequence length 2000, matching portion length 1000, average of 150 subs and 150 indels

1000 comp, 90% mask, maximum method for determining significance • Sequence length 1000-1300, matching portion length 500, average of 86.25 subs and 86.25 indels

Conclusions • Introduced notion of sufficient accuracy • Presented a strategy for enhancing data privacy in important real-world application • Present important real-world app that requires privacy and is efficiently parallelizable • These are relatively rare • Potential first entry for benchmark suite of apps for privacy study

In the Future • Solution is less than ideal • Lack of formal privacy model / provable security • Need more testing on real genetic data • But it’s a start • General problem is difficult, this is a potential avenue of attack • Smith-Waterman requires more careful study in this context • Application behavior vs. application configurations

Doug Szajda Mike Pohl * Jason Owen Barry Lawson