350 likes | 375 Views
Inference and De-anonymization Attacks against Genomic Privacy. ETH Zürich October 1, 2015. Mathias Humbert Joint work with Erman Ayday, Jean-Pierre Hubaux, Kévin Huguenin, Joachim Hugonot, Amalio Telenti. ( Human ) System Security. How is a human system encoded ?. Programmer 2.
E N D
Inference and De-anonymization Attacks against Genomic Privacy ETH Zürich October 1, 2015 Mathias Humbert Joint work with Erman Ayday, Jean-Pierre Hubaux, Kévin Huguenin, Joachim Hugonot, Amalio Telenti
(Human) System Security How is a human system encoded? Programmer 2 Programmer 1 The humangenomecanberepresented as a sequence of ternary values (called SNP/SNV) Computer -> humansystems: Binary-> ternary values!
ProgrammingHumanBeings… Programmer 1: Father Programmer 2: Mother . . . . . . . . . . . . Gamete Production Gamete Production Child
Genomic Data Deluge • Genotyping < 100$ today • > 950k people genotyped by 23andMe • Recent governmental and industrial initiatives • President Obama’s Precision Medicine Initiative (01/2015) => 1M+ citizens • Google Genomics (API to store, process, explore, and share DNA data) • Microsoft Research (genomic research in collaboration with Sanger Center) • Global Alliance for Genomics & Health (common framework for effective, responsible and secure sharing of genomic and clinical data) • Genomic-data benefits • Providing substantial improvement in diagnosis and personalized medicine • Helping medical research progress • Sharing of genomic data • Thousands of genomes are already available online (OpenSNP, Personal Genome Project, …) • First motivation for sharing: help research [1] [1] http://opensnp.wordpress.com/2011/11/17/first-results-of-the-survey-on-sharing-genetic-information/
GenomicPrivacyRisks • Genomecarries sensitive information about • Predisposition to diseases • Genetic discrimination in healthor life insurance, … • Future physical conditions • Genetic discrimination in work, sports, ... • Kinship • Familial tragedies (like divorce caused bythe discovery of illegitimateoffspring [2]) • Physical appearance, metabolism • The privacy situation isworsened by • The non-revokability of genomic data • Interdependentrisks [2] http://www.vox.com/2014/9/9/5975653/with-genetic-testing-i-gave-my-parents-the-gift-of-divorce-23andme
Outline • Statistical inference • Kin genomic privacy • M. Humbert, E. Ayday, J.-P. Hubaux, A.Telenti. Addressingthe Concerns of the LacksFamily: Quantification of KinGenomicPrivacy. CCS 2013 • De-anonymization • Of genomic databases with phenotypic traits • M. Humbert, K. Huguenin, J. Hugonot, E. Ayday, J.-P. Hubaux. De-anonymizingGenomicDatabases with PhenotypicTraits. PETS 2015 ? ? ? ? ?
Cross-WebsiteAttack Correlatedgenetic information betweenfamilymembers => an individual sharing his/hergenome threatenshis (known) relatives’ genomicprivacy
Genomics 101 • Human genome consists of 3 billion nucleotide pairs, i.e. 3B pairs of 4 letters (A, C, G, or T) • Organized into 23 pairs of chromosomes • ~99.9% of the genome is identical between 2 individuals • Single nucleotide polymorphism (SNP) • > 50 million SNP positions in human genome • Disease risk can be computed by analyzingparticular SNPs
Linkage Disequilibrium (LD) • Linkage disequilibrium: Correlationbetween pairs of SNPs • D = Pr(X=A, Y=B) – Pr(X=A)Pr(Y=B) • D’ = normalized D • Fromthese LD metrics, wecancomputepairwisejoint probabilitiesbetweenanySNPs D’ Expectedfrequenciesunderindependence Observedfrequencies SNP ID SNP ID
QuantifyingKinGenomicPrivacy • Quantifyingprivacyrisks • Withrespect to the amount of genomicdata thatisrevealed, and the relative(s) revealingit • Considering the background knowledge of the adversary (familial relationships, LD values, minor allelefrequencies) • Designingefficient inferencealgorithmsthatmimic reconstruction attacksgiven background knowledge • In order to propose protection mechanismsto reduce the inherentgenomic-privacyrisk
Reconstruction Attacks • Adversary’sobjective: • Compute the posterior marginal probabilities of the family’sSNPsgiven: • The observed data (publiclyavailableSNPs) • The background knowledge(inheritanceprobabilities, population allelefrequencies, LD statistics) SNP positions • Given by a sparsepairwise joint probability matrix LwhereLi,j = Pr(Xi,Xj) relatives
InferenceAlgorithms • Naivemarginalization of any of the random variable has computationalcomplexity O(3mn ) • We chose to runbeliefpropagation (a.k.a message passing) on graphicalmodels to reducethe computationalcomplexity • Exact inferencewithoutconsidering LD betweenSNPs • Junction treealgorithm = belief propagation on a junctiontree • Complexity= O(mn) • Approximateinference if LD included • Loopybeliefpropagation on a factor graph • Complexity= O(mn) per iteration
PrivacyMetrics : inferred value : actual value : observedSNPs • Adversary’sincorrectness [3] • Adversary’suncertainty [4] • Mutual information-basedmetric [5] Estimation error at SNP i for individual j 1 – (normalized) mutual information at SNP i for individual j [3] R. Shokri et al., Quantifying location privacy, S&P 2011 [4] A. Serjantov, G. Danezis, Towards an information theoreticmetric for anonymity, PET 2003 [5] D. Agrawal, C.C. Aggarwal, On the design and quantification of privacypreserving data miningalgorithms, PODS 2001
Genomic and HealthPrivacy • Genomic-PrivacyMetrics • Individualgenomicprivacy = average value over all of hisSNPs • Usingany of the previouslydefinedprivacymetrics • Wholefamilygenomicprivacy = average over all SNPs • Usingany of the previouslydefinedprivacymetrics • Health-PrivacyMetrics Privacy of individual j regarding disease d : set of SNPsassociatedwithdiseased : genomic privacyof individualjat SNP k : contribution of SNP k to diseased
Framework Evaluation • Pedigree from Utah • Familycontaining 4 grandparents, 2 parents, and 5 children • Focusing on chromosome 1 (longest one) • Relying on the threeprivacymetrics to quantifygenomicprivacy and healthprivacy • Using the L1 distance to measure the distance betweentwoSNPs in the estimation errormetric
80k SNPs, without LD Evolution of the genomicprivacy of parent P5 by graduallyrevealing the SNPs of otherfamilymembers (startingwith the most distant familymembers)
100 SNPs in the sameregion, with LD Evolution of the global genomicprivacy for the wholefamilyby graduallyrevealing 10% of the SNPs (that are randomlyselectedateachstep)
Real AttackExample • LinkingOpenSNPand Facebook with user names • 6 individuals sharing theirSNPs on OpenSNPfound on Facebook, whoalsopubliclyreveal (some of) theirrelatives • 29 individuals in 6 differentfamilies • With one memberrevealinghis/her SNP in eachfamily • Health-privacyevaluation for twofamilies • Focusing on SNPs relevantfor Alzheimer’sdisease • 2 SNPsthat are equallycontributing to the diseasepredisposition • 1 person/familyrevealingthese 2 SNPs
Summary • Framework toquantifykingenomicprivacygivenactualobservation and background knowledge • Trade-off between time efficiency and attack power • If the attackerisinterestedonly in a subset of targetedSNPs or if hecannot observe the full set of SNPs of a relative, hewouldmake use of the inferencemethodthatincludes LD • From the decision/policymaker’s point of view, the inferencemethodwithout LD gives an upperbound on the actuallevel of genomicprivacy of the familymembers • Optimized protection mechanism • Obfuscationmechanism and combinatorialoptimization
Outline • Statistical inference • Kin genomic privacy • M. Humbert, E. Ayday, J.-P. Hubaux, A.Telenti. Addressingthe Concerns of the LacksFamily: Quantification of KinGenomicPrivacy. CCS 2013 • De-anonymization • Of genomic databases with phenotypic traits • M. Humbert, K. Huguenin, J. Hugonot, E. Ayday, J.-P. Hubaux. De-anonymizingGenomicDatabases with PhenotypicTraits. PETS 2015 ? ? ? ? ?
Genome Sharing and Anonymity • Sharing genomic data withprivacy • Naive solution: anonymizinggenomic data • Anonymity of genomic data brokenwithtwo types of auxiliary information • Census data (ZIP code, birth date, …) [6] • Y-chromosome short tandem repeats(STRs) [7] • Currentlynotincluded in the genotypesprovided by mostpopular direct-to-consumer genetictesting providers (such as 23andMe) • Othermeans to de-anonymizegenomic data? [6] L. Sweeney et al., Identifying Participants in the PersonalGenome Project by Names, Report, 2013 [7] M. Gymrek et al., IdentifyingPersonalGenomes by SurnameInference, Science, 2013
Genomic-Phenotypic Relations • Physical/phenotypic traits are notablydetermined by genomic data • Thesedependenciescanbeused to inferphysical traits [8,9]… • … or to match genomic data withphysical/phenotypic traits [8] P. Claes et al., Toward DNA-based facial composites: Preliminaryresults and validation, Forensic Science International: Genetics, 2014 [9] P. Claes et al., Modeling 3D facial shape from DNA, PLoSGenetics, 2014
Our De-anonymizationAttacks Most commongenomicvariants (SNPs) Phenotypic traits (visible and non-visible) Statisticalrelationshipbetweengenotype and phenotype Statisticscomputed over population withknowngenomic-phenotypic relations (semi-)supervised Qualitative relations given by a genomicknowledge DB (e.g. SNPedia.com) unsupervised
Typical Attack Scenario n genotypes 1 phenotype g1 = (g1,1,g1,2, …, g1,s) px = (px,1,px,2, …, px,t) g2 = (g2,1,g2,2, …, g2,s) gn = (gn,1,gn,2, …, gn,s) Identification attack wheregi,j= {0, 1, 2} Select the genotypegithatmaximizes the likelihood: | gi,1, gi,2, …, gi,s)
PerfectMatching Attack g1 p1 g2 p2 . . . . . . gn pn Find the best matchingσ*thatmaximizes the product of the likelihoods: ), whichisequivalent to maximize the sum of the log-likelihoods Blossomalgorithmfinding the maximum weightassignment in O(n3)
Data-driven Evaluation • Raw dump of 818 OpenSNPusers • Each profile must includegenomic and phenotypic data • But many people not sharing theirphenotypic traits • By requiring 75% of traits and SNPspresent in the data -> 80 individuals • Background knowledge construction • Unsupervisedapproach: SNPedia.com • Qualitative relations (E.g., «blueeyes more likely») • Supervisedapproach: OpenSNP data • SNP-traits associations given by SNPedia • Learning of the conditionalprobabilities of the traits given the SNPswith the OpenSNP data
SelectedPhenotypic Traits 17 associatedSNPs 22 associatedSNPs+ sexual chromosome
Results – Identification Attack Attack’ssuccess in the unsupervised scenario Attack’ssuccess in the supervised scenario
Results – PerfectMatching Attack Attack’ssuccess in the supervised scenario Attack’ssuccess in the unsupervised scenario
Results – PerfectMatching Attack Evolution of attack’ssuccesswith n=10 individuals w.r.t. the degree of intimacywith the victims (supervised case)
Results – PerfectMatching Attack Attack’ssuccesswith n=2 individuals vs. distinguishabilitybetweenthesetwoindividuals (unsupervised case) min(Hamming distance on the phenotypes, Hamming distance on the genotypes)
Summary • Twonovelde-anonymizationattacks • Making use of mostcommongenomicvariants • Mostlyrelying on existinggenomicknowledge • Main results • Identification attackoutperforming the baseline by 3 to 8 times • Perfectmatchingattack more successfulthan the identification attack: 23% of correct match with50 individuals • Theseresultswillnaturallyimprove (or worsenfrom a privacy point of view!) with the progress of genomicknowledge • Future work • Use more data => enhancedsupervisedapproach • Implementation of countermeasures (e.g., obfuscation)
Conclusion • The genomicrevolutioniscoming • Millions (if not billions) of people’s DNA willbesequenced in the nextdecade • Given the very sensitive information itscontains, the genome must beprotected • First steptowards more genomicprivacy: fullycharacterize and formallyquantify the risks in order to • Raisegeneralawareness about the risks • Design proper protection mechanisms • Open crucial questions • Economic value and legalownership of the genomic data
genomeprivacy.org New community website • Searchable list of publications in genome privacy and security • List of major media news on the topic (from Science, Nature, GenomeWeb, etc.) • Research groups and companies involved • Tutorial and tools • Events (past & future)