1 / 35

Inference and De-anonymization Attacks against Genomic Privacy

Inference and De-anonymization Attacks against Genomic Privacy. ETH Zürich October 1, 2015. Mathias Humbert Joint work with Erman Ayday, Jean-Pierre Hubaux, Kévin Huguenin, Joachim Hugonot, Amalio Telenti. ( Human ) System Security. How is a human system encoded ?. Programmer 2.

williamm
Download Presentation

Inference and De-anonymization Attacks against Genomic Privacy

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Inference and De-anonymization Attacks against Genomic Privacy ETH Zürich October 1, 2015 Mathias Humbert Joint work with Erman Ayday, Jean-Pierre Hubaux, Kévin Huguenin, Joachim Hugonot, Amalio Telenti

  2. (Human) System Security How is a human system encoded? Programmer 2 Programmer 1 The humangenomecanberepresented as a sequence of ternary values (called SNP/SNV) Computer -> humansystems: Binary-> ternary values!

  3. ProgrammingHumanBeings… Programmer 1: Father Programmer 2: Mother . . . . . . . . . . . . Gamete Production Gamete Production Child

  4. Genomic Data Deluge • Genotyping < 100$ today • > 950k people genotyped by 23andMe • Recent governmental and industrial initiatives • President Obama’s Precision Medicine Initiative (01/2015) => 1M+ citizens • Google Genomics (API to store, process, explore, and share DNA data) • Microsoft Research (genomic research in collaboration with Sanger Center) • Global Alliance for Genomics & Health (common framework for effective, responsible and secure sharing of genomic and clinical data) • Genomic-data benefits • Providing substantial improvement in diagnosis and personalized medicine • Helping medical research progress • Sharing of genomic data • Thousands of genomes are already available online (OpenSNP, Personal Genome Project, …) • First motivation for sharing: help research [1] [1] http://opensnp.wordpress.com/2011/11/17/first-results-of-the-survey-on-sharing-genetic-information/

  5. GenomicPrivacyRisks • Genomecarries sensitive information about • Predisposition to diseases • Genetic discrimination in healthor life insurance, … • Future physical conditions • Genetic discrimination in work, sports, ... • Kinship • Familial tragedies (like divorce caused bythe discovery of illegitimateoffspring [2]) • Physical appearance, metabolism • The privacy situation isworsened by • The non-revokability of genomic data • Interdependentrisks [2] http://www.vox.com/2014/9/9/5975653/with-genetic-testing-i-gave-my-parents-the-gift-of-divorce-23andme

  6. Outline • Statistical inference • Kin genomic privacy • M. Humbert, E. Ayday, J.-P. Hubaux, A.Telenti. Addressingthe Concerns of the LacksFamily: Quantification of KinGenomicPrivacy. CCS 2013 • De-anonymization • Of genomic databases with phenotypic traits • M. Humbert, K. Huguenin, J. Hugonot, E. Ayday, J.-P. Hubaux. De-anonymizingGenomicDatabases with PhenotypicTraits. PETS 2015 ? ? ? ? ?

  7. HenriettaLacks and herFamily

  8. Cross-WebsiteAttack Correlatedgenetic information betweenfamilymembers => an individual sharing his/hergenome threatenshis (known) relatives’ genomicprivacy

  9. Genomics 101 • Human genome consists of 3 billion nucleotide pairs, i.e. 3B pairs of 4 letters (A, C, G, or T) • Organized into 23 pairs of chromosomes • ~99.9% of the genome is identical between 2 individuals • Single nucleotide polymorphism (SNP) • > 50 million SNP positions in human genome • Disease risk can be computed by analyzingparticular SNPs

  10. Linkage Disequilibrium (LD) • Linkage disequilibrium: Correlationbetween pairs of SNPs • D = Pr(X=A, Y=B) – Pr(X=A)Pr(Y=B) • D’ = normalized D • Fromthese LD metrics, wecancomputepairwisejoint probabilitiesbetweenanySNPs D’ Expectedfrequenciesunderindependence Observedfrequencies SNP ID SNP ID

  11. QuantifyingKinGenomicPrivacy • Quantifyingprivacyrisks • Withrespect to the amount of genomicdata thatisrevealed, and the relative(s) revealingit • Considering the background knowledge of the adversary (familial relationships, LD values, minor allelefrequencies) • Designingefficient inferencealgorithmsthatmimic reconstruction attacksgiven background knowledge • In order to propose protection mechanismsto reduce the inherentgenomic-privacyrisk

  12. Reconstruction Attacks • Adversary’sobjective: • Compute the posterior marginal probabilities of the family’sSNPsgiven: • The observed data (publiclyavailableSNPs) • The background knowledge(inheritanceprobabilities, population allelefrequencies, LD statistics) SNP positions • Given by a sparsepairwise joint probability matrix LwhereLi,j = Pr(Xi,Xj) relatives

  13. InferenceAlgorithms • Naivemarginalization of any of the random variable has computationalcomplexity O(3mn ) • We chose to runbeliefpropagation (a.k.a message passing) on graphicalmodels to reducethe computationalcomplexity • Exact inferencewithoutconsidering LD betweenSNPs • Junction treealgorithm = belief propagation on a junctiontree • Complexity= O(mn) • Approximateinference if LD included • Loopybeliefpropagation on a factor graph • Complexity= O(mn) per iteration

  14. PrivacyMetrics : inferred value : actual value : observedSNPs • Adversary’sincorrectness [3] • Adversary’suncertainty [4] • Mutual information-basedmetric [5] Estimation error at SNP i for individual j 1 – (normalized) mutual information at SNP i for individual j [3] R. Shokri et al., Quantifying location privacy, S&P 2011 [4] A. Serjantov, G. Danezis, Towards an information theoreticmetric for anonymity, PET 2003 [5] D. Agrawal, C.C. Aggarwal, On the design and quantification of privacypreserving data miningalgorithms, PODS 2001

  15. Genomic and HealthPrivacy • Genomic-PrivacyMetrics • Individualgenomicprivacy = average value over all of hisSNPs • Usingany of the previouslydefinedprivacymetrics • Wholefamilygenomicprivacy = average over all SNPs • Usingany of the previouslydefinedprivacymetrics • Health-PrivacyMetrics Privacy of individual j regarding disease d : set of SNPsassociatedwithdiseased : genomic privacyof individualjat SNP k : contribution of SNP k to diseased

  16. Framework Evaluation • Pedigree from Utah • Familycontaining 4 grandparents, 2 parents, and 5 children • Focusing on chromosome 1 (longest one) • Relying on the threeprivacymetrics to quantifygenomicprivacy and healthprivacy • Using the L1 distance to measure the distance betweentwoSNPs in the estimation errormetric

  17. 80k SNPs, without LD Evolution of the genomicprivacy of parent P5 by graduallyrevealing the SNPs of otherfamilymembers (startingwith the most distant familymembers)

  18. 100 SNPs in the sameregion, with LD Evolution of the global genomicprivacy for the wholefamilyby graduallyrevealing 10% of the SNPs (that are randomlyselectedateachstep)

  19. Real AttackExample • LinkingOpenSNPand Facebook with user names • 6 individuals sharing theirSNPs on OpenSNPfound on Facebook, whoalsopubliclyreveal (some of) theirrelatives • 29 individuals in 6 differentfamilies • With one memberrevealinghis/her SNP in eachfamily • Health-privacyevaluation for twofamilies • Focusing on SNPs relevantfor Alzheimer’sdisease • 2 SNPsthat are equallycontributing to the diseasepredisposition • 1 person/familyrevealingthese 2 SNPs

  20. Summary • Framework toquantifykingenomicprivacygivenactualobservation and background knowledge • Trade-off between time efficiency and attack power • If the attackerisinterestedonly in a subset of targetedSNPs or if hecannot observe the full set of SNPs of a relative, hewouldmake use of the inferencemethodthatincludes LD • From the decision/policymaker’s point of view, the inferencemethodwithout LD gives an upperbound on the actuallevel of genomicprivacy of the familymembers • Optimized protection mechanism • Obfuscationmechanism and combinatorialoptimization

  21. Outline • Statistical inference • Kin genomic privacy • M. Humbert, E. Ayday, J.-P. Hubaux, A.Telenti. Addressingthe Concerns of the LacksFamily: Quantification of KinGenomicPrivacy. CCS 2013 • De-anonymization • Of genomic databases with phenotypic traits • M. Humbert, K. Huguenin, J. Hugonot, E. Ayday, J.-P. Hubaux. De-anonymizingGenomicDatabases with PhenotypicTraits. PETS 2015 ? ? ? ? ?

  22. Genome Sharing and Anonymity • Sharing genomic data withprivacy • Naive solution: anonymizinggenomic data • Anonymity of genomic data brokenwithtwo types of auxiliary information • Census data (ZIP code, birth date, …) [6] • Y-chromosome short tandem repeats(STRs) [7] • Currentlynotincluded in the genotypesprovided by mostpopular direct-to-consumer genetictesting providers (such as 23andMe) • Othermeans to de-anonymizegenomic data? [6] L. Sweeney et al., Identifying Participants in the PersonalGenome Project by Names, Report, 2013 [7] M. Gymrek et al., IdentifyingPersonalGenomes by SurnameInference, Science, 2013

  23. Genomic-Phenotypic Relations • Physical/phenotypic traits are notablydetermined by genomic data • Thesedependenciescanbeused to inferphysical traits [8,9]… • … or to match genomic data withphysical/phenotypic traits [8] P. Claes et al., Toward DNA-based facial composites: Preliminaryresults and validation, Forensic Science International: Genetics, 2014 [9] P. Claes et al., Modeling 3D facial shape from DNA, PLoSGenetics, 2014

  24. Our De-anonymizationAttacks Most commongenomicvariants (SNPs) Phenotypic traits (visible and non-visible) Statisticalrelationshipbetweengenotype and phenotype Statisticscomputed over population withknowngenomic-phenotypic relations (semi-)supervised Qualitative relations given by a genomicknowledge DB (e.g. SNPedia.com) unsupervised

  25. Typical Attack Scenario n genotypes 1 phenotype g1 = (g1,1,g1,2, …, g1,s) px = (px,1,px,2, …, px,t) g2 = (g2,1,g2,2, …, g2,s) gn = (gn,1,gn,2, …, gn,s) Identification attack wheregi,j= {0, 1, 2} Select the genotypegithatmaximizes the likelihood: | gi,1, gi,2, …, gi,s)

  26. PerfectMatching Attack g1 p1 g2 p2 . . . . . . gn pn Find the best matchingσ*thatmaximizes the product of the likelihoods: ), whichisequivalent to maximize the sum of the log-likelihoods Blossomalgorithmfinding the maximum weightassignment in O(n3)

  27. Data-driven Evaluation • Raw dump of 818 OpenSNPusers • Each profile must includegenomic and phenotypic data • But many people not sharing theirphenotypic traits • By requiring 75% of traits and SNPspresent in the data -> 80 individuals • Background knowledge construction • Unsupervisedapproach: SNPedia.com • Qualitative relations (E.g., «blueeyes more likely») • Supervisedapproach: OpenSNP data • SNP-traits associations given by SNPedia • Learning of the conditionalprobabilities of the traits given the SNPswith the OpenSNP data

  28. SelectedPhenotypic Traits 17 associatedSNPs 22 associatedSNPs+ sexual chromosome

  29. Results – Identification Attack Attack’ssuccess in the unsupervised scenario Attack’ssuccess in the supervised scenario

  30. Results – PerfectMatching Attack Attack’ssuccess in the supervised scenario Attack’ssuccess in the unsupervised scenario

  31. Results – PerfectMatching Attack Evolution of attack’ssuccesswith n=10 individuals w.r.t. the degree of intimacywith the victims (supervised case)

  32. Results – PerfectMatching Attack Attack’ssuccesswith n=2 individuals vs. distinguishabilitybetweenthesetwoindividuals (unsupervised case) min(Hamming distance on the phenotypes, Hamming distance on the genotypes)

  33. Summary • Twonovelde-anonymizationattacks • Making use of mostcommongenomicvariants • Mostlyrelying on existinggenomicknowledge • Main results • Identification attackoutperforming the baseline by 3 to 8 times • Perfectmatchingattack more successfulthan the identification attack: 23% of correct match with50 individuals • Theseresultswillnaturallyimprove (or worsenfrom a privacy point of view!) with the progress of genomicknowledge • Future work • Use more data => enhancedsupervisedapproach • Implementation of countermeasures (e.g., obfuscation)

  34. Conclusion • The genomicrevolutioniscoming • Millions (if not billions) of people’s DNA willbesequenced in the nextdecade • Given the very sensitive information itscontains, the genome must beprotected • First steptowards more genomicprivacy: fullycharacterize and formallyquantify the risks in order to • Raisegeneralawareness about the risks • Design proper protection mechanisms • Open crucial questions • Economic value and legalownership of the genomic data

  35. genomeprivacy.org New community website • Searchable list of publications in genome privacy and security • List of major media news on the topic (from Science, Nature, GenomeWeb, etc.) • Research groups and companies involved • Tutorial and tools • Events (past & future)

More Related