430 likes | 449 Views
Dive into the complexities of genomic data privacy, security, and ethical considerations. Learn about harms, anonymization limitations, controlled access, and SSH usage for secure data interactions.
E N D
Canadian Bioinformatics Workshops www.bioinformatics.ca
Module 2Ethics of Data Usage and Security Mark Phillips Bioinformatics on Big Data: Computing on the Human Genome 29 – 30 September 2016
Learning Objectives of Module 2 Participants will be able to … • identify harms stemming from genomic data privacy breach; • understand anonymization’s limitations in preserving privacy; • understand controlled-access as an alternative; and • understand why and how ssh is used to securely interact with a virtual machine.
Harms to Genomic Data Subjects Discrimination in insurance, employment ... Disclosure of sensitive health information • directly, e.g. susceptibility to disease • Indirectly, e.g. attribute disclosure Paternity information Identity theft, when used as a biometric identifier Legal jeopardy Future uses • rapid development expected in the area • exacerbated by DNA’s “stasis”
Are Genomics Exceptional? Kinship. Contains data on blood relatives. Static. Doesn’t change over time. Unique. Identifies a person. Health & behaviour. Mystique. Public perception. Value. Important & increasingly so.
Relevant Ethical & Legal Rules Personal information protection law • Distinct laws govern public, private, and health sectors • Distinct laws govern federal and provincial jurisdictions Ethical research oversight • Research Ethics Boards / Institutional Review Boards • Data Access Compliance Offices Clinical health data • Patient confidentiality duties Unifying thread: Informed consent
Projects of the International Cancer Genome Consortium(map via Lincoln Stein)
Harms to Researchers • Breach and misuse can result in • Loss of participant confidence in the researcher and the field • Cancellation of funding and data access • Fines under privacy laws • Canadian statutes • US HIPAA • Occasionally, criminal or penal sanctions • 6 months jail sentence in Google Italy v. Associazionevivi down • Novel risks complicate informed consent • Regulation that is overly cumbersome can stall research
Research Value of Open Genomic Data • “Guidelines for Sharing Data & Resources”(NIH-DoE Joint Subcommittee, 1993) • “Data Release and Resource Sharing Policy” (Genome Canada, 2008) • “Open Access Policy” (Canadian Institutes of Health Research (CIHR), 2008) • “Genomic Data Sharing Policy” (NIH 2014) • “Tri-Agency Open Access Policy on Publications” (2015)
Anonymization: Privacy Panacea? • Law & policy regulate only “personal data” • De-identified data can be freed for researchers from the duties that would otherwise e • Legal definitions vary, but commonly ask whether the data, whether alone or in combination with other data, allows a person to be identified, directly or indirectly
Loss of Confidence Sophisticated re-identification methods have eroded confidence in anonymization’s potential • See especially Paul Ohm, “Broken Promises of Privacy” (2010) The consensus is now that “Data Cannot be Fully Anonymized and Remain Useful” (Dwork and Roth 2014) • Especially true of high-dimensional data, notably genomics Ohm argues elsewhere that “[p]rivacy laws should continue to apply even to data that has been de-identified, at least for the most sensitive forms of data.”
Criticisms Unrelated to Privacy On the basis of research ethics duties (ICH) • Duty to allow participant withdrawal from research at any moment • Duty to return incidental findings to participants To maximize data-derived benefits • Inability to link a person’s records between datasets, or detect duplicate records • Inability to update prospectively
Anonymous Genomic Data? Nietfield, “What is anonymous?” (2007, EMBO reports) • If DNA is stored “without identity data … and if there is no clue from whom the sample originates, such a sample is de facto anonymous.” • Anonymous DNA data means “Samples stored without identity data or a code.” But re-identification techniques have proliferated • Lin et al (2004): 75 SNPs can uniquely identify an individual • Homer et al (2008): Partial genetic information can be used to identify a person as belonging to either a study’s control or affected group • Gymrek et al (2013): Linking bioinformatic profiles to geneaological databases can re-identify up to 10% of anonymized WGS datasets • Cai et al (2015): Re-identification based on 25 randomly selected SNPs from Wellcome Trust data
Novel Cryptographic Methods Secure Multiparty Computing, e.g. DataSHIELD • analysis of pooled data • sensitive personal data remains secure on local computers Homomorphic encryption • Encrypted analysis Drawbacks • rely on “semi-trusted” entities • Difficulty scaling to real-world applications
Differential Privacy Statistics-backed promise to participants • Your privacy will not be undermined by allowing your data to be used, no matter what other data sets are available (Dwork and Roth 2014) Somewhat similar to k-anonymity Shortcomings • noise injection degrades data • iterative analysis vulnerabilities
Controlled Access: ICGC Datasets Controlled Access Open Access • Cancer Pathology • Histologic type or subtype • Histologic nuclear grade • Patient/Person • Gender, Age range, vital status, survival time, relapse type, status at follow-up • Gene Expression (normalized) • DNA methylation • Computed Copy Number and loss of Heterozygosity • Somatic variants from Exome / WGS • Detailed Phenotype and Outcome data • Gene Expression (probe-level data) • Raw genotype calls • Gene-sample identifier links • Genome sequence files • Germline variants
ICGC DACO Agreement Contents ICGC Guidelines 2008 (as updated by later amendments) Privacy and Security • ICGC (Cloud) Security Best Practices for Controlled-Access Data (updated 2015) • DACO IT Security Assessment and Access Agreement • GA4GH Framework for Responsible Sharing of Genomic and Health-Related Data Data Sharing • Fort Lauderdale Principles (2003) • Toronto Principles (2009) Intellectual Property • OECD Guidelines for the Licensing of Genetic Inventions (2006) • NIH Best Practices for the Licensing of Genomic Inventions (2005)
Data Security • Separate from, but related to, privacy concerns • Characterized, once again, by tradeoffs • This section: Best practice in connecting to VMs • Password-only authentication has various shortcomings • weak passwords are common • people lose or forget their passwords • SSH encryption is a common alternative • public key encryption scheme: public + private keypairs • strong keys • convenient to use for a number of services
Secure Shell (ssh) laptop public laptop public laptop private Internet SSH tunnel execute commands (ssh) transfer files (scp, sftp, rsync) remote desktop, etc.
Ports and Ongoing VM Usage • ssh allows you to connect to the VM • through an end-to-end encrypted connection • to a VM service listening on single port • Have the VM’s ssh-server listen on a random port in the dynamic range (41,152 through 65,535) • The VM’s firewall should block as many of the remaining ports as practicable • It’s crucial to limit access to the private key • both by exposing it to a network or through physical access • minimize any risk of losing a device holding the key
Ports and Ongoing VM Usage • If you have shared your keys or exposed them to other risks, regularly replace them with a regenerated pair • Shut down your VM whenever it is not in use • Consult further resources to harden your configuration further • Prohibiting password-only ssh connections to the VM, etc. • Don’t ignore ssh authentication warnings • If you don’t understand one, ask someone who does before overriding
Example SSH Warning The authenticity of host 'virtual_machine (10.254.142.33)' can't be established. RSA key fingerprint is 1f:51:ae:28:bf:89:e9:d8:1f:25:5d:37:2d:7d:b8:ca:9f:f5:f1:6f. Are you sure you want to continue connecting (yes/no)?
ICGC Security Best Practices Policy provides specific guidance on measures related to • Local infrastructure • Physical security • Server controls • Source data and control of copies of data • Destruction of data when no longer needed • Cloud-specific issues, including provider-specific guidance • Audits and accountability
Review and Auditing • Avoid considering security risks only when establishing a new system • Periodic review and, depending on the complexity of the project, third-party auditing is the best practice • Preferably by a certified auditor • Review and prune the public keys authorized to access your virtual machines • Not only does your project evolve, but so does the state of data security, as well as best practices