160 likes | 171 Views
This article discusses the importance of confidentiality in virtual data centres and proposes a solution to protect sensitive data. It explores the risks and challenges associated with data security in virtual environments and presents strategies for confidentiality protection. The article also highlights the need for automated output checking processes to ensure data privacy.
E N D
Protecting Confidentiality in a Virtual Data Centre Computational Informatics Christine O’Keefe , Mark Westcott, Adrien Ickowicz, Maree O’Sullivan, CSIRO Tim Churches, Sax Institute 28 October 2012
Overview • Introduction to the problem • Virtual Data Centres • Proposed solution Confidentiality in Virtual Data Centres | Christine O’Keefe
Population Health Research Network* • Provides access to linkable de-identified health data for research • Improving outcomes • Improving policy • Traditionally • Supplies linkable de-identified health data directly to researchers • Loss of control over data heightens risk of: • External attack on datasets • Accidental or inadvertent actions by researcher • Deliberate attack by trusted researcher *www.phrn.org.au Confidentiality in Virtual Data Centres | Christine O’Keefe
Secure Unified Research Environment* • Secure remote access to virtual workstations and network in a data centre *Sax Institute SURE User Guide v1.2 Confidentiality in Virtual Data Centres | Christine O’Keefe
Confidentiality Protection for Health Data • Governance • Comply with privacy legislation and regulation • Honour assurances to data providers • Restrict access to approved researchers • Information security measures • Restrict amount and detail of data available • Apply statistical disclosure control methods before releasing data to researcher • No further confidentiality measures • Enable access via secure on-line system • Manual checking for confidentiality issues in statistical analysis outputs • “…developing valid output checking processes that are automated is an open research question” (Duncan, Elliot, Salazar-González 2012) Confidentiality in Virtual Data Centres | Christine O’Keefe
Conceptual Model for online access VDC • Remote Analysis • Researcher cannot see data itself, only “Output for publication” • Virtual Data Centre • Researcher authorised to see data and “Output” as well as “Output for publication” RA Confidentiality in Virtual Data Centres | Christine O’Keefe
Virtual Data Centre • Assumptions • Custodian prepares data to comply with legislation, regulation and assurances • Researcher complies with applicable researcher agreements • Researcher authorised to see data itself • Do not need to protect dataset records from researcher • Do not need to protect against malicious attacks by researcher • Data transformations and analyses are unrestricted • Confidentiality issues with respect to readers of academic literature • Confidentiality issues with repect to outputs of genuine queries Confidentiality in Virtual Data Centres | Christine O’Keefe
Main Disclosure Risks in Statistical Output • Individual values • Small cells/samples … threshold • Dominance • Differencing • Linear or other algebraic relationships in data • Precision Confidentiality in Virtual Data Centres | Christine O’Keefe
Confidentiality Protection in a Virtual Data Centre – two stage process • Dataset preparation - by Custodian • Confidentialisation of statistical analysis output for publication – by Researcher 2 1 • Similarities to: • ESSNet SDC Guidelinesfor checking output based on microdata research … Hundepool, Domingo-Ferrer, Franconi, Giessing, Nordholt, Spicer, de Wolf 2012 • Statistics New Zealand Data Lab Output Guide Confidentiality in Virtual Data Centres | Christine O’Keefe
Dataset preparation – by Custodian • Custodian • Removes obvious identifiers • Ensures dataset has sufficient records • Ensures published datasets differ by sufficiently many records • Ensures variables and combinations of variables have suff many records • Reduces detail in data using aggregation (esp dates, locations) • Other measures as needed – statistical disclosure control 1 Confidentiality in Virtual Data Centres | Christine O’Keefe
Confidentialisation of statistical analysis output for publication – by Researcher • Researcher • uses Checklist of tests to identify outputs that fail one or more tests • considers context and interations of outputs to identify potential disclosure risks • applies treatments from Checklist to reduce potential disclosure risk Confidentiality in Virtual Data Centres | Christine O’Keefe
Checklist of Tests • Individual value: an individual data value is directly revealed • Threshold n: A cell or statistic is calculated on fewer than n data values • Threshold p%: A cell contains more than p% of the values in a table margin • Dominance (n,k): Amongst the records used to calculate a cell value or statistic, the n largest account for at least k% of the value • Dominance p%: Amongst the records used to calculate a cell value or statistic, the total minus the two largest values is less than p% of the largest value • Differencing: A statistic is calculated on populations that differ in fewer than n records • Relationships: The statistic involves linear or other algebraic relationships • Precision: The output involves a high level of precision in terms of significant figures and/or decimal places • Degrees of Freedom: The model output has fewer than n degrees of freedom Confidentiality in Virtual Data Centres | Christine O’Keefe
Checklist - examples Confidentiality in Virtual Data Centres | Christine O’Keefe
Checklist - examples Confidentiality in Virtual Data Centres | Christine O’Keefe
Summary • Virtual Data Centres • Becoming more popular • Manual checking of outputs for confidentiality risk not sustainable • Automated methods for confidentiality protection in statistical analysis outputs still under development • Interim Solution • Dataset preparation by Custodian • Researchers confidentialise their own outputs for publication • Training • Checklist of tests and confidentiality treatments Confidentiality in Virtual Data Centres | Christine O’Keefe
Thank you • Computational Informatics Dr Christine O’KeefeResearch Program Leader, Decision and User Science t +61 2 6216 7021 e Christine.OKeefe@csiro.au w www.csiro.au • Computational Informatics