160 likes | 252 Views
Protecting Confidentiality. in a Virtual Data Centre. Computational Informatics. Christine O’Keefe , Mark Westcott, Adrien Ickowicz, Maree O’Sullivan, CSIRO Tim Churches, Sax Institute. 28 October 2012. Overview. Introduction to the problem Virtual Data Centres Proposed solution.
E N D
Protecting Confidentiality in a Virtual Data Centre Computational Informatics Christine O’Keefe , Mark Westcott, Adrien Ickowicz, Maree O’Sullivan, CSIRO Tim Churches, Sax Institute 28 October 2012
Overview • Introduction to the problem • Virtual Data Centres • Proposed solution Confidentiality in Virtual Data Centres | Christine O’Keefe
Population Health Research Network* • Provides access to linkable de-identified health data for research • Improving outcomes • Improving policy • Traditionally • Supplies linkable de-identified health data directly to researchers • Loss of control over data heightens risk of: • External attack on datasets • Accidental or inadvertent actions by researcher • Deliberate attack by trusted researcher *www.phrn.org.au Confidentiality in Virtual Data Centres | Christine O’Keefe
Secure Unified Research Environment* • Secure remote access to virtual workstations and network in a data centre *Sax Institute SURE User Guide v1.2 Confidentiality in Virtual Data Centres | Christine O’Keefe
Confidentiality Protection for Health Data • Governance • Comply with privacy legislation and regulation • Honour assurances to data providers • Restrict access to approved researchers • Information security measures • Restrict amount and detail of data available • Apply statistical disclosure control methods before releasing data to researcher • No further confidentiality measures • Enable access via secure on-line system • Manual checking for confidentiality issues in statistical analysis outputs • “…developing valid output checking processes that are automated is an open research question” (Duncan, Elliot, Salazar-González 2012) Confidentiality in Virtual Data Centres | Christine O’Keefe
Conceptual Model for online access VDC • Remote Analysis • Researcher cannot see data itself, only “Output for publication” • Virtual Data Centre • Researcher authorised to see data and “Output” as well as “Output for publication” RA Confidentiality in Virtual Data Centres | Christine O’Keefe
Virtual Data Centre • Assumptions • Custodian prepares data to comply with legislation, regulation and assurances • Researcher complies with applicable researcher agreements • Researcher authorised to see data itself • Do not need to protect dataset records from researcher • Do not need to protect against malicious attacks by researcher • Data transformations and analyses are unrestricted • Confidentiality issues with respect to readers of academic literature • Confidentiality issues with repect to outputs of genuine queries Confidentiality in Virtual Data Centres | Christine O’Keefe
Main Disclosure Risks in Statistical Output • Individual values • Small cells/samples … threshold • Dominance • Differencing • Linear or other algebraic relationships in data • Precision Confidentiality in Virtual Data Centres | Christine O’Keefe
Confidentiality Protection in a Virtual Data Centre – two stage process • Dataset preparation - by Custodian • Confidentialisation of statistical analysis output for publication – by Researcher 2 1 • Similarities to: • ESSNet SDC Guidelinesfor checking output based on microdata research … Hundepool, Domingo-Ferrer, Franconi, Giessing, Nordholt, Spicer, de Wolf 2012 • Statistics New Zealand Data Lab Output Guide Confidentiality in Virtual Data Centres | Christine O’Keefe
Dataset preparation – by Custodian • Custodian • Removes obvious identifiers • Ensures dataset has sufficient records • Ensures published datasets differ by sufficiently many records • Ensures variables and combinations of variables have suff many records • Reduces detail in data using aggregation (esp dates, locations) • Other measures as needed – statistical disclosure control 1 Confidentiality in Virtual Data Centres | Christine O’Keefe
Confidentialisation of statistical analysis output for publication – by Researcher • Researcher • uses Checklist of tests to identify outputs that fail one or more tests • considers context and interations of outputs to identify potential disclosure risks • applies treatments from Checklist to reduce potential disclosure risk Confidentiality in Virtual Data Centres | Christine O’Keefe
Checklist of Tests • Individual value: an individual data value is directly revealed • Threshold n: A cell or statistic is calculated on fewer than n data values • Threshold p%: A cell contains more than p% of the values in a table margin • Dominance (n,k): Amongst the records used to calculate a cell value or statistic, the n largest account for at least k% of the value • Dominance p%: Amongst the records used to calculate a cell value or statistic, the total minus the two largest values is less than p% of the largest value • Differencing: A statistic is calculated on populations that differ in fewer than n records • Relationships: The statistic involves linear or other algebraic relationships • Precision: The output involves a high level of precision in terms of significant figures and/or decimal places • Degrees of Freedom: The model output has fewer than n degrees of freedom Confidentiality in Virtual Data Centres | Christine O’Keefe
Checklist - examples Confidentiality in Virtual Data Centres | Christine O’Keefe
Checklist - examples Confidentiality in Virtual Data Centres | Christine O’Keefe
Summary • Virtual Data Centres • Becoming more popular • Manual checking of outputs for confidentiality risk not sustainable • Automated methods for confidentiality protection in statistical analysis outputs still under development • Interim Solution • Dataset preparation by Custodian • Researchers confidentialise their own outputs for publication • Training • Checklist of tests and confidentiality treatments Confidentiality in Virtual Data Centres | Christine O’Keefe
Thank you • Computational Informatics Dr Christine O’KeefeResearch Program Leader, Decision and User Science t +61 2 6216 7021 e Christine.OKeefe@csiro.au w www.csiro.au • Computational Informatics