360 likes | 547 Views
Privacy Statistics and Data Linkage. Mark Elliot Confidentiality and Privacy Group University of Manchester. Overview. The disclosure risk problem Some e-science possibilities Monitored data access Grid based Data environment Analysis The meaning of privacy. Data Data Everywhere….
E N D
Privacy Statistics and Data Linkage Mark Elliot Confidentiality and Privacy Group University of Manchester
Overview • The disclosure risk problem • Some e-science possibilities • Monitored data access • Grid based Data environment Analysis • The meaning of privacy
Data Data Everywhere… • Massive and exponential increase in data; Mackey and Purdam(2002); Purdam and Elliot(2002). • These studies have led to the setting up of the data monitoring service. • Singer(1999) noted three behavioural tendencies: • Collect more information on each population unit • Replace aggregate data with person specific databases • Given the opportunity collect personal information • Purdam and Elliot add: • Link data whenever you can
The Disclosure Risk Problem:Type I: Identification Identification file Name Address Sex Age .. Sex Age .. Income .. .. Target file Target variables ID variables Key variables
Multiple datasets • Disclosure Risk assessment for single datasets is a reasonably understood problem. • But what happens with multiple datasets?
Data Mining and the Grid • Traditional Data Mining examines and identifies patterns on single (if massive) datasets. • But Data Mining is really a method/approach/technology that has been waiting for the grid to happen.
Smith and Elliot (2005,06,07) • Increases in data availability lead inexorably to an increase in disclosure risk • My ability to make linkages (disclosive or otherwise) between datasets X and Y is facilitated by the copresence of dataset Z. • It’s all about information!
CLEF: Clinical e-Science Framework A solution involving monitored access
CLEF Consortium Approximately 40 Staff from • University of Manchester • University of Sheffield • University College London • University of Brighton • Royal Marsden Hospital, London
Purpose • To provide a system for allowing research access to patient data, whilst maintaining privacy. • Patient records • Database • Texts such as referral letters and other clinical texts • Text mining system convert to microdata
CLEF one possible architecture Firewall Raw Data PRE-ACCESS DQI Monitor PRE-ACCESS SDRA/SDC Treated Data PRE-Output DQI Monitor PRE-OUTPUT SDRA/SDC Data Intrusionsentry Workbench
Data Sentry: an AI system • Monitors patterns of analytical requests • 3 levels: users, institution, world. • Looking for intrusive patterns. • Numbers of requests • Stores Analytical requests for future use.
CLEF Proposed Architecture Firewall Raw Data PRE-ACCESS DQI Monitor PRE-ACCESS SDRA/SDC Treated Data PRE-Output DQI Monitor PRE-OUTPUT SDRA/SDC Data Intrusionsentry Workbench
Data Quality • User analyses are run on both treated and untreated data. • Outputs are compared and assessed for difference. • Major research area – Knowledge Engineering • Analyses are stored and collectively run over pre and post SDC files for assessment of impact.
The Grid: the context for massive combining. • “Integrated infrastructure for high-performance distributed computation” Cannataro and Talia (2002) • Grid middleware handles the technical issues communication, security, access/authentication etc… Cole et al (2002) • Data grid • Knowledge grid
What’s it about? • Disclosure risk analysis is forever constrained by the fact that we tend to only look at the release object. • This is a bit like evaluating the risk of a house being vulnerable to flooding without looking at where it is located! • Data Environment Analysis aims to remedy that situation and complete change the face of disclosure control in so doing…..
What would it involve? • Web Crawling • Data Monitoring • Synthetic Data Generation • Grid based disclosure risk analysis
Web crawling • Untrained Screen scraping of all web sites that collect personal data. • Generic info gathering of web published personal info (personal web pages, My space etc)
Data Monitoring • The development of sophisticated metadatabases representing available info fields • Combined Database of web available data. • Involves intelligent interpretation of web data, record linkage and other AI crossover techniques.
Architecture Web Crawler Web Crawler Web Crawler Web Crawler Web Crawler SDRA system Synthesiser Data monitor Repository: Data & Metadata
What next? • Decide on roles. • Identify funder. • Develop grant application.
Synthetic Data Generation • Uses techniques like multiple imputation to generate artificial data from the metadata generated by the data monitors and from data stored and accessed through data repositories.
A Blurring of Concepts • The boundaries between data and processes become less distinct. • Cyberidenties • I am my data? • The distinction between informational and physical privacy becomes less distinct.
Data Growth • There is no reason to suppose that data growth will not continue at the same break neck pace • The data environment will become increasingly richer • In this context the meaning of “privacy” will undoubtedly change. • But how?
The meaning of Privacy • Do people care about privacy in an orthodox, absolute sense? • What does a blog mean? • Private-public: Public Privacy • Control and ownership are more important than the absolute right to secrecy.
From Data Subjects to Data Citizens • A data actualised individual in control and self aware of their own data. • What would data citizens be concerned about? • Ownership • The use/abuse of their data • Harm • Permission/Consent • This suggests that the law should focus on data abuse rather than privacy per se.
Summary • Statistical Disclosure prevents a problem for the use of data • Multiple linkable datasets exacerbate that problem. • E-science provides some tools for new modes of data access
But….. • Assuming that the global culture continues to feed and be fed by the information explosion: • Our view of ourselves/our data will/must change. • The meaning of privacy must change with it. • The key question is what sort of society we are constructing; the meaning of privacy will reflect this.