140 likes | 322 Views
Setting the Stage: How De-Identification Came into U.S. Law, and Why the Debate Matters Today. Professor Peter Swire Ohio State University/Future of Privacy Forum FPF Conference on DeIdentification National Press Club December 5 , 2011. Overview.
E N D
Setting the Stage: How De-Identification Came into U.S. Law, and Why the Debate Matters Today Professor Peter Swire Ohio State University/Future of Privacy Forum FPF Conference on DeIdentification National Press Club December 5, 2011
Overview • U.S. history: Census, federal agency statistics, & HIPAA • Why Deidentification (DeID) matters today • The debate – it works or it doesn’t • Three threat models • Analogy to law enforcement • Big picture – useful for many tasks, even with the limits shown by scientists
Census, Statistics & DeID • Many years of Census experience • Highly useful data • Deidentified • Periodic opposition to mandatory reporting • Needed strong confidentiality promises • Suppress small cell size • Only home in a census tract • Fuzz data • Strict rules against release even for national security purposes
Federal Agency Statistics • Codification in Confidential Information Protection & Statistical Efficiency Act of 2002 (CIPSEA) • Good history by Sylvester & Lohr • Basic rule: if collect data for statistical purposes, use only for statistical purposes, don’t ReID • Funny thing: same culture & practice for years in private sector polling (Gallup-style) and market research • Many years of practice here • Perhaps a basic guideline going forward?
HIPAA • 1999-2000 regs informed by Sweeney research • Safe harbor – delete a lot of specified data fields • Expert (I pushed for this) – where statistical basis, can achieve DeID based on risk, not safe harbor • Data use agreements – release for research, with enforceable promise not to ReID • In short: • If scrubbed enough, can release publicly • If scrubbed less, then enforceable promise not to ReID
Why It Matters Today • Now data mining far beyond specialized researchers • The Internet (commercial since only 1993) gives me access to data • Storage & processing on my laptop > mainframe of 25 years ago • Search is way better • The erosion of practical obscurity – “they” really may figure out who “we” are
The Debate is Joined • Ohm (and others) draw on Sweeney-type research • DeID likely to lead to ReID • Yakowitz (and others) respond • Benefits of public data enormous • Practical risk/harm from ReID low • Anonymization creates huge risks or low risks? • Worth doing anonymization/DeID at all? • Today’s conference to shed light on this …
Threat Models – Which Attackers? • Three types of attackers on “anonymized” data: • Insiders “peeping” • Outside hackers intruding • The public who doesn’t get into the database • DeID often effective for first two • Ohm/Yakowitz debate primarily on the third
Insiders Peeping • Swire 2009 Peeping article, at peterswire.net • Threat: employee or employee of sub-contractor sees the data and “peeps” • Sees celebrity information - Clooney • Sees information about friend/family/ex • Sees information to create harm (ID theft, blackmail) • Anonymization useful part of anti-peeping strategy • Employee doesn’t search or stumble upon Clooney • Employee may lack tools to do Sweeney-type analysis • Audit logs catch employees who try • Give employees access to statistical data, not PII
Outside Hackers • Hacker may intrude for a short while • Anonymization may prevent “ah hah” – Clooney • Hacker may download database • If so, then hacker becomes similar to the public • May or may not be good at Sweeney-type tricks • May be focused on specific types of information, and not try to ReID • Less-than-perfect DeID may substantially reduce incidence of ReID
Re-ID by “The Public” • So, masking may help against some threats • The debate, though, is whether “the public” (i.e., the experts) can ReID • Sweeney & other research provides startling & important results of ReID • Can everything be ReIdentified?
ReID & 2 Famous Studies • Date of birth, zip, & gender -> 80%+ unique • Yes • BUT, DOB is off-the-charts different • Gender – splits population in half • DOB = 366 (days) x 80 (years) = over 25,000 cells • Moral – DOB ridiculously strong to ReID • Netflix and can Re-ID over 60% of movie reviews • BUT, takes known ImDB reviewers and matches to Netflix • Can ReID a lot, but not a big effect
Law Enforcement Analogy • So, is ReID generally easy or hard, useful or useless? • Consider cop with a bunch of clues (male, tall, red hair, etc.) • Enough to ReID? No • Helpful to ReID? Yes • A matter of how much legwork, analysis, extra data is available and accurate • Very big range for difficulty of finding the suspect • Same is true for ability of “the public” to ReID, to name the suspect
Conclusion • Issue matters today -- more data potentially available to “the public” • History of useful anonymization in statistics • If collect data for statistical purposes, use only for statistical purposes, store that way, don’t ReID • DeID helps against insider & hacker threats • DeID by “the public” varies widely in the effort needed to find the “suspect” • Our conference today to help policymakers learn where DeID likely to be most useful