230 likes | 369 Views
Data Mining. Dr. Bradley A. Malin Assistant Professor Department of Biomedical Informatics Vanderbilt University. Data Collection. Localized: Personal records in databases at source Distributed: Integration of records from many sources. Electronic Medical Records. demographics.
E N D
29e Confrence internationale des commissaires à la protection de la vie prive
Data Mining Dr. Bradley A. Malin Assistant Professor Department of Biomedical Informatics Vanderbilt University 29e Confrence internationale des commissaires à la protection de la vie prive
Data Collection • Localized: Personal records in databases at source • Distributed: Integration of records from many sources Electronic Medical Records demographics clinical presentation Overt Collection Hospital visit for treatment Covert Collection webcam while walking 29e Confrence internationale des commissaires à la protection de la vie prive
Data Mining • Unsupervised • “Labels” unknown in advance, so search for intrinsic patterns of the data • Clustering “similar” people • Purchased same products • Supervised • “Labels” known in advance • Train models on sample data to classify new cases Country USA Canada Age Age <50 >50 <50 >50 Harry Potter 1984 Jaws Scream 29e Confrence internationale des commissaires à la protection de la vie prive
Website Personalization • Can a website predict what I want to see? • Intra-personalization: What pages / topics did I visit in my previous visits? • Inter-personalization: Is my browsing / purchasing history similar to other people’s? • Does my behavior reveal my identity or sensitive things about my life? • What information should not be revealed? 29e Confrence internationale des commissaires à la protection de la vie prive
Intelligence Alice Doe Bob Doe Junior Doe doc A Alice Doe John Smith doc C • Lists of entities are becoming increasingly prevalent • Intelligence reports, rosters, networks • How many Alice’s are there? Which is which? • How does Alice relate to Bob? Bob Doe Junior Doe Alice Doe doc B Alice Doe John Smith Brad Malin doc D 29e Confrence internationale des commissaires à la protection de la vie prive
Surveillance • Location Surveillance: Did someone on Interpol’s watchlist visit hotel X? Airline Y? • Challenge: Data holders want to collaborate, but fear strategic knowledge and legal constraints Hotel X Airline Y 29e Confrence internationale des commissaires à la protection de la vie prive
Privacy Protections • Protect Anonymity • Remove / encrypt identifying information • Suppress inferences that can reveal identity • Protect Confidentiality • Hide Sensitive Rules • Perturbation and Generalization • Secure Multiparty Computation • E(a) + E(b) = E(a + b) [homomorphism] • E(E(John),x),y) = E(E(John),y),x) [commutate] 29e Confrence internationale des commissaires à la protection de la vie prive
Clinical Genomics Linked by Medical Record # • Vanderbilt DNA Databank • DNA from “leftover” blood • 25-75K per year, 250K in 5 years • Combined with de-identified electronic medical records • 600 GBytes on 1.4 mil. patients • “Hypothesis Generation” to mine correlations between clinical features and DNA Blood Samples Clinical Record De-identification DNA Clinical Record 512 Bit Hash of # 29e Confrence internationale des commissaires à la protection de la vie prive
Example De-identified Medical Record Replaced SSN and phone # MR# is removed Substituted names Shifted Dates 29e Confrence internationale des commissaires à la protection de la vie prive
Patterns in data can lead to privacy compromise Suppress patterns “intelligently” to support goals Naïve Protection 29e Confrence internationale des commissaires à la protection de la vie prive
In Detail: Cystic Fibrosis(1149 patients, 174 hospitals) BEFORE Protection 100% Samples In Repository AFTER Protection 0% Samples Re-identified 29e Confrence internationale des commissaires à la protection de la vie prive
The Impact of Data Mining on Privacy in the Public and Private Sectors Richard S. Rosenberg Professor Emeritus, Department of Computer Science, University of British Columbia and President of the BC Freedom of Information and Privacy Association Vancouver, BC rosen@cs.ubc.ca 29e Confrence internationale des commissaires à la protection de la vie prive
The U.S. Government 29e Confrence internationale des commissaires à la protection de la vie prive
A Revision 29e Confrence internationale des commissaires à la protection de la vie prive
Top Six Purposes of Data Mining Efforts in Departments and Agencies 29e Confrence internationale des commissaires à la protection de la vie prive
Table 1: Key Steps Agencies Are Required to Take to Protect Privacy, with Examples of Related Detailed Procedures and Sources Key steps to protect privacy of Examples of procedures Primary statutory personal information Source___________ Publish notice in the Federal • Specify the routine uses for the system • Privacy Act Register when creating or modifying • Identify the individual responsible for the system system of records • Outline procedures individuals can use to gain access to their ________________________________records_________________________________________________________________ Provide individuals with access to • Permit individuals to review records about themselves • Privacy Act their records_____________________• Permit individuals to request corrections to their records__________________________ Notify individuals of the purpose and • Notify individuals of the authority that authorized the agency to • Privacy Act authority for the requested collect the information Information when it is collected • Notify individuals of the principal purposes for which the information ________________________________is to be used_____________________________________________________________ Implement guidance on system • Perform a risk assessment to determine the information system • FISMA vulnerabilities, identify threats, and develop countermeasures to • Privacy Act those threats • Have the system certified and accredited by management • Ensure the accuracy, relevance, timeliness, and completeness of _______________________________ information_______________________________________________________________ Conduct a privacy impact • Describe and analyze how information is secured • E-Government Act Assessment • Describe and analyze intended use of information • Have assessment reviewed by chief information officer or equivalent ______________________________• Make assessment publicly available, if practicable_________________________________ 29e Confrence internationale des commissaires à la protection de la vie prive
ADVISE Data Mining Tool(Analysis, Dissemination, Visualization, Insight, and Semantic Enhancement) 29e Confrence internationale des commissaires à la protection de la vie prive
Cato Institute: Data Mining and Terrorism • Attempting to use predictive data mining to ferret out terrorists before they strike would be a subtle but important misdirection of national security resources. • With a relatively small number of attempts every year and only one or two major terrorist incidents every few years – each one distinct in terms of planning and execution – there are no meaningful patterns that show what behavior indicates planning or preparation for terrorism. 29e Confrence internationale des commissaires à la protection de la vie prive
Data Mining in the Private Sector • We generate an enormous amount of data as a by-product of our everyday transactions (purchasing goods, enrolling for courses, etc.), visits to Web sites and interactions with government (taxes, census, car registration, voter registration, etc.). Not only is the number of records we generate increasing, but the amount of data gathered for each type of record is increasing. • As data miners, our tasks are colliding with these concerns. In analytic customer relationship management (CRM), we often analyze customer data with the specific intent of understanding individual behavior and instituting sales campaigns based on this understanding. Researchers in economics, demographics, medicine and social sciences are trying to understand the relationships between behaviors and outcomes. • How can we reconcile the legitimate needs of business and research with the equally legitimate desire of people to maintain their privacy? 29e Confrence internationale des commissaires à la protection de la vie prive
The Use of Anonymizing • Still, anonymizing technologies have been endorsed repeatedly by panels appointed to examine the implications of data mining. And intriguing progress appears to have been made at designing information-retrieval systems with record anonymization, user audit logs — which can confirm that no one looked at records beyond the approved scope of an investigation — and other privacy mechanisms "baked in." • The trick is to do more than simply strip names from records. Latanya Sweeney of Carnegie Mellon University — a leading privacy technologist who once had a project funded under TIA — has shown that 87% of Americans could be identified by records listing solely their birthdate, gender and ZIP code. • Sweeney had this challenge in mind as she developed a way for the U.S. Department of Housing and Urban Development to anonymously track the homeless. 29e Confrence internationale des commissaires à la protection de la vie prive
A Private Sector Example • Tesco is quietly building a profile of you, along with every individual in the country - a map of personality, travel habits, shopping preferences and even how charitable and eco-friendly you are. A subsidiary of the supermarket chain has set up a database, called Crucible, that is collating detailed information on every household in the UK, whether they choose to shop at the retailer or not. • The company refuses to reveal the information it holds, yet Tesco is selling access to this database to other big consumer groups, such as Sky, Orange and Gillette. "It contains details of every consumer in the UK at their home address across a range of demographic, socio-economic and lifestyle characteristics," says the marketing blurb of dunnhumby, the Tesco subsidiary in question. It has "added intelligent profiling and targeting" to its data through a software system called Zodiac. This profiling can rank your enthusiasm for promotions, your brand loyalty, whether you are a "creature of habit" and when you prefer to shop. As the blurb puts it: "The list is endless if you know what you are looking for." 29e Confrence internationale des commissaires à la protection de la vie prive
The View From 30,000 feet Brussels, Belgium EU “googles” Google privacy practices Choicepoint’s press release states they have forgone selling certain consumer information “in selected markets” at a cost of $15 million dollars per year Canadian company releases 3-D Face Scanner $350 2007 University of Pisa, Italy KDD Laboratory & “K-Anonymity” advancements RCMP buys info from data broker Roelof Temmingh, South Africa, releases version 1 of “Evolution” Fall 2006 -Purdue University electrophotograhic halftone printer code advances Melbourne, Australia Jane Doe vs. ABC 3 April 2007 Costs, including tort of invasion of privacy: $234,190 Feb 2007 - Portugal adopts “biometric” national ID card provider New Zealand Court of Appeal 4 May 2007 Brooker V. Police 29e Confrence internationale des commissaires à la protection de la vie prive