620 likes | 960 Views
Privacy and Data Mining: Friends or Foes?. Rakesh Agrawal IBM Almaden Research Center. Theme. DILEMMA Applications abound where data mining can do enormous good, but is vulnerable to misuse under misguided hands GOAL
E N D
Privacy and Data Mining: Friends or Foes? Rakesh Agrawal IBM Almaden Research Center
Theme DILEMMA • Applications abound where data mining can do enormous good, but is vulnerable to misuse under misguided hands GOAL • Understand the concerns with data mining and identify research directions that may address those concerns QUESTIONS • Perceived concerns with data mining • How real are those concerns • What data mining community is doing to address the concerns • What more needs to be done
Panelists • James Dempsey, Center for Democracy & Technology • Daniel Gallington, Potomac Institute • Lawrence Cox, National Center for Health Statistics • Bhavani Thuraisingham, National Science Foundation • Latanya Sweeney, Carnegie Mellon University • Christopher Clifton, Purdue University • Jeff Ullman, Stanford University
Plan • Position statements -- 6 minutes each • Rejoinders -- 2 minutes each • Questions and observations from the floor • Closing statements -- 1 minute each
The Potomac Institute for Policy Studies Privacy and Data Mining KDD 2003 August 25, 2003 Daniel J. Gallington
New Information Technology and Privacy– Status of the Debate • Demonization of Science • Technology development vs. policy/legal “envelope” • Rules vs. Process • Enablement vs. Disablement • Secrecy • When the dust settles– what could work?
Data Mining and Privacy:Friends or Foes? Dr. Bhavani Thuraisingham The National Science Foundation August 2003
Definitions • Data Mining • Data mining is the process of a user analyzing large amounts of data using techniques from statistical reasoning and machine learning and discovering information often previously unknown • Data fusion • The process of associating records from two (or more) databases, e.g., Medical Records and Grocery Store purchases • Privacy Problem • User U poses queries and deduces information from the responses that U is authorized to see; U is not authorized to see the deduced information about an individual or a group of individuals G deemed private by either G or some authority
Some Data Mining Applications • Medical and Healthcare • Mining genetic and medical databases and finding links between genetic composition and diseases • Security • Analyzing travel records, spending patterns, associations between people and determining potential terrorists • Examining audit data and determining unauthorized network intrusions • Mining credit card transactions, telephone calls and other related data and detecting fraud and identity theft • Marketing, Sales, and Finance • Understanding preferences of groups of consumers
Some Privacy concerns • Medical and Healthcare • Employers, marketers, or others knowing of private medical concerns • Security • Allowing access to individual’s travel and spending data • Allowing access to web surfing behavior • Marketing, Sales, and Finance • Allowing access to individual’s purchases
Data Mining as a Threat to Privacy • Data mining gives us “facts” that are not obvious to human analysts of the data • Can general trends across individuals be determined without revealing information about individuals? • Possible threats: • Combine collections of data and infer information that is private • Disease information from prescription data • Military Action from Pizza delivery to pentagon • Need to protect the associations and correlations between the data that are sensitive or private
Some Privacy Problems and Potential Solutions • Problem: Privacy violations that result due to data mining • Potential solution: Privacy-preserving data mining • Problem: Privacy violations that result due to the Inference problem • Inference is the process of deducing sensitive information from the legitimate responses received to user queries • Potential solution: Privacy Constraint Processing • Problem: Privacy violations due to un-encrypted data • Potential solution: Encryption at different levels • Problem: Privacy violation due to poor system design • Potential solution: Develop methodology for designing privacy-enhanced systems
Some Research Directions:Privacy Preserving Data Mining • Prevent useful results from mining • Introduce “cover stories” to give “false” results • Only make a sample of data available so that an adversary is unable to come up with useful rules and predictive functions • Randomization • Introduce random values into the data and/or results • Challenge is to introduce random values without significantly affecting the data mining results • Give range of values for results instead of exact values • Secure Multi-party Computation • Each party knows its own inputs; encryption techniques used to compute final results • Rules, predictive functions • Approach: Only make a sample of data available • Limits ability to learn good classifier
Some Research Directions:Privacy Constraint Processing • Privacy constraints processing • Based on prior research in security constraint processing • Simple Constraint: an attribute of a document is private • Content-based constraint: If document contains information about X, then it is private • Association-based Constraint: Two or more documents taken together is private; individually each document is public • Release constraint: After X is released Y becomes private • Augment a database system with a privacy controller for constraint processing
Some Research Directions:Encryption for Privacy • Encryption at various levels • Encrypting the data as well as the results of data mining • Encryption for multi-party computation • Encryption for untrusted third party publishing • Owner enforces privacy policies • Publisher gives the user only those portions of the document he/she is authorized to access • Combination of digital signatures and Merkle hash to ensure privacy
Some Research Directions:Methodology for Designing Privacy Systems • Jointly develop privacy policies with policy specialists • Specification language for privacy policies • Generate privacy constraints from the policy and check for consistency of constraints • Develop a privacy model • Privacy architecture that identifies privacy critical components • Design and develop privacy enforcement algorithms • Verification and validation
Data Mining and Privacy: Friends or Foes? • They are neither friends nor foes • Need advances in both data mining and privacy • Need to design flexible systems • For some applications one may have to focus entirely on “pure” data mining while for some others there may be a need for “privacy-preserving” data mining • Need flexible data mining techniques that can adapt to the changing environments • Technologists, legal specialists, social scientists, policy makers and privacy advocates MUST work together
Some NSF Projects addressing Privacy • Privacy-preserving data mining • Distributed data mining techniques to replicate or approximate the results of centralized data mining, with quantifiable limits on the disclosure of data from each • Privacy for Supply Chain Management • Secure Supply-Chain Collaboration protocols to enable supply-chain partners to cooperatively achieve desired system-wide goals without revealing any private information, even though the jointly-computed decisions may depend on the private information of all the parties • Privacy Model • Model for privacy based on secure query protocol, encryption and database organization with little trust on the client or server
Other Ideas and Directions? • Please contact • Dr. Bhavani Thuraisingham The National Science Foundation Suite 1115 4201 Wilson Blvd Arlington, VA 22230 Phone: 703-292-8930 Fax 703-292-9037 email: bthurais@nsf.gov
Technologies for Privacy Latanya Sweeney, Ph.D.Assistant Professor of Computer Science, Technology and PolicySchool of Computer ScienceCarnegie Mellon Universitylatanya @ privacy.cs.cmu.eduhttp://privacy.cs.cmu.edu/people/sweeney/index.html 6/29
Address 4 Questions • Concerns with data mining • How real are those concerns • What the data mining community is doing to address those concerns • What more needs to be done L. Sweeney. Navigating Computer Science Research Through Waves of Privacy Concerns. 2003. http://privacy.cs.cmu.edu/index.html
Address 4 Questions • Concerns with data mining:demand for person-specific data • How real are those concerns:explosion in collected informationindividual bears risks and harms • What the data mining community is doing:privacy-preserving data mining too limited • What more needs to be done:construct technology with provable guarantees of privacy protection privacy technology
Privacy Technology Center Core People Michael Shamos Mel Siegel Daniel Siewiorek Asim Smailagic Peter Steenkiste Scott Stevens Latanya Sweeney Katia Sycara Robert Thibedeau Howard Wactlar Alex Waibel Anastassia AilamakiChris AtkesonGuy BlellochManuel BlumJamie CallanJamie CarbonellKathleen CarleyRobert CollinsLorrie CranorSamuel Edoho-EketMaxine EskenaziScott Fahlman David FarberDavid GarlanRalph GrossAlex HauptmannTakeo KanadeBradley MalinBruce MaggsTom MitchellNorman SadehWilliam ScherlisJeff Schneider Henry Schneiderman
Emerging Technologies with Privacy Concerns 1. Face recognition, Biometrics (DNA, fingerprints, iris, gait) 2. Video Surveillance, Ubiquitous Networks (Sensors) 3. Semantic Web, “Data Mining,” Bio-Terrorism Surveillance 4. Professional Assistants (email and scheduling), Lifelog recording 5. E911 Cell Phones, IR Tags, GPS 6. Personal Robots, Intelligent Spaces, CareMedia 7. Peer to peer Sharing, Spam Blockers, Instant Messaging 8. Tutoring Systems, Classroom Recording, Cheating Detectors 9. DNA sequences, Genomic data, Pharmaco-genomics
Ubiquitous Data SharingBenefits and Concerns 3. Semantic Web, “Data Mining,” Bio-Terrorism Surveillance Benefits: - Counter terrorism surveillance may improve safety. - Bio-Terrorism surveillance can save lives by early detection of a biological agent and naturally occurring outbreaks. - Semantic web enables more powerful computer uses • Privacy concerns: • - Erosion of civil liberties • - Illegal search from law-enforcement “mining” cases - Patient privacy may render healthcare less effective. - Access to uncontrolled and unprecedented amounts of data - Collected data can be used for other government purposes
1. Concerns with Data Mining A. Video, wiretapping and surveillance B. Civil liberties, illegal search C. Medical privacy D. Employment, workplace privacy E. Educational records privacy F. Copyright law “data mining” ubiquitous data sharing, increased demand for person-specific data to realize potential benefits from algorithms
Definition. Privacy Privacy reflects the ability of a person, organization, government, or entity to control its own space, where the concept of space (or “privacy space”) takes on different contexts • Physical space, against invasion • Bodily space, medical consent • Computer space, spam • Web browsing space, Internet privacy
Definition. Data Privacy When privacy space refers to the fragments of data one leaves behind as a person moves through daily life, the notion of privacy is called data privacy. • No control or ownership • Historically dictated by policy and laws • Today’s technically empowered society renders overtaxes past approach
Address 4 Questions • Concerns with data mining • How real are those concerns • What the data mining community is doing to address those concerns • What more needs to be done L. Sweeney. Navigating Computer Science Research Through Waves of Privacy Concerns. 2003. http://privacy.cs.cmu.edu/index.html
Exponential Growth in Data Collected Growth in active web servers Growth in available disk storage 1991 1996 1993 First WWW conference 2001
Linking to Re-identify Data Name Address Date registered Party affiliation Date last voted Ethnicity Visit date Diagnosis Procedure Medication Total charge ZIP Birth date Sex Medical Data Voter List L. Sweeney. Weaving technology and policy together to maintain confidentiality. Journal of Law, Medicine and Ethics. 1997, 25:98-110.
{date of birth, gender, 5-digit ZIP} uniquely identifies 87.1% of USA pop.
Address 4 Questions • Concerns with data mining • How real are those concerns • What the data mining community is doing to address those concerns • What more needs to be done L. Sweeney. Navigating Computer Science Research Through Waves of Privacy Concerns. 2003. http://privacy.cs.cmu.edu/index.html
Address 4 Questions • Concerns with data mining • How real are those concerns • What the data mining community is doing to address those concerns • What more needs to be done L. Sweeney. Navigating Computer Science Research Through Waves of Privacy Concerns. 2003. http://privacy.cs.cmu.edu/index.html
What More Needs to Be Done Our approach.Privacy Technology Center proactively constructs privacy technology with provable guarantees of privacy protection while allowing society to collect and share person-specific information for many worthy purposes .
Some Privacy Technology Solutions - Face de-identification- Self-controlling data- Video abstraction- CertBox (“privacy appliance”)- Reasonable cause (“selective revelation”)- Distributed surveillance- Privacy and context awareness (“eWallet”)- Data valuation by simulation- Roster collocation networks- Video and sound opt-out- Text anonymizer- Privacy agent- Blocking devices- Point location query restriction
k-Same Face De-identification Privacy Compliance: No matter how good face recognition software may become, it will not be able to reliably re-identify k-Same’d faces. Warranty: The resulting data remain useful for identifying suspicious behavior and identifying basic characteristics. E. Newton, L. Sweeney, and B. Malin Preserving Privacy by De-identifying Facial Images. Carnegie Mellon University, School of Computer Science, Technical Report, CMU-CS-03-119. Pittsburgh: 2003. http://privacy.cs.cmu.edu/people/sweeney/video.html
Example of k-Same Faces for Varying k -Pixel -Eigen 100 k = 2 5 3 10 50
Performance of k-Same Algorithm for varying values of k Upper-bound on Recognition Performance = 1 k
Single Bar Mask T-Mask Black Blob Mouth Only Grayscale Black & White Ordinal Data Threshold Pixelation Negative Grayscale Black & White Random Grayscale Black & White Mr. Potato Head Some Attempts that Don’t Work!
Legal Flow of Medical Data for Surveillance HIPAA Public Health Law “No” risk! PublicHealth Explicitly Identified by Name, etc. Scientifically de-identified Surveillance Systems Hospitals, Labs,Physician Offices
De-identified Data through a “Privacy Wall” Generated in Real-Time by a “CertBox” Scientifically de-identified CertBox PublicHealth Explicitly Identified by Name, etc. Data de-identified automatically by a tamper-resistant system specific to the data and the task. Called a “CertBox.”
Risk of Re-identification Ann 9/1960 Ann “Ann” “Ann” PublicHealth “9/1960 F 37213” “Ann” “9/1960 F 37213” A re-identification results when a record in a sample from the Bio-Surveillance Datastream can reasonably be related to the patient who is the subject of the record in such a way that direct and rather specific communication with the patient is possible.
Measuring Identifiability Binsize of 1 Only 1 person is green with that shape head. Hal Jim Gil Binsize of 2 Ken Len Mel Population 2 people are gray with that shape head. Release Identifiability estimates, in graduated sized groupings, the number of people to which a released record is apt to refer. These groupings are called binsizes.
Risk Assessment Server Inferences Sample fromBio-Surveillance Datastream Assessment Engine Population Models computation models Profile of Databases The Risk Assessment Server identifies which fields and/or records in the Bio-surveillance Datastream are vulnerable to known re-identification inference strategies. The output of the assessment server is a report on the identifiability of the Bio-surveillance Datastream (not just the sample) with respect to those inference strategies. The Risk Assessment Server is licensed to Computer Information Technology Corp.(CIT). Diagram is courtesy of CIT. All rights reserved.
CertBox Contains PrivaCert™ Raw data Scientifically de-identified PrivaCert™ Rule-based system custom to data assessment
Reasonable Cause (“Selective Revelation”) Gross overview Sufficiently anonymous Normal operation Sufficiently de-identified Unusual activity Identifiable Suspicious activity Readily identifiable Outbreak suspected Explicitly identified Outbreak detected Datafly Idenifiability 0..1 Detection Status 0..1
Address 4 Questions • Concerns with data mining:demand for person-specific data • How real are those concerns:explosion in collected informationindividual bears risks and harms • What the data mining community is doing:privacy-preserving data mining too limited • What more needs to be done:construct technology with provable guarantees of privacy protection privacy technology
Perceived Concerns • Data mining lets you find out about my private life • I don’t want (you, my insurance company, the government) knowing everything • Data mining doesn’t always get it right • I don’t want to be put in jail because data mining said so • I don’t want to be denied a (credit, a job, insurance) because data mining said so