650 likes | 829 Views
De- anonymizing Data. Source (http://xkcd.org/834/). CompSci 590.03 Instructor: Ashwin Machanavajjhala. Announcements. Project ideas will be posted on the site by Friday. You are welcome to send me (or talk to me about) your own ideas. Outline. Recap & Intro to Anonymization
E N D
De-anonymizing Data Source (http://xkcd.org/834/) CompSci 590.03Instructor: AshwinMachanavajjhala Lecture 2 : 590.03 Fall 12
Announcements • Project ideas will be posted on the site by Friday. • You are welcome to send me (or talk to me about) your own ideas. Lecture 2 : 590.03 Fall 12
Outline • Recap & Intro to Anonymization • Algorithmically De-anonymizing Netflix Data • Algorithmically De-anonymizing Social Networks • Passive Attacks • Active Attacks Lecture 2 : 590.03 Fall 12
Outline • Recap & Intro to Anonymization • Algorithmically De-anonymizing Netflix Data • Algorithmically De-anonymizing Social Networks • Passive Attacks • Active Attacks Lecture 2 : 590.03 Fall 12
Personal Big-Data Person 1 Person 2 Person 3 Person N r1 r2 r3 rN Google Census DB DB Hospital Information Retrieval Researchers Recommen-dationAlgorithms Medical Researchers Doctors Economists DB Lecture 2 : 590.03 Fall 12
The Massachusetts Governor Privacy Breach [Sweeney IJUFKS 2002] • Governor of MA uniquely identified using ZipCode, Birth Date, and Sex. • Name linked to Diagnosis • Name • SSN • Visit Date • Diagnosis • Procedure • Medication • Total Charge • Name • Address • Date Registered • Party affiliation • Date last voted • Zip • Birth date • Sex Medical Data Voter List Lecture 2 : 590.03 Fall 12
The Massachusetts Governor Privacy Breach [Sweeney IJUFKS 2002] • Governor of MA uniquely identified using ZipCode, Birth Date, and Sex. 87 % of US population • Name • SSN • Visit Date • Diagnosis • Procedure • Medication • Total Charge • Name • Address • Date Registered • Party affiliation • Date last voted • Zip • Birth date • Sex Quasi Identifier Medical Data Voter List Lecture 2 : 590.03 Fall 12
Statistical Privacy (Trusted Collector) Problem Utility: Privacy: No breach about any individual Server DB Individual 1 Individual 2 Individual 3 Individual N r1 r2 r3 rN Lecture 2 : 590.03 Fall 12
Statistical Privacy (Untrusted Collector) Problem Server f ( ) DB Individual 1 Individual 2 Individual 3 Individual N r1 r2 r3 rN Lecture 2 : 590.03 Fall 12
Randomized Response • Flip a coin • heads with probability p, and • tails with probability 1-p (p > ½) • Answer question according to the following table: Lecture 2 : 590.03 Fall 12
Statistical Privacy (Trusted Collector) Problem Server DB Individual 1 Individual 2 Individual 3 Individual N r1 r2 r3 rN Lecture 2 : 590.03 Fall 12
Query Answering How many allergy patients? Hospital ‘ DB Correlate Genome to disease Individual 1 Individual 2 Individual 3 Individual N r1 r2 r3 rN Lecture 2 : 590.03 Fall 12
Query Answering • Need to know the list of questions up front • Each answer will leak some information about individuals. After answering a few questions, server will run out of privacy budget and not be able to answer any more questions. • Will see this in detail later in the course. Lecture 2 : 590.03 Fall 12
Anonymous/ Sanitized Data Publishing Hospital DB writingcenterunderground.wordpress.com I wont tell you what questions I am interested in! Individual 1 Individual 2 Individual 3 Individual N r1 r2 r3 rN Lecture 2 : 590.03 Fall 12
Anonymous/ Sanitized Data Publishing Hospital Answer any # of questions directly on DB’ without any modifications. D’B DB Individual 1 Individual 2 Individual 3 Individual N r1 r2 r3 rN Lecture 2 : 590.03 Fall 12
Today’s class • Identifying individual records and their sensitive values from data publishing (with insufficient sanitization). Lecture 2 : 590.03 Fall 12
Outline • Recap & Intro to Anonymization • Algorithmically De-anonymizing Netflix Data • Algorithmically De-anonymizing Social Networks • Passive Attacks • Active Attacks Lecture 2 : 590.03 Fall 12
Terms • Coin tosses of an algorithm • Union Bound • Heavy Tailed Distribution Lecture 2 : 590.03 Fall 12
Terms (contd.) • Heavy Tailed Distribution Normal Distribution Not heavy tailed. Lecture 2 : 590.03 Fall 12
Terms (contd.) • Heavy Tailed Distribution Laplace Distribution Heavy tailed. Lecture 2 : 590.03 Fall 12
Terms (contd.) • Heavy Tailed Distribution Zipf Distribution Heavy tailed. Lecture 2 : 590.03 Fall 12
Terms (contd.) • Cosine Similarity • Collaborative filtering • Problem of recommending new items to a user based on their ratings on previously seen items. θ Lecture 2 : 590.03 Fall 12
Netflix Dataset Column/Attribute Movies Record (r) Users Rating + TimeStamp Lecture 2 : 590.03 Fall 12
Definitions • Support • Set (or number) of non-null attributes in a record or column • Similarity • Sparsity Lecture 2 : 590.03 Fall 12
Adversary Model • Aux(r) – some subset of attributes from r Lecture 2 : 590.03 Fall 12
Privacy Breach • Definition 1: An algorithm A outputs an r’ such that • Definition 2: (When only a sample of the dataset is input) Lecture 2 : 590.03 Fall 12
Algorithm ScoreBoard • For each record r’, compute Score(r’, aux) to be the minimum similarity of an attribute in aux to the same attribute in r’. • Pick r’ with the maximum score OR • Return all records with Score > α Lecture 2 : 590.03 Fall 12
Analysis Theorem 1: Suppose we use Scoreboard with α = 1 – ε. If Aux contains m randomly chosen attributes s.t.Then Scoreboard returns a record r’ such that Pr [Sim(m, r’) > 1 – ε – δ] > 1 – ε Lecture 2 : 590.03 Fall 12
Proof of Theorem 1 • Call r’ a false match if Sim(Aux, r’) < 1 – ε – δ. • For any false match, Pr[ Sim(Auxi, ri’) > 1 – ε ] < 1 – δ • Sim(Aux, r’) = min Sim(Auxi, ri’) • Therefore, Pr[ Sim(Aux, r’) > 1 – ε ] < (1 – δ)m • Pr[some false match has similarity > 1- ε] < N(1-δ)m • N(1-δ)m < ε when m > log(N/ε) / log(1/1-δ) Lecture 2 : 590.03 Fall 12
Other results • If dataset D is (1-ε-δ, ε)-sparse, then D can be (1, 1-ε)-deanonymized. • Analogous results when a list of candidate records are returned Lecture 2 : 590.03 Fall 12
Netflix Dataset • Slightly different algorithm Lecture 2 : 590.03 Fall 12
Summary of Netflix Paper • Adversary can use a subset of ratings made by a user to uniquely identify the user’s record from the “anonymized” dataset with high probability • Simple Scoreboard algorithm provably guarantees identification of records. • A variant of Scoreboard can de-anonymize Netflix dataset. • Algorithms are robust to noise in the adversary’s background knowledge Lecture 2 : 590.03 Fall 12
Outline • Recap & Intro to Anonymization • Algorithmically De-anonymizing Netflix Data • Algorithmically De-anonymizing Social Networks • Passive Attacks • Active Attacks Lecture 2 : 590.03 Fall 12
Social Network Data • Social networks: graphs where each node represents a social entity, and each edge represents certain relationship between two entities • Example: email communication graphs, social interactions like in Facebook, Yahoo! Messenger, etc. Lecture 2 : 590.03 Fall 12
Anonymizing Social Networks • Naïve anonymization • removes the label of each node and publish only the structure of the network • Information Leaks • Nodes may still be re-identified based on network structure Alice Bob Cathy Diane Ed Fred Grace Lecture 2 : 590.03 Fall 12
Passive Attacks on an Anonymized Network • Consider the above email communication graph • Each node represents an individual • Each edge between two individuals indicates that they have exchanged emails Alice Bob Cathy Diane Ed Fred Grace Lecture 2 : 590.03 Fall 12
Passive Attacks on an Anonymized Network • Alice has sent emails to three individuals only Alice Bob Cathy Diane Ed Fred Grace Lecture 2 : 590.03 Fall 12
Passive Attacks on an Anonymized Network • Alice has sent emails to three individuals only • Only one node in the anonymized network has a degree three • Hence, Alice can re-identify herself Alice Bob Cathy Diane Ed Fred Grace Lecture 2 : 590.03 Fall 12
Passive Attacks on an Anonymized Network • Cathy has sent emails to five individuals Alice Bob Cathy Diane Ed Fred Grace Lecture 2 : 590.03 Fall 12
Passive Attacks on an Anonymized Network • Cathy has sent emails to five individuals • Only one node has a degree five • Hence, Cathy can re-identify herself Alice Bob Cathy Diane Ed Fred Grace Lecture 2 : 590.03 Fall 12
Passive Attacks on an Anonymized Network • Now consider that Alice and Cathy share their knowledge about the anonymized network • What can they learn about the other individuals? Alice Bob Cathy Diane Ed Fred Grace Lecture 2 : 590.03 Fall 12
Passive Attacks on an Anonymized Network • First, Alice and Cathy know that only Bob have sent emails to both of them Alice Bob Cathy Diane Ed Fred Grace Lecture 2 : 590.03 Fall 12
Passive Attacks on an Anonymized Network • First, Alice and Cathy know that only Bob have sent emails to both of them • Bob can be identified Alice Bob Cathy Diane Ed Fred Grace Lecture 2 : 590.03 Fall 12
Passive Attacks on an Anonymized Network • Alice has sent emails to Bob, Cathy, and Ed only Alice Bob Cathy Diane Ed Fred Grace Lecture 2 : 590.03 Fall 12
Passive Attacks on an Anonymized Network • Alice has sent emails to Bob, Cathy, and Ed only • Ed can be identified Alice Bob Cathy Diane Ed Fred Grace Lecture 2 : 590.03 Fall 12
Passive Attacks on an Anonymized Network • Alice and Cathy can learn that Bob and Ed are connected Alice Bob Cathy Diane Ed Fred Grace Lecture 2 : 590.03 Fall 12
Passive Attacks on an Anonymized Network • The above attack is based on knowledge about degrees of nodes. [Liu and Terzi, SIGMOD 2008] • More sophisticated attacks can be launched given additional knowledge about the network structure, e.g., a subgraph of the network.[Zhou and Pei, ICDE 2008, Hay et al., VLDB 2008, ] • Protecting privacy becomes even more challenging when the nodes in the anonymized network are labeled.[Pang et al., SIGCOMM CCR 2006] Lecture 2 : 590.03 Fall 12
Inferring Sensitive Values on a Network • Each individual has a single sensitive attribute. • Some individuals share the sensitive attribute, while others keep it private • GOAL: Infer the private sensitive attributes using • Links in the social network • Groups that the individuals belong to • Approach: Learn a predictive model (think classifier) using public profiles as training data. [Zheleva and Getoor, WWW 2009] Lecture 2 : 590.03 Fall 12
Inferring Sensitive Values on a Network • Baseline: Most commonly appearing sensitive value amongst all public profiles. Lecture 2 : 590.03 Fall 12
Inferring Sensitive Values on a Network • LINK: Each node x has a list of binary features Lx, one for every node in the social network. • Feature value Lx[y] = 1 if and only if (x,y) is an edge. • Train a model on all pairs (Lx, sensitive value(x)), for x’s with public sensitive values. • Use learnt model to predict private sensitive values Lecture 2 : 590.03 Fall 12