The Dilemma of Disclosing Anonymized Data

To Do or Not To Do: The Dilemma of Disclosing Anonymized DataLakshmanan L, Ng R, Ramesh GUniv. of British Columbia Oren Fine Nov. 2008 CS Seminar in Databases (236826)

Once Upon a Time… • The police is after Edgar, a drug lord suspect. • Intel. has gathered calls & meetings data records as a transactional database • In order to positively frame Edgar, the police must find hard evidence, and wishes to outsource data mining tasks to “We Mind your Data Ltd.” • But, the police is subject to the law, and is obligated to keep the privacy of the people in the database – including Edgar, which is innocent until proven otherwise. • Furthermore, Edgar is seeking for the smallest hint to disappear…

I have the pleasure to introduce Edgar vs. The Police VS.

Motivation • The Classic Dilemma: • Keep your data close to your chest and never risk privacy or confidentiality or… • Disclose the data and gain potential valuable knowledge and benefits • In order to decide, we need to answer a major question • “Just how safe is the anonymized data?” • Safe = protecting the identities of the of the objects.

Agenda • Anonymization • Model the Attacker’s Knowledge • Determine the risk to our data

Anonymization or De-Identification • Transform sensitive data into generated unique content (strings, numbers) • Example

Anonymization or De-Identification • Advantages • Very simple • Does not affect final outcome or perturb data characteristics • We do not suggest that anonymization is the “right” way, but it is probably the most common

Frequent Set Mining Crash Course • Transactional database • Each transaction has TID and a set of items • An association rule of the form XY has • Support s if s% of the transactions include (X,Y) • Confidence c if c% of the transactions that include X also include Y • Support = frequent sets • Confidence = association rules • A k-itemset is a set of k items

Example

Example (Cont.) • First, we look for frequent sets, according to a support threshold • 2-itemsets: {Angela, Edgar}, {Edgar, Steve} have 50% support (4 out of 8 transactions). • 3-itemsets: {Angela, Edgar, Steve}, {Benny, Edgar, Steve} and {Tommy, Edgar, Steve} have only 25% support (2 out of 8 transactions) • The rule {Edgar, Steve} {Angela} has 50% confidence (2 out 4 transactions) and the rule {Tommy} {Edgar, Steve} has 66.6% confidence.

Frequent Set Mining Crash Course (You’re Qualified!) • Widely used in market basket analysis, intrusion detection, Web usage mining and bioinformatics • Aimed at discovering non trivial or not necessarily intuitive relation between items/variables of large databases“Extracting wisdom out of data” • Who knows what is the most famous frequent set?

Big Mart’s Database

Modeling the Attacker’s Knowledge • We believe that the attacker has prior knowledge about the items in the original domain • The prior information regards the frequencies of items in the original domain • We capture the attacker’s knowledge with “Belief Functions”

Examples of Belief Functions

Consistent Mapping • Mapping anonymized entities to original entities only according tothe belief function

Ignorant Belief Function (Q) • How does the graph look like? • What is the expected number of cracks? • Suppose n items. Further suppose that we are only interested in a partial group, of size n1 • What is the expected number of cracks now? • Don’t you underestimate Edgar…

Ignorant Belief Function (A)

Compliant Point-Valued Belief Function (Q) • How does the graph look like? • What is the expected number of cracks? • Suppose n items. Further suppose that we are only interested in a partial group, of size n1 • What is the expected number of cracks now? • Unless he has inner source, we shouldn’t overestimate Edgar either…

Compliant Point-Valued Belief Function (A)

Compliant IntervalBelief Functions • Direct Computation Method • Build a graph G and adjacency matrix AG • The probability of cracking k out of n items: • Computing the permanent is know to be #P-complete problem, state of the art approximation running time O(n22) !! • What the !#$!% is a permanent or #P-complete?

Permanent • A permanent of an n*n matrix is • The sum is over all permutations of 1,2,… • Calculating the permanent is #P-complete • Which brings us to…

#P-Complete • Unlike well known complexity classes which are of decision problems, this is a class of function problems • "compute f(x)," where f is the number of accepting paths of an NP machine • Example • NP: Are there any subsets of a list of integers that add up to zero? • #P: How many subsets of a list of integers add up to zero?

Chain Belief Functions

Unfortunately… • General Belief Function does not always produce a chain… • We seek for way to estimate the number of cracks.

The O-estimate Heuristic • Suppose Graph G, interval belief function β. • For each x, let Ox denote the outdegree of x in G. • The probability of cracking x is simply • The expected number of cracks is

Properties of O-estimate • Inexact (hence “estimate”) • Monotonic

-Compliant Belief Function • Suppose we “somehow” know which items are guessed wrong • We sum the O-estimates only over the compliant frequency groups

Risk Assessment • Worst case \ Best case – unrealistic • Determine the intervals width • Twice the median gap of all successive frequency groups • Why? • Determine the degree of compliancy • Perform binary search on , subject to a “degree of tolerance” – .

End to End Example • These Intel. Calls & Meeting DR are classified “Top Secret”

We Anonymize the Database

Frequency Groups • The gaps between the frequency groups: 1/8, 1/8, 1/8, 1/8, 2/8 • The median gap = 1/8

The Attacker’s Prior Knowledge

1 2 3 4 5 6 7 8 9 10 11 12 The Graph, By the Way… Angela Ariel Edgar Steve Benny Hassan Tommy Joe Sara Israel Noa Mahhmud

Calculating the Risk • Oest=1/4+1/7+1/3+1/4+1/7+1/9+1/7+ 1/9+1/9+1/7+1/7+1/7 = 2.023 • Now, it’s a question of how much would you tolerate... • Note, that this is the expected number of cracks. However, if we are interested in Edgar, as we’ve seen in previous lemmas, the probability of crack – 1/3.

Experiments

Open Problems • The attacker’s prior knowledge remains a largely unsolved issue • This article does not really deal with frequent sets but rather frequent items • Frequent sets can add more information and differentiate objects from one frequency group

Modeling the Attacker’s Knowledge in the Real World • In a report for the Canadian Privacy Commissioner appears a broad mapping of adversary knowledge • Mapping phone directories • CV’s • Inferring gender, year of birth and postal code from different details • Data remnants on 2nd hand hard disks • Etc.

סוף טוב, הכל טוב

Bibliography • Lakshmanan L., Ng R., Ramesh G. To Do or Not To Do: The Dilemma of Disclosing Anonymized Data. ACM SIGMOD Conference, 2005. • Agrawal, R. and Srikant, R. 1994. Fast algorithms for mining association rules. In Proc. 1994 Int. Conf. Very Large Data Bases (VLDB’94), Santiago, Chile, pp. 487–499. • Pan-Canadian De-Identification Guidelines for Personal Health Information, Khaled El-Emam et al., April 2007. • Wikipedia • Association rule • #P • Permanent

Questions ?

The Dilemma of Disclosing Anonymized Data

The Dilemma of Disclosing Anonymized Data

Presentation Transcript

Oct/Nov 2008

22 Nov 2008

CS 461 – Nov. 18

Bocconi - Duke Seminar Nov 24 th , 2008

CS 461 – Nov. 14

Fine-Grained Authorization in Databases

CS 101 – Nov. 9

CS 111 – Nov. 17

CS 101 – Nov. 23

Oct/Nov 2008

CS 101 – Nov. 30

CS 461 – Nov. 7

Nov 2008

CS 461 – Nov. 9

CS 461 – Nov. 2

CS 111 – Nov. 19

CS 101 – Nov. 11

CS 111 – Nov. 22