780 likes | 1.15k Views
Hippocratic Data Management. Rakesh Agrawal IBM Almaden Research Center. Thesis. We need information systems that respect the privacy of data they manage AND do not impede the useful flow of information. It is feasible to reconcile the apparent contradiction.
E N D
HippocraticData Management Rakesh Agrawal IBM Almaden Research Center
Thesis • We need information systems that • respect the privacy of data they manage AND • do not impede the useful flow of information. • It is feasible to reconcile the apparent contradiction
Outline • Why Privacy in Data Systems • Some Technology Directions • Some Challenging Problems
Drivers for Privacy • Privacy Surveys: • 17% privacy fundamentalists, 56% pragmatic majority, 27% marginally concerned (Understanding net users' attitude about online privacy, April 99) • 83% would stop doing business with a company if it misused customer information (Privacy on and off the Internet: What consumers want, Nov. 2001) • Govt. legislations & guidelines: • Fair Information Practices Act (US, 1974) • OECD Guidelines (Europe, 1980) • Canadian Standards Association’s Model Code (1995) • Australian Privacy Amendment (2000) • Japan: proposed legislation (2003) • HIPAA, GLB, Recent U.S. Federal & State Initiatives
Privacy Violations • Accidents: • Kaiser, GlobalHealthrax • Lax security: • Massachusetts govt. • Ethically questionable behavior: • Lotus & Equifax, Lexis-Nexis, Medical Marketing Service, Boston University, CVS & Giant Food • Illegal: • Toysmart
Assertion • Enterprises lack tools and technologies for managing private data and enforcing privacy policies.
Founding Tenets of Current Database Systems • Ullman, “Principles of Database and Knowledgebase Systems” • Fundamental: • Manage persistent data. • Access a large amount of data efficiently. • Desirable: • Support for data model, high-level languages, transaction management, access control, and resiliency. • Similar list in other database textbooks.
Statistical & Secure Databases • Statistical Databases • Provide statistical information (sum, count, etc.) without compromising sensitive information about individuals, [AW89] • Multilevel Secure Databases • Multilevel relations, e.g., records tagged “secret”, “confidential”, or “unclassified”, e.g. [JS91] • Need to protect privacy in transactional databases that support daily operations. • Cannot restrict queries to statistical queries. • Cannot tag all the records “top secret”.
Our Research Directions • Privacy Preserving Data Mining • Hippocratic Databases
Data Mining and Privacy • The primary task in data mining: development of models about aggregated data. • Can we develop accurate models without access to precise information in individual data records? R. Agrawal, R. Srikant. Privacy Preserving Data Mining. ACM Int’l Conf. On Management of Data (SIGMOD), May 2000.
30 | 25K | … 50 | 40K | … Randomizer Randomizer 65 | 50K | … 35 | 60K | … Reconstruct Age Distribution Reconstruct Salary Distribution Data Mining Algorithm Model Privacy Preserving Data Mining
Reconstruction Problem • Original values x1, x2, ..., xn • from probability distribution X • To hide these values, we use y1, y2, ..., yn • from probability distribution Y • Given • x1+y1, x2+y2, ..., xn+yn • the probability distribution of Y Estimate the probability distribution of X.
Intuition (Reconstruct single point) • Use Bayes' rule for density functions
Intuition (Reconstruct single point) • Use Bayes' rule for density functions
Reconstruction: Intuition • Combine estimates of where a point came from for all the points: • yields estimate of original distribution.
Reconstruction Algorithm • fX0 := Uniform distribution • j := 0 • repeat • fXj+1(a) := Bayes’ Rule • j := j+1 • until (stopping criterion met) • Converges to maximum likelihood estimate. • D. Agrawal & C.C. Aggarwal, PODS 2001.
Classification • Naïve Bayes • Assumes independence between attributes. • Decision Tree • Correlations are weakened by randomization.
Experimental Methodology • Compare accuracy against • Original: unperturbed data without randomization. • Randomized: perturbed data but without making any corrections for randomization. • Test data not randomized. • Synthetic data benchmark from [AGI+92]. • Training set of 100,000 records, split equally between the two classes.
So far… • Question: Can we develop accurate models without access to precise information in individual data records? • Answer: yes, by randomization. • for numerical attributes, classification • How about Association Rules?
Associations Recap • A transaction t is a set of items (e.g. books) • All transactions form a set Tof transactions • Any itemset A has support s in Tif • Itemset A is frequent if s smin • Task: Find all frequent itemsets
The Problem • How to randomize transactions so that • we can find frequent itemsets • while preserving privacy at transaction level? Evfimievski, R. Srikant, R. Agrawal, J. Gehrke. Mining Association Rules Over Privacy Preserving Data. 8th Int'l Conf. on Knowledge Discovery in Databases and Data Mining, July 2002.
Randomization Overview Alice J.S. Bach, painting, nasa.gov, … J.S. Bach, painting, nasa.gov, … Recommendation Service B. Spears, baseball, cnn.com, … Bob B. Spears, baseball, cnn.com, … B. Marley, camping, linux.org, … Chris B. Marley, camping, linux.org, …
Randomization Overview Alice J.S. Bach, painting, nasa.gov, … J.S. Bach, painting, nasa.gov, … Recommendation Service B. Spears, baseball, cnn.com, … Bob Associations B. Spears, baseball, cnn.com, … B. Marley, camping, linux.org, … Chris Recommendations B. Marley, camping, linux.org, …
Randomization Overview Alice J.S. Bach, painting, nasa.gov, … Metallica, painting, nasa.gov, … Recommendation Service Support Recovery B. Spears, soccer, bbc.co.uk, … Bob Associations B. Spears, baseball, cnn.com, … B. Marley, camping, ibm.com … Chris Recommendations B. Marley, camping, linux.org, …
Uniform Randomization • Given a transaction, • keep item with, say 20% probability, • replace with a new random item with 80% probability.
10 M transactions of size 10 with 10 K items: 1% have {x, y,z} 5% have {x, y}, {x,z}, or {y,z} only 94% have one or zero items of {x, y, z} Example: {x, y, z} at most • 0.2• (9/10,000)2 • 0.23 • 0.22 • 8/10,000 0.008% 800 ts. 97.8% 0.00016% 16 trans. 1.9% less than 0.00002% 2 transactions 0.3% Privacy Breach: Given {x, y, z} in the randomized transaction, we have about 98% certainty of {x, y, z} in the original one
Privacy Breach • Suppose: • t is an original transaction; • t’ is the corresponding randomized transaction; • A is a (frequent) itemset. • Definition: Itemset A causes a privacy breach of level if, for some item z A,
Our Solution “Where does a wise man hide a leaf? In the forest. But what does he do if there is no forest?” “He grows a forest to hide it in.” G.K. Chesterton • Insert many false items into each transaction • Hide true itemsets among false ones Can we still find frequent itemsets while having sufficient privacy?
Cut and Paste Randomization • Given transaction t of size m, construct t’: t = a, b, c, u, v, w, x, y, z t’ =
Cut and Paste Randomization • Given transaction t of size m, construct t’: • Choose a number j between 0 and Km (cutoff); t = a, b, c, u, v, w, x, y, z t’ = j = 4
Cut and Paste Randomization • Given transaction t of size m, construct t’: • Choose a number j between 0 and Km (cutoff); • Include j items of t into t’; t = a, b, c, u, v, w, x, y, z t’ = b, v, x, z j = 4
Cut and Paste Randomization • Given transaction t of size m, construct t’: • Choose a number j between 0 and Km (cutoff); • Include j items of t into t’; • Each other item is included into t’ with probability pm . The choice of Km and pm is based on the desired level of privacy. t = a, b, c, u, v, w, x, y, z t’ = b, v, x, z œ, å, ß, ξ, ψ, €, א, ъ, ђ, … j = 4
Partial Supports To recover original support of an itemset, we need randomized supports of its subsets. • Given an itemset A of size k and transaction size m, • A vector of partial supports of A is • Here sk is the same as the support of A. • Randomized partial supports are denoted by
Transition Matrix • Let k = |A|, m = |t|. • Transition matrixP = P (k, m) connects randomized partial supports with original ones: • Randomized supports are distributed as a sum of multinomial distributions.
The Unbiased Estimators • Given randomized partial supports, we can estimate original partial supports: • Covariance matrix for this estimator: • To estimate it, substitute sl with (sest)l . • Special case: estimators for support and its variance
Privacy Breach Analysis • How many added items are enough to protect privacy? • Have to satisfy Pr [zt | At’] < ( no privacy breaches) • Select parameters so that it holds for all itemsets. • Use formula ( ): • Parameters are to be selected in advance! • Construct a privacy-challenging test: an itemset whose all subsets have maximum possible support. • Enough to know maximal support of an itemset for each size.
Lowest Discoverable Support • LDS is s.t., when predicted, is 4away from zero. • Roughly, LDS is proportional to |t| = 5, = 50%
LDS vs. Breach Level |t| = 5, |T| = 5 M • Reminder: breach level is the limit on Pr [zt | A t’]
Real Datasets: soccer, mailorder • Soccer is the clickstream log of WorldCup’98 web site, split into sessions of HTML requests. • 11 K items (HTMLs), 6.5 M transactions • Mailorder is a purchase dataset from a certain on-line store • Products are replaced with their categories • 96 items (categories), 2.9 M transactions
Results Breach level = 50%. Soccer: smin = 0.2% 0.07% for 3-itemsets Mailorder: smin = 0.2% 0.05% for 3-itemsets
Summary • Can have our cake and mine it too! • Randomization is an interesting approach for building data mining models while preserving user privacy!!! • Y. Lindell, B. Pinkas. Privacy Preserving Data Mining. Crypto 2000. S. Rizvi, J. Haritsa, “Privacy-Preserving Association Rule Mining”, VLDB 2002 J. Vaidya, C.W. Clifton. Privacy Preserving Association Rule Mining in Vertically Partitioned Data. KDD 2002.
The Hippocratic Oath “What I may see or hear in the course of treatment or even outside of the treatment in regard to the life of men, which on no account [ought to be] spread abroad, I will keep to myself, holding such things shameful to be spoken about.” – Hippocratic Oath, 8 (circa 400 BC)
Hippocratic Databases Founding tenet: Responsibility for the privacy of data they manage. R. Agrawal, J. Kiernan, R. Srikant, Y. Xu Hippocratic Databases 28th Int'l Conf. on Very Large Databases (VLDB), August 2002..
Approach • Derive founding principles from current privacy legislation. • Strawman Design
Ten Principles of Hippocratic Databases • Collection Group • Purpose Specification, Consent, Limited Collection • Use Group • Limited Use, Limited Disclosure, Limited Retention, Accuracy • Security & Openness Group • Safety, Openness, Compliance
Collection Group • Purpose Specification • For personal information stored in the database, the purposes for which the information has been collected shall be associated with that information. • Consent • The purposes associated with personal information shall have consent of the donor (person whose information is being stored). • Limited Collection • The information collected shall be limited to the minimum necessary for accomplishing the specified purposes.