300 likes | 641 Views
Privacy Preserving Data Mining: An Overview and Examination of Euclidean Distance Preserving Data Transformation. Chris Giannella cgiannel AT acm DOT org. Talk Outline. Introduction Privacy preserving data mining – what problem is it aimed to address?
E N D
Privacy Preserving Data Mining:An Overview and Examination of Euclidean Distance Preserving Data Transformation Chris Giannella cgiannel AT acm DOT org
Talk Outline • Introduction • Privacy preserving data mining – what problem is it aimed to address? • Focus of this talk: data transformation • Some data transformation approaches • My current research: Euclidean distance preserving data transformation • Wrap-up summary
An Example Problem • The U.S. Census Bureau collects lots of data • If released in raw form, this data would provide • a wealth of valuable information regarding broad population patterns • Access to private information regarding individuals How to allow analysts to extract population patterns without learning private information?
Privacy-Preserving Data Mining “The study of how to produce valid mining models and patterns without disclosing private information.”- F. Giannotti and F. Bonchi,“Privacy Preserving Data Mining,” KDUbiq Summer School, 2006. Several broad approaches … this talk data transformation (the “census model”)
Data Transformation(the “Census Model”) Researcher Private Data Transformed Data Data Miner
DT Objectives Minimize risk of disclosing private information Maximize the analytical utility of the transformed data DT is also studied in the field of Statistical Disclosure Control.
Some things DT does not address… Preventing unauthorized access to the private data (e.g. hacking). Securely communicating private data. DT and cryptography are quite different. (Moreover, standard encryption does not solve the DT problem)
Assessing Transformed Data Utility How accurately does a transformation preserve certain kinds of patterns, e.g.: • data mean, covariance • Euclidean distance between data records • Underlying generating distribution? How useful are the patterns at drawing conclusions/inferences?
Assessing Privacy Disclosure Risk • Some efforts in the literature to develop rigorous definitions of disclosure risk • no widely accepted agreement • This talk will take an ad-hoc approach: • for a specific attack, how closely can any private data record be estimated?
Talk Outline • Introduction • Privacy preserving data mining – what problem is it aimed to address? • Focus of this talk: data transformation • Some data transformation approaches • My current research: Euclidean distance preserving data transformation • Wrap-up summary
Some DT approaches • Discussed in this talk: • Additive independent noise • Euclidean distance preserving transformation • My current research • Others: • Data swapping/shuffling, multiplicative noise, micro-aggregation, K-anonymization, replacement with synthetic data, etc…
Additive Independent Noise For each private data record, (x1,…,xn), add independent random noise to each entry: (y1,…,yn) = (x1+e1,…,xn+en) ei is generated independently as N(0, d*Var(i)) Increasing d reduces privacy disclosure risk
Additive Independent Noise Difficult to set d producing low privacy disclosure risk high data utility Some enhancements on the basic idea exist E.g.Muralidhar et al.
Talk Outline • Introduction • Privacy preserving data mining – what problem is it aimed to address? • Focus of this talk: data transformation • Some data transformation approaches • My current research: Euclidean distance preserving data transformation (EDPDT) • Wrap-up summary
EDPDT – High Data Utility! • Many data clustering algorithms use Euclidean distance to group records, e.g. • K-means clustering, hierarchical agglomerative clustering • If Euclidean distance is accurately preserved, these algorithms will produce the same clusters on the transformed data as the original data.
EDPDT – High Data Utility! Original data Transformed data
EDPDT – Unclear Privacy Disclosure Risk • Focus of the research ... approach • Develop attacks combining the transformed data with plausible prior knowledge. • How well can these attacks estimate private data records?
Two Different Prior Knowledge Assumptions • Known input: The attacker knows a small subset of the private data records. • Focus of this talk. • Known sample: The attacker knows a set of data records drawn independently from the same underlying distribution as the private data records. • happy to discuss “off-line”.
Known Input Prior Knowledge Underlying assumption: Individuals know a) if there is a record for them along the private data records, and b) know the attributes of the private data records. Each individual knows one private record. A small group of malicious individuals could cooperate to produce a small subset of the private data records.
Known Input Attack Given: {Y1,…,Ym} (transformed data records) {X1,…,Xk} (known private data records) 1) Determine the transformation constraints i.e. which transformed records came from which known private records. 2) Choose T randomly from the set of all distance preserving transformations that satisfy the constraints. 3) Apply T-1 to the transformed data.
Known Input Attack – General Case Y = MX • Each column of X (Y) is a private (transformed) data record. • M is an orthogonal matrix. [Ykn Yun] = M[Xknown Xunkown] Attack: Choose T randomly from {T an orthogonal matrix: TXknown = Ykn}. Produce T-1(Yun). 23
Known Input Attack -- Experiments 18,000 record, 16-attribute real data set. Given k known private data records, computed Pk, the probability that the attack estimates one unknown private record with > 85% accuracy. P2 = 0.16 P4 = 1 … P16 = 1
Wrap-Up Summary • Introduction • Privacy preserving data mining – what problem is it aimed to address? • Focus of this talk: data transformation • Some data transformation approaches • My current research: Euclidean distance preserving data transformation
Thanks to … • You: • for your attention • Kun Liu: • joint research & some material used in this presentation • Krish Muralidhar: • some material used in this presentation • Hillol Kargupta: • joint research
original original perturbed Distance Preserving Perturbation Attributes perturbed Records
ID 1001 1002 ID 1001 1002 -0.2507 0.4556 -0.8542 × Wages -26,326 -22,613 Wages 98,563 83,821 = -0.9653 -0.0514 0.2559 Rent -94,502 -80,324 Rent 1,889 1,324 0.0726 0.8887 0.4527 Tax 10,151 8,432 Tax 2,899 2,578 M X Distance Preserving Perturbation Y
Known Sample Attack [more]
Known Sample Attack Experiments backup Fig. Known sample attack for Adult data with 32,561 private tuples. The attacker has 2% samples from the same distribution. The average relative error of the recovered data is 0.1081 (10.81%).