Privacy Preserving Data Mining: An Overview and Examination of Euclidean Distance Preserving Data Transformation

Privacy Preserving Data Mining:An Overview and Examination of Euclidean Distance Preserving Data Transformation Chris Giannella cgiannel AT acm DOT org

Talk Outline • Introduction • Privacy preserving data mining – what problem is it aimed to address? • Focus of this talk: data transformation • Some data transformation approaches • My current research: Euclidean distance preserving data transformation • Wrap-up summary

An Example Problem • The U.S. Census Bureau collects lots of data • If released in raw form, this data would provide • a wealth of valuable information regarding broad population patterns  • Access to private information regarding individuals  How to allow analysts to extract population patterns without learning private information?

Privacy-Preserving Data Mining “The study of how to produce valid mining models and patterns without disclosing private information.”- F. Giannotti and F. Bonchi,“Privacy Preserving Data Mining,” KDUbiq Summer School, 2006. Several broad approaches … this talk  data transformation (the “census model”)

Data Transformation(the “Census Model”) Researcher Private Data Transformed Data Data Miner

DT Objectives Minimize risk of disclosing private information Maximize the analytical utility of the transformed data DT is also studied in the field of Statistical Disclosure Control.

Some things DT does not address… Preventing unauthorized access to the private data (e.g. hacking). Securely communicating private data. DT and cryptography are quite different. (Moreover, standard encryption does not solve the DT problem)

Assessing Transformed Data Utility How accurately does a transformation preserve certain kinds of patterns, e.g.: • data mean, covariance • Euclidean distance between data records • Underlying generating distribution? How useful are the patterns at drawing conclusions/inferences?

Assessing Privacy Disclosure Risk • Some efforts in the literature to develop rigorous definitions of disclosure risk • no widely accepted agreement • This talk will take an ad-hoc approach: • for a specific attack, how closely can any private data record be estimated?

Talk Outline • Introduction • Privacy preserving data mining – what problem is it aimed to address? • Focus of this talk: data transformation • Some data transformation approaches • My current research: Euclidean distance preserving data transformation • Wrap-up summary

Some DT approaches • Discussed in this talk: • Additive independent noise • Euclidean distance preserving transformation • My current research • Others: • Data swapping/shuffling, multiplicative noise, micro-aggregation, K-anonymization, replacement with synthetic data, etc…

Additive Independent Noise For each private data record, (x1,…,xn), add independent random noise to each entry: (y1,…,yn) = (x1+e1,…,xn+en) ei is generated independently as N(0, d*Var(i)) Increasing d reduces privacy disclosure risk

Additive Independent Noise d = 0.5

Additive Independent Noise Difficult to set d producing low privacy disclosure risk high data utility Some enhancements on the basic idea exist E.g.Muralidhar et al.

Talk Outline • Introduction • Privacy preserving data mining – what problem is it aimed to address? • Focus of this talk: data transformation • Some data transformation approaches • My current research: Euclidean distance preserving data transformation (EDPDT) • Wrap-up summary

EDPDT – High Data Utility! • Many data clustering algorithms use Euclidean distance to group records, e.g. • K-means clustering, hierarchical agglomerative clustering • If Euclidean distance is accurately preserved, these algorithms will produce the same clusters on the transformed data as the original data.

EDPDT – High Data Utility! Original data Transformed data

EDPDT – Unclear Privacy Disclosure Risk • Focus of the research ... approach  • Develop attacks combining the transformed data with plausible prior knowledge. • How well can these attacks estimate private data records?

Two Different Prior Knowledge Assumptions • Known input: The attacker knows a small subset of the private data records. • Focus of this talk. • Known sample: The attacker knows a set of data records drawn independently from the same underlying distribution as the private data records. • happy to discuss “off-line”.

Known Input Prior Knowledge Underlying assumption: Individuals know a) if there is a record for them along the private data records, and b) know the attributes of the private data records.  Each individual knows one private record.  A small group of malicious individuals could cooperate to produce a small subset of the private data records.

Known Input Attack Given: {Y1,…,Ym} (transformed data records) {X1,…,Xk} (known private data records) 1) Determine the transformation constraints i.e. which transformed records came from which known private records. 2) Choose T randomly from the set of all distance preserving transformations that satisfy the constraints. 3) Apply T-1 to the transformed data.

Know Input Attack – 2D data, 1 known private data record

Known Input Attack – General Case Y = MX • Each column of X (Y) is a private (transformed) data record. • M is an orthogonal matrix. [Ykn Yun] = M[Xknown Xunkown] Attack: Choose T randomly from {T an orthogonal matrix: TXknown = Ykn}. Produce T-1(Yun). 23

Known Input Attack -- Experiments 18,000 record, 16-attribute real data set. Given k known private data records, computed Pk, the probability that the attack estimates one unknown private record with > 85% accuracy. P2 = 0.16 P4 = 1 … P16 = 1

Wrap-Up Summary • Introduction • Privacy preserving data mining – what problem is it aimed to address? • Focus of this talk: data transformation • Some data transformation approaches • My current research: Euclidean distance preserving data transformation

Thanks to … • You: • for your attention • Kun Liu: • joint research & some material used in this presentation • Krish Muralidhar: • some material used in this presentation • Hillol Kargupta: • joint research

original original perturbed Distance Preserving Perturbation Attributes perturbed Records

ID 1001 1002 ID 1001 1002 -0.2507 0.4556 -0.8542 × Wages -26,326 -22,613 Wages 98,563 83,821 = -0.9653 -0.0514 0.2559 Rent -94,502 -80,324 Rent 1,889 1,324 0.0726 0.8887 0.4527 Tax 10,151 8,432 Tax 2,899 2,578 M X Distance Preserving Perturbation Y

Known Sample Attack [more]

Known Sample Attack Experiments backup Fig. Known sample attack for Adult data with 32,561 private tuples. The attacker has 2% samples from the same distribution. The average relative error of the recovered data is 0.1081 (10.81%).

Privacy Preserving Data Mining: An Overview and Examination of Euclidean Distance Preserving Data Transformation

Privacy Preserving Data Mining: An Overview and Examination of Euclidean Distance Preserving Data Transformation

Presentation Transcript

Data Mining with Clementine

DATA FLOW DIAGRAMS (DFD): Sources and Sinks: Define System Boundaries Process (transformation) transforms data - no de

Chapter 2: Data Preprocessing

Data Mining on Streams

DATA MINING Introductory and Advanced Topics Part I

Data Mining

Data Mining

Drug Safety Assessment and Data Mining

DATA WAREHOUSING AND DATA MINING

DATA WAREHOUSING AND DATA MINING

DATA MINING LECTURE 5

Data Mining Tools

Data Mining2

Privacy-Preserving Authentication: A Tutorial

Data Mining with DB

Data Mining using Fractals and Power laws

Data Mining with CANape 9.0

Building A Trustworthy, Secure, And Privacy Preserving Network

Data Mining using Fractals and Power laws