270 likes | 383 Views
On the Use of Spectral Filtering for Privacy Preserving Data Mining. Songtao Guo UNC Charlotte Xintao Wu UNC Charlotte. Source: http://www.privacyinternational.org/issues/foia/foia-laws.jpg. PIPEDA 2000. European Union (Directive 94/46/EC). HIPAA for health care
E N D
On the Use of Spectral Filtering for Privacy Preserving Data Mining Songtao Guo UNC Charlotte Xintao Wu UNC Charlotte
Source: http://www.privacyinternational.org/issues/foia/foia-laws.jpg April 23-27, 2006
PIPEDA 2000 European Union (Directive 94/46/EC) • HIPAA for health care • California State Bill 1386 • Grann-Leach-Bliley Act for financial • COPPA for childern’s online privacy Source: http://www.privacyinternational.org/survey/dpmap.jpg April 23-27, 2006
Mining vs. Privacy • Data mining • The goal of data mining is summary results (e.g., classification, cluster, association rules etc.) from the data (distribution) • Individual Privacy • Individual values in database must not be disclosed, or at least no close estimation can be derived by attackers • Privacy Preserving Data Mining (PPDM) • How to “perturb” data such that • we can build a good data mining model (data utility) • while preserving individual’s privacy at the record level (privacy)? April 23-27, 2006
Outline • Additive Randomization • Distribution Reconstruction • Bayesian Method Agrawal & Srikant SIGMOD00 • EM Method Agrawal & Aggawal PODS01 • Individual Value Reconstruction • Spectral Filtering H. Kargupta ICDM03 • PCA Technique Du et al. SIGMOD05 • Error Bound Analysis for Spectral Filtering • Upper Bound • Conclusion and Future Work April 23-27, 2006
Additive Randomization • To hide the sensitive data by randomly modifying the data values using some additive noise • Privacy preserving aims at and • Utility preserving aims at • The aggregate characteristics remain unchanged or can be recovered April 23-27, 2006
Distribution Reconstruction • The original density distribution can be reconstructed effectively given the perturbed data and the noise's distribution --–Agrawal & Srikant SIGMOD2000 • Independent random noises with any distribution • fX0 := Uniform distribution • j := 0 // Iteration number • repeat • fXj+1(a) := • j := j+1 • until (stopping criterion met) • It can not reconstruct individual value April 23-27, 2006
Individual Value Reconstruction • Spectral Filtering, Kargupta et al. ICDM 2003 • Apply EVD : • Using some published information about V, extract the first k components of as the principal components. • and are the corresponding eigenvectors. • forms an orthonormal basis of a subspace . • Find the orthogonal projection on to : • Get estimate data set: PCA Technique, Huang, Du and Chen, SIGMOD 05 April 23-27, 2006
Motivation • Previous work on individual reconstruction are only empirical • The relationship between the estimation accuracy and the noise was not clear • Two questions • Attacker question: How close the estimated data using SF is from the original one? • Data owner question: How much noise should be added to preserve privacy at a given tolerated level? April 23-27, 2006
Our Work • Investigate the explicit relationship between the estimation accuracy and the noise • Derive one upper bound of in terms of V • The upper bound determines how close the estimated data achieved by attackers is from the original one • It imposes a serious threat of privacy breaches April 23-27, 2006
Preliminary • F-norm and 2-norm • Some properties • and • ,the square root of the largest eigenvalue of ATA • If A is symmetric, then ,the largest eigenvalue of A April 23-27, 2006
Matrix Perturbation • Traditional Matrix perturbation theory • How the derived perturbation E affects the co-variance matrix A • Our scenario • How the primary perturbation V affects the data matrix U A + E April 23-27, 2006
Error Bound Analysis • Prop 1. Let covariance matrix of the perturbed data be . Given and • Prop 2. (eigenvalue of E) (eigengap) April 23-27, 2006
Theorem • Given a date set and a noise set we have the perturbed data set . Let be the estimation obtained from the Spectral Filtering, then where is the derived perturbation on the original covariance matrix A = UUT • Proof is skipped April 23-27, 2006
Special Cases • When the noise matrix is generated by i.i.d. Gaussian distribution with zero mean and known variance • When the noise is completely correlated with data April 23-27, 2006
Experimental Results • Artificial Dataset • 35 correlated variables • 30,000 tuples April 23-27, 2006
Experimental Results • Scenarios of noise addition • Case 1: i.i.d. Gaussian noise • N(0,COV), where COV = diag(σ2,…, σ2) • Case 2: Independent Gaussian noise • N(0,COV), where COV = c * diag(σ12, …, σn2) • Case 3: Correlated Gaussian noise • N(0,COV), where COV = c * ΣU (or c * A……) • Measure • Absolute error • Relative error April 23-27, 2006
Determining k • Determine k in Spectral Filtering • According to Matrix Perturbation Theory • Our heuristic approach: • check • K = April 23-27, 2006
Effect of varying k (case 1) • N(0,COV), where COV = diag(σ2,…, σ2) relative error April 23-27, 2006
Effect of varying k (case 2) • N(0,COV), where COV = c * diag(σ12, σ22 …, σn2) relative error April 23-27, 2006
Effect of varying k (case 3) • N(0,COV), where COV = c * ΣU April 23-27, 2006
σ2=0.1 σ2=1.0 σ2=0.5 Effect of varying noise ||V||F/||U||F = 87.8% April 23-27, 2006
Case 1 Case 3 Case 2 Effect of covariance matrix ||V||F/||U||F = 39.1% April 23-27, 2006
Conclusion • Spectral filtering based technique has been investigated as a major means of point-wise data reconstruction. • We present the upper bound • which enables attackers to determines how close the estimated data achieved by attackers is from the original one April 23-27, 2006
Future Work • We are working on the lower bound • which represents the best estimate the attacker can achieve using SF • which can be used by data owners to determine how much noise should be added to preserve privacy • Bound analysis at point-wise level April 23-27, 2006
Acknowledgement • NSF Grant • CCR-0310974 • IIS-0546027 • Personnel • Xintao Wu • Songtao Guo • Ling Guo • More Info • http://www.cs.uncc.edu/~xwu/ • xwu@uncc.edu, April 23-27, 2006
Questions? Thank you! April 23-27, 2006