290 likes | 484 Views
Attacks on Randomization based Privacy Preserving Data Mining. Xintao Wu University of North Carolina at Charlotte Sept 20, 2010. Scope. Outline. Part I: Attacks on Randomized Numerical Data Additive noise Projection Part II: Attacks on Randomized Categorical Data Randomized Response.
E N D
Attacks on Randomization based Privacy Preserving Data Mining Xintao Wu University of North Carolina at Charlotte Sept 20, 2010
Outline Part I: Attacks on Randomized Numerical Data • Additive noise • Projection Part II: Attacks on Randomized Categorical Data • Randomized Response
Y X E = + Additive Noise Randomization Example = +
Individual Value Reconstruction (Additive Noise) • Methods • Spectral Filtering, Kargupta et al. ICDM03 • PCA, Huang, Du, and Chen SIGMOD05 • SVD, Guo, Wu and Li, PKDD06 • All aim to remove noise by projecting on lower dimensional space.
Up = U + V Perturbed Original Noise Individual Reconstruction Algorithm • Apply EVD : • Using some published information about V, extract the first k components of as the principal components. • λ1≥ λ2··· ≥ λk ≥ λe and e1, e2, ··· ,ek are the corresponding eigenvectors. • Qk = [e1 e2··· ek] forms an orthonormal basis of a subspace X. • Find the orthogonal projection on to X : • Get estimate data set:
1-d estimation 2-d estimation noise 1st principal vector original signal perturbed 2nd principal vector Why it works • Noise are not correlated • Original data are correlated = +
Challenging Questions • Previous work on individual reconstruction are only empirical • Attacker question: How close the estimated data is from the original one? • Data owner question: How much noise should be added to preserve privacy at a given tolerated level?
Determining k • Strategy 1: (Huang and Du SIGMOD05 ) • Strategy 2:(Guo, Wu and Li, PKDD 2006) • The estimated data using is approximate optimal
Y X E + = Perturbed Original Noise Transformation Original Additive Noise vs. Projection • Additive perturbation is not safe • Spectral Filtering Technique • H.Kargupta et al. ICDM03 • PCA Based Technique • Huang et al. SIGMOD05 • SVD based & Bound Analysis • Guo et al. SAC06,PKDD06 • How about the projection based perturbation? • Projection models • Vulnerabilities • Potential attacks Y R X = Perturbed
Rotation Randomization Example = Y = R X RRT = RTR = I
Rotation Approach (R is orthonormal) • When R is an orthonormal matrix (RTR = RRT = I) • Vector length: |Rx| = |x| • Euclidean distance: |Rxi - Rxj| = |xi - xj| • Inner product : <Rxi ,Rxj> = <xi , xj> • Many clustering and classification methods are invariant to this rotation perturbation. • Classification, Chen and Liu, ICDM 05 • Distributed data mining, Liu and Kargupta, TKDE 06
0.2902 0.2902 1.3086 1.3086 Example RRT = RTR = I
0.2902 0.2902 1.3086 1.3086 Weakness of Rotation Known sample attack Original data Known Info ? Regression
General Linear Transformation • Y = R X + E • When R = I: Y = X + E (Additive Noise Model) • When RRT = RTR = I and E = 0: Y = RX (Rotation Model) • R can be an arbitrary matrix = + Y R X E = +
Is Y = R X + E Safe? • R can be an arbitrary matrix, hence regression based attack wont work • How about noisy ICA direct attack? Y = R X + E General Linear Transformation Model X = A S + N Noisy ICA Model
ICA Revisited • ICA Motivation • Blind source separation: separating unobservable or latent independent source signals when mixed signals are observed • Cocktail-party problem • What is ICA • ICA is a statistical technique which aims to represent a set of random variables as linear combinations of statistically independent component variables • ICA is a process for determining the structure that produced a signal
Separation Process Demixing Matrix Separated Cost Function Independent? Optimize ICA Linear Mixing Process Mixing Matrix Source Observed
Restriction of ICA • Restrictions: • All the components si should be independent; • They must be non-Gaussian with the possible exception of one component. • Can we apply the ICA directly? No • Correlations among attributes of X • More than one attributes may have Gaussian distributions X = AS Y = RX
Correctness of AK-ICA • We prove that J exists such that • J represents the connection between the distributions of and More details, See Guo and Wu, PAKDD 2007
Assumption • Privacy can be breached when a small subset of the original data X , is available to attackers • Assumption is reasonable! Privacy Concern 56% Refuse 17% No Concern 27% Willing to provide data Understanding net users' attitude about online privacy, April 99
Outline Part I: Attacks on Randomized Numerical Data • Additive noise • Projection Part II: Attacks on Randomized Categorical Data • Randomized Response
Randomized Response ([ Stanley Warner; JASA 1965]) : Cheated in the exam : Didn’t cheat in the exam Cheated in exam Purpose Purpose: Get the proportion( ) of population members that cheated in the exam. • Procedure: “Yes” answer Didn’t cheat Randomization device Do you belong to A? (p) Do you belong to ?(1-p) … … “No” answer As: Unbiased estimate of is:
Matrix Expression • RR can be expressed by matrix as: ( 0: No 1:Yes) = • Unbiased estimate of is:
Vector Response • is the true proportions of the population • is the observed proportions in the survey • is the randomization device set by the interviewer. = =
Extension to Multi Attributes • m sensitive attributes: each has categories: • denote the true proportion corresponding to the combination be vector with elements ,arranged lexicographically. • e.g., if m =2, t1 =2 and t2=3 • Simultaneous Model • Consider all variables as one compounded variable and apply the regular vector response RR technique • Sequential Model stands for Kronecker product
Disclosure Analysis R: Typical response which is “yes” ( ) or “no” ( ) Posterior probabilities: , are conditional probabilities set by investigators R is regarded as jeopardizing with respect or if: or
Q A & Xintao Wu xwu@uncc.edu, http://www.sis.uncc.edu/~xwu Data Privacy Lab http://www.dpl.sis.uncc.edu