390 likes | 537 Views
Privacy Preservation for Data Streams. Feifei Li, Boston University Joint work with: Jimeng Sun (CMU), Spiros Papadimitriou, George A. Mihaila and Ioana Stanoi (IBM T.J. Watson Research Center). P. P. P. Sensitive data. Application (1). Corp. A. Analytical Services. Corp. B. Corp. C.
E N D
Privacy Preservation for Data Streams Feifei Li, Boston University Joint work with: Jimeng Sun (CMU), Spiros Papadimitriou, George A. Mihaila and Ioana Stanoi (IBM T.J. Watson Research Center)
P P P Sensitive data Application (1) Corp. A Analytical Services Corp. B Corp. C Finding trends, clusters, patterns, aggregations.
Publish data as a service Subscribe data to identify trends, patterns, classes Application (2) Client A Information Hub Corp. A P Client B
Identify trends Target Application value stream 1 time value Cluster/ classification stream 2 time value stream 3 time value stream 4 time
A1 A1t A2 + Online generated noise, one vector at a time AN t Problem Formulation time time …….. time
Given σ2, obtain A* online, s.t. D(A, A*) = σ2, and for given R, D(A, A~) is close to σ2 x Offline and Online Problem Formulation (continued) time time ……. R time
Data Perturbation Random i.i.d noise time time + time time time time time time i.i.d: identical independently distributed
Principal Component Analysis: PCA i.i.d Noise
Principal Component Analysis: PCA Correlated Noise
A* Added Noise: Utility Removed Noise σ2 Projection Error A~ Remaining Noise Privacy PCA Based Data Reconstruction A: Original Data A*: Perturbed Data A~: Reconstructed Data A Principal Direction
Added Noise: Utility σ2 A* Projection Error A~ Remaining Noise Privacy PCA Based Data Reconstruction Correlated Noise! A: Original Data A*: Perturbed Data A~: Reconstructed Data A Principal Direction
Data Perturbation: main idea • Observations • The amount of the random noise controls privacy/utility tradeoff • i.i.d (identical independently distributed) noise does not preserve the privacy! Not well enough • Lesson learned • Noise should be correlated with original data • Z. Huang et al. Sigmod 05.
State of the Art • Privacy Preservation • Given a utility requirement, maximize the privacy • Existing Work (Z. Huang et al. Sigmod05) • Batch mode, static data • And many other works (see our paper for a detailed literature review)
At Et A~t + Publish A~t U3x3: online estimation of principal components Update U Generate noise distributed along U S. Papadimitriou et al. VLDB05 Adding Dynamic Correlated Noise A1 A2 A3
σ2 σ2 Added to At Rotate back to data space Noise distributed in principal components’ subspace Put it into Algorithm: Distribute Noise k=3, U: eigenvectors, V: eigenvalues
Removed noise by online reconstruction Local principal component Local principal component Removed noise by online reconstruction Noise added along global PC -- offline Global principal component Noise added along global PC -- offline why is our algorithm better (state of the art)?
Online Reconstruction vs. Offline Reconstruction • Choice of adversary: • Offline reconstruction based on global principal components • Online tracking of the principal components and apply local reconstruction • Please see the details in the paper
h streams 1 2 3 2 3 4 3 4 5 4 5 6 Time w1 w2 w3 w4 W = Tracking Autocorrelation a=[1 2 3 4 5 6]T
Distribute Noise Avoid adding noise > allowed threshold! And still auto-correlated with the stream 1 2 3 2 3 4 3 4 5 4 5 6 1 2 3 2 3 4 3 4 5 4 5 6 1 2 3 2 3 4 3 4 5 4 5 6 1 2 3 2 3 4 3 4 5 4 5 6 1 2 3 2 3 4 3 4 5 4 5 6 Idea: constraint the next k noise values based on previous h-k noises + current estimation of U becomes a linear system W =
Experiments • Three Real Data Streams • Sensor streams, Lab: Light, Humidity, Volt, Temperature. 7712x198 • Choroline environmental streams: 4310x166 • Stock streams: 8000x2
Perturbation vs. Reconstruction streaming auto-correlated additive noise noise correlated with global principal components streaming correlated additive noise take perturbed data as the reconstruction streaming auto-correlated online reconstruction streaming correlated online reconstruction offline-reconstruction based on global principal components noise (discrepancy) is represented by the relative energy as percentage to the original data streams, i.e., D(A, A*)/||A||
Reconstruction Error: Online-R vs. Offline-R 10% noise k=10 online reconstruction achieves better accuracy as it minimizes the projection error
Reconstruction Error: vary k • online reconstruction achieves better accuracy • large k reduces projection error
Future Work • Combing correlation and autocorrelation • Other type of data streams, other than numeric data, such as categorical data
Questions • Thank you!