520 likes | 799 Views
Nisarg Raval , Madhuchand Rushi Pillutla, Piyush Bansal, K Srinathan and C V Jawahar IIIT Hyderabad, India. CSTAR. Privacy Preserving Outlier Detection using Locality Sensitive Hashing. Motivation. Motivation. Trusted Third Party (TTP). Motivation. Can we avoid TTP ?.
E N D
Nisarg Raval, Madhuchand Rushi Pillutla, Piyush Bansal, K Srinathan and C V Jawahar IIIT Hyderabad, India CSTAR Privacy Preserving Outlier Detection using Locality Sensitive Hashing
Motivation Trusted Third Party (TTP)
Motivation Can we avoid TTP ? Trusted Third Party (TTP)
Motivation Simulate Trusted Third Party
Alice and Bob have database of customer behavior. • They together want to find fraudulent customers (outliers) in their respective database. • Only outliers should be revealed. • Individual data should be private. Privacy Preserving Outlier Detection
Statistics based • Barnett et al. John Wiley 1994 • Density based • Papadimitriou et al. ICDE 2003 • Distance based • Knorr et al. VLDB 1998 • Ramaswamy et al. SIGMOD 2000 • Wang et al. ICDE 2011 Outlier Detection Approach
Heuristic based • Atallah et al. KDEW 1999 • Verykios et al. KDE 2003 • Reconstruction based • Agrawal et al. SIGMOD 2000 • Rizvi et al. VLDB 2002 • Cryptography based • Lindell et al. CRYPTO 2000 • Clifton et al. SIGKDD 2002 Privacy Preserving Data Mining Verykios et al. ; SIGMOD 2004
Vaidyaet al. ICDM 2004 • Pair wise distance computation • Secure Distance and Secure Comparison protocols • Zhou et al. EBISS 2009 • Pair wise distance computation • Homomorphic Encryption and Randomization Related Work
Vaidyaet al. ICDM 2004 • Pair wise distance computation • Secure Distance and Secure Comparison protocols • Zhou et al. EBISS 2009 • Pair wise distance computation • Homomorphic Encryption and Randomization • Quadratic Cost • Approximately 1012 operations on 1 Million data points. Related Work Our method is 10000 times faster on 1 Million data points!
Distance based outliers [Knorr et al. VLDB 1998] • An object is an outlier if very large fraction of total objects lie outside the specified radius. Outlier Detection
Distance based outliers [Knorr et al. VLDB 1998] • An object is an outlier if very large fraction of total objects lie outside the specified radius. Outlier Detection
Distance based outliers [Knorr et al. VLDB 1998] • An object is an outlier if very large fraction of total objects lie outside the specified radius. Outlier Detection Non Neighbors Neighbors
Distance based outliers [Knorr et al. VLDB 1998] • An object is an outlier if very large fraction of total objects lie outside the specified radius. Outlier Detection Outlier Non Neighbors Neighbors
Distance based outliers [Knorr et al. VLDB 1998] • An object is an outlier if very large fraction of total objects lie outside the specified radius. Outlier Detection Outlier Non Neighbors Neighbors
Converse of the definition • An object is non-outlier if it has enough neighbors within specified radius. Our Approach
Converse of the definition • An object is non-outlier if it has enough neighbors within specified radius. Our Approach
Converse of the definition • An object is non-outlier if it has enough neighbors within specified radius. Our Approach Neighbors Non Neighbors
Converse of the definition • An object is non-outlier if it has enough neighbors within specified radius. Our Approach Non - Outlier Neighbors Non Neighbors
Converse of the definition • An object is non-outlier if it has enough neighbors within specified radius. Our Approach Easy to find small number of neighbors! Non - Outlier Neighbors Non Neighbors
Property • Condition • Hash Family Locality Sensitive Hashing (LSH) Similar objects are hashed to same bin
Centralized Algorithm Outlier Detection Find Non Outliers Near Neighbor Queries LSH • MadhuchandRushiPillutla, Nisarg Raval, PiyushBansal, KannanSrinathan and C.V. Jawahar • LSH Based Outlier Detection and Its Application in Distributed Setting CIKM 2011
Vertically distributed data • Each player has different attributes for the same set of objects Distributed Settings
Vertically distributed data • Each player has different attributes for the same set of objects • Horizontally distributed data • Each player has the same attributes for a subset of the total objects Distributed Settings
Two phase: • Generation of global LSH bin structure using information from all the players • Find non outliers using generated global bin structure Privacy in Vertical Distribution
Two phase: • Generation of global LSH bin structure using information from all the players • Find non outliers using generated global bin structure Privacy in Vertical Distribution How do we generate LSH bin structure privately ?
Two phase: • Generation of global LSH bin structure using information from all the players • Find non outliers using generated global bin structure Privacy in Vertical Distribution • Private Hash Evaluation • LSH based on p-stable distribution
Two phase: • Generation of global LSH bin structure using information from all the players • Find non outliers using generated global bin structure Privacy in Vertical Distribution • Secure Evaluation of Dot Product (a.v) • Each player will generate values of vector a corresponding to the dimensions of v they have. • Add the corresponding products to generate shares of dot product • Using Secure Sum protocol generate final dot product
Two phase: • Generation of global LSH bin structure using information from all the players • Find non outliers using generated global bin structure Privacy in Vertical Distribution • Perform near neighbor queries • Secure Distance and Secure Comparison • Many neighbors • Quadratic Communication
Two phase: • Generation of global LSH bin structure using information from all the players • Find non outliers using generated global bin structure Privacy in Vertical Distribution • Perform near neighbor queries • Secure Distance and Secure Comparison • Many neighbors • Quadratic Communication Can we break the Quadratic Bound ?
Definition of outliers are subjective • Unlike traditional LSH queries NO explicit distance calculation • No communication required Hash Objects Approximate Near Neighbor Queries Count Neighbors Yes Non Outlier No Outlier
No of queries = No of objects in database • Databases are very large Hash Objects Need for Pruning Count Neighbors Can we reduce the number of queries? Yes Non Outlier No Outlier
Hash Objects Pruning Count Neighbors Yes No Outlier
Neighbors of a non outlier are also non outliers Hash Objects Pruning Count Neighbors Yes Non Outliers No Outlier
Neighbors of a non outlier are also non outliers Hash Objects Pruning < 1 % of total database needs to be processed! Count Neighbors Yes Non Outliers No Outlier
Data is the union of the set of objects all players have • Steps: • Generate local LSH bin structure • Perform local pruning • Communicate to obtain global neighbor information • Perform global pruning Privacy in Horizontal Distribution
Data is the union of the set of objects all players have • Steps: • Generate local LSH bin structure • Perform local pruning • Communicate to obtain global neighbor information • Perform global pruning Privacy in Horizontal Distribution How do we obtain global neighbor information privately ?
Construct global LSH bin labels • Secure Union Protocol • Add count of objects of corresponding bins • Secure Sum protocol • Perform global pruning using global bin structure Private Global Bin Structure
LSH is probabilistic • Probability of being near neighbor is at least • False neighbors may cause pruning of an outlier • False Negatives Approximation Error How do we reduce False Negatives ?
Bin Threshold (BT) • Neighbor only if it appears in at least (BT) bins • Increasing BT will decrease False Negatives Hash Objects Reducing False Negatives Count Neighbors Yes Non Outlier No Outlier
Bin Threshold may remove actual neighbors • High Bin Threshold reduce pruning efficiency • False Positives Bin Threshold How do we reduce False Positives without increasing False Negatives?
Reducing False Positives LSH LSH LSH Pruning Pruning Pruning Compute Parameters Compute Parameters Compute Parameters Find Near Neighbors Find Near Neighbors Find Near Neighbors Intersection of Results Iteration n Iteration 1 Iteration 2 Generate Bin Structure Generate Bin Structure Generate Bin Structure Prune Non Outliers Prune Non Outliers Prune Non Outliers Final Set of Outliers Multiple Runs Output
Analysis Security of the Algorithm depends on the security of Secure Union and Secure Sum protocols
Increasing BT will increase detection rate but also increase false positives • Optimal BT • High detection rate • Low false positives Effect of Bin Threshold ( BT ) Corel Landsat Darpa Household
False positives decrease exponentially with increase in iterations • Very small number of iterations needed to achieve low false positive rate Effect of Iterations on False Positives
Less than Quadratic • Superior than previously known best results Communication Corel Landsat Up to 10000 times less communication on datasets of size 106 ! Darpa Household
Performance False Positives can be considered as borderline outliers!
Approximate Outlier Detection • Efficient Private algorithms for both Vertical and Horizontal Distribution • Efficient Pruning based on LSH • Scalable for large and high dimensional data • Trade off between Accuracy and Cost Conclusion
CSTAR nisarg.raval@research.iiit.ac.in Supported by Microsoft Research India Travel Grant