Privacy Preserving Outlier Detection using Locality Sensitive Hashing

Nisarg Raval, Madhuchand Rushi Pillutla, Piyush Bansal, K Srinathan and C V Jawahar IIIT Hyderabad, India CSTAR Privacy Preserving Outlier Detection using Locality Sensitive Hashing

Motivation

Motivation Trusted Third Party (TTP)

Motivation Can we avoid TTP ? Trusted Third Party (TTP)

Motivation Simulate Trusted Third Party

Alice and Bob have database of customer behavior. • They together want to find fraudulent customers (outliers) in their respective database. • Only outliers should be revealed. • Individual data should be private. Privacy Preserving Outlier Detection

Statistics based • Barnett et al. John Wiley 1994 • Density based • Papadimitriou et al. ICDE 2003 • Distance based • Knorr et al. VLDB 1998 • Ramaswamy et al. SIGMOD 2000 • Wang et al. ICDE 2011 Outlier Detection Approach

Heuristic based • Atallah et al. KDEW 1999 • Verykios et al. KDE 2003 • Reconstruction based • Agrawal et al. SIGMOD 2000 • Rizvi et al. VLDB 2002 • Cryptography based • Lindell et al. CRYPTO 2000 • Clifton et al. SIGKDD 2002 Privacy Preserving Data Mining Verykios et al. ; SIGMOD 2004

Vaidyaet al. ICDM 2004 • Pair wise distance computation • Secure Distance and Secure Comparison protocols • Zhou et al. EBISS 2009 • Pair wise distance computation • Homomorphic Encryption and Randomization Related Work

Vaidyaet al. ICDM 2004 • Pair wise distance computation • Secure Distance and Secure Comparison protocols • Zhou et al. EBISS 2009 • Pair wise distance computation • Homomorphic Encryption and Randomization • Quadratic Cost • Approximately 1012 operations on 1 Million data points. Related Work Our method is 10000 times faster on 1 Million data points!

Distance based outliers [Knorr et al. VLDB 1998] • An object is an outlier if very large fraction of total objects lie outside the specified radius. Outlier Detection

Distance based outliers [Knorr et al. VLDB 1998] • An object is an outlier if very large fraction of total objects lie outside the specified radius. Outlier Detection Non Neighbors Neighbors

Distance based outliers [Knorr et al. VLDB 1998] • An object is an outlier if very large fraction of total objects lie outside the specified radius. Outlier Detection Outlier Non Neighbors Neighbors

Converse of the definition • An object is non-outlier if it has enough neighbors within specified radius. Our Approach

Converse of the definition • An object is non-outlier if it has enough neighbors within specified radius. Our Approach Neighbors Non Neighbors

Converse of the definition • An object is non-outlier if it has enough neighbors within specified radius. Our Approach Non - Outlier Neighbors Non Neighbors

Converse of the definition • An object is non-outlier if it has enough neighbors within specified radius. Our Approach Easy to find small number of neighbors! Non - Outlier Neighbors Non Neighbors

Property • Condition • Hash Family Locality Sensitive Hashing (LSH) Similar objects are hashed to same bin

Centralized Algorithm Outlier Detection Find Non Outliers Near Neighbor Queries LSH • MadhuchandRushiPillutla, Nisarg Raval, PiyushBansal, KannanSrinathan and C.V. Jawahar • LSH Based Outlier Detection and Its Application in Distributed Setting CIKM 2011

Vertically distributed data • Each player has different attributes for the same set of objects Distributed Settings

Vertically distributed data • Each player has different attributes for the same set of objects • Horizontally distributed data • Each player has the same attributes for a subset of the total objects Distributed Settings

Two phase: • Generation of global LSH bin structure using information from all the players • Find non outliers using generated global bin structure Privacy in Vertical Distribution

Two phase: • Generation of global LSH bin structure using information from all the players • Find non outliers using generated global bin structure Privacy in Vertical Distribution How do we generate LSH bin structure privately ?

Two phase: • Generation of global LSH bin structure using information from all the players • Find non outliers using generated global bin structure Privacy in Vertical Distribution • Private Hash Evaluation • LSH based on p-stable distribution

Two phase: • Generation of global LSH bin structure using information from all the players • Find non outliers using generated global bin structure Privacy in Vertical Distribution • Secure Evaluation of Dot Product (a.v) • Each player will generate values of vector a corresponding to the dimensions of v they have. • Add the corresponding products to generate shares of dot product • Using Secure Sum protocol generate final dot product

Two phase: • Generation of global LSH bin structure using information from all the players • Find non outliers using generated global bin structure Privacy in Vertical Distribution • Perform near neighbor queries • Secure Distance and Secure Comparison • Many neighbors • Quadratic Communication

Two phase: • Generation of global LSH bin structure using information from all the players • Find non outliers using generated global bin structure Privacy in Vertical Distribution • Perform near neighbor queries • Secure Distance and Secure Comparison • Many neighbors • Quadratic Communication Can we break the Quadratic Bound ?

Definition of outliers are subjective • Unlike traditional LSH queries NO explicit distance calculation • No communication required Hash Objects Approximate Near Neighbor Queries Count Neighbors Yes Non Outlier No Outlier

No of queries = No of objects in database • Databases are very large Hash Objects Need for Pruning Count Neighbors Can we reduce the number of queries? Yes Non Outlier No Outlier

Hash Objects Pruning Count Neighbors Yes No Outlier

Neighbors of a non outlier are also non outliers Hash Objects Pruning Count Neighbors Yes Non Outliers No Outlier

Neighbors of a non outlier are also non outliers Hash Objects Pruning < 1 % of total database needs to be processed! Count Neighbors Yes Non Outliers No Outlier

Data is the union of the set of objects all players have • Steps: • Generate local LSH bin structure • Perform local pruning • Communicate to obtain global neighbor information • Perform global pruning Privacy in Horizontal Distribution

Data is the union of the set of objects all players have • Steps: • Generate local LSH bin structure • Perform local pruning • Communicate to obtain global neighbor information • Perform global pruning Privacy in Horizontal Distribution How do we obtain global neighbor information privately ?

Construct global LSH bin labels • Secure Union Protocol • Add count of objects of corresponding bins • Secure Sum protocol • Perform global pruning using global bin structure Private Global Bin Structure

LSH is probabilistic • Probability of being near neighbor is at least • False neighbors may cause pruning of an outlier • False Negatives Approximation Error How do we reduce False Negatives ?

Bin Threshold (BT) • Neighbor only if it appears in at least (BT) bins • Increasing BT will decrease False Negatives Hash Objects Reducing False Negatives Count Neighbors Yes Non Outlier No Outlier

Bin Threshold may remove actual neighbors • High Bin Threshold reduce pruning efficiency • False Positives Bin Threshold How do we reduce False Positives without increasing False Negatives?

Reducing False Positives LSH LSH LSH Pruning Pruning Pruning Compute Parameters Compute Parameters Compute Parameters Find Near Neighbors Find Near Neighbors Find Near Neighbors Intersection of Results Iteration n Iteration 1 Iteration 2 Generate Bin Structure Generate Bin Structure Generate Bin Structure Prune Non Outliers Prune Non Outliers Prune Non Outliers Final Set of Outliers Multiple Runs Output

Analysis Security of the Algorithm depends on the security of Secure Union and Secure Sum protocols

Experimental Results

Increasing BT will increase detection rate but also increase false positives • Optimal BT • High detection rate • Low false positives Effect of Bin Threshold ( BT ) Corel Landsat Darpa Household

False positives decrease exponentially with increase in iterations • Very small number of iterations needed to achieve low false positive rate Effect of Iterations on False Positives

Less than Quadratic • Superior than previously known best results Communication Corel Landsat Up to 10000 times less communication on datasets of size 106 ! Darpa Household

Performance False Positives can be considered as borderline outliers!

Approximate Outlier Detection • Efficient Private algorithms for both Vertical and Horizontal Distribution • Efficient Pruning based on LSH • Scalable for large and high dimensional data • Trade off between Accuracy and Cost Conclusion

CSTAR nisarg.raval@research.iiit.ac.in Supported by Microsoft Research India Travel Grant

Privacy Preserving Outlier Detection using Locality Sensitive Hashing