600 likes | 617 Views
Thesis on Privacy-Preserving Data Mining, Outlier Detection, & Encrypted Search for Privacy with Utility, Cryptographic Privacy, and Private Content-Based Search using Hierarchical Index Structures. Innovative approaches improving accuracy & reducing cost in outlier detection & private image retrieval.
E N D
Thesis Defense Private Outlier Detection and Content based Encrypted Search Nisarg Raval MS by Research, CSE Advisors : Prof. C. V. Jawahar & Dr. Kannan Srinathan IIIT – Hyderabad
Privacy Preserving Data Mining (PPDM) • Sharing information leads to mutual gain • Patient records help medical research • Data confidentiality • Sensitive information • Privacy with Utility • Randomization • Cryptographic
Privacy in Information Retrieval • Cloud based solutions • Storage and Processing • Loose control over data • Private database • Encrypted Search • Public database • Private Information Retrieval
Contributions • Distributed Outlier Detection using Locality Sensitive Hashing • Privacy Preserving Outlier Detection using Locality Sensitive Hashing • Private Content Based Search on Encrypted Data using Hierarchical Index Structures • Private Content Based Image Retrieval
Motivation Trusted Third Party (TTP)
Motivation Can we avoid TTP ? Trusted Third Party (TTP)
Motivation Simulate Trusted Third Party
Previous Results • Vaidya et al. ICDM 2004 • Secure Distance and Secure Comparison Protocol • Zhou et al. EBISS 2009 • Homomorphic Encryption and Randomization • Quadratic Cost How do we reduce Quadratic Cost ?
Approximation • No crisp definition of Outliers • Approximation is as good as exact results • Reducing Quadratic cost by approximation Trade off between Accuracy and Cost
Outlier Detection • Distance based outlier [Knorr et al. VLDB 1998] • An object is an outlier if very large fraction of total objects lie outside the specified radius. Outlier Non Neighbors Neighbors
Our Approach • Converse of the definition • An object is non-outlier if it has enough neighbors within specified radius. Easy to find small number of neighbors! Non - Outlier Neighbors Non Neighbors
Locality Sensitive Hashing (LSH) • Property • Condition • Hash Family Similar objects are hashed to the same bin
Centralized Outlier Detection LSH Pruning Compute Parameters Find Near Neighbors Generate Bin Structure Prune Non Outliers Phase I Phase II Madhuchand Rushi Pillutla, Nisarg Raval, Piyush Bansal, Kannan Srinathan and C.V. Jawahar; LSH Based Outlier Detection and Its Application in Distributed Setting; CIKM 2011
Distributed Setting • Horizontally partitioned data • Each player has the same attributes for a subset of the total objects A B
Player A Player B Distributed Outlier Detection • Global LSH parameters • Local Pruning • Centralized Algorithm • Local Outliers • May have non-outliers What about the non-outliers which have neighbours in other player’s dataset?
Distributed System Overview Global Parameters Local Pruning Global Pruning Exhaustive Pruning
Private Outlier Detection • Three Phases • Global parameter computation • Local pruning • Secure global pruning • Secure Exhaustive Pruning • Secure distance computation • High cost • Minimal increase in False Positives Trade off between Accuracy and Cost
Private System Overview One player computes LSH parameters and publishes Secure Sum Local Pruning Secure Union Construct global LSH bin structure using secure union and secure sum protocols Secure Sum Secure Exhaustive Pruning is costly
Analysis • Computational Cost: • L is sub-linear in n • Communication Cost: • Nb << n • Independent of dimensionality • Round Complexity : constant • Security of the algorithm depends on the security of the secure union and secure sum protocols • Honest But Curious Model (HBC) n : Number of objects d : Number of dimensions L : Number of hash functions Nb : Number of bins
Experimental Results DistributedOD : Distributed Outlier Detection PrivateOD : Private Outlier Detection DA : Processed Points PT : Point Threshold NN : Near Neighbors Only 0.02% increase in False Positives of PrivateOD False Positives can be considered as borderline outliers!
Comparison • Less than Quadratic • Superior than previous known best results Corel Landsat Up to 10000 times less communication on datasets of size 106 ! HouseHold Darpa
Motivation Content Base Image Retrieval Google Goggles
Motivation Query Image Potential Privacy Breach!
Existing Solutions • Download the entire database and search at client side • Trivial but Impractical • Kuzu et al. ICDE 2012 • Multiple rounds of Similarity SSE (CSL) • Low accuracy and High Cost • Shashank etal. CVPR 2008 • Private Content Based Image Retrieval (PCBIR) • Complexity linear in database size Single server PIR has to access every element in the database!
Our Approach • Two server solution • Content Owner and Database Server • Hierarchical Index Structures • Client and Server jointly perform secure search • Improve Accuracy • Bag of Words • Reduce Complexity • Multi-round Protocol
Vocabulary Tree • Bag of Visual Words • Visual Words = Vector quantization of feature vectors
CS-SSE using Hierarchical Indexing • Secure Index • Content Owner • Encryption and Permutation • Private Search • Authorized Users • Oblivious Traversal
Analysis • Computational cost : O(m logk n) • Optimal cost • Communication cost : O(m logk n) • Round complexity : O(logk n) • Vocabulary tree with one Million leaf nodes : 6 • Adaptive semantic secure against polynomial time adversary • Honest But Curious adversary model (HBC) m:size of a node k: branching factor n:number of leaf nodes
Datasets Dataset Size Precision@10 Query Time (ms) Communication (MB) CS-SSE CSL CS-SSE CSL CS-SSE CSL Caltech256Easy10 100 37.6 2.9 0.055 0.013 3.80 0.50 200 44.5 3.1 0.055 0.049 3.84 2.79 300 52.07 3.66 0.055 0.058 3.87 4.20 400 57.48 8.05 0.055 0.090 3.90 5.60 Caltech256Var20 200 15.1 3.35 0.055 0.093 3.85 2.79 400 17.65 3.45 0.055 0.080 3.91 3.33 600 18.47 2.73 0.056 0.030 4.01 12.68 800 20.58 3.88 0.056 0.660 4.09 16.91 Comparison with CSL Communication is independent of database size! 30% improvement in accuracy!
Comparison with PCBIR PCBIR : + CS-SSE : x Scene15 Caltech256 – Var20 Our algorithm is O(105) times faster than PCBIR!
Conclusions • Addressed issues of privacy in the domain of Data Mining and Information Retrieval • Private Outlier Detection • Use LSH to achieve less than quadratic cost • Distributed and Private algorithms • Private Content based Encrypted Search • Use Hierarchical Indexing for efficient encrypted search • Private Content based Image Search
Related Publications • M Pillutla, N Raval, P Bansal, K Srinathan, C.V. Jawahar, LSH based outlier detection and its application in distributed setting, CIKM 2011. • N Raval, M Pillutla, P Bansal, K Srinathan, C.V. Jawahar, Privacy Preserving Outlier Detection using Locality Sensitive Hashing, ICDMW 2011. • N Raval, M Pillutla, P Bansal, K Srinathan, C.V. Jawahar, Efficient Content Similarity Search on Encrypted Data using Hierarchical Index Structures, TDP (Under Review)
0 1 101 0 1 1 0 LSH Example Courtesy: Fergus et al.
Cryptographic Primitives Secure Union Secure Sum
PPOD Example Player A : a1 = (1,1), a2 = (1,3), a3 = (2,1), a4 = (2,3), a5 = (5,1), a6=(4,5) Player B: b1 = (3,1), b2 = (4,2), b3 = (5,2), b4 = (4,1) Total Dataset Size N = 10 Point Threshold PT = 0.8 (PT’ = (1 – PT) x N = 2) Distance Threshold DT = 2 Approximation Factor AF = 0 LSH Radius R = DT/(1 + AF) = 2 Local Probable Outliers of A = {a5,a6} Global Probable Outliers of A = {a6} Player A’s LSH Bin Structure Player B’s LSH Bin Structure
CS-SSE Example Vocabulary Tree Secure Index
Img2, 1 Img1, 1 Img1, 2 Index Construction Courtesy: Nister et al.
Query Courtesy: Nister et al.
PPOD Results • Bins << Data Points • Communication Cost increases with number of players • The rate of increase in communication cost is slow
CS-SSE Results Retrieval quality of CS-SSE 30% improvement in accuracy over previous methods! Search results on Ukbench dataset
Comparison with CSL CSL : + CS-SSE : x Communication is independent of database size! 30% improvement in accuracy! Caltech256 Scene15
Our Approach Outlier Detection Pruning Non Outliers Near Neighbor Queries LSH LSH is efficient for near neighbor queries!
LSH Hash Objects LSH Bin Structure