1 / 60

Private Outlier Detection and Content based Encrypted Search

Thesis on Privacy-Preserving Data Mining, Outlier Detection, & Encrypted Search for Privacy with Utility, Cryptographic Privacy, and Private Content-Based Search using Hierarchical Index Structures. Innovative approaches improving accuracy & reducing cost in outlier detection & private image retrieval.

daryld
Download Presentation

Private Outlier Detection and Content based Encrypted Search

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Thesis Defense Private Outlier Detection and Content based Encrypted Search Nisarg Raval MS by Research, CSE Advisors : Prof. C. V. Jawahar & Dr. Kannan Srinathan IIIT – Hyderabad

  2. Need for Privacy

  3. Need for Privacy

  4. Privacy Preserving Data Mining (PPDM) • Sharing information leads to mutual gain • Patient records help medical research • Data confidentiality • Sensitive information • Privacy with Utility • Randomization • Cryptographic

  5. Privacy in Information Retrieval • Cloud based solutions • Storage and Processing • Loose control over data • Private database • Encrypted Search • Public database • Private Information Retrieval

  6. Contributions • Distributed Outlier Detection using Locality Sensitive Hashing • Privacy Preserving Outlier Detection using Locality Sensitive Hashing • Private Content Based Search on Encrypted Data using Hierarchical Index Structures • Private Content Based Image Retrieval

  7. Private Outlier Detection

  8. Motivation Trusted Third Party (TTP)

  9. Motivation Can we avoid TTP ? Trusted Third Party (TTP)

  10. Motivation Simulate Trusted Third Party

  11. Previous Results • Vaidya et al. ICDM 2004 • Secure Distance and Secure Comparison Protocol • Zhou et al. EBISS 2009 • Homomorphic Encryption and Randomization • Quadratic Cost How do we reduce Quadratic Cost ?

  12. Approximation • No crisp definition of Outliers • Approximation is as good as exact results • Reducing Quadratic cost by approximation Trade off between Accuracy and Cost

  13. Outlier Detection • Distance based outlier [Knorr et al. VLDB 1998] • An object is an outlier if very large fraction of total objects lie outside the specified radius. Outlier Non Neighbors Neighbors

  14. Our Approach • Converse of the definition • An object is non-outlier if it has enough neighbors within specified radius. Easy to find small number of neighbors! Non - Outlier Neighbors Non Neighbors

  15. Locality Sensitive Hashing (LSH) • Property • Condition • Hash Family Similar objects are hashed to the same bin

  16. Centralized Outlier Detection LSH Pruning Compute Parameters Find Near Neighbors Generate Bin Structure Prune Non Outliers Phase I Phase II Madhuchand Rushi Pillutla, Nisarg Raval, Piyush Bansal, Kannan Srinathan and C.V. Jawahar; LSH Based Outlier Detection and Its Application in Distributed Setting; CIKM 2011

  17. Distributed Setting • Horizontally partitioned data • Each player has the same attributes for a subset of the total objects A B

  18. Player A Player B Distributed Outlier Detection • Global LSH parameters • Local Pruning • Centralized Algorithm • Local Outliers • May have non-outliers What about the non-outliers which have neighbours in other player’s dataset?

  19. Distributed System Overview Global Parameters Local Pruning Global Pruning Exhaustive Pruning

  20. Private Outlier Detection • Three Phases • Global parameter computation • Local pruning • Secure global pruning • Secure Exhaustive Pruning • Secure distance computation • High cost • Minimal increase in False Positives Trade off between Accuracy and Cost

  21. Private System Overview One player computes LSH parameters and publishes Secure Sum Local Pruning Secure Union Construct global LSH bin structure using secure union and secure sum protocols Secure Sum Secure Exhaustive Pruning is costly

  22. Analysis • Computational Cost: • L is sub-linear in n • Communication Cost: • Nb << n • Independent of dimensionality • Round Complexity : constant • Security of the algorithm depends on the security of the secure union and secure sum protocols • Honest But Curious Model (HBC) n : Number of objects d : Number of dimensions L : Number of hash functions Nb : Number of bins

  23. Experimental Results DistributedOD : Distributed Outlier Detection PrivateOD : Private Outlier Detection DA : Processed Points PT : Point Threshold NN : Near Neighbors Only 0.02% increase in False Positives of PrivateOD False Positives can be considered as borderline outliers!

  24. Comparison • Less than Quadratic • Superior than previous known best results Corel Landsat Up to 10000 times less communication on datasets of size 106 ! HouseHold Darpa

  25. Private Content Based Search

  26. Motivation Content Base Image Retrieval Google Goggles

  27. Motivation Query Image Potential Privacy Breach!

  28. Existing Solutions • Download the entire database and search at client side • Trivial but Impractical • Kuzu et al. ICDE 2012 • Multiple rounds of Similarity SSE (CSL) • Low accuracy and High Cost • Shashank etal. CVPR 2008 • Private Content Based Image Retrieval (PCBIR) • Complexity linear in database size Single server PIR has to access every element in the database!

  29. Our Approach • Two server solution • Content Owner and Database Server • Hierarchical Index Structures • Client and Server jointly perform secure search • Improve Accuracy • Bag of Words • Reduce Complexity • Multi-round Protocol

  30. Vocabulary Tree • Bag of Visual Words • Visual Words = Vector quantization of feature vectors

  31. CS-SSE using Hierarchical Indexing • Secure Index • Content Owner • Encryption and Permutation • Private Search • Authorized Users • Oblivious Traversal

  32. Secure Index Construction

  33. Private Searching on Encrypted Data

  34. Analysis • Computational cost : O(m logk n) • Optimal cost • Communication cost : O(m logk n) • Round complexity : O(logk n) • Vocabulary tree with one Million leaf nodes : 6 • Adaptive semantic secure against polynomial time adversary • Honest But Curious adversary model (HBC) m:size of a node k: branching factor n:number of leaf nodes

  35. Datasets Dataset Size Precision@10 Query Time (ms) Communication (MB) CS-SSE CSL CS-SSE CSL CS-SSE CSL Caltech256Easy10 100 37.6 2.9 0.055 0.013 3.80 0.50 200 44.5 3.1 0.055 0.049 3.84 2.79 300 52.07 3.66 0.055 0.058 3.87 4.20 400 57.48 8.05 0.055 0.090 3.90 5.60 Caltech256Var20 200 15.1 3.35 0.055 0.093 3.85 2.79 400 17.65 3.45 0.055 0.080 3.91 3.33 600 18.47 2.73 0.056 0.030 4.01 12.68 800 20.58 3.88 0.056 0.660 4.09 16.91 Comparison with CSL Communication is independent of database size! 30% improvement in accuracy!

  36. Comparison with PCBIR PCBIR : + CS-SSE : x Scene15 Caltech256 – Var20 Our algorithm is O(105) times faster than PCBIR!

  37. Conclusions • Addressed issues of privacy in the domain of Data Mining and Information Retrieval • Private Outlier Detection • Use LSH to achieve less than quadratic cost • Distributed and Private algorithms • Private Content based Encrypted Search • Use Hierarchical Indexing for efficient encrypted search • Private Content based Image Search

  38. Related Publications • M Pillutla, N Raval, P Bansal, K Srinathan, C.V. Jawahar, LSH based outlier detection and its application in distributed setting, CIKM 2011. • N Raval, M Pillutla, P Bansal, K Srinathan, C.V. Jawahar, Privacy Preserving Outlier Detection using Locality Sensitive Hashing, ICDMW 2011. • N Raval, M Pillutla, P Bansal, K Srinathan, C.V. Jawahar, Efficient Content Similarity Search on Encrypted Data using Hierarchical Index Structures, TDP (Under Review)

  39. nisarg.raval@research.iiit.ac.in

  40. 0 1 101 0 1 1 0 LSH Example Courtesy: Fergus et al.

  41. Cryptographic Primitives Secure Union Secure Sum

  42. PPOD Example Player A : a1 = (1,1), a2 = (1,3), a3 = (2,1), a4 = (2,3), a5 = (5,1), a6=(4,5) Player B: b1 = (3,1), b2 = (4,2), b3 = (5,2), b4 = (4,1) Total Dataset Size N = 10 Point Threshold PT = 0.8 (PT’ = (1 – PT) x N = 2) Distance Threshold DT = 2 Approximation Factor AF = 0 LSH Radius R = DT/(1 + AF) = 2 Local Probable Outliers of A = {a5,a6} Global Probable Outliers of A = {a6} Player A’s LSH Bin Structure Player B’s LSH Bin Structure

  43. CS-SSE Example Vocabulary Tree Secure Index

  44. Img2, 1 Img1, 1 Img1, 2 Index Construction Courtesy: Nister et al.

  45. Query Courtesy: Nister et al.

  46. PPOD Results • Bins << Data Points • Communication Cost increases with number of players • The rate of increase in communication cost is slow

  47. CS-SSE Results Retrieval quality of CS-SSE 30% improvement in accuracy over previous methods! Search results on Ukbench dataset

  48. Comparison with CSL CSL : + CS-SSE : x Communication is independent of database size! 30% improvement in accuracy! Caltech256 Scene15

  49. Our Approach Outlier Detection Pruning Non Outliers Near Neighbor Queries LSH LSH is efficient for near neighbor queries!

  50. LSH Hash Objects LSH Bin Structure

More Related