1 / 20

Similarity Search in High Dimensions via Hashing

Similarity Search in High Dimensions via Hashing. Aristides Gionis, Piotr Indyk, Rajeev Motwani Presented by: Fatih Uzun. Outline. Introduction Problem Description Key Idea Experiments and Results Conclusions. Introduction. Similarity Search over High-Dimensional Data

oke
Download Presentation

Similarity Search in High Dimensions via Hashing

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Similarity Search in High Dimensions via Hashing Aristides Gionis, Piotr Indyk, Rajeev Motwani Presented by: Fatih Uzun

  2. Outline • Introduction • Problem Description • Key Idea • Experiments and Results • Conclusions

  3. Introduction • Similarity Search over High-Dimensional Data • Image databases, document collections etc • Curse of Dimensionality • All space partitioning techniques degrade to linear search for high dimensions • Exact vs. Approximate Answer • Approximate might be good-enough and much-faster • Time-quality trade-off

  4. Problem Description •  - Nearest Neighbor Search( - NNS) • Given a set P of points in a normed space , preprocess P so as to efficiently return a point p  P for any given query point q, such that • dist(q,p)  (1 +  )  min r  P dist(q,r) • Generalizes to K- nearest neighbor search ( K >1)

  5. Problem Description

  6. Key Idea • Locality Sensitive Hashing ( LSH ) to get sub-linear dependence on the data-size for high-dimensional data • Preprocessing : • Hash the data-point using several LSH functions so that probability of collision is higher for closer objects

  7. Algorithm : Preprocessing • Input • Set of N points { p1 , …….. pn } • L ( number of hash tables ) • Output • Hash tables Ti , i = 1 , 2, …. L • Foreach i = 1 , 2, …. L • Initialize Ti with a random hash function gi(.) • Foreach i = 1 , 2, …. L Foreach j = 1 , 2, …. N Store point pjon bucket gi(pj) of hash table Ti

  8. LSH - Algorithm P pi g1(pi) g2(pi) gL(pi) T1 T2 TL

  9. Algorithm :  - NNS Query • Input • Query point q • K ( number of approx. nearest neighbors ) • Access • Hash tables Ti , i = 1 , 2, …. L • Output • Set S of K ( or less ) approx. nearest neighbors • S   Foreach i = 1 , 2, …. L • S  S  { points found in gi(q) bucket of hash table Ti }

  10. LSH - Analysis • Family H of (r1, r2, p1, p2)-sensitive functions, {hi(.)} • dist(p,q) < r1 ProbH [h(q) = h(p)]  p1 • dist(p,q)  r2  ProbH [h(q) = h(p)]  p2 • p1>p2andr1<r2 • LSHfunctions: gi(.) = { h1(.) …hk(.) } • For a proper choice of k and l, a simpler problem, (r,)-Neighbor, and hence the actual problem can be solved • Query Time : O(dn[1/(1+)] ) • d : dimensions , n : data size

  11. Experiments • Data Sets • Color images from COREL Draw library (20,000 points,dimensions up to 64) • Texture information of aerial photographs (270,000 points, dimensions 60) • Evaluation • Speed, Miss Ratio, Error (%) for various data sizes, dimensions, and K values • Compare Performance with SR-Tree ( Spatial Data Structure )

  12. Performance Measures • Speed • Number of disk block accesses in order to answer the query ( # hash tables) • Miss Ratio • Fraction of cases when less than K points are found for K-NNS • Error • Average of fractional error in distance to point found by LSH as compared to nearest neighbor distance taken over entire set of queries

  13. Speed vs. Data Size

  14. Speed vs. Dimension

  15. Speed vs. Nearest Neighbors

  16. Speed vs. Error

  17. Miss Ratio vs. Data Size

  18. Conclusion • Better Query Time than Spatial Data Structures • Scales well to higher dimensions and larger data size ( Sub-linear dependence ) • Predictable running time • Extra storage over-head • Inefficient for data with distances concentrated around average

  19. Future Work • Investigate Hybrid-Data Structures obtained by merging tree and hash based structures. • Make use of the structure of the data-set to systematically obtain LSH functions • Explore other applications of LSH-type techniques to data mining

  20. Questions ?

More Related