220 likes | 469 Views
Similarity Search in High Dimensions via Hashing. Aristides Gionis, Piotr Indyk, Rajeev Motwani Presented by: Fatih Uzun. Outline. Introduction Problem Description Key Idea Experiments and Results Conclusions. Introduction. Similarity Search over High-Dimensional Data
E N D
Similarity Search in High Dimensions via Hashing Aristides Gionis, Piotr Indyk, Rajeev Motwani Presented by: Fatih Uzun
Outline • Introduction • Problem Description • Key Idea • Experiments and Results • Conclusions
Introduction • Similarity Search over High-Dimensional Data • Image databases, document collections etc • Curse of Dimensionality • All space partitioning techniques degrade to linear search for high dimensions • Exact vs. Approximate Answer • Approximate might be good-enough and much-faster • Time-quality trade-off
Problem Description • - Nearest Neighbor Search( - NNS) • Given a set P of points in a normed space , preprocess P so as to efficiently return a point p P for any given query point q, such that • dist(q,p) (1 + ) min r P dist(q,r) • Generalizes to K- nearest neighbor search ( K >1)
Key Idea • Locality Sensitive Hashing ( LSH ) to get sub-linear dependence on the data-size for high-dimensional data • Preprocessing : • Hash the data-point using several LSH functions so that probability of collision is higher for closer objects
Algorithm : Preprocessing • Input • Set of N points { p1 , …….. pn } • L ( number of hash tables ) • Output • Hash tables Ti , i = 1 , 2, …. L • Foreach i = 1 , 2, …. L • Initialize Ti with a random hash function gi(.) • Foreach i = 1 , 2, …. L Foreach j = 1 , 2, …. N Store point pjon bucket gi(pj) of hash table Ti
LSH - Algorithm P pi g1(pi) g2(pi) gL(pi) T1 T2 TL
Algorithm : - NNS Query • Input • Query point q • K ( number of approx. nearest neighbors ) • Access • Hash tables Ti , i = 1 , 2, …. L • Output • Set S of K ( or less ) approx. nearest neighbors • S Foreach i = 1 , 2, …. L • S S { points found in gi(q) bucket of hash table Ti }
LSH - Analysis • Family H of (r1, r2, p1, p2)-sensitive functions, {hi(.)} • dist(p,q) < r1 ProbH [h(q) = h(p)] p1 • dist(p,q) r2 ProbH [h(q) = h(p)] p2 • p1>p2andr1<r2 • LSHfunctions: gi(.) = { h1(.) …hk(.) } • For a proper choice of k and l, a simpler problem, (r,)-Neighbor, and hence the actual problem can be solved • Query Time : O(dn[1/(1+)] ) • d : dimensions , n : data size
Experiments • Data Sets • Color images from COREL Draw library (20,000 points,dimensions up to 64) • Texture information of aerial photographs (270,000 points, dimensions 60) • Evaluation • Speed, Miss Ratio, Error (%) for various data sizes, dimensions, and K values • Compare Performance with SR-Tree ( Spatial Data Structure )
Performance Measures • Speed • Number of disk block accesses in order to answer the query ( # hash tables) • Miss Ratio • Fraction of cases when less than K points are found for K-NNS • Error • Average of fractional error in distance to point found by LSH as compared to nearest neighbor distance taken over entire set of queries
Conclusion • Better Query Time than Spatial Data Structures • Scales well to higher dimensions and larger data size ( Sub-linear dependence ) • Predictable running time • Extra storage over-head • Inefficient for data with distances concentrated around average
Future Work • Investigate Hybrid-Data Structures obtained by merging tree and hash based structures. • Make use of the structure of the data-set to systematically obtain LSH functions • Explore other applications of LSH-type techniques to data mining