190 likes | 332 Views
Nearest Neighbor. Paul Hsiung March 16, 2004. Quick Review of NN. Set of points P Query point q Distance metric d Find p in P such that d(p,q) < d(p’,q) for all p’ in P. p. q. NN Used In…. Image databases [Pentland et al ] Color indexing [swain et al ]
E N D
Nearest Neighbor Paul Hsiung March 16, 2004
Quick Review of NN • Set of points P • Query point q • Distance metric d • Find p in P such that d(p,q) < d(p’,q)for all p’ in P p q
NN Used In… • Image databases [Pentland et al] • Color indexing [swain et al] • Recognizing 3D objects [Murase et al] • Shapes [Mori et al] • Drug testing • DNA sequence matching [Buhler]
Tree-based Approaches • Quadtrees • Split middle in all dimensions • Split until no points or one point left • Kd-trees • Split in one dimension • Pick the middle wisely • Ball-trees • Pick two pivots and split • SR-trees • We have rectangles and spheres, so why not combine them
Indyk’s Gripe • Beyond 10 or 20 dimensions, tree-based structures will look at many points • No better than brute force linear search… • So he came up with a hash table approach: Locality Sensitive Hashing (LSH) • Rest of talk will be on his paper
Interlude: Near Neighbor • Set of points P • Query point q • Distance metric d • Find p in P such that d(p,q) < (1+ε)d(P,q)where d(P,q) is the distance of q to its closest point in P d(P,q) p q (1+ε)d(P,q)
Hash • Pick a subset I of random coordinates • Hash function, h(p), will return a bucket ID h(p) = projection of p on I
Intuition • If two points are close, they hash to same bucket with some probability p1 • If they are far, they hash to same bucket with a smaller probability p2 < p1
Indyk’s Hash • Convert coordinates of p to {0,1}d • Use Hamming distance: d(p,q)= # positions on which p and q differ • Example: • p=(0,1,0,1,1,1,0,0,1,0) • I={2,5,7} • Then, h(p)=(1,1,0) • Demo: • http://web.mit.edu/ardonite/6.838/locality-hashing.htm
Why Locality-sensitive? • Pr[h(p)=h(q)]=(1-d(p,q)/D)k • D is the number of dimensions in the binary representation • k is the size of I • We can vary the probability by changing k Pr k=1 Pr k=2 distance distance
Now to Use It (Training) • Generate l hash functions: h1..hl • Store each point p in the bucket hi(p) of the i-th hash array, i=1...l
Now to Use It (Query) • Retrieve all the points that belong to the buckets: h1(q)..hl(q) • Return the retrieved point that is closest to q • This “solves” the Near Neighbor problem
Indyk’s Results • Compared with another tree-based algorithm • Color histogram dataset from Corel Draw • 20,000 images, 64 dimensions • Used 1k, 2k, 5k, 10k, 19k points for training • 1k points are used for query • Computed missed ratio – fraction of queries with no hits
Ugly Side • Works best with Hamming distance • Can be extended from L1 and L2 norms • Requires parameter tweaking (size of I and number of hash buckets) • Does not work well on uniform data
Bibliography A. Gionis, P. Indyk, R. Motwani. Similarity Search in High Dimensions via Hashing. In VLDB 25th, 1999 J. Buhler. Efficient Large-Scale Sequence Comparison by Locality-Sensitive Hashing. In Bioinformatics 17(5) 419-428, 2001 H. Murase, S. K. Nayar. Visual Learning and Recognition of 3D Objects from Appearance. In IJCV, Vol. 14, No. 1 5-24, 1995 A. Pentland, R.W. Picard, S. Scalroff. Photobook: Tools for Content Based Manipulation of Image Databases. In SPIE Vol. 2185 34-47, 1994 M.J. Swain, D.H. Ballard. Color Indexing. In IJCV, Vol. 7, No. 1 11-32, 1991 G. Mori, S. Belongie, J. Malik. Shape Contexts Enable Efficient Retrieval of Similar Shapes. CVPR 1 723-730, 2001 Slides: “Algorithms for Nearest Neighbor Search” by Piotr Indyk Slides: “Approximate Nearest Neighbor in High Dimensions via Hashing” by Aris Gionis, Piotr Indyk, and Rajeev Motwani