1 / 34

Algorithms for Nearest Neighbor Search

Algorithms for Nearest Neighbor Search. Piotr Indyk MIT. Nearest Neighbor Search. Given: a set P of n points in R d Goal: a data structure, which given a query point q , finds the nearest neighbor p of q in P. p. q. Outline of this talk. Variants Motivation

omer
Download Presentation

Algorithms for Nearest Neighbor Search

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Algorithms for Nearest Neighbor Search Piotr Indyk MIT

  2. Nearest Neighbor Search • Given: a set P of n points in Rd • Goal: a data structure, which given a query point q, finds the nearest neighborp of q in P p q

  3. Outline of this talk • Variants • Motivation • Main memory algorithms: • quadtrees • kd-trees • Locality Sensitive Hashing • Secondary storage algorithms: • R-tree (and its variants) • VA-file

  4. Variants of nearest neighbor • Near neighbor (range search): find one/all points in P within distance r from q • Spatial join: given two sets P,Q, find all pairs p in P, q in Q, such that p is within distance r from q • Approximate near neighbor: find one/all points p’ in P, whose distance to q is at most (1+e) times the distance from q to its nearest neighbor

  5. Motivation Depends on the value of d: • low d: graphics, vision, GIS, etc • high d: • similarity search in databases (text, images etc) • finding pairs of similar objects (e.g., copyright violation detection) • useful subroutine for clustering

  6. Algorithms • Main memory (Computational Geometry) • linear scan • tree-based: • quadtree • kd-tree • hashing-based: Locality-Sensitive Hashing • Secondary storage (Databases) • R-tree (and numerous variants) • Vector Approximation File (VA-file)

  7. Quadtree • Simplest spatial structure on Earth !

  8. Quadtree ctd. • Split the space into 2d equal subsquares • Repeat until done: • only one pixel left • only one point left • only a few points left • Variants: • split only one dimension at a time • k-d-trees (in a moment)

  9. Range search • Near neighbor (range search): • put the root on the stack • repeat • pop the next node T from the stack • for each child C of T: • if C is a leaf, examine point(s) in C • if C intersects with the ball of radius r around q, add C to the stack

  10. Near neighbor ctd

  11. Nearest neighbor • Start range search with r =  • Whenever a point is found, update r • Only investigate nodes with respect to current r

  12. Quadtree ctd. • Simple data structure • Versatile, easy to implement • So why doesn’t this talk end here ? • Empty spaces: if the points form sparse clouds, it takes a while to reach them • Space exponential in dimension • Time exponential in dimension, e.g., points on the hypercube

  13. Space issues: example

  14. K-d-trees [Bentley’75] • Main ideas: • only one-dimensional splits • instead of splitting in the middle, choose the split “carefully” (many variations) • near(est) neighbor queries: as for quadtrees • Advantages: • no (or less) empty spaces • only linear space • Exponential query time still possible

  15. Exponential query time • What does it mean exactly ? • Unless we do something really stupid, query time is at most dn • Therefore, the actual query time is Min[ dn, exponential(d) ] • This is still quite bad though, when the dimension is around 20-30 • Unfortunately, it seems inevitable (both in theory and practice)

  16. Approximate nearest neighbor • Can do it using (augmented) k-d trees, by interrupting search earlier [Arya et al’94] • Still exponential time (in the worst case)! • Try a different approach: • for exact queries, we can use binary search trees or hashing • can we adapt hashing to nearest neighbor search ?

  17. Locality-Sensitive Hashing [Indyk-Motwani’98] • Hash functions are locality-sensitive, if, for a random hash random function h, for any pair of points p,q we have: • Pr[h(p)=h(q)] is “high” if p is “close” to q • Pr[h(p)=h(q)] is “low” if p is”far” from q

  18. Do such functions exist ? • Consider the hypercube, i.e., • points from {0,1}d • Hamming distance D(p,q)= # positions on which p and q differ • Define hash function h by choosing a set I of k random coordinates, and setting h(p) = projection of p on I

  19. Example • Take • d=10, p=0101110010 • k=2, I={2,5} • Then h(p)=11

  20. h’s are locality-sensitive • Pr[h(p)=h(q)]=(1-D(p,q)/d)k • We can vary the probability by changing k Pr k=1 Pr k=2 distance distance

  21. How can we use LSH ? • Choose several h1..hl • Initialize a hash array for each hi • Store each point p in the bucket hi(p) of the i-th hash array, i=1...l • In order to answer query q • for each i=1..l, retrieve points in a bucket hi(q) • return the closest point found

  22. What does this algorithm do ? • By proper choice of parameters k and l, we can make, for any p, the probability that hi(p)=hi(q) for some i look like this: • Can control: • Position of the slope • How steep it is distance

  23. The LSH algorithm • Therefore, we can solve (approximately) the near neighbor problem with given parameter r • Worst-case analysis guarantees dn1/(1+e) query time • Practical evaluation indicates much better behavior [GIM’99,HGI’00,Buh’00,BT’00] • Drawbacks: • works best for Hamming distance (although can be generalized to Euclidean space) • requires radius r to be fixed in advance

  24. Secondary storage • Seek time same as time needed to transfer hundreds of KBs • Grouping the data is crucial • Different approach required: • in main memory, any reduction in the number of inspected points was good • on disk, this is notthe case !

  25. Disk-based algorithms • R-tree [Guttman’84] • departing point for many variations • over 600 citations ! (according to CiteSeer) • “optimistic” approach: try to answer queries in logarithmic time • Vector Approximation File [WSB’98] • “pessimistic” approach: if we need to scan the whole data set, we better do it fast • LSH works also on disk

  26. R-tree • “Bottom-up” approach (k-d-tree was “top-down”) : • Start with a set of points/rectangles • Partition the set into groups of small cardinality • For each group, find minimum rectangle containing objects from this group • Repeat

  27. R-tree ctd.

  28. R-tree ctd. • Advantages: • Supports near(est) neighbor search (similar as before) • Works for points and rectangles • Avoids empty spaces • Many variants: X-tree, SS-tree, SR-tree etc • Works well for low dimensions • Not so great for high dimensions

  29. VA-file [Weber, Schek, Blott’98] • Approach: • In high-dimensional spaces, all tree-based indexing structures examine large fraction of leaves • If we need to visit so many nodes anyway, it is better to scan the whole data set and avoid performing seeks altogether • 1 seek = transfer of few hundred KB

  30. VA-file ctd. • Natural question: how to speed-up linear scan ? • Answer: use approximation • Use only i bits per dimension (and speed-up the scan by a factor of 32/i) • Identify all points which could be returned as an answer • Verify the points using original data set

  31. Time to sum up • “Curse of dimensionality” is indeed a curse • In main memory, we can perform sublinear-time search using trees or hashing • In secondary storage, linear scan is pretty much all we can do (for high dim) • Personal thought: if linear search is all we can do, we are not doing too well…. • Maybe it is time to buy a few GB of RAM • ..but at the end everything depends on your data set

  32. Resources • Surveys: • Berchtold & Keim: • http://www.informatik.unihalle.de/~keim/PS/ICDE00.pdf • Theodoridis: • http://dias.cti.gr/~ytheod/research/ADBIS/handouts.pdf • Agarwal et al (range searching): • http://www.cs.duke.edu/~pankaj/papers.html

  33. Resources • Source code: http://dias.cti.gr/~ytheod/research/indexing/ http://www.cs.sunysb.edu/~algorith/major_section/1.6.shtml • References: see surveys plus very recent • [Buh’00,BT’00]: J. Buhler et al: http://www.cs.washington.edu/homes/jbuhler/ • [HGI’00]: Haveliwala et al: http://theory.lcs.mit.edu/~indyk/webdb.ps

  34. Contact • If you have any question, feel free to e-mail me at indyk@theory.lcs.mit.edu • Thank you !

More Related