1 / 18

Locality Sensitive Hashing

Locality Sensitive Hashing. Petra Kohoutkov á, Martin Kyselák. Outline. Motivation NN query, randomized approximate NN, open problems Definition LSH basic principles, definition, illustration Algorithm Basic idea, parameters, complexity Examples

jhighsmith
Download Presentation

Locality Sensitive Hashing

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Locality Sensitive Hashing Petra Kohoutková, Martin Kyselák

  2. Outline • Motivation • NN query, randomized approximate NN, open problems • Definition • LSH basic principles, definition, illustration • Algorithm • Basic idea, parameters, complexity • Examples • Specific LSH functions for several distance measures

  3. Motivation • Nearest neighbor queries: The goal is to preprocess a dataset of objects (e.g., images) so that later, given a new query object, one can quickly return the dataset object that is most similar to the query. • Several efficient algorithms known for the case when the dimension d is low, e.g. kd-trees. However, these solutions suffer from either space or query time that is exponential in d. Thus, all current indexing techniques (based on space partitioning) degrade to linear search for sufficiently high d. This phenomenon is often called “the curse of dimensionality.”

  4. Motivation II • Approximate NN: return a point whose distance from the query is at most c times the distance from the query to its nearest points; c > 1 is called the approximation factor. • Randomized c-approximate R-near neighbor ((c, R)– NN) problem: given a set P of points in a d-dimensional space, and parameters R > 0, δ > 0, construct a data structure such that, given any query point q, if there exists an R-near neighbor of q in P, it reports some cR-near neighbor of q in P with probability 1 – δ. • Locality Sensitive Hashing (LSH): the key idea is to use hash functions such that the probability of collision is much higher for objects that are close to each other than for those that are far apart. Then, one can determine near neighbors by hashing the query point and retrieving elements stored in buckets containing that point. • aiming for linear complexity in d, sublinear in n

  5. LSH Definition • The LSH algorithm relies on the existence of locality-sensitive hash functions. Let H be a family of hash functions mapping Rd to some universe U. For any two points p and q, consider a process in which we choose a function h from H uniformly at random, and analyze the probability that h(p)= h(q).

  6. LSH Definition II • Definition: A family H of functions h: Rd U is called (R, cR, P1, P2)-sensitive, if for any p, q: • If |p-q| ≤ R, then Pr[h(p) = h(q)] ≥ P1 • If |p-q| ≥ cR, then Pr[h(p) = h(q)] ≤ P2 • In order for a locality-sensitive hash (LSH) family to be useful, it has to satisfy P1 > P2.

  7. Example: Hamming Distance • Consider a data set of binary strings (000100, 000101, …) compared by the Hamming distance (D). In this case, we can use a simple family of functions H which contains all projections of the input point on one of the coordinates: • H= {hi | hi : {0,1}d → {0,1}, hi(p) = pi } • (pi is the i-th bit of p) • Then the probability is as follows: • Pr[h(p)=h(q)] = 1 - D(p,q) / d • Let p =000100, q = 000101; Pr[h(p)=h(q)] = 5/6 • H is (1, 2, 5/6, 4/6)-sensitive

  8. LSH Algorithm • An LSH family H can be used to design an efficient algorithm for approximate NN search. However, one typically cannot use H directly since the gap between the probabilities P1 and P2 could be quite small. • Given a family H of hash functions with parameters (R, cR, P1, P2), we amplify the gap between the high probability P1 and low probability P2 by concatenating several functions. In particular, for parameters k and L (specified later), we choose L functions gj(q), j = 1,…,L, by setting gj=(h1,j(q),h2,j(q),…,hk,j(q)), where ht,j (1 ≤ t ≤ k, 1 ≤ j ≤ L) are chosen independently and uniformly at random from H. These are the actual functions that we use to hash the data points.

  9. LSH Algorithm – Parameters (A. Andoni, P. Indyk: Near-Optimal Hashing Algorithms for Approximate Nearest Neighbor in High Dimensions)

  10. LSH Algorithm II Preprocessing: • Choose L functions gj, j = 1,…L, by setting gj=(h1,j, h2,j, …, hk,j), where h1,j,…hk,j are chosen at random from the LSH family H. • Construct L hash tables, where, for each j =1,…L, the jth hash table contains the dataset points hashed using the function gj. Query algorithm for a query point q: • For each j = 1, 2,…, L • Retrieve the points from the bucket gj(q) in the jth hash table. • For each of the retrieved point, compute the distance from q to it, and report the point if it is a correct answer

  11. LSH Algorithm III • The LSH algorithm gives a solution to the randomized c-approximate R-near neighbor problem, with parameters R and δ for some constant failure probability δ < 1. The value of δ depends on the choice of the parameters k and L. Conversely, for each δ, one can provide parameters k and L so that the error probability is smaller than δ. • The query time is also dependent on k and L. It could be as high as Θ(n) in the worst case, but, for many natural data sets, a proper choice of parameters results in a sublinear query time O(dn),  < 1.

  12. LSH Functions • for Hamming distance • for l1 distance • for Euclidean (l2) distance • for Jaccard’s coefficient • for Arccos measure • … • for general metric space?

  13. LSH for Hamming Distance • Hamming distance of strings p, q is equal to the number of positions where p and q differ • Define a family of functions H which contains all projections of the input point on one of the coordinates: • H= {hi | hi : {0,1}d → {0,1}, hi(p) = pi } • (pi is the i-th bit of p)

  14. LSH for l1 Distance • l1 distance of vectors x, y is defined as l1(x,y) = (|x1-y1| + … + |xd-yd|) • Fix a real w>>R, and impose a randomly shifted grid with cells of width w; each cell defines a bucket. More specifically, pick random real numbers s1,…,sd from [0, w) and define • hs1,…,sd(x) = ((x1 – s1)/w, …, (xd – sd)/w)

  15. LSH for Euclidean (l2) Distance • l2(x,y) = (|x1-y1|2 + … + |xd-yd|2)1/2 • Pick a random projection of Rd onto a 1-dimensional line and chop the line into segments of length w, shifted by a random value b from [0, w) • hr,b(x) = (r·x + b)/w • the projection vector r from Rd is constructed by picking each coordinate of r from the Gaussian distribution w = 3 r = (3,1), b = 2 r = (4, -1), b = 2

  16. LSH for Jaccard’s Coefficient • Jaccard’s coefficient for sets is defined as • Pick a random permutation π on the ground universe U. Then, define hπ(A) = min{π(a) | a is in A}. • The probability of collision is Prπ[hπ(A)= hπ(B)] = 1 - d(A, B) A π1 hπ1(A) = hπ1(B) = π2 hπ2(A) = hπ2(B) = B

  17. LSH for Arccos Measure • Arccos measure for vectors is defined as • Family of LSH functions is then defined as follows: • H= {hu(p) = sign(u.p)} • where u is a random unit-length vector

  18. LSH for General Metric Space • We know nothing about the distance measure • We can only use distances between objects • Idea: use some randomly (?) picked objects as pivots, define buckets as Voronoi regions ?

More Related