Locality Sensitive Hashing

Locality Sensitive Hashing Petra Kohoutková, Martin Kyselák

Outline • Motivation • NN query, randomized approximate NN, open problems • Definition • LSH basic principles, definition, illustration • Algorithm • Basic idea, parameters, complexity • Examples • Specific LSH functions for several distance measures

Motivation • Nearest neighbor queries: The goal is to preprocess a dataset of objects (e.g., images) so that later, given a new query object, one can quickly return the dataset object that is most similar to the query. • Several efficient algorithms known for the case when the dimension d is low, e.g. kd-trees. However, these solutions suffer from either space or query time that is exponential in d. Thus, all current indexing techniques (based on space partitioning) degrade to linear search for sufficiently high d. This phenomenon is often called “the curse of dimensionality.”

Motivation II • Approximate NN: return a point whose distance from the query is at most c times the distance from the query to its nearest points; c > 1 is called the approximation factor. • Randomized c-approximate R-near neighbor ((c, R)– NN) problem: given a set P of points in a d-dimensional space, and parameters R > 0, δ > 0, construct a data structure such that, given any query point q, if there exists an R-near neighbor of q in P, it reports some cR-near neighbor of q in P with probability 1 – δ. • Locality Sensitive Hashing (LSH): the key idea is to use hash functions such that the probability of collision is much higher for objects that are close to each other than for those that are far apart. Then, one can determine near neighbors by hashing the query point and retrieving elements stored in buckets containing that point. • aiming for linear complexity in d, sublinear in n

LSH Definition • The LSH algorithm relies on the existence of locality-sensitive hash functions. Let H be a family of hash functions mapping Rd to some universe U. For any two points p and q, consider a process in which we choose a function h from H uniformly at random, and analyze the probability that h(p)= h(q).

LSH Definition II • Definition: A family H of functions h: Rd U is called (R, cR, P1, P2)-sensitive, if for any p, q: • If |p-q| ≤ R, then Pr[h(p) = h(q)] ≥ P1 • If |p-q| ≥ cR, then Pr[h(p) = h(q)] ≤ P2 • In order for a locality-sensitive hash (LSH) family to be useful, it has to satisfy P1 > P2.

Example: Hamming Distance • Consider a data set of binary strings (000100, 000101, …) compared by the Hamming distance (D). In this case, we can use a simple family of functions H which contains all projections of the input point on one of the coordinates: • H= {hi | hi : {0,1}d → {0,1}, hi(p) = pi } • (pi is the i-th bit of p) • Then the probability is as follows: • Pr[h(p)=h(q)] = 1 - D(p,q) / d • Let p =000100, q = 000101; Pr[h(p)=h(q)] = 5/6 • H is (1, 2, 5/6, 4/6)-sensitive

LSH Algorithm • An LSH family H can be used to design an efficient algorithm for approximate NN search. However, one typically cannot use H directly since the gap between the probabilities P1 and P2 could be quite small. • Given a family H of hash functions with parameters (R, cR, P1, P2), we amplify the gap between the high probability P1 and low probability P2 by concatenating several functions. In particular, for parameters k and L (specified later), we choose L functions gj(q), j = 1,…,L, by setting gj=(h1,j(q),h2,j(q),…,hk,j(q)), where ht,j (1 ≤ t ≤ k, 1 ≤ j ≤ L) are chosen independently and uniformly at random from H. These are the actual functions that we use to hash the data points.

LSH Algorithm – Parameters (A. Andoni, P. Indyk: Near-Optimal Hashing Algorithms for Approximate Nearest Neighbor in High Dimensions)

LSH Algorithm II Preprocessing: • Choose L functions gj, j = 1,…L, by setting gj=(h1,j, h2,j, …, hk,j), where h1,j,…hk,j are chosen at random from the LSH family H. • Construct L hash tables, where, for each j =1,…L, the jth hash table contains the dataset points hashed using the function gj. Query algorithm for a query point q: • For each j = 1, 2,…, L • Retrieve the points from the bucket gj(q) in the jth hash table. • For each of the retrieved point, compute the distance from q to it, and report the point if it is a correct answer

LSH Algorithm III • The LSH algorithm gives a solution to the randomized c-approximate R-near neighbor problem, with parameters R and δ for some constant failure probability δ < 1. The value of δ depends on the choice of the parameters k and L. Conversely, for each δ, one can provide parameters k and L so that the error probability is smaller than δ. • The query time is also dependent on k and L. It could be as high as Θ(n) in the worst case, but, for many natural data sets, a proper choice of parameters results in a sublinear query time O(dn),  < 1.

LSH Functions • for Hamming distance • for l1 distance • for Euclidean (l2) distance • for Jaccard’s coefficient • for Arccos measure • … • for general metric space?

LSH for Hamming Distance • Hamming distance of strings p, q is equal to the number of positions where p and q differ • Define a family of functions H which contains all projections of the input point on one of the coordinates: • H= {hi | hi : {0,1}d → {0,1}, hi(p) = pi } • (pi is the i-th bit of p)

LSH for l1 Distance • l1 distance of vectors x, y is defined as l1(x,y) = (|x1-y1| + … + |xd-yd|) • Fix a real w>>R, and impose a randomly shifted grid with cells of width w; each cell defines a bucket. More specifically, pick random real numbers s1,…,sd from [0, w) and define • hs1,…,sd(x) = ((x1 – s1)/w, …, (xd – sd)/w)

LSH for Euclidean (l2) Distance • l2(x,y) = (|x1-y1|2 + … + |xd-yd|2)1/2 • Pick a random projection of Rd onto a 1-dimensional line and chop the line into segments of length w, shifted by a random value b from [0, w) • hr,b(x) = (r·x + b)/w • the projection vector r from Rd is constructed by picking each coordinate of r from the Gaussian distribution w = 3 r = (3,1), b = 2 r = (4, -1), b = 2

LSH for Jaccard’s Coefficient • Jaccard’s coefficient for sets is defined as • Pick a random permutation π on the ground universe U. Then, define hπ(A) = min{π(a) | a is in A}. • The probability of collision is Prπ[hπ(A)= hπ(B)] = 1 - d(A, B) A π1 hπ1(A) = hπ1(B) = π2 hπ2(A) = hπ2(B) = B

LSH for Arccos Measure • Arccos measure for vectors is defined as • Family of LSH functions is then defined as follows: • H= {hu(p) = sign(u.p)} • where u is a random unit-length vector

LSH for General Metric Space • We know nothing about the distance measure • We can only use distances between objects • Idea: use some randomly (?) picked objects as pivots, define buckets as Voronoi regions ?

Locality Sensitive Hashing

Locality Sensitive Hashing

Presentation Transcript

Hashing

Hashing

Hashing

Locality-Sensitive Hashing

Object Recognition Using Locality-Sensitive Hashing of Shape Contexts

VLSH: Voronoi-based Locality Sensitive Hashing

No Free Lunch: Brute Force vs Locality-Sensitive Hashing for Cross-Lingual Pairwise Similarity

Summer School on Hashing’14 Locality Sensitive Hashing

Coherency Sensitive Hashing

Coherency Sensitive Hashing (CSH)

Latency-sensitive hashing for collaborative Web caching

Finding Similar Items: Locality Sensitive Hashing

Locality Sensitive Distributed Computing

Locality Sensitive Distributed Computing

Locality Sensitive Distributed Computing Exercise Set 1

Privacy Preserving Outlier Detection using Locality Sensitive Hashing

Locality Sensitive Distributed Computing Exercise Set 2

Applications of LSH (Locality-Sensitive Hashing)

Locality

Summer School on Hashing’14 Locality Sensitive Hashing

Introduction to locality sensitive approach to distributed systems

Hashing, Hashing Tables