CS 361A (Advanced Data Structures and Algorithms)

CS 361A (Advanced Data Structures and Algorithms) Lecture 19 (Dec 5, 2005) Nearest Neighbors: Dimensionality Reduction and Locality-Sensitive Hashing Rajeev Motwani

Metric Space • Metric Space (M,D) • For points p,q in M, D(p,q) is distance from p to q • only reasonable model for high-dimensional geometric space • Defining Properties • Reflexive:D(p,q) = 0 if and only if p=q • Symmetric:D(p,q) = D(q,p) • Triangle Inequality:D(p,q) is at most D(p,r)+D(r,q) • Interesting Cases • M  points in d-dimensional space • D  Hamming or Euclidean Lp-norms

High-Dimensional Near Neighbors • Nearest Neighbors Data Structure • Given – N points P={p1, …, pN} in metric space (M,D) • Queries –“Which point pP is closest to point q?” • Complexity – Tradeoff preprocessing space with query time • Applications • vector quantization • multimedia databases • data mining • machine learning • …

Known Results • Some expressions are approximate • Bottom-line – exponential dependence on d

Approximate Nearest Neighbor • Exact Algorithms • Benchmark – brute-force needs space O(N), query time O(N) • Known Results – exponential dependence on dimension • Theory/Practice – no better than brute-force search • Approximate Near-Neighbors • Given – N points P={p1, …, pN} in metric space (M,D) • Given–error parameter >0 • Goal –for query q and nearest-neighbor p, return r such that • Justification • Mapping objects to metric space is heuristic anyway • Get tremendous performance improvement

Results for Approximate NN • Will show main ideas of last 3 results • Some expressions are approximate

Approximate r-Near Neighbors • Given – N points P={p1,…,pN} in metric space (M,D) • Given–error parameter >0, distance threshold r>0 • Query • If no point p with D(q,p)<r, return FAILURE • Else, return anyp’ with D(q,p’)< (1+)r • Application • Solving Approximate Nearest Neighbor • Assume maximum distance is R • Run in parallel for • Time/space – O(log R) overhead • [Indyk-Motwani] – reduce to O(polylog n) overhead

Hamming Metric • Hamming Space • Points in M: bit-vectors {0,1}d (can generalize to {0,1,2,…,q}d) • Hamming Distance:D(p,q) = # of positions where p,q differ • Remarks • Simplest high-dimensional setting • Still useful in practice • In theory, as hard (or easy) as Euclidean space • Trivial in low dimensions • Example • Hypercube in d=3 dimensions • {000, 001, 010, 011, 100, 101, 110, 111}

Dimensionality Reduction • Overall Idea • Map from high to low dimensions • Preserve distances approximately • Solve Nearest Neighbors in new space • Performance improvement at cost of approximation error • Mapping? • Hash function family H = {H1, …, Hm} • Each Hi: {0,1}d  {0,1}t with t<<d • Pick HR from Huniformly at random • Map each point in P using same HR • Solve NN problem on HR(P) = {HR(p1), …, HR(pN)}

Reduction for Hamming Spaces Theorem: For any r and small >0, there is hash family H such that for any p,q and random HRH with probability >1-, provided for some constant C, b b a a c c

Remarks • For fixed threshold r, can distinguish between • NearD(p,q) < r • Far D(p,q) > (1+ε)r • For N points, need • Yet, can reduce to O(log N)-dimensional space, while approximately preserving distances • Works even if points not known in advance

Hash Family • Projection Function • Let S be ordered, multiset of s indexes from {1,…,d} • p|S:{0,1}d {0,1}s projects p into s-dimensional subspace • Example • d=5, p=01100 • s=3, S={2,2,4}  p|S = 110 • Choosing hash function HR in H • Repeat for i=1,…,t • Pick Si randomly (with replacement) from {1…d} • Pick random hash function fi:{0,1}s{0,1} • hi(p)=fi(p|Si) • HR(p) = (h1(p), h2(p),…,ht(p)) • Remark –note similarity to Bloom Filters

Illustration of Hashing . . . . . 1 d p p|S1 p|St . . . . . . . . . . . 1 s 1 s ft f1 HR(p) . . . h1(p) ht(p)

Analysis I • Choose random index-set S • Claim:For any p,q • Why? • p,q differ in D(p,q) bit positions • Need all s indexes of S to avoid these positions • Sampling with replacement from {1, …,d}

Analysis II • Chooses=d/r • Since 1-x<e-x for |x|<1, we obtain • Thus

Analysis III • Recallhi(p)=fi(p|Si) • Thus • Choosing c= ½ (1-e-1)

Analysis IV • RecallHR(p)=(h1(p),h2(p),…,ht(p)) • D(HR(p),HR(q)) = number of i’s wherehi(p), hi(q) differ • By linearity of expectations • Theorem almost proved • For high probability bound, need Chernoff Bound

Chernoff Bound • Consider Bernoulli random variables X1,X2, …, Xn • Values are 0-1 • Pr[Xi=1] = x and Pr[Xi=0] = 1-x • DefineX = X1+X2+…+Xn with E[X]=nx • Theorem: For independentX1,…, Xn, for any 0<<1, P 2nx X nx

Analysis V • Define • Xi=0 if hi(p)=hi(q), and 1 otherwise • n=t • Then X = X1+X2+…+Xt= D(HR(p),HR(q)) • Case 1 [D(p,q)<r  x=c] • Case 2 [D(p,q)>(1+ε)r  x=c+ε/6] • Observe – sloppy bounding of constants in Case 2

Putting it all together • Recall • Thus, error probability • Choosing C=1200/c • Theorem is proved!!

Algorithm I • Set error probability • Select hash HR andmappoints p  HR(p) • Processing query q • ComputeHR(q) • Find nearest neighbor HR(p) for HR(q) • Ifthen return p, else FAILURE • Remarks • Brute-force for findingHR(p) implies query time • Need another approach for lower dimensions

Algorithm II • Fact – Exact nearest neighbors in {0,1}t requires • SpaceO(2t) • Query timeO(t) • How? • Precompute/store answers to all queries • Number of possible queries is 2t • Since • Theorem – In Hamming space {0,1}d, can solve approximate nearest neighbor with: • Space • Query time

Different Metric • Many applications have “sparse” points • Many dimensions but few 1’s • Example – pointsdocuments, dimensionswords • Better to view as “sets” • Previous approach would require larges • For sets A,B, define • Observe • A=B  sim(A,B)=1 • A,B disjoint  sim(A,B)=0 • Question – Handling D(A,B)=1-sim(A,B) ?

Min-Hash • Random permutations p1,…,pt of universe (dimensions) • Define mapping hj(A)=mina in A pj(a) • Fact:Pr[hj(A)= hj(B)] = sim(A,B) • Proof? – already seen!! • Overall hash-function HR(A) = (h1(A), h2(A),…,ht(A))

Min-Hash Analysis • Select • Hamming Distance • D(HR(A),HR(B)) number of j’s such that • Theorem For any A,B, • Proof? – Exercise (apply Chernoff Bound) • Obtain – ANN algorithm similar to earlier result

Generalization • Goal • abstract technique used for Hamming space • enable application to other metric spaces • handle Dynamic ANN • Dynamic Approximate r-Near Neighbors • Fix – threshold r • Query – if any point within distance r of q, return any point within distance • Allow insertions/deletions of points in P • Recall– earlier method required preprocessing all possible queries in hash-range-space…

Locality-Sensitive Hashing • Fix – metric space (M,D), threshold r, error • Choose – probability parameters Q1 > Q2>0 Definition– Hash family H={h:MS} for (M,D) is called . -sensitive, if for random h and for any p,q in M • Intuition • p,q are near  likely to collide • p,q are far  unlikely to collide

Examples • Hamming SpaceM={0,1}d • point p=b1…bd • H = {hi(b1…bd)=bi, for i=1…d} • sampling one bit at random • Pr[hi(q)=hi(p)] = 1 – D(p,q)/d • Set SimilarityD(A,B) = 1 – sim(A,B) • Recall • H = • Pr[h(A)=h(B)] = 1 – D(A,B)

Multi-Index Hashing • Overall Idea • Fix LSH family H • BoostQ1, Q2gap by definingG = Hk • Using G, each point hashes intolbuckets • Intuition • r-near neighbors likely to collide • few non-near pairs in any bucket • Define • G = { g | g(p) = h1(p)h2(p)…hk(p) } • Hamming metric sample k random bits

Example (l=4) …… h1 hk p g1 q g2 g3 g4 r

Overall Scheme • Preprocessing • Prepare hash table for range of G • Select l hash functionsg1, g2, …, gl • Insert(p) – add p to bucketsg1(p), g2(p), …, gl(p) • Delete(p) – remove p from bucketsg1(p), g2(p), …, gl(p) • Query(q) • Check buckets g1(q), g2(q), …, gl(q) • Report nearest of (say) first 3l points • Complexity • Assume – computing D(p,q) needs O(d) time • Assume – storing p needs O(d) space • Insert/Delete/Query Time – O(dlk) • Preprocessing/Storage – O(dN+Nlk)

Collision Probability vs. Distance 1 Q1 Q2 0 r r r r

Multi-Index versus Error • Setl=Nz where Theorem For l=Nz, any query returns r-near neighbor correctly with probability at least 1/6. • Consequently (ignoring k=O(log N) factors) • Time O(dNz) • Space O(N1+z) • Hamming Metric • Boost Probability – use several parallel hash-tables

Analysis • Define (for fixed query q) • p* – any point with D(q,p*) < r • FAR(q) – all p with D(q,p) > (1+ )r • BUCKET(q,j)– all p with gj(p) = gj(q) • Event Esize: (query cost bounded by O(dl)) • Event ENN:gj(p*) = gj(q) for some j (nearest point in l buckets is r-near neighbor) • Analysis • Show:Pr[Esize] = x > 2/3and Pr[ENN] = y > 1/2 • Thus: Pr[not(Esize & ENN)] < (1-x) + (1-y) < 5/6

Analysis – Bad Collisions • Choose • Fact • Clearly • Markov Inequality – Pr[X>r.E[X]]<1/r, for X>0 • Lemma 1

Analysis – Good Collisions • Observe • Since l=nz • Lemma 2Pr[ENN] >1/2

Euclidean Norms • Recall • x=(x1,x2, …,xd) and y=(y1,y2, …,yd) in Rd • L1-norm • Lp-norm (for p>1)

Extension to L1-Norm • Round coordinates to{1,…M} • Embed L1-{1,…,M}dinto Hamming-{0,1}dM • Unary Mapping • Apply algorithm for Hamming Spaces • Error due to rounding of 1/M  • Space-Time Overhead due to mapping of d  dM

Extension to L2-Norm • Observe • Little difference in L1-norm andL2-norm for highd • Additional erroris small • More generally – Lp, for1 p 2 • [Figiel et al 1977, Johnson-Schechtman 1982] • Can embed LpintoL1 • Dimensions d  O(d) • Distances preserved within factor (1+a) • Key Idea– random rotation of space

Improved Bounds • [Indyk-Motwani 1998] • For any Lp-norm • Query Time – O(log3 N) • Space – • Problem – impractical • Today – only a high-level sketch

Better Reduction • Recall • Reduced Approximate Nearest Neighbors to Approximate r-Near Neighbors • Space/Time Overhead – O(log R) • R = max distance in metric space • Ring-Cover Trees • Removed dependence on R • Reduced overhead to O(polylog N)

Approximate r-Near Neighbors • Idea • Impose regular-grid on Rd • Decompose into cubes of side length s • Label cubes with points at distance <r • Data Structure • Query q– determine cube containing q • Cube labels– candidate r-near neighbors • Goals • Small s lower error • Fewer cubes  smaller storage

p1 p2 p3

Grid Analysis • Assume r=1 • Choose • Cube Diameter = • Number of cubes = Theorem – For any Lp-norm, can solve Approx r-Near Neighbor using • Space – • Time – O(d)

Dimensionality Reduction [Johnson-Lindenstraus 84, Frankl-Maehara 88] For , can map points in P into subspace of dimension while preserving all inter-point distances to within a factor • Proof idea – project onto random lines • Result for NN • Space – • Time – O(polylog N)

References • Approximate Nearest Neighbors: Towards Removing the Curse of Dimensionality P. Indyk and R. Motwani STOC 1998 • Similarity Search in High Dimensions via Hashing A. Gionis, P. Indyk, and R. Motwani VLDB 1999

CS 361A (Advanced Data Structures and Algorithms)