k-Nearest Neighbors Search in High Dimensions

k-Nearest NeighborsSearchin High Dimensions Tomer Peled Dan Kushnir Tell me who your neighbors are, and I'll know who you are

Outline • Problem definition and flavors • Algorithms overview - low dimensions • Curse of dimensionality (d>10..20) • Enchanting the curse Locality Sensitive Hashing(high dimension approximate solutions) • l2 extension • Applications (Dan)

Nearest Neighbor SearchProblem definition • Given: a set P of n points in RdOver some metric • find the nearest neighborp of q in P Q? Distance metric

Applications • Classification • Clustering • Segmentation • Indexing • Dimension reduction (e.g. lle) Weight q ? color

Naïve solution • No preprocess • Given a query point q • Go over all n points • Do comparison in Rd • query time = O(nd) • Keep in mind

Common solution • Use a data structure for acceleration • Scale-ability with n & with d is important

Parametric Non-parametric Probability distribution estimation Density estimation Nearest neighbors When to use nearest neighbor High level algorithms • complex models • Sparse data • High dimensions Assuming no prior knowledge about the underlying probability structure

Nearest Neighbor q? Closest • min pi  P dist(q,pi)

r,  - Nearest Neighbor q? (1 +  ) r r • dist(q,p1)  r • dist(q,p2)  (1 +  ) r r2=(1 +  ) r1

Outline • Problem definition and flavors • Algorithms overview - low dimensions • Curse of dimensionality (d>10..20) • Enchanting the curse Locality Sensitive Hashing (high dimension approximate solutions) • l2 extension • Applications (Dan)

The simplest solution • Lion in the desert

Quadtree Split the first dimension into 2 Repeat iteratively Stop when each cell has no more than 1 data point

Quadtree - structure X1,Y1 P≥X1 P≥Y1 P<X1 P<Y1 P<X1 P≥Y1 P≥X1 P<Y1 X1,Y1 Y X

Quadtree - Query X1,Y1 P≥X1 P≥Y1 P<X1 P<Y1 P<X1 P≥Y1 P≥X1 P<Y1 X1,Y1 Y X In many cases works

Quadtree – Pitfall1 X1,Y1 P<X1 P<Y1 P≥X1 P≥Y1 P≥X1 P<Y1 P<X1 P≥Y1 X1,Y1 Y P<X1 X In some cases doesn’t

Quadtree – Pitfall1 Y X In some cases nothing works

Quadtree – pitfall 2 X Y O(2d) Could result in Query time Exponential in #dimensions

Space partition based algorithms Could be improved Multidimensional access methods / Volker Gaede, O. Gunther

O( min(nd, nd) ) Naive Curse of dimensionality • Query time or space O(nd) • D>10..20  worst than sequential scan • For most geometric distributions • Techniques specific to high dimensions are needed • Prooved in theory and in practice by Barkol & Rabani 2000 & Beame-Vee 2002

Curse of dimensionalitySome intuition 2 22 23 2d

Preview • General Solution – Locality sensitive hashing • Implementation for Hamming space • Generalization to l1 & l2

Hash function

Hash function Data_Item Hash function Key Bin/Bucket

Hash function Data structure X=Number in the range 0..n X modulo 3 0 0..2 Storage Address Usually we would like related Data-items to be stored at the same bin

Recall r,  - Nearest Neighbor q? (1 +  ) r r • dist(q,p1)  r • dist(q,p2)  (1 +  ) r r2=(1 +  ) r1

Locality sensitive hashing q? (1 +  ) r r • (r, ,p1,p2) Sensitive • ≡Pr[I(p)=I(q)] is “high” if p is “close” to q • ≡Pr[I(p)=I(q)] is “low” if p is”far” from q P1 P2 r2=(1 +  ) r1

Preview • General Solution – Locality sensitive hashing • Implementation for Hamming space • Generalization to l1 & l2

Hamming Space • Hamming space = 2N binary strings • Hamming distance = #changed digits a.k.a Signal distance Richard Hamming

SUM(X1 XOR X2) Hamming Space N • Hamming space • Hamming distance 010100001111 010100001111 Distance = 4 010010000011

p L1 to Hamming Space Embedding C=11 2 d’=C*d 8 11111111000 11000000000 11000000000 11111111000

Hash function p ∈ Hd’ 11000000000 11111111000 11000000000 11111111000 1 0 1 Lj Hash function j=1..L, k=3 digits Gj(p)=p|Ij Bits sampling from p Store p into bucket p|Ij 2k buckets 101

Construction p 1 2 L

Query q 1 2 L

p Alternative intuition random projections C=11 2 d’=C*d 8 11111111000 11000000000 11000000000 11111111000

p Alternative intuition random projections C=11 2 8 11111111000 11000000000 11000000000 11111111000

p Alternative intuition random projections 11000000000 11111111000 11000000000 11111111000 1 0 1 110 111 100 101 101 23 Buckets 000 001

k samplings

Repeating

Repeating L times

Secondary hashing Support volume tuning dataset-size vs. storage volume 2k buckets Skip 011 Simple Hashing M*B=α*n α=2 Size=B M Buckets

The above hashing is locality-sensitive • Probability (p,q in same bucket) = k=1 Pr k=2 Probability Distance (q,pi) Distance (q,pi) Adopted from Piotr Indyk’s slides

Preview • General Solution – Locality sensitive hashing • Implementation for Hamming space • Generalization to l2

Direct L2 solution • New hashing function • Still based on sampling • Using mathematical trick • P-stable distribution for Lp distance • Gaussian distribution for L2 distance

v1* +v2* +… …+vn* = Central limit theorem (Weighted Gaussians) = Weighted Gaussian

Central limit theorem v1* X1 +v2* X2 +… = …+vn* Xn v1..vn = Real Numbers X1:Xn = Independent Identically Distributed (i.i.d)

Central limit theorem Dot Product Norm

k-Nearest Neighbors Search in High Dimensions