1.11k likes | 1.56k Views
k-Nearest Neighbors Search in High Dimensions . Tomer Peled Dan Kushnir. Tell me who your neighbors are, and I'll know who you are. Outline. Problem definition and flavors Algorithms overview - low dimensions Curse of dimensionality (d>10..20)
E N D
k-Nearest NeighborsSearchin High Dimensions Tomer Peled Dan Kushnir Tell me who your neighbors are, and I'll know who you are
Outline • Problem definition and flavors • Algorithms overview - low dimensions • Curse of dimensionality (d>10..20) • Enchanting the curse Locality Sensitive Hashing(high dimension approximate solutions) • l2 extension • Applications (Dan)
Nearest Neighbor SearchProblem definition • Given: a set P of n points in RdOver some metric • find the nearest neighborp of q in P Q? Distance metric
Applications • Classification • Clustering • Segmentation • Indexing • Dimension reduction (e.g. lle) Weight q ? color
Naïve solution • No preprocess • Given a query point q • Go over all n points • Do comparison in Rd • query time = O(nd) • Keep in mind
Common solution • Use a data structure for acceleration • Scale-ability with n & with d is important
Parametric Non-parametric Probability distribution estimation Density estimation Nearest neighbors When to use nearest neighbor High level algorithms • complex models • Sparse data • High dimensions Assuming no prior knowledge about the underlying probability structure
Nearest Neighbor q? Closest • min pi P dist(q,pi)
r, - Nearest Neighbor q? (1 + ) r r • dist(q,p1) r • dist(q,p2) (1 + ) r r2=(1 + ) r1
Outline • Problem definition and flavors • Algorithms overview - low dimensions • Curse of dimensionality (d>10..20) • Enchanting the curse Locality Sensitive Hashing (high dimension approximate solutions) • l2 extension • Applications (Dan)
The simplest solution • Lion in the desert
Quadtree Split the first dimension into 2 Repeat iteratively Stop when each cell has no more than 1 data point
Quadtree - structure X1,Y1 P≥X1 P≥Y1 P<X1 P<Y1 P<X1 P≥Y1 P≥X1 P<Y1 X1,Y1 Y X
Quadtree - Query X1,Y1 P≥X1 P≥Y1 P<X1 P<Y1 P<X1 P≥Y1 P≥X1 P<Y1 X1,Y1 Y X In many cases works
Quadtree – Pitfall1 X1,Y1 P<X1 P<Y1 P≥X1 P≥Y1 P≥X1 P<Y1 P<X1 P≥Y1 X1,Y1 Y P<X1 X In some cases doesn’t
Quadtree – Pitfall1 Y X In some cases nothing works
Quadtree – pitfall 2 X Y O(2d) Could result in Query time Exponential in #dimensions
Space partition based algorithms Could be improved Multidimensional access methods / Volker Gaede, O. Gunther
Outline • Problem definition and flavors • Algorithms overview - low dimensions • Curse of dimensionality (d>10..20) • Enchanting the curse Locality Sensitive Hashing (high dimension approximate solutions) • l2 extension • Applications (Dan)
O( min(nd, nd) ) Naive Curse of dimensionality • Query time or space O(nd) • D>10..20 worst than sequential scan • For most geometric distributions • Techniques specific to high dimensions are needed • Prooved in theory and in practice by Barkol & Rabani 2000 & Beame-Vee 2002
Curse of dimensionalitySome intuition 2 22 23 2d
Outline • Problem definition and flavors • Algorithms overview - low dimensions • Curse of dimensionality (d>10..20) • Enchanting the curse Locality Sensitive Hashing (high dimension approximate solutions) • l2 extension • Applications (Dan)
Preview • General Solution – Locality sensitive hashing • Implementation for Hamming space • Generalization to l1 & l2
Hash function Data_Item Hash function Key Bin/Bucket
Hash function Data structure X=Number in the range 0..n X modulo 3 0 0..2 Storage Address Usually we would like related Data-items to be stored at the same bin
Recall r, - Nearest Neighbor q? (1 + ) r r • dist(q,p1) r • dist(q,p2) (1 + ) r r2=(1 + ) r1
Locality sensitive hashing q? (1 + ) r r • (r, ,p1,p2) Sensitive • ≡Pr[I(p)=I(q)] is “high” if p is “close” to q • ≡Pr[I(p)=I(q)] is “low” if p is”far” from q P1 P2 r2=(1 + ) r1
Preview • General Solution – Locality sensitive hashing • Implementation for Hamming space • Generalization to l1 & l2
Hamming Space • Hamming space = 2N binary strings • Hamming distance = #changed digits a.k.a Signal distance Richard Hamming
SUM(X1 XOR X2) Hamming Space N • Hamming space • Hamming distance 010100001111 010100001111 Distance = 4 010010000011
p L1 to Hamming Space Embedding C=11 2 d’=C*d 8 11111111000 11000000000 11000000000 11111111000
Hash function p ∈ Hd’ 11000000000 11111111000 11000000000 11111111000 1 0 1 Lj Hash function j=1..L, k=3 digits Gj(p)=p|Ij Bits sampling from p Store p into bucket p|Ij 2k buckets 101
Construction p 1 2 L
Query q 1 2 L
p Alternative intuition random projections C=11 2 d’=C*d 8 11111111000 11000000000 11000000000 11111111000
p Alternative intuition random projections C=11 2 8 11111111000 11000000000 11000000000 11111111000
p Alternative intuition random projections C=11 2 8 11111111000 11000000000 11000000000 11111111000
p Alternative intuition random projections 11000000000 11111111000 11000000000 11111111000 1 0 1 110 111 100 101 101 23 Buckets 000 001
Secondary hashing Support volume tuning dataset-size vs. storage volume 2k buckets Skip 011 Simple Hashing M*B=α*n α=2 Size=B M Buckets
The above hashing is locality-sensitive • Probability (p,q in same bucket) = k=1 Pr k=2 Probability Distance (q,pi) Distance (q,pi) Adopted from Piotr Indyk’s slides
Preview • General Solution – Locality sensitive hashing • Implementation for Hamming space • Generalization to l2
Direct L2 solution • New hashing function • Still based on sampling • Using mathematical trick • P-stable distribution for Lp distance • Gaussian distribution for L2 distance
v1* +v2* +… …+vn* = Central limit theorem (Weighted Gaussians) = Weighted Gaussian
Central limit theorem v1* X1 +v2* X2 +… = …+vn* Xn v1..vn = Real Numbers X1:Xn = Independent Identically Distributed (i.i.d)
Central limit theorem Dot Product Norm