140 likes | 152 Views
This lecture covers the topic of Nearest Neighbor Search and introduces the concept of sketching short bit-strings to distinguish between close and far points. Various approaches for Nearest Neighbor Search are discussed, including linear scan, exhaustive storage, and Locality Sensitive Hashing. The lecture also explores LSH algorithms for Hamming space, Euclidean space, and other distance metrics.
E N D
Plan • Distinguished Lecture • Quantum Computing • Oct 19, 11:30am, Davis Aud in CEPSR • Nearest Neighbor Search • Scriber?
Sketching • short bit-strings • given and ,can distinguish between: • Close: • Far: • With high success probability: only failure prob. • , norm: bits 010110 010101 Is ? Yes: close, No: far,
NNS: approaches • Sketch : uses bits • 1: Linear scan++ • Precompute for • Given , compute • For each , estimate distance using • 2: Exhaustive storage++ • For each possible • compute point s.t. • On query , output • Space: Near-linear space and sub-linear query time?
Locality Sensitive Hashing [Indyk-Motwani’98] Random hash function on satisfying: forclosepair (when ) is “high” forfarpair (when ) is “small” Use several hash tables “not-so-small” ,where 1
LSH for Hamming space • Hash function is usually a concatenation of “primitive” functions: • Fact 1: • Example: Hamming space • , i.e., choose bit for a random • chooses bits at random 1
Full Algorithm • Data structure is just hash tables: • Each hash table uses a fresh random function • Hash all dataset points into the table • Query: • Check for collisions in each of the hash tables • until we encounter a point within distance • Guarantees: • Space: , plus space to store points • Query time: (in expectation) • 50% probability of success.
Choice of parameters ? • hash tables with • Pr[collision of farpair] = • Pr[collision of closepair] = • Success probabilityfor a hash table: • tables should suffice • Runtime as a function of ? • Hence sets.t.
Analysis: correctness • Let be an -near neighbor • If does not exists, algorithm can output anything • Algorithm fails when: • near neighbor is not in the searched buckets • Probability of failure: • Probability do not collide in a hash table: • Probability they do not collide in hash tables at most
Analysis: Runtime • Runtime dominated by: • Hash function evaluation: time • Distance computations to points in buckets • Distance computations: • Care only about far points, at distance • In one hash table, we have • Probability a far point collides is at most • Expected number of far points in a bucket: • Over hash tables, expected number of far points is • Total: in expectation
LSH in practice safety not guaranteed • If want exact NNS, what is ? • Can choose any parameters • Correct as long as • Performance: • trade-off between # tables and false positives • will depend on dataset “quality” fewer false positives fewer tables
LSH Algorithms Hamming space Euclidean space • Table does not include: • additive space • factor in query time
LSH Zoo () To sketch or not to sketch To be or not to be • Hamming distance [IM’98] • : pick a random coordinate(s) • (Manhattan) distance [AI’06] • : cell in a randomly shifted grid • Jaccard distance between sets: • : pick a random permutation on the universe min-wise hashing[Bro’97] Claim: Pr[collision]=J(A,B) sketch sketch not not not not be be to or to or 1 …01111… …11101… 1 …01122… …21102… {be,not,or,to} {not,or,to, sketch} be to =be,to,sketch,or,not
LSH for Euclidean distance [Datar-Immorlica-Indyk-Mirrokni’04] • LSH function : • pick a random line , and quantize • project point into • is a random Gaussian vector • random in • is a parameter (e.g., 4) • Claim: