180 likes | 348 Views
Sketching, Sampling and other Sublinear Algorithms: Euclidean space: dimension reduction and NNS. Alex Andoni (MSR SVC). A Sketching Problem. Sketching: :objects short bit-strings given and should be able to deduce if and are “similar” Why?
E N D
Sketching, Sampling and other Sublinear Algorithms:Euclidean space: dimension reduction and NNS Alex Andoni (MSR SVC)
A Sketching Problem • Sketching: • :objects short bit-strings • given and should be able to deduce if and are “similar” • Why? • reduce space and time to compute similarity To sketch or not to sketch To be or not to be 010110 010101 be to similar? similar?
Sketch from LSH 1 • LSH often has property: • Sketching from LSH: • Estimate by the fraction of collisions between • controls the variance of the estimate [Broder’97]: for Jaccard coefficient
General Theory: embeddings • The above map is an embedding • General motivation: given distance (metric) , solve a computational problem under Hamming distance Compute distance between two points Euclidean distance (ℓ2) Nearest Neighbor Search Diameter/Close-pair of set S Edit distance between two strings Earth-Mover (transportation) Distance Clustering, MST, etc f Reduce problem <P under hard metric> to <P under simpler metric>
Embeddings: landscape • Definition: an embedding is a map of a metric into a host metric such that for any : where is the distortion (approximation) of the embedding . • Embeddings come in all shapes and colors: • Source/host spaces • Distortion • Can be randomized: withprobability • Time to compute • Types of embeddings: • From norm to the same norm but of lower dimension (dimension reduction) • From non-norms (edit distance, Earth-Mover Distance) into a norm (ℓ1) • From given finite metric (shortest path on a planar graph) into a norm (ℓ1) • not a metric but a computational procedure sketches
Dimension Reduction • Johnson Lindenstrauss Lemma: there is a linear map , , that preserves distance between two vectors • up to distortion • with probability ( some constant) • Preserves distances among points for • Motivation: • E.g.: diameter of a pointset in -dimensional Euclidean space • Trivially: time • Using lemma: time for approximation • MANY applications: nearest neighbor search, streaming, pattern matching, approximation algorithms (clustering)…
Main intuition • The map can be simply a projection onto a random subspace of dimension
1D embedding • How about one dimension () ? • Map • , • where are iid normal (Gaussian) random variable • Why Gaussian? • Stability property: is distributed as , where is also Gaussian • Equivalently: is centrally distributed, i.e., has random direction, and projection on random direction depends only on length of pdf = E[g]=0 E[g2]=1
1D embedding • Map , • for any , • Linear: • Want: • Claim: for any , we have • Expectation: • Standard deviation: • Proof: • Prove for since linear • Expectation pdf = E[g]=0 E[g2]=1 2 2
Full Dimension Reduction • Just repeat the 1D embedding for times! • where is matrix of Gaussian random variables • Want to prove: • with probability • OK to prove for fixed
Concentration • is distributed as • where each is distributed as Gaussian • Norm • is called chi-squared distribution with degrees • Fact: chi-squared very well concentrated: • Equal towith probability • Akin to central limit theorem
Dimension Reduction: wrap-up • with high probability • Extra: • Linear: can update as changes • Can use instead of Gaussians [AMS’96, Ach’01, TZ’04…] • Fast JL: can compute faster than time [AC’06, AL’07’09, DKS’10, KN’10’12…]
NNS for Euclidean space [Datar-Immorlica-Indyk-Mirrokni’04] • Can use dimensionality reduction to get LSH for • LSH function : • pick a random line , and quantize • project point into • is a random Gaussian vector • random in • is a parameter (e.g., 4)
Near-Optimal LSH [A-Indyk’06] • Regular grid → grid of balls • p can hit empty space, so take more such grids until p is in a ball • Need (too) many grids of balls • Start by projecting in dimension t • Analysis gives • Choice of reduced dimension t? • Tradeoff between • # hash tables, n, and • Time to hash, tO(t) • Total query time: dn1/c2+o(1) p p 2D Rt
Open question: • More practical variant of above hashing? • Design space partitioning of that is • efficient: point location in poly(t) time • qualitative: regions are “sphere-like” c2 [Prob. needle of length 1 is not cut] ≥ [Prob needle of length c is not cut]
Time-Space Trade-offs query time space low high medium medium one hash table lookup! high low
NNS beyond LSH • Data-dependent partitions… • Practice: • Trees: kd-trees, quad-trees, ball-trees, rp-trees, PCA-trees, sp-trees… • often no guarantees • Theory: • can improve standard LSH by random data-dependent space partitions [A-Indyk-Nguyen-Razenshteyn’??] • tree-based approach to max-norm ()
Finale • Dimension Reduction in Euclidean space • , random projection preserves distances • only dimensions for distance among points! • NNS for Euclidean space • Random projections gives LSH • Even better with ball partitioning • Or better with cool lattices?