Alex Andoni (MSR SVC)

Sketching, Sampling and other Sublinear Algorithms:Euclidean space: dimension reduction and NNS Alex Andoni (MSR SVC)

A Sketching Problem • Sketching: • :objects short bit-strings • given and should be able to deduce if and are “similar” • Why? • reduce space and time to compute similarity To sketch or not to sketch To be or not to be 010110 010101 be to similar? similar?

Sketch from LSH 1 • LSH often has property: • Sketching from LSH: • Estimate by the fraction of collisions between • controls the variance of the estimate [Broder’97]: for Jaccard coefficient

General Theory: embeddings • The above map is an embedding • General motivation: given distance (metric) , solve a computational problem under Hamming distance Compute distance between two points Euclidean distance (ℓ2) Nearest Neighbor Search Diameter/Close-pair of set S Edit distance between two strings Earth-Mover (transportation) Distance Clustering, MST, etc f Reduce problem <P under hard metric> to <P under simpler metric>

Embeddings: landscape • Definition: an embedding is a map of a metric into a host metric such that for any : where is the distortion (approximation) of the embedding . • Embeddings come in all shapes and colors: • Source/host spaces • Distortion • Can be randomized: withprobability • Time to compute • Types of embeddings: • From norm to the same norm but of lower dimension (dimension reduction) • From non-norms (edit distance, Earth-Mover Distance) into a norm (ℓ1) • From given finite metric (shortest path on a planar graph) into a norm (ℓ1) • not a metric but a computational procedure sketches

Dimension Reduction • Johnson Lindenstrauss Lemma: there is a linear map , , that preserves distance between two vectors • up to distortion • with probability ( some constant) • Preserves distances among points for • Motivation: • E.g.: diameter of a pointset in -dimensional Euclidean space • Trivially: time • Using lemma: time for approximation • MANY applications: nearest neighbor search, streaming, pattern matching, approximation algorithms (clustering)…

Main intuition • The map can be simply a projection onto a random subspace of dimension

1D embedding • How about one dimension () ? • Map • , • where are iid normal (Gaussian) random variable • Why Gaussian? • Stability property: is distributed as , where is also Gaussian • Equivalently: is centrally distributed, i.e., has random direction, and projection on random direction depends only on length of pdf = E[g]=0 E[g2]=1

1D embedding • Map , • for any , • Linear: • Want: • Claim: for any , we have • Expectation: • Standard deviation: • Proof: • Prove for since linear • Expectation pdf = E[g]=0 E[g2]=1 2 2

Full Dimension Reduction • Just repeat the 1D embedding for times! • where is matrix of Gaussian random variables • Want to prove: • with probability • OK to prove for fixed

Concentration • is distributed as • where each is distributed as Gaussian • Norm • is called chi-squared distribution with degrees • Fact: chi-squared very well concentrated: • Equal towith probability • Akin to central limit theorem

Dimension Reduction: wrap-up • with high probability • Extra: • Linear: can update as changes • Can use instead of Gaussians [AMS’96, Ach’01, TZ’04…] • Fast JL: can compute faster than time [AC’06, AL’07’09, DKS’10, KN’10’12…]

NNS for Euclidean space [Datar-Immorlica-Indyk-Mirrokni’04] • Can use dimensionality reduction to get LSH for • LSH function : • pick a random line , and quantize • project point into • is a random Gaussian vector • random in • is a parameter (e.g., 4)

Near-Optimal LSH [A-Indyk’06] • Regular grid → grid of balls • p can hit empty space, so take more such grids until p is in a ball • Need (too) many grids of balls • Start by projecting in dimension t • Analysis gives • Choice of reduced dimension t? • Tradeoff between • # hash tables, n, and • Time to hash, tO(t) • Total query time: dn1/c2+o(1) p p 2D Rt

Open question: • More practical variant of above hashing? • Design space partitioning of that is • efficient: point location in poly(t) time • qualitative: regions are “sphere-like” c2 [Prob. needle of length 1 is not cut] ≥ [Prob needle of length c is not cut]

Time-Space Trade-offs query time space low high medium medium one hash table lookup! high low

NNS beyond LSH • Data-dependent partitions… • Practice: • Trees: kd-trees, quad-trees, ball-trees, rp-trees, PCA-trees, sp-trees… • often no guarantees • Theory: • can improve standard LSH by random data-dependent space partitions [A-Indyk-Nguyen-Razenshteyn’??] • tree-based approach to max-norm ()

Finale • Dimension Reduction in Euclidean space • , random projection preserves distances • only dimensions for distance among points! • NNS for Euclidean space • Random projections gives LSH • Even better with ball partitioning • Or better with cool lattices?

Alex Andoni (MSR SVC)

Alex Andoni (MSR SVC)

Presentation Transcript

Gecko: A Contention-Oblivious Design for Cloud Storage

Are PCPs Inherent in Efficient Arguments?

Gecko: Contention-Oblivious Disk Arrays for Cloud Storage

Adding Privacy to Netflix Recommendations Frank McSherry, Ilya Mironov (MSR SVC)

Sketching, Sampling and other Sublinear Algorithms: Algorithms for parallel models

Cooperation and Efficiency in Utility Maximization Games

Computational Entropy

Counting Algorithms for Knapsack and Related Problems

Sketching, Sampling and other Sublinear Algorithms: Streaming

Pseudorandom Generators for Combinatorial Shapes

Raghu Meka (IAS)

Sketching, Sampling and other Sublinear Algorithms: Nearest Neighbor Search

Gecko: Contention-Oblivious Disk Arrays for Cloud Storage

Machine Learning in DryadLINQ

Raghu Meka (IAS, Princeton) Parikshit Gopalan (MSR, SVC) Omer Reingold (MSR, SVC)

SQL Text Mining

What is System Research? Why does it Matter?

Computational Entropy

Cluster Computing with DryadLINQ

Raghu Meka (IAS, work done at MSR, SVC) Parikshit Gopalan (MSR, SVC) Omer Reingold (MSR, SVC)