Lecture 7: Dynamic sampling Dimension Reduction

Lecture 7:Dynamic samplingDimension Reduction

Plan • Admin: • PSet 2 released later today, due next Wed • Alex office hours: Tue 2:30-4:30 • Plan: • Dynamic streaming graph algorithms • S2: Dimension Reduction & Sketching • Scriber?

Sub-Problem: dynamic sampling • Stream: general updates to a vector • Goal: • Output with probability

Dynamic Sampling • Goal: output with probability • Let • Intuition: • Suppose • How can we sample with ? • Each is a -heavy hitter • Use CountSketch recover all of them • space total • Suppose • Downsample: pick a random set s.t. • Focus on substream on only (ignore the rest) • What’s ? • In expectation = 10 • Use CountSketch on the downsampled stream … • In general: prepare for all levels

Basic Sketch • Hash function • Let #tail zeros in • for and • Partition stream into substreams • Substreamfocuses on elements with • Sketch: for each , • Store : CountSketch for • Store : distinct count sketch for approx= • would be sufficient here! • Both for success probability

Estimation AlgorithmDynSampleBasic: Initialize: hash function # tail zeros in CountSketch sketches , DistinctCount sketches , Process(int, real ): Let Add to and Estimator: Let j be s.t. If no such j, FAIL random heavy hitter from Return • Find a substream s.t. output • If no such stream, then FAIL • Recover all with (using ) • Pick any of them at random

Analysis AlgorithmDynSampleBasic: Initialize: hash function # tail zeros in CountSketch sketches , DistinctCount sketches , Process(int, real ): Let Add to and Estimator: Let j be s.t. If no such j, FAIL random heavy hitter from Return • If • then for some • Suppose • Let be such that • Chebyshev: deviates from expectation by with probability at most • Ie., probability of FAIL is at most 0.7

Analysis (cont) AlgorithmDynSampleBasic: Initialize: hash function # tail zeros in CountSketch sketches , DistinctCount sketches , Process(int, real ): Let Add to and Estimator: Let j be s.t. If no such j, FAIL random heavy hitter from Return • Let with • All heavy hitters = • will recover a heavy hitter, i.e., • By symmetry, once we output some , it is random over • Randomness? • We just used Chebyshev pairwise is OK !

Dynamic Sampling: overall • DynSampleBasic guarantee: • FAIL: with probability • Otherwise, output a random • Modulo a negligible probability of failing • Reduce FAIL probability? • DynSample-Full: • Take independent DynSampleBasic • Will not FAIL in at least one with probability at least • Space: words for: • repetitions • substreams • for each

Back to Dynamic Graphs • Graph with edges inserted/deleted • Define node-edge incidence vectors: • For node , we have vector: • where • For : if edge exists • For : if edge exists • Idea: • Use Dynamic-Sample-Full to sample an edge from each vertex • Collapse edges • How to iterate? • Property: • For a set of nodes • Consider: • Claim: has non-zero in coordinate iff edge crosses from to outside (i.e., ) • Sketch enough for: for any set , can sample an edge from !

Dynamic Connectivity where for : if for : if • Sketching algorithm: • Dynamic-Sample-Full for each • Check connectivity: • Sample an edge from each node • Contract all sampled edges • partitioned the graph into a bunch of components (each is connected) • Iterate on the components • How many iterations? • – each time we reduce the number of components by a factor • Issue: iterations not independent! • Can use a fresh Dynamic-Sampling-Full for each of the iterations

A little history • [Ahn-Guha-McGregor’12]: the above streaming algorithm • Overall space • [Kapron-King-Mountjoy’13]: • Data structure for maintaining graph connectivity under edge inserts/deletes • First algorithm with time for update/connectivity ! • Open since ‘80s

Section 2: Dimension Reduction & Sketching

Why? Application: Nearest Neighbor Search in high dimensions Preprocess: a set of points Query: given a query point , report a point with the smallest distance to

Motivation 000000 011100 010100 000100 010100 011111 000000 001100 000100 000100 110100 111111 • Generic setup: • Points model objects (e.g. images) • Distance models (dis)similarity measure • Application areas: • machine learning: k-NN rule • speech/image/video/music recognition, vector quantization, bioinformatics, etc… • Distance can be: • Euclidean, Hamming

Low-dimensional: easy • Compute Voronoi diagram • Given query , perform point location • Performance: • Space: • Query time:

High-dimensional case All exact algorithms degrade rapidly with the dimension

Dimension Reduction • Reduce high dimension?! • “flatten” dimension into dimension • Not possible in general: packing bound • But can if: for a fixedsubset of

Lecture 7: Dynamic sampling Dimension Reduction