Streaming Algorithms

Streaming Algorithms Piotr Indyk MIT

Data Streams • A data stream is a sequence of data that is too large to be stored in available memory • Examples: • Network traffic • Sensor networks • Approximate query optimization and answering in large databases • Scientific data streams • ..And this talk

Outline • Data stream model, motivations, connections • Streaming problems and techniques • Estimating number of distinct elements • Other quantities: Norms, moments, heavy hitters… • Sparse approximations and compressed sensing

Basic Data Stream Model • Single pass over the data: i1, i2,…,in • Typically, we assume n is known • Bounded storage (typically nor logc n) • Units of storage: bits, words, coordinates or „elements” (e.g., points, nodes/edges) • Fast processing time per element • Randomness OK (in fact, almost always necessary) 8 2 1 9 1 9 2 4 6 3 9 4 2 3 4 2 3 8 5 2 5 6 ...

Connections • External memory algorithms • Linear scan works best • Streaming algorithms yield efficient external memory algorithms • Communication complexity/sketching • Low-space algorithms yield low-communication protocols • Metric embeddings, sub-linear time algorithms, pseudo-randomness, compressive sensing A B

Counting Distinct Elements • Stream elements: numbers from {1...m} • Goal: estimate the number of distinct elements DE in the stream • Up to 1 • With probability 1-P • Simpler goal: for a given T>0, provide an algorithm which, with probability 1-P: • Answers YES, if DE> (1+)T • Answers NO, if DE< (1-)T • Run, in parallel, the algorithm with T=1, 1+, (1+)2,..., n • Total space multiplied by log1+n ≈log(n)/ 

Vector X: 1 2 3 4 5 6 7 8 9 Vector Interpretation • Initially, x=0 • Insertion of i is interpreted as xi = xi +1 • Want to estimate DE(x) = the number of non-zero elements of x Stream: 8 2 1 9 1 9 2 4 4 9 4 2 5 4 2 5 8 5 2 5

Vector X: 1 2 3 4 5 6 7 8 9 (T=4) Set S: + + + Estimating DE(x) (based on[Flajolet-Martin’85]) • Choose a random set S of coordinates • For each i, we have Pr[iS]=1/T • Maintain SumS(x) = iSxi • Estimation algorithmI: • YES, if SumS(x)>0 • NO, if SumS(x)=0 • Analysis: • Pr=Pr[SumS(x)=0] = (1-1/T)DE • For T large enough: (1-1/T)DE≈e-DE/T • Using calculus, for  small enough: • If DE> (1+)T, then Pr ≈ e-(1+) < 1/e - /3 • if DE< (1-)T, then Pr ≈ e-(1-)> 1/e + /3 Pr DE

Estimating DE(x) ctd. • We have an algorithm: • If DE> (1+)T, then Pr<1/e-/3 • if DE< (1-)T, then Pr>1/e+/3 • Need to amplify the probability of correctness: • Select sets S1 … Sk, k=O(log(1/P)/2) • Let Z = number of SumSj(x) that are equal to 0 • By Chernoff bound with probability >1-P • If DE> (1+)T, then Z<k/e • if DE< (1-)T, then Z>k/e • Total space: • Decision version: O(log (1/P)/2 ) numbers in range 0…n • Estimation: : O(log(n)/ * log (1/P)/2 ) numbers in range 0…n (the probability of error is O(P log(n)/) ) • Better algorithms known: • Theory: O( 1/2 +log n) bits [Kane-Nelson-Woodruff’10] • Practice: need 128 bytes for all works of Shakespeare , ε≈10% [Durand-Flajolet’03]

Chernoff bound Let X1…Xn be a sequence of i.i.d. 0-1 random variables such that Pr[Xi=1]=p. Let X=sumi Xi Then for any eps<1 we have Pr[ |X-pn| >eps pn] <2e-eps^2 pn/3

Comments • Implementing S: • Choose a uniform hash function h: {1..m} -> {1..T} • Define S={i: h(i)=1} • We have i  S iff h(i)=1 • Implementing h • In practice: to compute h(i) do: • Seed=i • Return random()

Vector X: 1 2 3 4 5 6 7 8 9 More comments • The algorithm uses “linear sketches” SumSj(x)=iSjxi • Can implement decrementsxi=xi-1 • I.e., the stream can contain deletions of elements (as long as x≥0) • Other names: dynamic model, turnstile model

More General Problem • What other functions of a vector x can we maintain in small space ? • Lp norms: ||x||p = ( ∑i |xi|p )1/p • We also have ||x|| =maxi |xi| • … and we can define ||x||0 = DE(x), since ||x||pp =∑i |xi|p→DE(x) as p→0 • Alternatively: frequency moments Fp = p-th power of Lpnorms (exception: F0 = L0 ) • How much space do you need to estimate ||x||p (for const. ) ? • Theorem: • For p[0,2]:polylog n space suffices • For p>2:n1-2/p polylog n space suffices and is necessary [Alon-Matias-Szegedy’96, Feigenbaum-Kannan-Strauss-Viswanathan’99, Indyk’00, Coppersmith-Kumar’04, Ganguly’04, Bar-Yossef-Jayram-Kumar-Sivakumar’02’03, Saks-Sun’03, Indyk-Woodruff’05]

Estimating L2 norm: AMS

Alon-Matias-Szegedy’96 • Choose r1 … rmto be i.i.d. r.v., with Pr[ri=1]=Pr[ri=-1]=1/2 • Maintain Z=∑i ri xi under increments/decrements to xi • Algorithm I: Y=Z2 • “Claim”: Y “approximates” ||x||22with “good” probability

Analysis • We will use Chebyshev inequality • Need expectation, variance • The expectation of Z2 = (∑i ri xi )2 is equal to E[Z2] = E[∑i,j rixirjxj] = ∑i,j xi x j E[rirj] • We have • For i≠j, E[rirj] = E[ri] E[rj] =0 – term disappears • For i=j, E[rirj] =1 • Therefore E[Z2] = ∑i xi2 =||x||22 (unbiased estimator)

Analysis, ctd. • The second moment of Z2 = (∑i ri xi )2 is equal to the expectation of Z4 = (∑i ri xi ) (∑i ri xi ) (∑i ri xi ) (∑i ri xi ) • This can be decomposed into a sum of • ∑i (ri xi )4 →expectation= ∑i xi 4 • 6 ∑i<j (ri rj xixj )2 →expectation= 6∑i<j xi2xj2 • Terms involving single multiplier ri xi (e.g., r1x1r2x2r3x3r4x4) →expectation=0 Total: ∑i xi 4 +6∑i<j xi2xj2 • The variance of Z2 is equal to E[Z4]-E2[Z2] = ∑i xi 4 +6∑i<j xi2xj2 –(∑i xi2 )2 = ∑i xi 4 +6∑i<j xi2xj2 –∑i xi4 -2 ∑i<j xi2xj2 = 4∑i<j xi2xj2 ≤ 2 (∑i xi 2 )2

Analysis, ctd. • We have an estimator Y=Z2 • E[Y] = ∑i xi2 • σ2 =Var[Y] ≤ 2 (∑i xi 2 )2 • Chebyshev inequality: Pr[ |E[Y]-Y| ≥ cσ ] ≤ 1/c2 • Algorithm II: • Maintain Z1 … Zk(and thusY1 … Yk), define Y’ = ∑i Yi /k • E[Y’] = k ∑i xi2 /k = ∑i x i2 • σ’2 = Var[Y’] ≤ 2k(∑i xi 2 )2 /k2 = 2 (∑i xi 2 )2 /k • Guarantee: Pr[ |Y’ - ∑i xi2 | ≥c (2/k)1/2 ∑i xi2 ] ≤ 1/c2 • Setting c to a constant and k=O(1/ε2) gives (1 ε)-approximation with const. probability

Comments • Only needed 4-wise independence of r1…rm • I.e., for any subset S of {1…m}, |S|=4, and any b{-1,1}4, we have Pr[rS=b]=1/24 • Can generate such vars from O(log m) random bits • What we did: • Maintain a “linear sketch” vector Z=[Z1...Zk] = R x • Estimator for ||x||22 : (Z12 +... + Zk2)/k = ||Rx||22 /k • “Dimensionality reduction”: x→ Rx … but the tail somewhat “heavy” • Reason: only used second moment of the estimator

Streaming Algorithms