250 likes | 394 Views
Compact Representations in Streaming Algorithms. Moses Charikar Princeton University. Talk Outline. Statistical properties of data streams Distinct elements Frequency moments, norm estimation Frequent items. Frequency Moments [Alon, Matias, Szegedy ’99].
E N D
Compact Representations in Streaming Algorithms Moses CharikarPrinceton University
Talk Outline • Statistical properties of data streams • Distinct elements • Frequency moments, norm estimation • Frequent items
Frequency Moments[Alon, Matias, Szegedy ’99] • Stream consists of elements from {1,2,…,n} • mi = number of times i occurs • Frequency moment • F0 = number of distinct elements • F1 = size of stream • F2 =
Overall Scheme • Design estimator (i.e. random variable) with the right expectation • If estimator is tightly concentrated, maintain number of independent copies of estimator E1, E2, …, Er • Obtain estimate E from E1, E2, …, Er • Within (1+) with probability 1-
Randomness • Design estimator assuming perfect hash functions, as much randomness as needed • Too much space required to explicitly store such a hash function • Fix later by showing that limited randomness suffices
Distinct Elements • Estimate the number of distinct elements in a data stream • “Brute Force solution”: Maintain list of distinct elements seen so far • (n) storage • Can we do better ?
Distinct Elements[Flajolet, Martin ’83] • Pick a random hash function h:[n] [0,1] • Saythere are k distinct elements • Then minimum value of h over k distinct elements is around 1/k • Apply h() to every element of data stream; maintain minimum value • Estimator = 1/minimum
(Idealized) Analysis • Assume perfectly random hash function h:[n] [0,1] • S: set of k elements of [n] • X = min aS { h(a) } • E[X] = 1/(k+1) • Var[X] = O(1/k2) • Mean of O(1/2) independent estimators is within (1+) of 1/k with constant probability
Analysis • [Alon,Matias,Szegedy]Analysis goes through with pairwise independent hash functionh(x) = ax+b • 2 approximation • O(log n) space • Many improvements[Bar-Yossef,Jayram,Kumar,Sivakumar,Trevisan]
Estimating F2 • F2 = • “Brute force solution”: Maintain counters for all distinct elements • Sampling ? • n1/2space
Estimating F2[Alon,Matias,Szegedy] • Pick a random hash functionh:[n] {+1,-1} • hi = h(i) • Z = • Z initially 0, add hievery time you see i • Estimator X = Z2
Analyzing the F2 estimator • Median of means gives good estimator
What about the randomness ? • Analysis only requires 4-wise independence of hash function h • Pick h from 4-wise independent family • O(log n) space representation, efficient computation of h(i)
Properties of F2 estimator • “sketch” of data stream that allows computation of • Linear function of mi • Can be added, subtracted • Given two streams, frequencies mi , ni • E[(Z1-Z2)2] = • Estimate L2 norm of difference • How about L1 norm ? Lp norm ?
Stable Distributions • p-Stable distribution DIf X1, X2, … Xn are i.i.d. samples from D,m1X1+m2X2+…mnXn is distributed as||(m1,m2,…,mn)||pX • Defining property up to scale factor • Gaussian distribution is 2-stable • Cauchy distribution is 1-stable • p-Stable distributions exist for all0 < p 2
Talk Outline • Similarity preserving hash functions • Similarity estimation • Statistical properties of data streams • Distinct elements • Frequency moments, norm estimation • Frequent items
Variants of F2 estimator[Alon, Gibbons, Matias, Szegedy] • Estimate join size of two relations(m1,m2,…) (n1,n2,…) • Variance may be too high
Finding Frequent Items [C,Chen,Farach-Colton ’02] Goal: Given a data stream, return an approximate list of the k most frequent items in one pass and sub-linear space Applications: Analyzing search engine queries, network traffic.
Finding Frequent Items ai: ith most frequent element mi : frequency If we hadan oracle that gave us exact frequencies, can find most frequent items in one pass Solution: A data structure called a Count Sketch that gives good estimates of frequencies of the high frequency elements at every point in the stream
Intuition • Consider a single counter X with a single hash function h:{a} { +1, -1} • On seeing each element ai, update the counter with X += h(ai) • X = mi • h(ai) • Claim: E[X • h(ai)] = mi • Proof idea: Cross-terms cancel because of pairwise independence
Finding the max element • Problem with the single counter scheme: variance is too high • Replace with an array of t counters, using independent hash functions h1... ht h1: a {+1, -1} ht: a {+1, -1}
Analysis of “array of counters” data structure • Expectation still correct • Claim: Variance of final estimate < mi2 /t • Variance of each estimate < mi2 • Proof idea: cross-terms cancel • Set t = O(log n • mi2 / (m1)2) to get answer with high prob. • Proof idea: “median of averages”
Problem with “array of counters” data structure • Variance of estimator dominated by contribution of large elements • Estimates for important elements such as ak corrupted by larger elements (variance much more than mk2) • To avoid collisions, replace each counter with a hash table of b counters to spread out the large elements
In Conclusion • Simple powerful ideas at the heart of several algorithmic techniques for large data sets • “Sketches” of data tailored to applications • Many interesting research questions