470 likes | 520 Views
Range-Efficient Counting of Distinct Elements. Srikanta Tirthapura Iowa State University (joint work with Phillip Gibbons, Aduri Pavan). Range-Efficient F 0. Stream: [100,200], [0,10], [60, 120], [5,25] F 0 : |[0,25] U [60,200]| = 167. 120. 200. 100. 60. 0. 5. 10. 25.
E N D
Range-Efficient Counting of Distinct Elements Srikanta Tirthapura Iowa State University (joint work with Phillip Gibbons, Aduri Pavan)
Range-Efficient F0 Stream: [100,200], [0,10], [60, 120], [5,25] F0: |[0,25] U [60,200]| = 167 120 200 100 60 0 5 10 25 IIT Kanpur Streams Workshop
Range-Efficient F0 Input Stream:Sequence of ranges [l1,r1], [l2,r2] … [lm,rm] for each i, 0 <= li <= ri <= n, and li, ri are integers Output: Return | [l1,r1] U [l2,r2] U … U [lm,rm]| i.e. number of distinct elements in the union (F0) Constraints: • Single pass through the data • Small Workspace • Fast Processing Time IIT Kanpur Streams Workshop
Reductions to Range-Efficient F0 Duplicate Insensitive Sum Max-Dominance Norm Range-Efficient F0 Counting Triangles in Graphs IIT Kanpur Streams Workshop
Duplicate-Insensitive Sum Problem:Sum of all distinct elements in a stream of integers Input Stream:Sequence of integers S = a1,a2,….., an Output: distinct ai in S ai Example: S = 4, 5, 15, 4, 100, 4, 16, 15 Distinct Elements = 4,5,15,100, 16 Sum = 140 IIT Kanpur Streams Workshop
Reduction from Dup-Insensitive Sum to F0 Stream from U = [0,m-1] Alternate Stream from U’=[0,m2-1] Duplicate-Insensitive Sum Number of Distinct Elements IIT Kanpur Streams Workshop
Max Dominance Norm Given k streams of m integers each, (the elements of the streams arrive in an arbitrary order), where 1 ≤ ai,j≤ na1,1 a1,2 .. a1,ma2,1 a2,2 … a2,m … ak,1 ak,2 … ak,m Return j=1m max1 ≤ i ≤ k ai,j a b IIT Kanpur Streams Workshop
Input stream I, output stream O:F0 of Output Stream = Dominance Norm of Input Stream Assign ranges to the k positions: [1,n] [n+1,2n] … [(k-1)n+1, kn] When element ai,j is received, generate the range[(j-1)m+1, (j-1)m+1+ai,j] Observation: F0 of the resulting stream of ranges is the dominance norm of the input stream Reduction From Max Dominance Norm a b IIT Kanpur Streams Workshop
Talk Outline • Range Efficient F0 • Reductions Among Data Stream Problems • Algorithm for Range Efficient F0 (building on distinct sampling) • Update Streams • Open Questions IIT Kanpur Streams Workshop
Counting Distinct Elements (F0) • Example • How many different users accessed my website today? • Stream = 1,1,2,3,4,1,2 F0 = 4 • Numerous Applications in databases and networking • Prior Work • Flajolet-Martin (1985) • Alon, Matias and Szegedy (1996) • Gibbons and Tirthapura (2001) • Bar-Yossef et al. (2002) (currently most space-efficient) • Indyk-Woodruff (2003) (Lower Bounds) IIT Kanpur Streams Workshop
Range-Efficient F0 (Pavan and Tirthapura) Range Sampling for 2-way Independent Hash Functions Distinct Sampling Algorithm for F0 + IIT Kanpur Streams Workshop
Sampling Based Algorithm for F0(Gibbons and Tirthapura 2001) D = Distinct Elements In Stream U = {1,2,3,…..,n} S0 p=1/2 D S1 S0, S1, S2.. stored implicitly implicitly using hash functions {2,4,7,…} S1 p=1/2 D S2 {4,7,11,..} S2 IIT Kanpur Streams Workshop
Distinct Sampling Sample = {}, p = 1 Target Workspace = 4 numbers IIT Kanpur Streams Workshop
Distinct Sampling Sample = {5}, p = 1 Target Workspace = 4 numbers IIT Kanpur Streams Workshop
Distinct Sampling Sample = {5,3}, p = 1 Target Workspace = 4 numbers IIT Kanpur Streams Workshop
Distinct Sampling Sample = {5,3,7}, p = 1 Target Workspace = 4 numbers IIT Kanpur Streams Workshop
Distinct Sampling Sample = {5,3,7}, p = 1 Target Workspace = 4 numbers IIT Kanpur Streams Workshop
Distinct Sampling Sample = {5,3,7,6}, p = 1 Target Workspace = 4 numbers IIT Kanpur Streams Workshop
Distinct Sampling Sample = {5,3,7,6,8}, p = 1 Overflow Sample = Sample S1 Sample = {3,6,8}, p = ½ Target Workspace = 4 numbers IIT Kanpur Streams Workshop
Distinct Sampling Sample = {3,6,8,9}, p= ½ Target Workspace = 4 numbers IIT Kanpur Streams Workshop
Distinct Sampling Same Decision for both Sample = {3,6,8,9}, p= ½ Target Workspace = 4 numbers IIT Kanpur Streams Workshop
Distinct Sampling Sample = {3,6,8,9,2}, p= ½ Overflow Sample = Sample S2 Sample = {6,9}, p=¼ Target Workspace = 4 numbers IIT Kanpur Streams Workshop
Distinct Sampling Finally, Sample = {6,9}, p=¼ Estimate of F0 = (Sample Size)(4) = 8 IIT Kanpur Streams Workshop
Counting Distinct Elements • Finally, return a sample of distinct elements of the stream of a “large enough” size • If target workspace = O((1/2)(log(1/)) integers, then estimate of F0 is a (, )-approximation • Hash functions need only be pairwise independent and can be stored in small space IIT Kanpur Streams Workshop
Sampling Using Independent Coin Tosses Distinct Sampling Using Hash Functions Hash Function 0 1 0 0 0 1 IIT Kanpur Streams Workshop
Adaptive Sampling for Range-Efficient F0 • Naïve Approach: Given range [x,y], successively insert {x, x+1, … y} into F0 sampling algorithm • Problem: Time per range very large • Range-Sampling: Given stream element [p,q], how to sample all elements in [p,q] quickly? • At sampling level i, quickly compute |[p,q] ∩ Si| IIT Kanpur Streams Workshop
Hash Functions, and S0,S1,S2… 1 v2 h(x)=(ax+b) mod p p primea,b random in [0,p-1] v3 0 v1 p-1 n If h(x) Є[0,vi], then x Є Si IIT Kanpur Streams Workshop
Range Sampling v 1 X1 0 X2 p-1 n f(x)=(ax+b) mod p Compute |{x Є [x1,x2] : f(x) Є [0,v] }| IIT Kanpur Streams Workshop
v f(x1) 0 f(x1+1) p-1 Arithmetic Progression 1 X1 X2 n f(x)=(ax+b) mod p Common Difference = a IIT Kanpur Streams Workshop
v f(x1) 0 f(x1+1) p-1 Low and High Revolutions • Each revolution, number of hits on [0,v] is • floor(v/a) (low rev) • floor(v/a) +1 (high rev) • Task: Count number of low, high revolutions IIT Kanpur Streams Workshop
v f(x1) 0 f(x1+1) p-1 Starting Points of Revolutions • Can find r = (v - v mod a) such that: • If starting point in [0,r], then high revolution • Else low revolution • Task: Count the number of revolutions with starting point in [0,r] r IIT Kanpur Streams Workshop
a r r 0 0 a-1 p-1 Recursive Algorithm modulo a circle modulo p circle Observation: Starting Points form an Arithmetic Progression with difference (- p mod a) IIT Kanpur Streams Workshop
Recursive Algorithm • Focus on common difference • Two Reductions Possible Common Difference a- (p mod a) Common Difference a Common Difference (p mod a) At least one of the two common differences is smaller than a/2 IIT Kanpur Streams Workshop
Range Sampling Theorem: There is an algorithm for sampling range [x,y] using 2-way independent hash functions with • Time complexity O(log (y-x)) • Space Complexity O(log (y-x) + log m) Plug back into distinct sampling to get range-efficient F0 algorithm IIT Kanpur Streams Workshop
Input StreamSequence of ranges [l1,r1], [l2,r2] … [lm,rm] for each i, 0 <= li <= ri < n, and li, ri are integers Output | [l1,r1] U[l2,r2] U … U[lm,rm]| Results • Randomized (,)-Approximation Algorithm for Range-efficient F0 of a data stream • Processing Time (n is the size of the universe): • Amortized processing time per interval: O(log(1/) (log (n/))) • Time to answer a query for F0 is a constant • WorkSpace: O((1/2)(log(1/)) (log n)) Pavan,TirthapuraSICOMP (to appear) IIT Kanpur Streams Workshop
Prior Work • Bar-Yossef, Kumar, Sivakumar 2002 • First studied range-efficient F0 • Algorithms with higher space complexity • Cormode, Muthukrishnan 2003 • Max-dominance Norm • Nath, Gibbons, Seshan, Anderson 2004 • Duplicate-insensitive Sum assuming ideal hash functions IIT Kanpur Streams Workshop
Comparison IIT Kanpur Streams Workshop
Other Applications of Distinct Sampling • Sample of distinct elements of the stream of any desired target size • Approximate median of all distinct elements in stream (duplicate insensitive median) • Distinct Frequent elements (“heavy hitters” in network monitoring) IIT Kanpur Streams Workshop
Update Streams • Insertions and Deletions of elements into the streams(11, +1), (7, +3), (4, +2), (7, -2), (11,-1)… • Distinct Elements Problem: How many elements have a positive cumulative weight? • Assume a “sanity constraint”, no element has weight less than 0 • Sampling algorithm described so far fails, since it can only decrease sampling probability as stream becomes larger IIT Kanpur Streams Workshop
Distinct Sampling on Update Streams (three independent approaches) • Sumit Ganguly, Minos N. Garofalakis, Rajeev Rastogi: Processing Set Expressions over Continuous Update Streams. SIGMOD 2003, followed up by Ganguly, 2005 and Ganguly, Majumder 2006 • Graham Cormode, S. Muthukrishnan, Irina Rozenbaum: Summarizing and Mining Inverse Distributions on Data Streams via Dynamic Inverse Sampling. VLDB 2005 • Gereon Frahling, Piotr Indyk, Christian Sohler: Sampling in dynamic data streams and applications. SocG 2005 IIT Kanpur Streams Workshop
Distinct Elements on Update Streams Use of K-Set Structure in storing samples Ganguly, Garofalakis, Rastogi 2003 Ganguly 2005 Ganguly, Majumder 2006 IIT Kanpur Streams Workshop
K-Set Structure • Small space data structure for multi-set S (size Ỡ(K)) • Operations • Insert (x,v) into S • Delete (x,v’) from S • Membership Query (is x in S?) what is the number of distinct elements in S? • If |S| ≤ K, then Queries answered correctly K Active Silent Active IIT Kanpur Streams Workshop
Counting Distinct Elements on Update Streams • Sample Stream at different probabilities, 1, ½, ¼,….. • Store each of (D ∩ S0, D ∩ S1,D ∩ S2,…..) in a k-set structure for an appropriate value of k • When queried, use the highest probability sample that hasn’t overflowed yet IIT Kanpur Streams Workshop
Distributed Streams Alice Workspace = $$ Stream A Sketch(A) 11 54 21 11 2 45 21 1… Referee Bob ComputeDup-Ins-Sum(A,B) Workspace = $$ 1 5 21 2 54 21 35 … Sketch(B) Stream B IIT Kanpur Streams Workshop
Summary Range-Efficiency(range-sampling) Update Streams(k-set structure) Sliding Windows(multiple samples) Distinct Sampling IIT Kanpur Streams Workshop
Open Questions • Can we efficiently handle higher-dimensional ranges? • Klee’s measure problem in streaming model IIT Kanpur Streams Workshop
Open Questions • Range-Efficient F0 under update streams • Duplicate-insensitive Fk (k ≥ 2), range-efficient Fk IIT Kanpur Streams Workshop