CS 410/510 Data Streams Lecture 16: Data-Stream Sampling: Basic Techniques and Results

CS 410/510Data StreamsLecture 16: Data-Stream Sampling: Basic Techniques and Results Kristin Tufte, David Maier Data Streams: Lecture 16

Data Stream Sampling • Sampling provides a synopsis of a data stream • Sample can serve as input for • Answering queries • “statistical inference about the contents of the stream” • “variety of analytical procedures” • Focus on: obtaining a sample from the window (sample size « window size) Data Streams: Lecture 16

Windows • Stationary Window • Endpoints of window fixed (think relation) • Sliding Window • Endpoints of window move • What we’ve been talking about • More complex than stationary window because elements must be removed from sample when they expire from window Data Streams: Lecture 16

Simple Random Sampling (SRS) • What is a “representative” sample? • SRS for a sample of k elements from a window with n elements • Every possible sample (of size k) is equally likely, that is has probability: 1/ • Every element is equally likely to be in sample • Stratified Sampling • Divide window into disjoint segments (strata) • SRS over each stratum • Advantageous when stream elements close together in stream have similar values n k ( ) Data Streams: Lecture 16

n k ( ) qk(1-q)n-k Bernoulli Sampling • Includes each element in the sample with probability q • The sample size is not fixed, sample size is binomially distributed • Probability that sample contains k elements is: • Expected sample size is nq Data Streams: Lecture 16

Binomial Distribution - Example Binomial Distribution (n=20, q=0.5) Probability Sample Size Expected Sample Size = 20*0.5 = 10 Data Streams: Lecture 16

Binomial Distribution - Example Binomial Distribution (n=20, q=1/3) Probability Sample Size Expected Sample Size = 20*1/3 ≈ 6.667 Data Streams: Lecture 16

Bernoulli Sampling - Implementation • Naïve: • Elements inserted with probability q (ignored with probability 1-q) • Use a sequence of pseudorandom numbers (U1, U2, U3, …) Ui [0,1] • Element ei is included if Ui ≤ q Example q = 0.2 e1 e2 e3 e4 e5 e6 e7 U3=0.9 U4=0.8 U1=0.5 U2=0.1 U5=0.2 U6=0.3 U7=0.0 e2 e5 Sample: e7 Data Streams: Lecture 16

Bernoulli Sampling – Efficient Implementation • Calculate number of elements to be skipped after an insertion (Δi) • Pr {Δi = j} = q(1-q)j • If you skip zero elements, must get: Ui ≤ q (pr: q) • Skip one element, must get: Ui > q, Ui+1 ≤ q (pr: (1-q)q) • Skip two elements: Ui > q, Ui+1 > q, Ui+2 ≤ q (pr: (1-q)2q) • Δi has a geometric distribution Data Streams: Lecture 16

Geometric Distribution - Example Geometric Distribution q = 0.2 Probability Number of Skips (Δi) Data Streams: Lecture 16

Bernoulli Sampling - Algorithm Data Streams: Lecture 16

Bernoulli Sampling • Straightforward, SRS, easy to implement • But… • Sample size is not fixed! • Look at algorithms with deterministic sample size • Reservoir Sampling • Stratified Sampling • Biased Sampling Schemes Data Streams: Lecture 16

Reservoir Sampling • Produces a SRS of size k from a window of length n (k is specified) • Initialize a “reservoir” using first k elements • For every following element, insert with probability pi (ignore with probability 1-pi) • pi = k/i for i>k (pi = 1 for i ≤ k) • pi changes as i increases • Remove one element from reservoir before insertion Data Streams: Lecture 16

Reservoir Sampling Sample size 3 (k=3) Recall: pi = 1 i≤k, pi = i/k i>k e1 e2 e3 e4 e5 e6 e7 e8 p3=1 p4=3/4 p1=1 p2=1 p5=3/5 p6=3/6 p7=3/7 p8=3/8 U6=0.9 U4=0.8 U4=0.5 U5=0.1 U5=0.2 Reservoir Sample: e1 e2 e8 e4 e3 e5 Data Streams: Lecture 16

i-1 k-1 ( ) i k ( ) Reservoir Sampling - SRS • Why set pi = k/i? • Want Sj to be a SRS from Uj = {e1, e2, …, ej} • Sj is the sample from Uj • Recall SRS means every sample of size k is equally likely • Intuition: Probability that ei is included in SRS from Ui is k/i • k is sample size, i is “window” size • k/i = (#samples containing ei)/(#samples of size k) = Data Streams: Lecture 16

Reservoir Sampling - Observations • Insertion probability (pi = k/i i>k) decreases as i increases • Also, opportunities for an element in the sample to be removed from the sample decrease as i increases • These trends offset each other • Probability of being in final sample is same for all elements in the window Data Streams: Lecture 16

Other Sampling Schemes • Stratified Sampling • Divide window into strata, SRS in each stratum • Deterministic & Semi-Deterministic Schemes • i.e. Sample every 10th element • Biased Sampling Schemes • Bias sample towards recently-received elements • Biased Reservoir Sampling • Biased Sampling by Halving Data Streams: Lecture 16

Stratified Sampling Data Streams: Lecture 16

Stratified Sampling • When elements close to each other in window have similar values, algorithms such as reservoir sampling can have bad luck • Alternative: divide window into strata and do SRS in each strata • If you know there is a correlation between data values (i.e. timestamp) and position in stream, you may wish to use stratified sampling Data Streams: Lecture 16

Deterministic Semi-deterministic Schemes • Produce sample of size k by inserting every n/k th element into the sample • Simple, but not random • Can’t make statistical conclusions about window from sample • Bad if data is periodic • Can be good if data exhibits a trend • Ensures sampled elements are spread throughout the window e1 e2 e3 e4 e5 e6 e7 e8 e9 e10 e11 e12 e13 e14 e15 e16 e17 e18 Data Streams: Lecture 16 n=18, k=6

Biased Reservoir Sampling • Recall: Reservoir sampling – probability of inclusion decreased as we got further into the window (pi = i/k) • What if pi was constant? (pi = p) • Alternative: pi decreases more slowly than i/k • Will favor recently-arrived elements • Recently-arrived elements are more likely to be in sample than long-ago-arrived elements Data Streams: Lecture 16

Biased Reservoir Sampling • For reservoir sampling, Probability that ei is included in sample S: • If pi is fixed, that is set pi = p  (0,1) • Probability that ei is in final sample increases geometrically as i increases n ( ) k-pj  Pr {ei S} = pi k j=max(i, k)+1 n - max(i, k) ( ) k-p Pr {ei S} = p k Data Streams: Lecture 16

Biased Reservoir Sampling Probability Element index (i) 40 - max(i, 10) ( ) 10-.2 Probability ei is included in final sample, p=0.2, k=10, n=40 .2 10 Data Streams: Lecture 16

Biased Sampling by Halving Λ1 Λ2 Λ3 Λ4 • Break into strata (Λi), Sample of size 2k • Step 1: S = unbiased SRS samples of size k from Λ1 and Λ2 (i.e. use reservoir sampling) • Step 2: Sub-sample S to produce a sample of size k, insert SRS of size k from Λ3 into S k k k k k k Data Streams: Lecture 16

Sampling from Sliding Windows • Harder than sampling from stationary window • Must remove elements from sample as the elements expire from the window • Difficult to maintain a sample of a fixed size • Window Types: • Sequence-based windows - contain n most recent elements (row-based window) • Timestamp-based windows - contains all elements that arrived within past t time units (time-based windows) • Unbiased sampling from within a window Data Streams: Lecture 16

Sequence-based Windows • Wj is a window of length n, j ≥ 1 • Wj = {ej, ej+1, … ej+n-1} • Want a SRS Sj of k elements from Wj • Tradeoff between amount of memory required and degree of dependence between Sj’s Data Streams: Lecture 16

Complete Resampling S2= {e3, e5} S1= {e2, e4} • Window size = 5, Sample size = 2 • Maintain full window (Wj) • Each time window changes, use reservoir sampling to create Sj from Wj • Very expensive – memory, CPU O(n) (n=window-size) e1 e2 e3 e4 e5 e6 e7 e8 e9 e10 e11 e12 e13 e14 e15 W1 W2 Data Streams: Lecture 16

Passive Algorithm S1 = {e2, e4} S2 = {e2, e4} S3 ={e7, e4} • Window size = 5, sample size = 2 • When an element in the sample expires, insert the newly-arrived element into sample • Sj is a SRS from Wj • Sj’s are highly correlated • If S1 is a bad sample, S2 will be also… • Memory is O(k), k = sample size e1 e2 e3 e4 e5 e6 e7 e8 e9 e10 e11 e12 e13 e14 e15 W1 W2 W3 Data Streams: Lecture 16

Chain Sampling (Babcock, et al.) • Improved independence properties compared to passive algorithm • Expected memory usage: O(k) • Basic algorithm – maintains sample of size 1 • Get sample of size k, by running k chain-samplers Data Streams: Lecture 16

Chain Sampling - Issue • Behaves as reservoir sampler for first n elements • Insert additional elements into sample with probability 1/n p4=1/3 p2=1/2 p3=1/3 p1=1 e1 e2 e3 e4 e5 W1 W2 W3 Now, what do we do? e1 e2 Sample: Data Streams: Lecture 16

Chain Sampling - Solution • When ei is selected for inclusion in sample, select K from {i+1, i+2, … i+n}, eK will replace ei if ei expires while part of sample S • Know ek will be in window when ei expires p4=1/3 p2=1/2 p3=1/3 e1 e2 e3 e4 e5 W1 W2 W3 e5 e7 e7 Choose K  {3, 4, 5}, K=5 Choose K  {6, 7, 8}, K=7 e1 e2 Sample: e5 Data Streams: Lecture 16

Chain Sampling - Summary • Expected memory consumptin O(k) • Chain sampling produces a SRS with replacement for each sliding window • If we use k chain-samplers to get a sample of size k, may get duplicates in that sample • Can over sample (use sample size k + α), then sub-sample to get a sample of size k Data Streams: Lecture 16

Stratified Sampling • Divide window into strata and do SRS in each strata Data Streams: Lecture 16

Stratified Sampling – Sliding Window ss1 = {e1,e2} ss2 = {e6,e7} ss3 = {e9,e11} ss2 = {e14,e16} e1 e2 e3 e4 e5 e6 e7 e8 e9 e10 e11 e12 e13 e14 e15 e16 W1 W2 W3 • Window size = 12 (n), stratum size 4 (m), stratum sample size = 2 (k) • Wj overlaps between 3 and 4 strata (l, l+1 strata) • l = win_size/stratum_size = n/m (=3) • Paper says sample size is between k(l-1) and k∙l, think should be k(l-1) – k(l+1) Data Streams: Lecture 16

Timestamp-Based Windows • Number of elements in window changes over time • Multiple elements in sample expire at once • Chain sampling relies on insertion probability = 1/n (n is window size) • Stratified Sampling – wouldn’t be able to bound sample size Data Streams: Lecture 16

Priority Sampling (Babcock, et al.) • Priority Sampler maintains a SRS of size 1, use k priority samplers to get SRS of size k • Assign random, uniformly-distributed priority (0,1) to each element • Current sample is element in window with highest priority • Keep elements for which there is no other element with both higher priority and higher (later) timestamp Data Streams: Lecture 16

Priority Sampling - Example priority: .1 .8 .3 .4 .7 .1 .3 .5 .2 .6 .4 .1 .5 .3 • Keep elements for which there is no element with: • higher priority and • higher (later) timestamp e1 e2 e3 e4 e5 e6 e7 e8 e9 e10 e11 e12 e13 e14 e15 W1 W2 W3 elt in sample elt stored in mem elt in window, not stored Data Streams: Lecture 16

Inference From a Sample • What do we do with these samples? • SRS samples can be used to estimate “population sums” • If each element ei is a sales transaction and v(ei) is dollar value of transaction • v(ei) = total sales of transactions in W • Count: h(ei) = 1 if v(ei) > $1000,  h(ei) = number of transactions in window for > $1000 • Can also do average ei W ei W Data Streams: Lecture 16

Θ = (n/k)  h(ei) ^ eiS ^ α = Θ/n = (1/k)  h(ei) ^ eiS SRS Sampling • To estimate a population sum from a SRS of size k, expansion estimator: • To estimate average, use sample average: • Also works for Stratified Sampling Data Streams: Lecture 16

Estimating Different Results • SRS sampling is good for estimating population sums, statistics • But, use different algorithms for different results • Heavy Hitters algorithm • Find elements (values) that occur commonly in the stream • Min-Hash Computation • set resemblance Data Streams: Lecture 16

Heavy Hitters • Goal: Find all stream elements that occur in at least a fraction s of all transactions • For example, find sourceIPs that occur in at least 1% of network flows • sourceIPs from which we are getting a lot of traffic Data Streams: Lecture 16

Heavy Hitters • Divide window into buckets of width w • Current bucket id = N/w, N is current stream length • Data structure D : (e, f, Δ) • e - element • f – estimated frequency • Δ – maximum possible error in f • If we are looking for common sourceIPs in a network stream • D : (sourceIP, f, Δ) Data Streams: Lecture 16

Heavy Hitters • Data structure D : (e, f, Δ) • New element e: • Check if e exists in D • If so, f = f+1 • If not, new entry (e, 1, bcurrent -1) • At bucket boundary (when bcurrent changes) • Delete all elements (e, f, Δ) if f + Δ bcurrent • If only one instance of f in bucket, entry for f deleted • Deleting items that occur  once per bucket • For threshold s, output items: f  (s-ε)N (w = 1/ε) (N is stream size) Data Streams: Lecture 16

Min-Hash • Resemblance, ρ, of two sets A, B = ρ(A,B) = | A  B | / | A  B | • Min-hash signature is a representation of a set from which one can estimate the resemblance of two sets Let h1, h2, … hn be hash functions si(A) = min(hi(a) | a  A) (minimum hash value of hi over A) Signature of A: S(A) = (s1(A), s2(A), …, sn(A)) Data Streams: Lecture 16

Min-Hash • Resemblance estimator: n ρ(A,B) =I(si(A), si(B)) I(x,y) = 1 if x=y, 0 otherwise ^ i=1 • Count # times min hash value is equal • Can substitute N minimum values of one hash function for minimum values of N hash functions h1, h2, … hn hash functions si(A) = min(hi(a) | a  A) S(A) = (s1(A), s2(A), …, sn(A)) ρ(A,B) = | A  B | / | A  B | Data Streams: Lecture 16

CS 410/510 Data Streams Lecture 16: Data-Stream Sampling: Basic Techniques and Results