1 / 31

Algorithms for massive data sets

Algorithms for massive data sets. Lecture 2 (Mar 14, 2004) Yossi Matias & Ely Porat (partially based on various presentations & notes). Negative Result for Sampling [ Charikar, Chaudhuri, Motwani, Narasayya 2000 ].

wolfe
Download Presentation

Algorithms for massive data sets

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Algorithms for massive data sets Lecture 2 (Mar 14, 2004) Yossi Matias & Ely Porat (partially based on various presentations & notes)

  2. Negative Result for Sampling [Charikar, Chaudhuri, Motwani, Narasayya 2000] Theorem: Let E be estimator for D(X) examining r<n values in X, possibly in an adaptive and randomized order. Then, for any , E must have relative error with probability at least . • Example • Say, r = n/5 • Error 20% with probability 1/2

  3. Scenario Analysis Scenario A: • all values in X are identical (say V) • D(X) = 1 Scenario B: • distinct values in X are {V, W1, …, Wk}, • V appears n-k times • each Wi appears once • Wi’s are randomly distributed • D(X) = k+1

  4. Proof • Little Birdie – one of Scenarios A or B only • Suppose • E examines elements X(1), X(2), …, X(r) in that order • choice of X(i) could be randomized and depend arbitrarily on values of X(1), …, X(i-1) • Lemma P[ X(i)=V | X(1)=X(2)=…=X(i-1)=V ] • Why? • No information on whether Scenario A or B • Wi values are randomly distributed

  5. Proof (continued) • Define EV – event {X(1)=X(2)=…=X(r)=V} • Last inequality because

  6. Proof (conclusion) • Choose to obtain • Thus: • Scenario A  • Scenario B  • Suppose • E returns estimate Z when EV happens • Scenario A  D(X)=1 • Scenario B  D(X)=k+1 • Z must have worst-case error >

  7. k Bit vector : 0000101010001001111 b Randomized Approximation (based on [Flajolet-Martin 1983, Alon-Matias-Szegedy 1996]) Theorem:For every c > 2 there exists an algorithm that, given a sequence A of n members of U={1,2,…,u}, computes a number d’ using O(log u) memory bits, such that the probability that max(d’/d,d/d’) > c is at most 2/c. A bit vector BV will represent the set Let b be smallest integer s.t. 2^b > u. Let F = GF(2^b). Let r,s be random from F. For a in A, let h(a) = r ·a + s = 101****10….0 Set k’th bit. Estimate is 2^{max bit set}. Pr(h(a)=k) k 0 1 k u-1

  8. Randomized Approximation (2)(based on [Indyk-Motwani 1998]) • Algorithm SM – For fixed t, is D(X) >> t? • Choose hash function h: U[1..t] • Initialize answer to NO • For each , if h( ) = t, set answer to YES • Theorem: • If D(X) < t, P[SM outputs NO] > 0.25 • If D(X) > 2t, P[SM outputs NO] < 0.136 = 1/e^2

  9. Analysis • Let – Y be set of distinct elements of X • SM(X) = NO no element of Y hashes to t • P[element hashes to t] = 1/t • Thus – P[SM(X) = NO] = • Since |Y| = D(X), • If D(X) < t, P[SM(X) = NO] > > 0.25 • If D(X) > 2t, P[SM(X) = NO] < < 1/e^2 • Observe – need 1 bit memory only!

  10. Boosting Accuracy • With 1 bitcan probabilistically distinguish D(X) < t from D(X) > 2t • Running O(log 1/δ) instances in parallel reduces error probability to any δ>0 • Running O(log n) in parallel for t = 1, 2, 4, 8 …, n  can estimate D(X) within factor 2 • Choice of factor 2 is arbitrary  can use factor (1+ε) to reduce error to ε • EXERCISE – Verify that we can estimate D(X) within factor (1±ε) with probability (1-δ) using space

  11. Sampling: Basics • Idea: A small random sample S of the data often well-represents all the data • For a fast approx answer, apply the query to S & “scale” the result • E.g., R.a is {0,1}, S is a 20% sample select count(*) from R where R.a = 0 select 5 * count(*) from S where S.a = 0 R.a 1 1 0 1 1 1 1 1 0 0 0 0 1 1 1 1 1 0 1 1 1 0 1 0 1 1 0 1 1 0 Red = in S Est. count = 5*2 = 10, Exact count = 10 • Leverage extensive literature on confidence intervals for sampling • Actual answer is within the interval [a,b] with a given probability • E.g., 54,000 ± 600 with prob  90%

  12. Sampling versus Counting • Observe • Count merely abstraction – need subsequent analytics • Data tuples – X merely one of many attributes • Databases – selection predicate, join results, … • Networking – need to combine distributed streams • Single-pass Approaches • Good accuracy • But gives only a count -- cannot handle extensions • Sampling-based Approaches • Keeps actual data – can address extensions • Strong negative result

  13. Distinct Sampling for Streams[Gibbons 2001] • Best of both worlds • Good accuracy • Maintains “distinct sample” over stream • Handles distributed setting • Basic idea • Hash – random “priority” for domain values • Tracks highest priority values seen • Random sample of tuples for each such value • Relative error with probability

  14. Hash Function • Domain U = [0..m-1] • Hashing • Random A, B from U, with A>0 • g(x) = Ax + B (mod m) • h(x) – # leading 0s in binary representation of g(x) • Clearly – • Fact

  15. Overall Idea • Hash  random “level” for each domain value • Compute level for stream elements • Invariant • Current Level –cur_lev • Sample S – all distinct values scanned so far of level at least cur_lev • Observe • Random hash  random sample of distinct values • For each value  can keep sample of their tuples

  16. Algorithm DS (Distinct Sample) • Parameters – memory size • Initialize –cur_lev0; Sempty • For each input x • L  h(x) • If L>cur_levthen add x to S • If |S| > M • delete from S all values of level cur_lev • cur_lev cur_lev +1 • Return

  17. Analysis • Invariant – S contains all values x such that • By construction • Thus • EXERCISE – verify deviation bound

  18. Hot list queries • Why is it interesting: • Top ten – best seller list • Load balancing • Caching policies

  19. djkkdkvza Hot list queries • Let use sampling edoejddkaklsadkjdkdkpryekfvcuszldfoasd k3d2jvza

  20. Hot list queries • The question is: • How to sample if we don’t know our sample size?

  21. 1 2 1 5 3 1 3 a b c d a b a c a a b d b a d d Gibbons & Matias’ algorithm Hotlist: 0 0 0 0 p = 1.0 Produced values:

  22. 1 2 1 5 3 1 3 a b c d a b a c a a b d b a d d Gibbons & Matias’ algorithm Need to replace one value Hotlist: 0 0 0 0 p = 1.0 Produced values: e

  23. Throw biased coins with probability f Multiply p with some amount f 1 2 1 (f = 0.75) 4 5 3 3 0 1 3 2 Replace counts by number of seen heads a b c d a b a c a a b d b a d d Gibbons & Matias’ algorithm Hotlist: 0 0 0 0 p = 0.75 Produced values: e

  24. 2 1 1 4 5 3 3 1 1 2 3 a b e d a b a c a a b d b a d d Gibbons & Matias’ algorithm Replace a value which has zero count Hotlist: 0 0 0 0 p = 0.75 Count/p is an estimate of number of times a value has been seen. E.g., the value ‘a’ has been seen 4/p = 5.33 times Produced values: e

  25. Counters • How many bits need to count? • Prefix code • Approximated counters

  26. Rarity • Paul goes fishing. • There are many different fish species U={1,..,u} • Paul catch one fish at a time atU • Ct[j]=|{ai| ai=j,i≤t}| number of time catches the species j • Species j is rare at time t if it appears only once • [t]=|{j| Ct[j]=1}|/u

  27. Rarity • Why is it interesting?

  28. Again lets use sampling U={1,2,3,4,5,6,7,8,9,10,11,12…u} U’={4,9,13,18,24} Xt[i]=|{t|aj=U’[i],j≤t}|

  29. Again lets use sampling Xi[t]=|{t|aj=Xi,j≤t}| [t]=|{Ct[i]| Ct[i]=1}|/u תזכורת: ’[t]=|{Xt[i]| Xt[i]=1}|/k

  30. Rarity • But [t] need to be at least 1/k to get a good estimator.

  31. Min-wise independent hash functions • Family of hash functions H[n]->[n]call Min-wise independent • If for any X [n] and xX

More Related