1 / 38

Randomization for Massive and Streaming Data Sets

Randomization for Massive and Streaming Data Sets. Rajeev Motwani. Data Streams Mangement Systems. Traditional DBMS – data stored in finite, persistent data sets Data Streams – distributed, continuous, unbounded, rapid, time-varying, noisy, …

ellery
Download Presentation

Randomization for Massive and Streaming Data Sets

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Randomization for Massive and Streaming Data Sets Rajeev Motwani CS Forum Annual Meeting

  2. Data Streams Mangement Systems • Traditional DBMS – data stored in finite, persistentdata sets • Data Streams – distributed, continuous, unbounded, rapid, time-varying, noisy, … • Emerging DSMS – variety of modern applications • Network monitoring and traffic engineering • Telecom call records • Network security • Financial applications • Sensor networks • Manufacturing processes • Web logs and clickstreams • Massive data sets

  3. Streamed Result Register Query DSMS – Big Picture Stored Result DSMS Input streams Archive Scratch Store Stored Relations

  4. Algorithmic Issues • Computational Model • Streaming data (or, secondary memory) • Bounded main memory • Techniques • New paradigms • Negative Results and Approximation • Randomization • Complexity Measures • Memory • Time per item (online, real-time) • # Passes (linear scan in secondary memory)

  5. 1 1 0 0 1 0 1 1 1 0 1 Stream Model of Computation Main Memory (Synopsis Data Structures) Increasing time Memory: poly(1/ε, log N) Query/Update Time: poly(1/ε, log N) N: # items so far, or window size ε:error parameter Data Stream

  6. “Toy” Example – Network Monitoring Intrusion Warnings Online Performance Metrics Register Monitoring Queries DSMS Network measurements, Packet traces, … Archive Scratch Store Lookup Tables

  7. Top-k most frequent elements Find elements that occupy 0.1% of the tail. Mean + Variance? Median? 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Find all elements with frequency > 0.1% What is the frequency of element 3? What is the total frequency of elements between 8 and 14? Frequency Related Problems Analytics on Packet Headers – IP Addresses How many elements have non-zero frequency?

  8. Example 1– Distinct Values • Input Sequence X = x1, x2, …, xn, … • Domain U = {0,1,2, …, u-1} • Compute D(X)number of distinct values • Remarks • Assume stream size n is finite/known (generally, n is window size) • Domain could be arbitrary (e.g., text, tuples)

  9. Naïve Approach • Counter C(i) for each domain value i • Initialize counters C(i) 0 • Scan X incrementing appropriate counters • Problem • Memory size M << n • Space O(u)– possibly u >> n (e.g., when counting distinct words in web crawl)

  10. Negative Result Theorem: Deterministic algorithms need M = Ω(n log u) bits Proof:Information-theoretic arguments Note:Leaves open randomization/approximation

  11. Randomized Algorithm h:U  [1..t] Input Stream Hash Table Analysis • Random h few collisions & avg list-size O(n/t) • Thus • Space: O(n) – since we need t = Ω(n) • Time: O(1) per item [Expected]

  12. Improvement via Sampling? • Sample-based Estimation • Random Sample R (of size r) of n values in X • Compute D(R) • EstimatorE = D(R) x n/r • Benefit – sublinear space • Cost – estimation erroris high • Why? – low-frequency values underrepresented

  13. Negative Result for Sampling • Consider estimator E of D(X) examining r items in X • Possibly in adaptive/randomized fashion. Theorem: For any , E has relative error with probability at least . • Remarks • r = n/10  Error 75% with probability ½ • Leaves open randomization/approximation on full scans

  14. Randomized Approximation • Simplified Problem – For fixed t, is D(X) >> t? • Choose hash function h: U[1..t] • Initialize answer to NO • For each xi, if h(xi) = t, set answer to YES • Observe – need 1 bit memory only ! • Theorem: • If D(X) < t, P[output NO] > 0.25 • If D(X) > 2t, P[output NO] < 0.14 Boolean Flag 1 h:U  [1..t] t YES/NO Input Stream

  15. Analysis • Let – Y be set of distinct elements of X • output NOno element of Y hashes to t • P [element hashes to t] = 1/t • Thus – P[output NO] = (1-1/t)|Y| • Since |Y| = D(X), • D(X) < t P[output NO] > (1-1/t)t > 0.25 • D(X) > 2t P[output NO] < (1-1/t)2t < 1/e^2

  16. Boosting Accuracy • With 1 bitdistinguish D(X)<t from D(X)>2t • Running O(log 1/δ) instances in parallel reduce error probability to any δ>0 • Running O(log n) in parallel for t = 1, 2, 4, 8,…, n  can estimate D(X) within factor 2 • Choice of multiplier 2 is arbitrary  can use factor (1+ε) to reduce error to ε • Theorem: Can estimate D(X) within factor (1±ε) with probability (1-δ) using space

  17. Example 2 – Elephants-and-Ants Stream • Identify items whose current frequency exceeds support threshold s = 0.1%. [Jacobson 2000, Estan-Verghese 2001]

  18. Window 1 Window 2 Window 3 Algorithm 1: Lossy Counting Step 1: Divide the stream into ‘windows’ Window-size W is function of support s – specify later…

  19. Frequency Counts + First Window At window boundary, decrement all counters by 1 Lossy Counting in Action ... Empty

  20. Frequency Counts + Next Window At window boundary, decrement all counters by 1 Lossy Counting continued ...

  21. Error Analysis How much do we undercount? If current size of stream = N and window-size W = 1/ε then# windows = εN frequency error Rule of thumb: Set ε = 10% of support s Example: Given support frequency s = 1%, set error frequency ε = 0.1%

  22. Putting it all together… Output: Elements with counter values exceeding (s-ε)N Approximation guarantees Frequencies underestimated by at most εN No false negatives False positives have true frequency at least (s–ε)N How many counters do we need? • Worst case bound: 1/ε log εN counters • Implementation details…

  23. Stream 28 31 41 34 15 30 23 35 19 Algorithm 2: Sticky Sampling  Create counters by sampling  Maintain exact counts thereafter What is sampling rate?

  24. Approximation guarantees (probabilistic) Frequencies underestimated by at most εN No false negatives False positives have true frequency at least (s-ε)N Same error guarantees as Lossy Counting but probabilistic Sticky Sampling contd... For finite stream of length N Sampling rate = 2/εN log 1/s  = probability of failure Output: Elements with counter values exceeding (s-ε)N Same Rule of thumb: Set ε = 10% of support s Example: Given support threshold s = 1%, set error threshold ε = 0.1% set failure probability  = 0.01%

  25. Independent of N Number of counters? Finite stream of length N Sampling rate: 2/εN log 1/s Infinite stream with unknown N Gradually adjust sampling rate In either case, Expected number of counters = 2/ log 1/s

  26. Example 3 – Correlated Attributes C1 C2 C3 C4 C5 R1 1 1 1 1 0 R2 1 1 0 1 0 R3 1 0 0 1 0 R4 0 0 1 0 1 R5 1 1 1 0 1 R6 1 1 1 1 1 R7 0 1 1 1 1 R8 0 1 1 1 0 … … … • Input Stream – items with boolean attributes • Matrix – M(r,c) = 1  Row r has Attribute c • Identify – Highly-correlated column-pairs

  27. Correlation  Similarity • View column as set of row-indexes (where it has 1’s) • Set Similarity (Jaccard measure) • Example CiCj 0 1 1 0 1 1sim(Ci,Cj) = 2/5 = 0.4 0 0 1 1 0 1

  28. Identifying Similar Columns? • Goal– finding candidate pairs in small memory • Signature Idea • Hash columns Ci to small signaturesig(Ci) • Set of signatures fits in memory • sim(Ci,Cj) approximated by sim(sig(Ci),sig(Cj)) • Naïve Approach • Sample P rows uniformly at random • Define sig(Ci) as P bits of Ci in sample • Problem • sparsity would miss interesting part of columns • sample would get only 0’s in columns

  29. Key Observation • For columns Ci, Cj, four types of rows Ci Cj A 1 1 B 1 0 C 0 1 D 0 0 • Overload notation: A = # rows of type A • Observation

  30. Min Hashing • Randomly permute rows • Hashh(Ci) = index of first row with 1 in column Ci • Suprising Property P[h(Ci) = h(Cj)] = sim(Ci, Cj) • Why? • Both are A/(A+B+C) • Look down columns Ci, Cj until first non-Type-D row • h(Ci) = h(Cj)  if type A row

  31. Min-Hash Signatures • Pick – k random row permutations • Min-Hash Signature sig(C) = k indexes of first rows with 1 in column C • Similarity of signatures • Define:sim(sig(Ci),sig(Cj)) = fraction of permutations where Min-Hash values agree • Lemma E[sim(sig(Ci),sig(Cj))] = sim(Ci,Cj)

  32. Example Signatures S1 S2 S3 Perm 1 = (12345) 1 2 1 Perm 2 = (54321) 4 5 4 Perm 3 = (34512) 3 5 4 C1 C2 C3 R1 1 0 1 R2 0 1 1 R3 1 0 0 R4 1 0 1 R5 0 1 0 Similarities 1-2 1-3 2-3 Col-Col 0.00 0.50 0.25 Sig-Sig 0.00 0.67 0.00

  33. Implementation Trick • Permuting rows even once is prohibitive • Row Hashing • Pick k hash functions hk: {1,…,n}{1,…,O(n)} • Ordering under hk gives random row permutation • One-pass implementation

  34. Comparing Signatures • Signature Matrix S • Rows = Hash Functions • Columns = Columns • Entries = Signatures • Need– Pair-wise similarity of signature columns • Problem • MinHash fits column signatures in memory • But comparing signature-pairs takes too much time • Limiting candidate pairs –Locality Sensitive Hashing

  35. Summary • New algorithmic paradigms needed for streams and massive data sets • Negative results abound • Need to approximate • Power of randomization

  36. Thank You!

  37. References Rajeev Motwani (http://theory.stanford.edu/~rajeev) STREAM Project (http://www-db.stanford.edu/stream) • STREAM: The Stanford Stream Data Manager.Bulletin of the Technical Committee on Data Engineering 2003. • Motwani et al. Query Processing, Approximation, and Resource Management in a Data Stream Management System.CIDR 2003. • Babcock-Babu-Datar-Motwani-Widom. Models and Issues in Data Stream Systems.PODS 2002. • Manku-Motwani. Approximate Frequency Counts over Streaming Data.VLDB 2003. • Babcock-Datar-Motwani-O’Callahan.Maintaining Variance and K-Medians over Data Stream Windows. PODS 2003. • Guha-Meyerson-Mishra-Motwani-O’Callahan. Clustering Data Streams: Theory and Practice. IEEE TKDE 2003.

  38. References (contd) • Datar-Gionis-Indyk-Motwani. Maintaining Stream Statistics over Sliding Windows.SIAM Journal on Computing 2002. • Babcock-Datar-Motwani. Sampling From a Moving Window Over Streaming Data.SODA 2002. • O’Callahan-Guha-Mishra-Meyerson-Motwani. High-Performance Clustering of Streams and Large Data Sets.ICDE 2003. • Guha-Mishra-Motwani-O’Callagahan. Clustering Data Streams.FOCS 2000. • Cohen et al. Finding Interesting Associations without Support Pruning. ICDE 2000. • Charikar-Chaudhuri-Motwani-Narasayya. Towards Estimation Error Guarantees for Distinct Values.PODS 2000. • Gionis-Indyk-Motwani. Similarity Search in High Dimensions via Hashing.VLDB 1999. • Indyk-Motwani. Approximate Nearest Neighbors: Towards Removing the Curse of Dimensionality.STOC 1998.

More Related