1 / 43

Sampling Based Range Partition for Big Data Analytics + Some Extras

Sampling Based Range Partition for Big Data Analytics + Some Extras. Milan Vojnović Microsoft Research Cambridge, United Kingdom Joint work with Charalampos Tsourakakis, Bozidar Radunovic, Zhenming Liu, Fei Xu, Jingren Zhou. INQUEST Workshop, September 2012. Big Data Analytics.

Download Presentation

Sampling Based Range Partition for Big Data Analytics + Some Extras

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Sampling Based Range Partition for Big Data Analytics+ Some Extras Milan Vojnović Microsoft Research Cambridge, United Kingdom Joint work with Charalampos Tsourakakis, Bozidar Radunovic, Zhenming Liu, Fei Xu, Jingren Zhou INQUEST Workshop, September 2012

  2. Big Data Analytics • Our goal: innovation in the area of algorithms for large scale computations to move the frontier of the computer science of big data • Some figures of scale • Peta / Tera bytes of online services data processed daily • 200M tweets per day (Twitter) • 1B of content pieces shared per day (Facebook) • 8,000 Exabytes of global data by 2015 (The Economist)

  3. Machine learning Database queries Optimization Research Agenda Distributed computing system

  4. Outline • Range Partitionwith Fei Xu and Jingren Zhou • Count Tracking with Zhenming Liu and Bozidar Radunovic • Graph Partitioning (def. only) with Charalampos Tsourakakis and Bozidar Radunovic

  5. Range Partition 101-250 1-100 950-1024 • Special interest: balanced range partition . . . 1 2 m (120,4) (120,10) (120,5) 52 8 120 . . . 1 23 52 120 120 8 83 1 23 83 1 23 24 1024 24 24 1024 1024 1 k 2

  6. Range Partition Requirements • Given and and desired relative partition sizes • -accurate range partition:with probability at least = number of data items assigned to range

  7. Two Approaches • Sampling based methods • Take a sample of data items • Compute partition boundaries using the sample • Quantile summary methods • At each node compute a local quantile summary • Merge at the coordinator node

  8. Related Work • Sampling based estimation of histograms studied by Chaudhuri, Motwani and Narasayya (ACM SIGMOD 1998) Required sample size: • Communication cost to draw samples without replacement (Trithapura and Woodruff, 2011) : For therwise:

  9. Related Work (cont’d) • Quantile summaries based approach (Greenwald and Khanna, 2001) Communication cost = • Pros • Deterministic guarantee • Cons • It requires sorting of data items • Largest frequency of an item must be at most

  10. Problem • Range partition data while making one pass through data with minimal communication between the coordinator and sites

  11. Sampling Based Method • Collect samples and partition using the samples 1 2 coordinator . . . • Pros • simplicity, scalability • Cons • how many samples to take from each site?data size imbalance: number of data input records per machine may differ from one machine to another k

  12. Data Sizes Imbalance

  13. Origins of Data Sizes Imbalance • JOINSELECT FROM A INNER JOIN B ON A.KEY==B.KEY ORDER BY COL • Lookup TableIf the record value of column X is in the lookup table, then return the row • UNPIVOTInput: Col 1 Col 2 1 2, 3 2 3, 9, 8, 13 … Output: (1,2), (1,3), (2,3), (2,9), …

  14. Weighted Sampling Scheme • SAMPLE: Each site reports a random sample of t/k data items and the total number of items • MERGE: Summary created by adding each data item from site for times • PARTITION: Use the summary to determine partition boundaries Note: the total number of data items reported by a site only once available – the site made one pass through local data

  15. SAMPLE 1 2 coordinator . . . k

  16. MERGE . . . replicas coordinator . . .

  17. PARTITION Empirical CDF of data summary 1 coordinator 0 Range 1 2 3 4 5

  18. Sufficient Sample Size • Assume For sample size-accurate range partition w. p. • largest frequency of a data value

  19. Constant Factor Imbalance • Suppose that for some • Then

  20. Proof Outline • Large deviation analysis of the error exponent:

  21. Performance • DataSet-1 • 100K data records per range,

  22. Performance (cont’d)

  23. Summary for Range Partitioning • Novel weighted sampling scheme • Provable performance guarantees • Simple and practical • Coder transfer to Cosmos • More info:Sampling Based Range Partition Methods for Big Data Analytics, V., Xu, Zhou, MSR-TR-2012-18, Mar 2012

  24. Outline • Range Partitionwith Fei Xu and Jingren Zhou • Count Tracking with Zhenming Liu and Bozidar Radunovic • Graph Partitioning (def. only) with Charalampos Tsourakakis and Bozidar Radunovic

  25. SUM Tracking Problem : Maintain estimate k 1 2 3 SUM:

  26. SUM Tracking

  27. Applications • Ex 1: database queriesSELECT SUM(AdBids)from Ads • Ex 2: iterative solving input data

  28. State of the Art • Count tracking [Huang, Yi and Zhang, 2011] • Worst-case input, monotonic sum • Expected total communication: messages • Lower bound for worst case input[Arackaparambil, Brody and Chakrabarti, 2009] • Expected total communication messages

  29. The Challenge • Q: What are communication cost efficient algorithms for the sum tracking problem with random input streams? • Random permutation • Random i.i.d. • Fractional Brownian motion

  30. Communication Complexity Bounds • Lower bound: • Upper bound: Sublinear, “price of non-monotonicity”:

  31. Communication Complexity BoundsUnknown Drift Case • Input: i.i.d. Bernoulli : unknown drift parameter Expected total communication: messages • Generalizes monotonic case to constant drift case

  32. Our Tracker Algorithm • Each site reports to the coordinator upon receiving a value update with probability • Sync all whenever the coordinator receives an update from a site S S1 S = S1+ … + Sk S, S1 site coordinator Mi = 1 Sk S S, Sk Xi site

  33. Two Applications • Second Frequency Moment • Bayesian Linear Regression

  34. App 1: Second Frequency Moment • Input: • Counter of value : • Second frequency moment: • Goal: track within relative accuracy

  35. AMS Sketch {0,1} valued hash • For and , within w. p.

  36. App 1: Second Frequency Moment (cont’d) • Sum tracking: • Expected total communication:

  37. App 2: Bayesian Linear Regression • Feature vector , output • Prior osterior

  38. App 2: Bayesian Linear Regression (cont’d) • Posterior mean and precision: • Sum tracking: • Under random permutation input, the expected communication cost =

  39. Summary for Sum Tracking • Studied the sum tracking problem with non-monotonic distributed streams under random permutation, random i. i. d. and fractional Brownian motion • Proposed a novel algorithm with nearly optimal communication complexity • Details: ACM PODS 2012

  40. Outline • Range Partitionwith Fei Xu and Jingren Zhou • Count Tracking with Zhenming Liu and Bozidar Radunovic • Graph Partitioning (def. only) with Charalampos Tsourakakis and Bozidar Radunovic

  41. Problem • Partition a graph with two objectives • Sparsely connected components • Balanced number of vertices per component • Applications • Parallel processing • Community detection

  42. Problem (cont’d) • Requirements • Streaming algorithm • Single pass / incremental • Efficient computing • Desired • Approximation guarantees • Average-case efficient k 1 2 3

  43. Summary for Graph Partitioning • Designed a streaming algorithm whose average-case performance appears superior to any of previously proposed online heuristics • Provable approximation guarantees • More details available soon

More Related