440 likes | 560 Views
Sampling Based Range Partition for Big Data Analytics + Some Extras. Milan Vojnović Microsoft Research Cambridge, United Kingdom Joint work with Charalampos Tsourakakis, Bozidar Radunovic, Zhenming Liu, Fei Xu, Jingren Zhou. INQUEST Workshop, September 2012. Big Data Analytics.
E N D
Sampling Based Range Partition for Big Data Analytics+ Some Extras Milan Vojnović Microsoft Research Cambridge, United Kingdom Joint work with Charalampos Tsourakakis, Bozidar Radunovic, Zhenming Liu, Fei Xu, Jingren Zhou INQUEST Workshop, September 2012
Big Data Analytics • Our goal: innovation in the area of algorithms for large scale computations to move the frontier of the computer science of big data • Some figures of scale • Peta / Tera bytes of online services data processed daily • 200M tweets per day (Twitter) • 1B of content pieces shared per day (Facebook) • 8,000 Exabytes of global data by 2015 (The Economist)
Machine learning Database queries Optimization Research Agenda Distributed computing system
Outline • Range Partitionwith Fei Xu and Jingren Zhou • Count Tracking with Zhenming Liu and Bozidar Radunovic • Graph Partitioning (def. only) with Charalampos Tsourakakis and Bozidar Radunovic
Range Partition 101-250 1-100 950-1024 • Special interest: balanced range partition . . . 1 2 m (120,4) (120,10) (120,5) 52 8 120 . . . 1 23 52 120 120 8 83 1 23 83 1 23 24 1024 24 24 1024 1024 1 k 2
Range Partition Requirements • Given and and desired relative partition sizes • -accurate range partition:with probability at least = number of data items assigned to range
Two Approaches • Sampling based methods • Take a sample of data items • Compute partition boundaries using the sample • Quantile summary methods • At each node compute a local quantile summary • Merge at the coordinator node
Related Work • Sampling based estimation of histograms studied by Chaudhuri, Motwani and Narasayya (ACM SIGMOD 1998) Required sample size: • Communication cost to draw samples without replacement (Trithapura and Woodruff, 2011) : For therwise:
Related Work (cont’d) • Quantile summaries based approach (Greenwald and Khanna, 2001) Communication cost = • Pros • Deterministic guarantee • Cons • It requires sorting of data items • Largest frequency of an item must be at most
Problem • Range partition data while making one pass through data with minimal communication between the coordinator and sites
Sampling Based Method • Collect samples and partition using the samples 1 2 coordinator . . . • Pros • simplicity, scalability • Cons • how many samples to take from each site?data size imbalance: number of data input records per machine may differ from one machine to another k
Origins of Data Sizes Imbalance • JOINSELECT FROM A INNER JOIN B ON A.KEY==B.KEY ORDER BY COL • Lookup TableIf the record value of column X is in the lookup table, then return the row • UNPIVOTInput: Col 1 Col 2 1 2, 3 2 3, 9, 8, 13 … Output: (1,2), (1,3), (2,3), (2,9), …
Weighted Sampling Scheme • SAMPLE: Each site reports a random sample of t/k data items and the total number of items • MERGE: Summary created by adding each data item from site for times • PARTITION: Use the summary to determine partition boundaries Note: the total number of data items reported by a site only once available – the site made one pass through local data
SAMPLE 1 2 coordinator . . . k
MERGE . . . replicas coordinator . . .
PARTITION Empirical CDF of data summary 1 coordinator 0 Range 1 2 3 4 5
Sufficient Sample Size • Assume For sample size-accurate range partition w. p. • largest frequency of a data value
Constant Factor Imbalance • Suppose that for some • Then
Proof Outline • Large deviation analysis of the error exponent:
Performance • DataSet-1 • 100K data records per range,
Summary for Range Partitioning • Novel weighted sampling scheme • Provable performance guarantees • Simple and practical • Coder transfer to Cosmos • More info:Sampling Based Range Partition Methods for Big Data Analytics, V., Xu, Zhou, MSR-TR-2012-18, Mar 2012
Outline • Range Partitionwith Fei Xu and Jingren Zhou • Count Tracking with Zhenming Liu and Bozidar Radunovic • Graph Partitioning (def. only) with Charalampos Tsourakakis and Bozidar Radunovic
SUM Tracking Problem : Maintain estimate k 1 2 3 SUM:
Applications • Ex 1: database queriesSELECT SUM(AdBids)from Ads • Ex 2: iterative solving input data
State of the Art • Count tracking [Huang, Yi and Zhang, 2011] • Worst-case input, monotonic sum • Expected total communication: messages • Lower bound for worst case input[Arackaparambil, Brody and Chakrabarti, 2009] • Expected total communication messages
The Challenge • Q: What are communication cost efficient algorithms for the sum tracking problem with random input streams? • Random permutation • Random i.i.d. • Fractional Brownian motion
Communication Complexity Bounds • Lower bound: • Upper bound: Sublinear, “price of non-monotonicity”:
Communication Complexity BoundsUnknown Drift Case • Input: i.i.d. Bernoulli : unknown drift parameter Expected total communication: messages • Generalizes monotonic case to constant drift case
Our Tracker Algorithm • Each site reports to the coordinator upon receiving a value update with probability • Sync all whenever the coordinator receives an update from a site S S1 S = S1+ … + Sk S, S1 site coordinator Mi = 1 Sk S S, Sk Xi site
Two Applications • Second Frequency Moment • Bayesian Linear Regression
App 1: Second Frequency Moment • Input: • Counter of value : • Second frequency moment: • Goal: track within relative accuracy
AMS Sketch {0,1} valued hash • For and , within w. p.
App 1: Second Frequency Moment (cont’d) • Sum tracking: • Expected total communication:
App 2: Bayesian Linear Regression • Feature vector , output • Prior osterior
App 2: Bayesian Linear Regression (cont’d) • Posterior mean and precision: • Sum tracking: • Under random permutation input, the expected communication cost =
Summary for Sum Tracking • Studied the sum tracking problem with non-monotonic distributed streams under random permutation, random i. i. d. and fractional Brownian motion • Proposed a novel algorithm with nearly optimal communication complexity • Details: ACM PODS 2012
Outline • Range Partitionwith Fei Xu and Jingren Zhou • Count Tracking with Zhenming Liu and Bozidar Radunovic • Graph Partitioning (def. only) with Charalampos Tsourakakis and Bozidar Radunovic
Problem • Partition a graph with two objectives • Sparsely connected components • Balanced number of vertices per component • Applications • Parallel processing • Community detection
Problem (cont’d) • Requirements • Streaming algorithm • Single pass / incremental • Efficient computing • Desired • Approximation guarantees • Average-case efficient k 1 2 3
Summary for Graph Partitioning • Designed a streaming algorithm whose average-case performance appears superior to any of previously proposed online heuristics • Provable approximation guarantees • More details available soon