310 likes | 321 Views
Introducing LIBRA, a solution to handle data skew in MapReduce by balancing data distribution and improving computational efficiency for big data processing.
E N D
LIBRA: Lightweight Data Skew Mitigation in MapReduce Qi Chen, Jinyu Yao, and Zhen Xiao Nov2014 To appear in IEEE Transactions on Parallel and Distributed Systems
2. Background 5. Evaluation 3. Previous work 6. Conclusion 4. System Design 1. Introduction Outlines 0
2. Background 5. Evaluation 3. Previous work 6. Conclusion 4. System Design 1. Introduction Outlines 0
Introduction • The new era of Big Data is coming! • – 20 PB per day (2008) • – 30 TB per day (2009) • – 60 TB per day (2010) • –petabytes per day • What does big data mean? • Important user information • significant business value
MapReduce • What is MapReduce? • most popular parallel computing model proposed by Google Select, Join, Group Page rank, Inverted index, Log analysis Clustering, machine translation, Recommendation database operation Search engine Machine learning Applications … Scientific computation Cryptanalysis
Data skew in MapReduce • Mantri has witnessed the • Coefficients of variation in data • size across tasks are 0.34 and 3.1 • at the 50th and 90thpercentiles in the • Microsoft production cluster • The imbalance in the amount of data assigned to each task • Fundamental reason: • The datasets in the real world are often skewed • physical properties, hot spots • We do not know the data distribution beforehand • It cannot be solved by speculative execution
2. Background 5. Evaluation 3. Previous work 6. Conclusion 4. System Design 1. Introduction Outlines 0
Architecture Intermediate data are divided according to some user defined partitioner Master Assign Assign Part 1 Map Part 2 Reduce Split 1 Part 1 Output1 Split 2 Map Part 2 … Output2 Split M … Reduce Output files Input files Part 1 Map Part 2 Map Stage Reduce Stage reduce sort combine copy map →
Challenges to solve data skew • Many real world applications exhibit data skew • Sort, Grep, Join, Group, Aggregation, Page Rank, Inverted Index, etc. • The data distribution cannot be determined ahead of time • The computing environment can be heterogeneous • Diversity of hardware • Resource competition in cloud environment
2. Background 5. Evaluation 3. Previous work 6. Conclusion 4. System Design 1. Introduction Outlines 0
Previous work • Significant overhead • Applicable only to certain applications • In the parallel database area • limited on join, group, and aggregate operations • Pre-run sampling jobs • Adding two pre-run sampling and counting jobs for theta join (SIGMOD’11) • Operating pre-processing extracting and samplingprocedures for the spatial feature extraction (SOCC’11) • Collect data information during the job execution • Collecting key frequency in each node and aggregating them on the master after all maps done (Cloudcom’10) • Partitioning intermediate data into more partitions and using greedy bin-packing to pack them after all maps finish (CLOSER’11, ICDE’12) • Skewtune (SIGMOD’12) • Split skewed tasks when detected • Reconstruct the output by concatenating the results • Bring barrier between map and reduce phases • Bin-packing cannot support total order • Need more task slots • Cannot detect large keys • Cannot split in copy and sort phases
2. Background 5. Evaluation 3. Previous work 6. Conclusion 4. System Design 1. Introduction Outlines 0
LIBRA – Solving data skew Normal Map Reduce Sample Map Normal Map Reduce HDFS HDFS Sample Map 4: Ask Workers to Partition Map Output 1: Issue Sample Tasks First Master 2:Sample Data 3: Calculate Partitions
Sampling and partitioning • Sampling strategy • Random, TopCluster (ICDE’12) • LIBRA – p largest keys and q random keys • Estimate Intermediate Data Distribution • Large keys -> represent only one large key • Random keys -> represent a small range keys • Partitioning strategy • Hash, bin packing, range • LIBRA - range
Heterogeneity Consideration Cnt=300 Intermediate data Cnt=150 Cnt=100 Cnt=50 Reducer2 Reducer1 Reducer3 Performance=0.5 Performance=1.5 Performance=1 Node3 Node1 Node2 Start Processing Finish
Problem Statement • The intermediate data can be represented as: • (K1, C1), (K2, C2), …, (Kn, Cn) Ki < Ki+1 • Ki a distinct key Ci number of (k,v) pairs of Ki • Range partition: 0 = < < … < = n • Reducer keys in the range of (, ] • Our goal: • Minimize • computational complexity of processing Kj • sort:, self-join: • performance factor of the worker node
(, ), …… (,) L L P1 keys, Q1 tuples K1 (), …… (), (), (), …… (), (), (), …… () (), …… (), (), (), …… (), (), (), …… () Distribution estimation P2 keys, Q2 tuples K2 P3 keys, Q3 tuples (, ) K3 (, ), …… (, ) … Ki-1 Pi (=1) keys, Qi tuples Ki (, ) Pi+1 keys, Qi+1 tuples Ki+1 (, ), …… (, ) … K|L| • Sum up samples (b) Pick up “marked keys” (c) Estimate distribution Minimize
Sparse Index to Speed Up Partitioning decrease the partition time by an order of magnitude Intermediate data Offset1 (Kb1, Vb1) Index chunk (Kb1+1, Vb1+1) L1 …… Sparse index Offset2 (Kb2, Vb2) (Kb1, Offset1, L1, Checksum1) (Kb2+1, Vb2+1) L2 (Kb2, Offset2, L2, Checksum2) …… …… (Kbn, Offsetn, Ln, Checksumn) Offsetn (Kbn, Vbn) (Kbn+1, Vbn+1) Ln ……
Large Cluster Splitting C, cnt = 10 A, cnt = 100 B, cnt = 10 • treat each intermediate (k,v) pair independently in reduce phase • e.g. sort, grep, join Cluster split is allow Cluster split is not allow A, cnt=100 B, cnt = 10 C, cnt = 10 A, cnt=60 A, cnt = 40 B, cnt = 10 C, cnt = 10 Reducer 1 Reducer 2 Reducer 1 Reducer 2 Data Skewed
2. Background 5. Evaluation 3. Previous work 6. Conclusion 4. System Design 1. Introduction Outlines 0
Experiment Environment • Cluster: • 30 virtual machines on 15 physical machines • Each physical machine: • dual-Processors (2.4GHz Xeon E5620) • 24GB of RAM • two 150GB disks • connected by 1Gbps Ethernet • Each virtual machine: • 2 virtual core, 4GB RAM and 40GB of disk space • Benchmark: • Sort, Grep, Inverted Index, join
Evalution - Accuracy of the Sampling Method Zipf distribution (= 1.0) #keys = 65535 Sample 20% of splits and 1000 keys from each split
Evaluation – LIBRA Execution (sort) • 80% faster than Hadoop Hash • 167% faster than Hadoop Range
Evaluation – Degree of the skew (sort) The overhead of LIBRA is minimal
Evaluation – different applications • Grep application -- grep different words from the full English Wikipedia archive with total data size of 31GB
Evaluation – different applications • Inverted Index application • Dataset: full English Wikipedia archive
Evaluation – different applications • Join application
Evaluation – Heterogeneous Environments (sort) • 30% faster than without • heterogeneous consideration
2. Background 5. Evaluation 3. Previous work 6. Conclusion 4. System Design 1. Introduction Outlines 0
Conclusion • We present LIBRA, a system that implements a set of innovative skew mitigation strategies in MapReduce: • A new sampling method for general user-defined programs • p largest keys and q random keys • An approach to balance the load among the reduce tasks • Large key split support • An innovative consideration of heterogeneous environment • Balance the processing time instead of just the amount of data • Performance evaluation demonstrates that: • the improvement is significant (up to 4x times faster) • the overhead is minimal and negligible even in the absence of skew