1 / 31

LIBRA: Lightweight Data Skew Mitigation in MapReduce

Introducing LIBRA, a solution to handle data skew in MapReduce by balancing data distribution and improving computational efficiency for big data processing.

ktomas
Download Presentation

LIBRA: Lightweight Data Skew Mitigation in MapReduce

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. LIBRA: Lightweight Data Skew Mitigation in MapReduce Qi Chen, Jinyu Yao, and Zhen Xiao Nov2014 To appear in IEEE Transactions on Parallel and Distributed Systems

  2. 2. Background 5. Evaluation 3. Previous work 6. Conclusion 4. System Design 1. Introduction Outlines 0

  3. 2. Background 5. Evaluation 3. Previous work 6. Conclusion 4. System Design 1. Introduction Outlines 0

  4. Introduction • The new era of Big Data is coming! • – 20 PB per day (2008) • – 30 TB per day (2009) • – 60 TB per day (2010) • –petabytes per day • What does big data mean? • Important user information • significant business value

  5. MapReduce • What is MapReduce? • most popular parallel computing model proposed by Google Select, Join, Group Page rank, Inverted index, Log analysis Clustering, machine translation, Recommendation database operation Search engine Machine learning Applications … Scientific computation Cryptanalysis

  6. Data skew in MapReduce • Mantri has witnessed the • Coefficients of variation in data • size across tasks are 0.34 and 3.1 • at the 50th and 90thpercentiles in the • Microsoft production cluster • The imbalance in the amount of data assigned to each task • Fundamental reason: • The datasets in the real world are often skewed • physical properties, hot spots • We do not know the data distribution beforehand • It cannot be solved by speculative execution

  7. 2. Background 5. Evaluation 3. Previous work 6. Conclusion 4. System Design 1. Introduction Outlines 0

  8. Architecture Intermediate data are divided according to some user defined partitioner Master Assign Assign Part 1 Map Part 2 Reduce Split 1 Part 1 Output1 Split 2 Map Part 2 … Output2 Split M … Reduce Output files Input files Part 1 Map Part 2 Map Stage Reduce Stage reduce sort combine copy map →

  9. Challenges to solve data skew • Many real world applications exhibit data skew • Sort, Grep, Join, Group, Aggregation, Page Rank, Inverted Index, etc. • The data distribution cannot be determined ahead of time • The computing environment can be heterogeneous • Diversity of hardware • Resource competition in cloud environment

  10. 2. Background 5. Evaluation 3. Previous work 6. Conclusion 4. System Design 1. Introduction Outlines 0

  11. Previous work • Significant overhead • Applicable only to certain applications • In the parallel database area • limited on join, group, and aggregate operations • Pre-run sampling jobs • Adding two pre-run sampling and counting jobs for theta join (SIGMOD’11) • Operating pre-processing extracting and samplingprocedures for the spatial feature extraction (SOCC’11) • Collect data information during the job execution • Collecting key frequency in each node and aggregating them on the master after all maps done (Cloudcom’10) • Partitioning intermediate data into more partitions and using greedy bin-packing to pack them after all maps finish (CLOSER’11, ICDE’12) • Skewtune (SIGMOD’12) • Split skewed tasks when detected • Reconstruct the output by concatenating the results • Bring barrier between map and reduce phases • Bin-packing cannot support total order • Need more task slots • Cannot detect large keys • Cannot split in copy and sort phases

  12. 2. Background 5. Evaluation 3. Previous work 6. Conclusion 4. System Design 1. Introduction Outlines 0

  13. LIBRA – Solving data skew Normal Map Reduce Sample Map Normal Map Reduce HDFS HDFS Sample Map 4: Ask Workers to Partition Map Output 1: Issue Sample Tasks First Master 2:Sample Data 3: Calculate Partitions

  14. Sampling and partitioning • Sampling strategy • Random, TopCluster (ICDE’12) • LIBRA – p largest keys and q random keys • Estimate Intermediate Data Distribution • Large keys -> represent only one large key • Random keys -> represent a small range keys • Partitioning strategy • Hash, bin packing, range • LIBRA - range

  15. Heterogeneity Consideration Cnt=300 Intermediate data Cnt=150 Cnt=100 Cnt=50 Reducer2 Reducer1 Reducer3 Performance=0.5 Performance=1.5 Performance=1 Node3 Node1 Node2 Start Processing Finish

  16. Problem Statement • The intermediate data can be represented as: • (K1, C1), (K2, C2), …, (Kn, Cn) Ki < Ki+1 • Ki  a distinct key Ci  number of (k,v) pairs of Ki • Range partition: 0 = < < … < = n • Reducer keys in the range of (, ] • Our goal: • Minimize •  computational complexity of processing Kj • sort:, self-join: •  performance factor of the worker node

  17. (, ), …… (,) L L P1 keys, Q1 tuples K1 (), …… (), (), (), …… (), (), (), …… () (), …… (), (), (), …… (), (), (), …… () Distribution estimation P2 keys, Q2 tuples K2 P3 keys, Q3 tuples (, ) K3 (, ), …… (, ) … Ki-1 Pi (=1) keys, Qi tuples Ki (, ) Pi+1 keys, Qi+1 tuples Ki+1 (, ), …… (, ) … K|L| • Sum up samples (b) Pick up “marked keys” (c) Estimate distribution Minimize

  18. Sparse Index to Speed Up Partitioning decrease the partition time by an order of magnitude Intermediate data Offset1 (Kb1, Vb1) Index chunk (Kb1+1, Vb1+1) L1 …… Sparse index Offset2 (Kb2, Vb2) (Kb1, Offset1, L1, Checksum1) (Kb2+1, Vb2+1) L2 (Kb2, Offset2, L2, Checksum2) …… …… (Kbn, Offsetn, Ln, Checksumn) Offsetn (Kbn, Vbn) (Kbn+1, Vbn+1) Ln ……

  19. Large Cluster Splitting C, cnt = 10 A, cnt = 100 B, cnt = 10 • treat each intermediate (k,v) pair independently in reduce phase • e.g. sort, grep, join Cluster split is allow Cluster split is not allow A, cnt=100 B, cnt = 10 C, cnt = 10 A, cnt=60 A, cnt = 40 B, cnt = 10 C, cnt = 10 Reducer 1 Reducer 2 Reducer 1 Reducer 2 Data Skewed

  20. 2. Background 5. Evaluation 3. Previous work 6. Conclusion 4. System Design 1. Introduction Outlines 0

  21. Experiment Environment • Cluster: • 30 virtual machines on 15 physical machines • Each physical machine: • dual-Processors (2.4GHz Xeon E5620) • 24GB of RAM • two 150GB disks • connected by 1Gbps Ethernet • Each virtual machine: • 2 virtual core, 4GB RAM and 40GB of disk space • Benchmark: • Sort, Grep, Inverted Index, join

  22. Evalution - Accuracy of the Sampling Method Zipf distribution (= 1.0) #keys = 65535 Sample 20% of splits and 1000 keys from each split

  23. Evaluation – LIBRA Execution (sort) • 80% faster than Hadoop Hash • 167% faster than Hadoop Range

  24. Evaluation – Degree of the skew (sort) The overhead of LIBRA is minimal

  25. Evaluation – different applications • Grep application -- grep different words from the full English Wikipedia archive with total data size of 31GB

  26. Evaluation – different applications • Inverted Index application • Dataset: full English Wikipedia archive

  27. Evaluation – different applications • Join application

  28. Evaluation – Heterogeneous Environments (sort) • 30% faster than without • heterogeneous consideration

  29. 2. Background 5. Evaluation 3. Previous work 6. Conclusion 4. System Design 1. Introduction Outlines 0

  30. Conclusion • We present LIBRA, a system that implements a set of innovative skew mitigation strategies in MapReduce: • A new sampling method for general user-defined programs • p largest keys and q random keys • An approach to balance the load among the reduce tasks • Large key split support • An innovative consideration of heterogeneous environment • Balance the processing time instead of just the amount of data • Performance evaluation demonstrates that: • the improvement is significant (up to 4x times faster) • the overhead is minimal and negligible even in the absence of skew

  31. Thank You!

More Related