The Limitation of MapReduce : A Probing Case and a Lightweight Solution

The Limitation of MapReduce: A Probing Case and a Lightweight Solution Department of Computer Science and Engineering The Hong Kong University of Science and Technology Zhiqiang Ma Lin Gu CLOUD COMPUTING 2010 November 21-26, 2010 - Lisbon, Portugal

MapReduce • MapReduce: parallel computing framework for large-scale data processing • Successful used in datacenters comprising commodity computers • A fundamental piece of software in the Google architecture for many years • Open source variant already exists: Hadoop • Widely used in solving data-intensive problems MapReduce Hadoop … Hadoop or variants …

Introduction to MapReduce • Map and Reduce are higher-order functions • Map: apply an operation to all elements in a list • Reduce: Like “fold”; aggregate elements of a list 12 + 22 + 32 + 42 + 52 = ? 1 2 3 4 5 m: x2 m m m m m 1 4 9 16 25 r: + r r r r r final value 0 1 5 14 30 55 Initial value

Introduction to MapReduce Massive parallel processing made simple • Example: world count • Map: parse a document and generate <word, 1> pairs • Reduce: receive all pairs for a specific word, and count Map Reduce // D is a document for each word w in D output <w, 1> Reduce for key w: count = 0 for each input item count = count + 1 output <w, count>

Thoughts on MapReduce MapReduce provides an easy-to-use framework for parallel programming. But is it good for general programs running in datacenters?

Our work • Analyze MapReduce’s design and use a case study to probe thelimitation • Design a new parallelization framework - MRlite • Evaluate the new framework’s performance Design a general parallelization framework and programming paradigm for cloud computing

Thoughts on MapReduce • Originally designed for processing large static data sets • No significant dependence • Throughput over latency • Large-data-parallelism over small-maybe-ephemeral parallelization opportunities … Input Output MapReduce MapReduce

The limitation of MapReduce • One-way scalability • Allows a program to scale up to process very large data sets • Constrains the program’s ability to process moderate-size data items • Limits the applicability of MapReduce • Difficult to handle dynamic, interactive and semantic-rich applications

A case study on MapReduce Distributed compiler • Very useful in development environments • Code (data) has dependence • Abundant parallelization opportunities A “typical” application, but a hard case for MapReduce init/version.o vmlinux-init vmlinux-main driver/built-in.o make -j N … vmlinux mm/built-in.o kallsyms.o

A case study: mrcc • Develop a distributed compiler using the MapReduce model • How to extract the parallelizable components in a relatively complex data flow? • mrcc: A distributed compilation system • The workload is parallelizable but data-dependence constrained • Explores parallelism using the MapReduce model

mrcc • Multiple machines available to MapReduce for parallel compilation • A master instructs multiple slaves (“map workers”) to compile source files

… … Design of mrcc MapReduce jobs for compiling source files “make” explores parallelism among compiling source files make -j N master slave The map task compiles an individual file

Experiment: mrcc over Hadoop • MapReduce implementation • Hadoop 0.20.2 • Testbed • 10 nodes available to Hadoop for parallel execution • Nodes are connected by 1Gbps Ethernet • Workload • Compiling Linux kernel, ImageMagick, and Xen tools

Result and observation The compilation using mrcc on 10 nodes is 2~11 times slower than sequential compilation on one node. • Where does the slowdown come from? • Network communication overhead for data transportation and replication • Tasking overhead For compiling source file • Put source files to HDFS: >2s • Start Hadoop job: > 20s • Retrieve object files: >2s

mrcc: Distributed Compilation • Is there sufficient parallelism to exploit? • Yes. “distcc” serves as baseline • One-way scalability in the (MapReduce) design and (Hadoop) implementation. MapReduce is not designed for compiling. We use this case to show some of its limitations.

Parallelization framework • MapReduce/Hadoop is inefficient for general programming • Cloud computing needs a general parallelization framework! • Handle applications with complex logic, data dependence and frequent updates, etc. • 39% of Facebook’s MapReduce workload have only 1 Map [Zaharia 2010] • Easy to use and high performance [Zaharia 2010] M. Zaharia, D. Borthakur, J. S. Sarma, K. Elmeleegy, S. Shenker, and I. Stoica. 2010. Delay scheduling: a simple technique for achieving locality and fairness in cluster scheduling. EuroSys ‘10.

Lightweight solution: MRlite • A lightweight parallelization framework following the MapReduce paradigm • Parallelization can be invoked when needed • Able to scale “up” like MapReduce, and scale “down” to process moderate-size data • Low latency and massive parallelism • Small run-time system overhead A general parallelization framework and programming paradigm for cloud computing

Architecture of MRlite • The MRlite master accepts jobs from clients and schedules them to execute on slaves slave • Distributed nodes accept tasks from master and execute them application slave MRlite master scheduler MRlite client slave • Linked together with the app, the MRlite client library accepts calls from app and submits jobs to the master slave • High speed distributed storage, stores intermediate files High speed Distributed storage Data flow Command flow

Design • Parallelization is invoked when needed • An application can request parallel execution for arbitrary number of times • Program’s natural logic flow integrated with parallelism • Remove one important limitation • Facility outlives utility • Use and reuse threads for master and slaves • Memory is “first class” medium • Avoid touching hard drives

Design • Programming interface • Provides simple API • API allows programs to invoke parallel processing during execution • Data handling • Network file system which stores files in memory • No replication for intermediate files • Applications are responsible to retrieve output files • Latency control • Jobs and tasks have timeout limitations

Implementation • Implemented in C as Linux applications • Distributed file storage • Implemented with NFS in memory; Mounted from all nodes; Stores intermediate files • A specially designed distributed in-memory network file system may further improve performance (future work) • There is no limitation on the choice of programming languages

Evaluation • Re-implement mrcc on MRlite • It is not difficult to port mrcc because MRlite can handle a “superset” of the MapReduce workloads • Testbed and workload • Use the same testbed and same workload to compare MRlite‘s performance with MapReduce/Hadoop’s

Result The compilation of the three projects using mrcc on MRlite is much faster than compilation on one node. The speedup is at least 2 and the best speedup reaches 6.

MRlite vs. Hadoop The average speedup of MRlite is more than 12 times better than that of Hadoop. The evaluation shows that MRlite is one order of magnitude faster than Hadoop on problems that MapReduce has difficulty in handling.

Conclusion • Cloud Computing needs a general programming framework • Cloud computing shall not be a platform to run just simple OLAP applications. It is important to support complex computation and even OLTP on large data sets. • Use the distributed compilation case (mrcc) to probe the one-way scalability limitation of MapReduce • Design MRlite: a general parallelization framework for cloud computing • Handles applications with complex logic flow and data dependencies • Mitigates the one-way scalability problem • Able to handle all MapReduce tasks with comparable (if not better) performance

Conclusion Emerging computing platforms increasingly emphasize parallelization capability, such as GPGPU • MRlite respects applications’ natural logic flow and data dependencies • This modularization of parallelization capability from application logic enables MRlite to integrate GPGPU processing very easily (future work)

Thank you!

The Limitation of MapReduce : A Probing Case and a Lightweight Solution

The Limitation of MapReduce : A Probing Case and a Lightweight Solution

Presentation Transcript

Measuring a (MapReduce) Data Center

The Life of a Case

Comparison of Parallel DB and MapReduce MapReduce: A Flexible Data Processing Tool

MOBILE APPLICATON SOFTWARE – THE LIMITATION OF A GENERATION ?

A Batch Solution to the Death Date Problem A Case Study

A Hierarchical MapReduce Framework

C l oud MapReduce: A MapReduce Implementation on top of a Cloud Operation System

Dynamic Cloud Deployment of a MapReduce Architecture

A Model of Computation for MapReduce

MOLARITY A measurement of the concentration of a solution

MOLARITY A measurement of the concentration of a solution

Probing, Learning, Improving and Leading a Nation

MapReduce A Common Mistake Theory of MapReduce Algorithms Some Examples

A Lightweight Case Tool for Learning OO Design

Lightweight Violin Case

A BigData Tour – HDFS, Ceph and MapReduce

A BigData Tour – HDFS, Ceph and MapReduce

A BigData Tour – HDFS, Ceph and MapReduce

A Model of Computation for MapReduce

Powers and Limitation of Directors

Lightweight Marvels and Unraveling the Wonders of Cenosphere as a Filler