Incoop : MapReduce for I ncremental Computation

Incoop: MapReduce for Incremental Computation Pramod Bhatotia Alexander Wieder, Rodrigo Rodrigues, UmutA. Acar, Rafael Pasquini Max Planck Institute for Software Systems (MPI-SWS) ACM SOCC 2011

Large-scale data processing Need to process growing large data-sets Use of distributed and data-parallel computing MapReduce: De-facto data processing paradigm Simple, yet powerful programming model Widely adopted to support various services Pramod Bhatotia

Incremental data processing Applications repeatedly process evolving data-sets For search page rank is re-computed for every new crawl Online data-sets evolve slowly Successive input data-sets change by 0.1% to 10% Need for incremental computations Instead of re-computing from scratch Pramod Bhatotia

Incremental data processing Recent proposals for incremental processing “MapReduce (…) cannot process small updates individually as they rely on creating large batches for efficiency.” • Google Percolator [OSDI’10] • CBP (Yahoo! /UCSD) [SOCC’10] Drawbacks of these systems Adopt a new programming model Require implementation of dynamic algorithms Pramod Bhatotia

Goals Retain the simplicity of bulk data processing systems Achieve the efficiency of incremental processing Can we meet these goals for MapReduce ? Take an unmodified MapReduce application Automatically adapt it to handle incremental input changes Pramod Bhatotia

Incoop: Incremental Mapreduce Incremental bulk data processing Transparent Efficient Inspired by algorithms/PL research Provable asymptotic gains Efficient implementation based on Hadoop Pramod Bhatotia

Outline Motivation Incoop design Challenges faced Evaluation Pramod Bhatotia

Background: MapReduce a a b a b c c b Map(input) { foreach word in input output (word,1); } Reduce(key,list(v)) { print key + SUM(v); } (a,1) (a,1) (a,1) (b,1) Reduce (a,<1,1,1>)  a = 3 Output Pramod Bhatotia

Basic design Basic principle: “Self-adjusting computations” Break computation into sub-computations Memoize the results of sub-computations Track dependencies between input and computation Re-compute only the parts affected by changes Pramod Bhatotia

Basic design Changes propagate through dependence graph Read input Map tasks Reduce tasks Write output Pramod Bhatotia

Challenges Stability How to efficiently handle insertion/deletion in input? Granularity How to perform fine-grained updates to the output? Scheduling How to minimize data movement? Pramod Bhatotia

Challenge 1: Stability Stability: Small changes in the input lead to small changes in the dependence graph Stable dependence graph  efficient change propagation Is the basic approach stable? Pramod Bhatotia

Challenge 1: Stability Read input Map tasks Reduce tasks Write output Pramod Bhatotia

Challenge 1: Stability Solution: Content-based chunking Avoid partitioning at fixed offset Instead, use property of input contents Example: Assume partition upon “b”: a a babc c b a Pramod Bhatotia

Challenge 1: Stability Incremental HDFS Upon file write, compute Rabin fingerprint of sliding window contents Fingerprint matches pattern  boundary Content-based chunking addresses stability Probability of finding pattern controls granularity Pramod Bhatotia

Challenge 2: Granularity Coarse-grained change propagation can be inefficient Even for small input change, large task need to be recomputed Not an issue for Map tasks Incremental HDFS controls granularity Difficult to address for reducers Reducer processes all values for a given key Depends exclusively on computation and input Pramod Bhatotia

Challenge 2: Granularity Read input Map tasks Reduce task Write output Pramod Bhatotia

Challenge 2: Granularity Leverage Combiners: Pre-processing of Reduce Co-located with Map task Preprocesses Map outputs Meant to reduce bandwidth Pramod Bhatotia

Background: Combiners a a b a b c c b Combine(set of <k,v>) { foreach distinct k output(<k,SUM(v)>); } ––––––– (a,2) (a,1) (a,1) a = 3 Output Pramod Bhatotia

Challenge 2: Granularity Contraction tree Run Combiners at Reducer site as well Use them to break up Reduce work Combiners Combiners Reduce task Pramod Bhatotia

Challenge 2: Granularity Read input Map tasks Reduce task Write output Pramod Bhatotia

Challenge 2: Granularity Read input Map tasks Contraction tree Reduce task Write output Pramod Bhatotia

Challenge 3: Scheduling Scheduler determines where to run each task Based on input data localiy and machine availability New variable for incremental computation Location of previously computed memoized results Memoization-aware scheduling Prevents unnecessary data movement Pramod Bhatotia

Challenge 3: Scheduling Drawback of only memoization-aware scheduling Strict scheduling leads to straggler effect New hybrid scheduling algorithm Data locality to exploit memoized results Flexibility to prevent straggler effect More details are in the paper Pramod Bhatotia

Summary Incoop enables Incremental processing for MapReduce applications Incoop design includes Incremental HDFS for stable input partitioning Contraction tree for fine-grained updates Scheduler for memoized data locality and straggler mitigation Pramod Bhatotia

Evaluating Incoop Goal: Determine how Incoop works in practice What are the performance benefits? How effective are the optimizations? What are the overheads? Methodology Incoop implementation based on Hadoop Applications (compute and data-intensive) Wikipedia and synthetic data-sets Cluster of 20 machines Pramod Bhatotia

Performance gains For incremental changes Speedups of range 1000X-1.5X for 0% to 25% changes Compute-intensive performs better than data-intensive Pramod Bhatotia

Optimization: Contraction tree • K-nearest neighbor classifier • (b) Co-occurrence matrix Pramod Bhatotia

Optimization: Scheduler • Modified scheduler run-time savings range around • 30% less time for data-intensive applications • 15% less time for compute-intensive applications Pramod Bhatotia

Overhead: Performance Run-time overhead of up to 22% for the first run Incurred only once and subsequent runs are fast Pramod Bhatotia

Overhead: Storage Space usage of upto 9X of the input-size for memoization Garbage collection for bounded storage consumption Pramod Bhatotia

Case-studies Implemented two use cases of incremental processing Incremental log processing (Flume) Continuous query processing (Pig) We transparently benefit them as well Refer to the paper for details Pramod Bhatotia

Conclusions Incremental processing of MapReducecomputations Transparent Efficient Good performance at modest overhead for first run Pramod Bhatotia

Incoop : MapReduce for I ncremental Computation