360 likes | 566 Views
Incoop : MapReduce for I ncremental Computation. Pramod Bhatotia Alexander Wieder , Rodrigo Rodrigues, Umut A. Acar , Rafael Pasquini Max Planck Institute for Software Systems (MPI-SWS ). ACM SOCC 2011. Large-scale data processing. Need to process growing large data-sets
E N D
Incoop: MapReduce for Incremental Computation Pramod Bhatotia Alexander Wieder, Rodrigo Rodrigues, UmutA. Acar, Rafael Pasquini Max Planck Institute for Software Systems (MPI-SWS) ACM SOCC 2011
Large-scale data processing Need to process growing large data-sets Use of distributed and data-parallel computing MapReduce: De-facto data processing paradigm Simple, yet powerful programming model Widely adopted to support various services Pramod Bhatotia
Incremental data processing Applications repeatedly process evolving data-sets For search page rank is re-computed for every new crawl Online data-sets evolve slowly Successive input data-sets change by 0.1% to 10% Need for incremental computations Instead of re-computing from scratch Pramod Bhatotia
Incremental data processing Recent proposals for incremental processing “MapReduce (…) cannot process small updates individually as they rely on creating large batches for efficiency.” • Google Percolator [OSDI’10] • CBP (Yahoo! /UCSD) [SOCC’10] Drawbacks of these systems Adopt a new programming model Require implementation of dynamic algorithms Pramod Bhatotia
Goals Retain the simplicity of bulk data processing systems Achieve the efficiency of incremental processing Can we meet these goals for MapReduce ? Take an unmodified MapReduce application Automatically adapt it to handle incremental input changes Pramod Bhatotia
Incoop: Incremental Mapreduce Incremental bulk data processing Transparent Efficient Inspired by algorithms/PL research Provable asymptotic gains Efficient implementation based on Hadoop Pramod Bhatotia
Outline Motivation Incoop design Challenges faced Evaluation Pramod Bhatotia
Background: MapReduce a a b a b c c b Map(input) { foreach word in input output (word,1); } Reduce(key,list(v)) { print key + SUM(v); } (a,1) (a,1) (a,1) (b,1) Reduce (a,<1,1,1>) a = 3 Output Pramod Bhatotia
Basic design Basic principle: “Self-adjusting computations” Break computation into sub-computations Memoize the results of sub-computations Track dependencies between input and computation Re-compute only the parts affected by changes Pramod Bhatotia
Basic design Changes propagate through dependence graph Read input Map tasks Reduce tasks Write output Pramod Bhatotia
Outline Motivation Incoop design Challenges faced Evaluation Pramod Bhatotia
Challenges Stability How to efficiently handle insertion/deletion in input? Granularity How to perform fine-grained updates to the output? Scheduling How to minimize data movement? Pramod Bhatotia
Challenge 1: Stability Stability: Small changes in the input lead to small changes in the dependence graph Stable dependence graph efficient change propagation Is the basic approach stable? Pramod Bhatotia
Challenge 1: Stability Read input Map tasks Reduce tasks Write output Pramod Bhatotia
Challenge 1: Stability Read input Map tasks Reduce tasks Write output Pramod Bhatotia
Challenge 1: Stability Solution: Content-based chunking Avoid partitioning at fixed offset Instead, use property of input contents Example: Assume partition upon “b”: a a babc c b a Pramod Bhatotia
Challenge 1: Stability Incremental HDFS Upon file write, compute Rabin fingerprint of sliding window contents Fingerprint matches pattern boundary Content-based chunking addresses stability Probability of finding pattern controls granularity Pramod Bhatotia
Challenge 2: Granularity Coarse-grained change propagation can be inefficient Even for small input change, large task need to be recomputed Not an issue for Map tasks Incremental HDFS controls granularity Difficult to address for reducers Reducer processes all values for a given key Depends exclusively on computation and input Pramod Bhatotia
Challenge 2: Granularity Read input Map tasks Reduce task Write output Pramod Bhatotia
Challenge 2: Granularity Leverage Combiners: Pre-processing of Reduce Co-located with Map task Preprocesses Map outputs Meant to reduce bandwidth Pramod Bhatotia
Background: Combiners a a b a b c c b Combine(set of <k,v>) { foreach distinct k output(<k,SUM(v)>); } ––––––– (a,2) (a,1) (a,1) a = 3 Output Pramod Bhatotia
Challenge 2: Granularity Contraction tree Run Combiners at Reducer site as well Use them to break up Reduce work Combiners Combiners Reduce task Pramod Bhatotia
Challenge 2: Granularity Read input Map tasks Reduce task Write output Pramod Bhatotia
Challenge 2: Granularity Read input Map tasks Contraction tree Reduce task Write output Pramod Bhatotia
Challenge 3: Scheduling Scheduler determines where to run each task Based on input data localiy and machine availability New variable for incremental computation Location of previously computed memoized results Memoization-aware scheduling Prevents unnecessary data movement Pramod Bhatotia
Challenge 3: Scheduling Drawback of only memoization-aware scheduling Strict scheduling leads to straggler effect New hybrid scheduling algorithm Data locality to exploit memoized results Flexibility to prevent straggler effect More details are in the paper Pramod Bhatotia
Summary Incoop enables Incremental processing for MapReduce applications Incoop design includes Incremental HDFS for stable input partitioning Contraction tree for fine-grained updates Scheduler for memoized data locality and straggler mitigation Pramod Bhatotia
Outline Motivation Incoop design Challenges faced Evaluation Pramod Bhatotia
Evaluating Incoop Goal: Determine how Incoop works in practice What are the performance benefits? How effective are the optimizations? What are the overheads? Methodology Incoop implementation based on Hadoop Applications (compute and data-intensive) Wikipedia and synthetic data-sets Cluster of 20 machines Pramod Bhatotia
Performance gains For incremental changes Speedups of range 1000X-1.5X for 0% to 25% changes Compute-intensive performs better than data-intensive Pramod Bhatotia
Optimization: Contraction tree • K-nearest neighbor classifier • (b) Co-occurrence matrix Pramod Bhatotia
Optimization: Scheduler • Modified scheduler run-time savings range around • 30% less time for data-intensive applications • 15% less time for compute-intensive applications Pramod Bhatotia
Overhead: Performance Run-time overhead of up to 22% for the first run Incurred only once and subsequent runs are fast Pramod Bhatotia
Overhead: Storage Space usage of upto 9X of the input-size for memoization Garbage collection for bounded storage consumption Pramod Bhatotia
Case-studies Implemented two use cases of incremental processing Incremental log processing (Flume) Continuous query processing (Pig) We transparently benefit them as well Refer to the paper for details Pramod Bhatotia
Conclusions Incremental processing of MapReducecomputations Transparent Efficient Good performance at modest overhead for first run Pramod Bhatotia