1 / 8

FlumeJava Easy, Efficient Data-Parallel Pipelines

FlumeJava Easy, Efficient Data-Parallel Pipelines. Google @PLDI’10 Mosharaf Chowdhury. Problem. Efficient data-parallel pipelines Chain of MapReduce programs Iterative jobs … Exposes a limited set of parallel operations on immutable parallel collections. Goals. Expressiveness

aulani
Download Presentation

FlumeJava Easy, Efficient Data-Parallel Pipelines

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. FlumeJavaEasy, Efficient Data-Parallel Pipelines Google @PLDI’10 Mosharaf Chowdhury

  2. Problem • Efficient data-parallel pipelines • Chain of MapReduce programs • Iterative jobs • … • Exposes a limited set of parallel operations on immutable parallel collections

  3. Goals • Expressiveness • Abstractions • Data representation • Implementation strategy • Performance • Lazy evaluation • Dynamic optimization • Usability & deployability • Implemented as a Java library • Inspired by the failure of Lumberjack

  4. FlumeJava Workflow 1 3 2 Write a Java program using the FlumeJava library Optimize FlumeJava.run(); PCollection<String> words = lines.parallelDo(newDoFn<String, String>() { void process(String line, EmitFn<String> emitFn) { for (String word : splitIntoWords(line)) { emitFn.emit(word); } } }, collectionOf(strings())); 4 Execute

  5. Core Abstractions Parallel Collections Data-parallel Operations Primitives parallelDo() groupByKey() combineValues() flatten() Derived operations count() join() top() • PCollection<T> • PTable<K, V>

  6. MapShuffleCombineReduce (MSCR) • Transform combinations of the four primitives into single MapReduce • Generalizes MapReduce • Multiple reducers/combiners • Multiple output per reducer • Pass-through outputs

  7. Optimization Optimizer Strategy Optimizer Output MSCR Flatten Operate • Sink flattens • Lift CombineValues • Insert fusion blocks • Fuse parallelDos • Fuse MSCRs

  8. Hit or Miss? • Sizable reduction in SLOC • Except for Sawzall • 5x reduction in average number of stages • Faster than other approaches • Except for Hand-optimized MapReduce chains • 319 users over a year period

More Related