FlumeJava Easy, Efficient Data-Parallel Pipelines

FlumeJavaEasy, Efficient Data-Parallel Pipelines Google @PLDI’10 Mosharaf Chowdhury

Problem • Efficient data-parallel pipelines • Chain of MapReduce programs • Iterative jobs • … • Exposes a limited set of parallel operations on immutable parallel collections

Goals • Expressiveness • Abstractions • Data representation • Implementation strategy • Performance • Lazy evaluation • Dynamic optimization • Usability & deployability • Implemented as a Java library • Inspired by the failure of Lumberjack

FlumeJava Workflow 1 3 2 Write a Java program using the FlumeJava library Optimize FlumeJava.run(); PCollection<String> words = lines.parallelDo(newDoFn<String, String>() { void process(String line, EmitFn<String> emitFn) { for (String word : splitIntoWords(line)) { emitFn.emit(word); } } }, collectionOf(strings())); 4 Execute

Core Abstractions Parallel Collections Data-parallel Operations Primitives parallelDo() groupByKey() combineValues() flatten() Derived operations count() join() top() • PCollection<T> • PTable<K, V>

MapShuffleCombineReduce (MSCR) • Transform combinations of the four primitives into single MapReduce • Generalizes MapReduce • Multiple reducers/combiners • Multiple output per reducer • Pass-through outputs

Optimization Optimizer Strategy Optimizer Output MSCR Flatten Operate • Sink flattens • Lift CombineValues • Insert fusion blocks • Fuse parallelDos • Fuse MSCRs

Hit or Miss? • Sizable reduction in SLOC • Except for Sawzall • 5x reduction in average number of stages • Faster than other approaches • Except for Hand-optimized MapReduce chains • 319 users over a year period

FlumeJava Easy, Efficient Data-Parallel Pipelines

FlumeJava Easy, Efficient Data-Parallel Pipelines

Presentation Transcript

Parallel Data Cubing

Parallel Data Mining

PIPELINES

SCOPE: Easy and Efficient Parallel Processing of Massive Data Sets

SCOPE Easy and Efficient Parallel Processing of Massive Data Sets

Pipelines!

Efficient Parallel kNN Joins for Large Data in MapReduce

SIMD Defragmenter: Efficient ILP Realization on Data-parallel Architectures

Parallel Data Structures

Efficient Data Parallel Computing on GPUs

Data Grid Services and Pipelines

Efficient Clustering of Large EST Data Sets on Parallel Computers

Parallel processing is not easy

Parallel digital data

PIPELINES Parallel Level 1

Parallel Prefix and Data Parallel Operations

Pipelines

COMP 308 Parallel Efficient Algorithms

Parallel Data Cube

Parallel IP Lookup using Multiple SRAM-based Pipelines

Pipelines

Multi-Terabit IP Lookup Using Parallel Bidirectional Pipelines