MapReduce As A Language for Parallel Computing

MapReduce As A Language for Parallel Computing Wenguang CHEN, Dehao CHEN Tsinghua University

Future Architecture • Many alternatives • A few powerful cores( Intel/AMD, 2,3,4,6 …) • Many simple cores( nVidia, ATI, Larrabe, 32, 128, 196, 256 … ) • Heterogenous( CELL, 1/8; FPGA speedup … ) • But programming them is not easy • All use different programming model, some are (relatively) easy, some are extremely difficult • OpenMP, MPI, MapReduce • CUDA, Brooks • Verilog, System C

What makes parallel computing so difficult • Parallelism identification and expression • Autoparallelizing has been failed so far • Complex synchronization may be required • Data races and deadlocks which are difficult to debug • Load balance…

Map-Reduce is promising • Can only solve a subset of problems • But an important and fast growing subset, such as indexing • Easy to use • Programmers only need to write sequential code • The simplest practical parallel programming paradigm? • Dominated programming paradigm in Internet companies • Originally support distributed systems, now ported to GPU, CELL, multicore • But many dialects, which hurt the portability

Limitations on GPUs • Rely on the CPU to allocate memory • How to support variant length data? • Combine size and offset information with the key/val pair • How to allocate output buffer on GPUs? • Two-pass scan—Get the count first, and then do real execution • Lack of lock support • How to synchronize to avoid write conflict? • Memory is pre-allocated, so that every thread knows where it should write to

MapReduce on Multi-core CPU (Phoenix [HPCA'07]) Input Split Map Partition Reduce Merge Output

MapReduce on Multi-core CPU (Mars[PACT‘08]) Input MapCount Prefixsum Allocate intermediate buffer on GPU Map Sort and Group ReduceCount Prefixsum Allocate output buffer on GPU Reduce Output

Program Example • Word Count (Phoenix Implementation) … for (i = 0; i < args->length; i++) { curr_ltr = toupper(data[i]); switch (state) { case IN_WORD: data[i] = curr_ltr; if ((curr_ltr < 'A' || curr_ltr > 'Z') && curr_ltr != '\'‘) { data[i] = 0; emit_intermediate(curr_start, (void *)1, &data[i] - curr_start + 1); state = NOT_IN_WORD; } break; …

Program Example • Word Count (Mars Implementation) __device__ void GPU_MAP_FUNC//(void *key, void val, int keySize, int valSize){…. do {…. if (*line != ' ‘) line++; else { line++; GPU_EMIT_INTER_FUNC(word, &wordSize, wordSize-1, sizeof(int)); while (*line == ' ‘) { line++; } wordSize = 0; } } while (*line != '\n');…} __device__ void GPU_MAP_COUNT_FUNC //(void *key, void *val, int keySize, int valSize) {…. do {…. if (*line != ' ‘) line++; else { line++; GPU_EMIT_INTER_COUNT_FUNC( wordSize-1, sizeof(int)); while (*line == ' ‘) { line++; } wordSize = 0; } } while (*line != '\n');…}

Pros and Cons • Load Balance • Phoenix: Static + Dynamic • Mars: Static, attribute same amount of map/reduce workload to each thread • Pre-allocation • Lock free • requires two-phase scan, which is not an efficient solution • Sorting----Bottleneck of Mars • Phoenix use insertion sorts dynamically during emitting • Mars use bitonic sort -- O(n*logn*logn)

Map-Reduce as a Language, not a library • Can we have a portable Map-Reduce that could run across different architectures efficiently? • Promising • Map-Reduce already specify the parallelism well • No complex synchronizations in users code • But still difficult • Different architecture provides different features • Either portability and performance issues • Use compiler and runtime to cover the architecture differences, as what we have done in supporting high-level languages such as C

Compiler, library &Runtime C X86 Power Sparc … Map-Reduce Cluster library &Runtime Cluster library &Runtime Cluster library &Runtime Map-Reduce Multicore Multicore library &Runtime Map-Reduce General Multicore library &Runtime Map-Reduce GPU GPU library &Runtime GPU

Case study on nVidia GPU • Portability • Host function support • Annotating libc and inline • Dynamic memory allocation • Big problem, not support that in user code? • Performance • Memory Hierarchy Optimization( global, shared, readonly memory identification ) • Typed Language is preferrable( int4 type acceleration…) • Dynamic memory allocation(again!)

More to explore • …

MapReduce As A Language for Parallel Computing

MapReduce As A Language for Parallel Computing

Presentation Transcript

Parallel Computing

A Pattern Language for Parallel Programming

Parallel Computing

Cloud Computing and MapReduce

Parallel Computing Explained Parallel Computing Overview

Parallel Computing

Parallel Computing

Parallel computing

Parallel Computing

Parallel Computing

MapReduce VS Parallel DBMSs

L22: Parallel Programming Language Features (Chapel and MapReduce)

Parallel Computing

Parallel Computing

Centre for Parallel Computing

Cloud Computing Mapreduce (2)

Parallel Computing

Parallel and Distributed Computing: MapReduce

Parallel Computing

Parallel Computing

Parallel computing

Parallel Computing