140 likes | 252 Views
MapReduce As A Language for Parallel Computing. Wenguang CHEN, Dehao CHEN Tsinghua University. Future Architecture. Many alternatives A few powerful cores( Intel/AMD, 2,3,4,6 …) Many simple cores( nVidia, ATI, Larrabe, 32, 128, 196, 256 … ) Heterogenous( CELL, 1/8; FPGA speedup … )
E N D
MapReduce As A Language for Parallel Computing Wenguang CHEN, Dehao CHEN Tsinghua University
Future Architecture • Many alternatives • A few powerful cores( Intel/AMD, 2,3,4,6 …) • Many simple cores( nVidia, ATI, Larrabe, 32, 128, 196, 256 … ) • Heterogenous( CELL, 1/8; FPGA speedup … ) • But programming them is not easy • All use different programming model, some are (relatively) easy, some are extremely difficult • OpenMP, MPI, MapReduce • CUDA, Brooks • Verilog, System C
What makes parallel computing so difficult • Parallelism identification and expression • Autoparallelizing has been failed so far • Complex synchronization may be required • Data races and deadlocks which are difficult to debug • Load balance…
Map-Reduce is promising • Can only solve a subset of problems • But an important and fast growing subset, such as indexing • Easy to use • Programmers only need to write sequential code • The simplest practical parallel programming paradigm? • Dominated programming paradigm in Internet companies • Originally support distributed systems, now ported to GPU, CELL, multicore • But many dialects, which hurt the portability
Limitations on GPUs • Rely on the CPU to allocate memory • How to support variant length data? • Combine size and offset information with the key/val pair • How to allocate output buffer on GPUs? • Two-pass scan—Get the count first, and then do real execution • Lack of lock support • How to synchronize to avoid write conflict? • Memory is pre-allocated, so that every thread knows where it should write to
MapReduce on Multi-core CPU (Phoenix [HPCA'07]) Input Split Map Partition Reduce Merge Output
MapReduce on Multi-core CPU (Mars[PACT‘08]) Input MapCount Prefixsum Allocate intermediate buffer on GPU Map Sort and Group ReduceCount Prefixsum Allocate output buffer on GPU Reduce Output
Program Example • Word Count (Phoenix Implementation) … for (i = 0; i < args->length; i++) { curr_ltr = toupper(data[i]); switch (state) { case IN_WORD: data[i] = curr_ltr; if ((curr_ltr < 'A' || curr_ltr > 'Z') && curr_ltr != '\'‘) { data[i] = 0; emit_intermediate(curr_start, (void *)1, &data[i] - curr_start + 1); state = NOT_IN_WORD; } break; …
Program Example • Word Count (Mars Implementation) __device__ void GPU_MAP_FUNC//(void *key, void val, int keySize, int valSize){…. do {…. if (*line != ' ‘) line++; else { line++; GPU_EMIT_INTER_FUNC(word, &wordSize, wordSize-1, sizeof(int)); while (*line == ' ‘) { line++; } wordSize = 0; } } while (*line != '\n');…} __device__ void GPU_MAP_COUNT_FUNC //(void *key, void *val, int keySize, int valSize) {…. do {…. if (*line != ' ‘) line++; else { line++; GPU_EMIT_INTER_COUNT_FUNC( wordSize-1, sizeof(int)); while (*line == ' ‘) { line++; } wordSize = 0; } } while (*line != '\n');…}
Pros and Cons • Load Balance • Phoenix: Static + Dynamic • Mars: Static, attribute same amount of map/reduce workload to each thread • Pre-allocation • Lock free • requires two-phase scan, which is not an efficient solution • Sorting----Bottleneck of Mars • Phoenix use insertion sorts dynamically during emitting • Mars use bitonic sort -- O(n*logn*logn)
Map-Reduce as a Language, not a library • Can we have a portable Map-Reduce that could run across different architectures efficiently? • Promising • Map-Reduce already specify the parallelism well • No complex synchronizations in users code • But still difficult • Different architecture provides different features • Either portability and performance issues • Use compiler and runtime to cover the architecture differences, as what we have done in supporting high-level languages such as C
Compiler, library &Runtime C X86 Power Sparc … Map-Reduce Cluster library &Runtime Cluster library &Runtime Cluster library &Runtime Map-Reduce Multicore Multicore library &Runtime Map-Reduce General Multicore library &Runtime Map-Reduce GPU GPU library &Runtime GPU
Case study on nVidia GPU • Portability • Host function support • Annotating libc and inline • Dynamic memory allocation • Big problem, not support that in user code? • Performance • Memory Hierarchy Optimization( global, shared, readonly memory identification ) • Typed Language is preferrable( int4 type acceleration…) • Dynamic memory allocation(again!)
More to explore • …