220 likes | 527 Views
Panda: MapReduce Framework on GPU’s and CPU’s. Hui Li Geoffrey Fox. Research Goal. provide a uniform MapReduce programming model that works on HPC Clusters or Virtual Clusters cores on traditional Intel architecture chip, cores on GPU. CUDA, OpenCL , OpenMP , OpenACC.
E N D
Panda: MapReduce Framework on GPU’s and CPU’s Hui Li Geoffrey Fox
Research Goal • provide a uniform MapReduce programming model that works on HPC Clusters or Virtual Clusters cores on traditional Intel architecture chip, cores on GPU. CUDA, OpenCL, OpenMP, OpenACC
Multi Core Architecture • Sophisticated mechanism in optimizing instruction and caching • Current trends: • Adding many cores • More SIMD: SSE3/AVX • Application specific extensions: VT-x, AES-NI • Point-to-Point interconnects, higher memory bandwidths
Fermi GPU Architecture • Generic many core GPU • Not optimized for single-threaded performance, are designed for work requiring lots of throughput • Low latency hardware managed thread switching • Large number of ALU per “core” with small user managed cache per core • Memory bus optimized for bandwidth
GPU Architecture Trends Multi-threaded Multi-core Many-core Intel LarabeeNVIDIA CUDA CPU Fully Programmable Programmability GPU Partially Programmable Fixed Function Throughput Performance Figure based on Intel Larabee Presentation at SuperComputing 2009
Top 10 innovations in NVIDIA Fermi GPU and top 3 next challenges
GPU Clusters • GPU clusters hardware systems • FutureGrid 16-node Tesla 2075 “Delta” 2012 • Keeneland 360-node Fermi GPUs 2010 • NCSA 192-node Tesla S1070 “Lincoln” 2009 • GPU clusters software systems • Software stack similar to CPU cluster • GPU resources management • GPU clusters runtimes • MPI/OpenMP/CUDA • Charm++/CUDA • MapReduce/CUDA • Hadoop/CUDA
GPU Programming Models • Shared memory parallelism (single GPU node) • OpenACC • OpenMP/CUDA • MapReduce/CUDA • Distributed memory parallelism (multiple GPU nodes) • MPI/OpenMP/CUDA • Charm++/CUDA • MapReduce/CUDA • Distributed memory parallelism on GPU and CPU nodes • MapCG/CUDA/C++ • Hadoop/CUDA • Streaming • Pipelines • JNI (Java Native Interface)
CUDA: Software Stack Image from [5]
CUDA: Program Flow Main Memory CPU Host PCI-Express Device GPU Cores Device Memory
CUDA: Thread Model • Kernel • A device function invoked by the host computer • Launches a grid with multiple blocks, and multiple threads per block • Blocks • Independent tasks comprised of multiple threads • no synchronization between blocks • SIMT: Single-Instruction Multiple-Thread • Multiple threads executing time instruction on different data (SIMD), can diverge if neccesary Image from [3]
CUDA: Memory Model Image from [3]
Panda: MapReduce Framework on GPU’s and CPU’s • Current Version 0.2 • Applications: • Word count • C-means clustering • Features: • Run on two GPUs cards • Some initial iterative MapReduce support • Next Version 0.3 • Features: • Run on GPU’s and CPU’s (done for word count) • Optimized static scheduling (todo)
Panda: Data Flow CPU Cores CPU Memory Panda Scheduler PCI-Express Shared memory CPU processor group GPU accelerator group CPU Cores CPU Memory GPU Cores GPU Memory
Architecture of Panda Version 0.3 Configure Panda job, GPU and CPU groups Iterations Static scheduling based on GPU and CPU capability CPU Processor Group 1 CPUMapper(num_cpus) Hash Partitioner GPU Accelerator Group 1 GPUMapper<<<block,thread>>> Round-robin Partitioner GPU Accelerator Group 2 GPUMapper<<<block,thread>>> Round-robin Partitioner 3 16 5 6 10 12 13 7 2 11 4 9 16 15 8 1 Copy intermediate results of mappers from GPU to CPU memory; sort all intermediate key-value pairs in CPU memory 1 2 3 4 5 6 7 8 9 10 11 13 14 12 15 16 Static scheduling for reduce tasks CPU Processor Group 1 CPUReducer(num_cpus) Hash Partitioner GPU Accelerator Group 1 GPUReducer<<<block,thread>>> Round-robin Partitioner GPU Accelerator Group 2 GPUReducer<<<block,thread>>> Round-robin Partitioner Merge Output
Panda’s Performance on GPU’s • 2 GPU: T2075 • C-means Clustering (100dim,10c,10iter, 100m)
Panda’s Performance on GPU’s • 1 GPU T2075 • C-means clustering (100dim,10c,10iter,100m)
Panda’s Performance on CPU’s • 20 CPU Xeon 2.8GHz; 2GPU T2075 • Word Count Input File: 50MB
Acknowledgement • FutureGrid • SalsaHPC