1 / 21

Panda: MapReduce Framework on GPU’s and CPU’s

Panda: MapReduce Framework on GPU’s and CPU’s. Hui Li Geoffrey Fox. Research Goal. provide a uniform MapReduce programming model that works on HPC Clusters or Virtual Clusters cores on traditional Intel architecture chip, cores on GPU. CUDA, OpenCL , OpenMP , OpenACC.

armina
Download Presentation

Panda: MapReduce Framework on GPU’s and CPU’s

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Panda: MapReduce Framework on GPU’s and CPU’s Hui Li Geoffrey Fox

  2. Research Goal • provide a uniform MapReduce programming model that works on HPC Clusters or Virtual Clusters cores on traditional Intel architecture chip, cores on GPU. CUDA, OpenCL, OpenMP, OpenACC

  3. Multi Core Architecture • Sophisticated mechanism in optimizing instruction and caching • Current trends: • Adding many cores • More SIMD: SSE3/AVX • Application specific extensions: VT-x, AES-NI • Point-to-Point interconnects, higher memory bandwidths

  4. Fermi GPU Architecture • Generic many core GPU • Not optimized for single-threaded performance, are designed for work requiring lots of throughput • Low latency hardware managed thread switching • Large number of ALU per “core” with small user managed cache per core • Memory bus optimized for bandwidth

  5. GPU Architecture Trends Multi-threaded Multi-core Many-core Intel LarabeeNVIDIA CUDA CPU Fully Programmable Programmability GPU Partially Programmable Fixed Function Throughput Performance Figure based on Intel Larabee Presentation at SuperComputing 2009

  6. Top 10 innovations in NVIDIA Fermi GPU and top 3 next challenges

  7. GPU Clusters • GPU clusters hardware systems • FutureGrid 16-node Tesla 2075 “Delta” 2012 • Keeneland 360-node Fermi GPUs 2010 • NCSA 192-node Tesla S1070 “Lincoln” 2009 • GPU clusters software systems • Software stack similar to CPU cluster • GPU resources management • GPU clusters runtimes • MPI/OpenMP/CUDA • Charm++/CUDA • MapReduce/CUDA • Hadoop/CUDA

  8. GPU Programming Models • Shared memory parallelism (single GPU node) • OpenACC • OpenMP/CUDA • MapReduce/CUDA • Distributed memory parallelism (multiple GPU nodes) • MPI/OpenMP/CUDA • Charm++/CUDA • MapReduce/CUDA • Distributed memory parallelism on GPU and CPU nodes • MapCG/CUDA/C++ • Hadoop/CUDA • Streaming • Pipelines • JNI (Java Native Interface)

  9. GPU Parallel Runtimes

  10. CUDA: Software Stack Image from [5]

  11. CUDA: Program Flow Main Memory CPU Host PCI-Express Device GPU Cores Device Memory

  12. CUDA: Thread Model • Kernel • A device function invoked by the host computer • Launches a grid with multiple blocks, and multiple threads per block • Blocks • Independent tasks comprised of multiple threads • no synchronization between blocks • SIMT: Single-Instruction Multiple-Thread • Multiple threads executing time instruction on different data (SIMD), can diverge if neccesary Image from [3]

  13. CUDA: Memory Model Image from [3]

  14. Panda: MapReduce Framework on GPU’s and CPU’s • Current Version 0.2 • Applications: • Word count • C-means clustering • Features: • Run on two GPUs cards • Some initial iterative MapReduce support • Next Version 0.3 • Features: • Run on GPU’s and CPU’s (done for word count) • Optimized static scheduling (todo)

  15. Panda: Data Flow CPU Cores CPU Memory Panda Scheduler PCI-Express Shared memory CPU processor group GPU accelerator group CPU Cores CPU Memory GPU Cores GPU Memory

  16. Architecture of Panda Version 0.3 Configure Panda job, GPU and CPU groups Iterations Static scheduling based on GPU and CPU capability CPU Processor Group 1 CPUMapper(num_cpus) Hash Partitioner GPU Accelerator Group 1 GPUMapper<<<block,thread>>> Round-robin Partitioner GPU Accelerator Group 2 GPUMapper<<<block,thread>>> Round-robin Partitioner 3 16 5 6 10 12 13 7 2 11 4 9 16 15 8 1 Copy intermediate results of mappers from GPU to CPU memory; sort all intermediate key-value pairs in CPU memory 1 2 3 4 5 6 7 8 9 10 11 13 14 12 15 16 Static scheduling for reduce tasks CPU Processor Group 1 CPUReducer(num_cpus) Hash Partitioner GPU Accelerator Group 1 GPUReducer<<<block,thread>>> Round-robin Partitioner GPU Accelerator Group 2 GPUReducer<<<block,thread>>> Round-robin Partitioner Merge Output

  17. Panda’s Performance on GPU’s • 2 GPU: T2075 • C-means Clustering (100dim,10c,10iter, 100m)

  18. Panda’s Performance on GPU’s • 1 GPU T2075 • C-means clustering (100dim,10c,10iter,100m)

  19. Panda’s Performance on CPU’s • 20 CPU Xeon 2.8GHz; 2GPU T2075 • Word Count Input File: 50MB

  20. Acknowledgement • FutureGrid • SalsaHPC

More Related