1 / 12

End Semester Project for course Parallel Computing

“Cellular Automata using nVIDIA CUDA” and “Bridging the Gap between MPJExpress and CUDA”. End Semester Project for course Parallel Computing. Team members: Bibrak Qamar NUST-2007-BIT9-105 Jahanzaib Maqbool NUST-2007-BIT9-118 Bilawal Sarwar NUST-2007-BIT9-11

louis
Download Presentation

End Semester Project for course Parallel Computing

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. “Cellular Automata using nVIDIA CUDA” and “Bridging the Gap between MPJExpress and CUDA” End Semester Project for course Parallel Computing Team members: BibrakQamar NUST-2007-BIT9-105 JahanzaibMaqbool NUST-2007-BIT9-118 BilawalSarwar NUST-2007-BIT9-11 Muhammad Imran NUST-2007-BIT9-127 MahreenNadeem NUST-2007-BIT9-

  2. System Specification • Name CUDA-TESTBED • Processor: Intel(R) Xeon(R) CPU W3520 @2.67GHz, Core2 Quad • Physical Threads per core = 2 • Cores = 4 • GPU : 2 NVIDIA GTX 285 • Memory = 8 GB

  3. NVIDIA GTX 285 • GPU Engine Specs: • CUDA Cores : 240 • Graphics Clock : 648 MHz “The shader clock” • Processor Clock :1476 MHz “Hot clock” • Memory Specs: • Memory Clock :1242 MHz • Standard Memory : 1GB GDDR3 • Memory Interface Width : 512-bit • Memory Bandwidth : 159.0 GB/sec

  4. Implementation • Game of Life on CUDA • Fish and Shark on CUDA • Matrix Multiplication on GPU Accelerated Cluster using MPJExpress

  5. Cellular AutomataFish and Shark Execution Flow • Initialize device • Allocate Device and Host side memory • Populate cells • Copy From Host to Device • Loop in Display() • Draw cells • Execute Kernel • Copy result back to Host • End loop • Free memory • End program

  6. Kernel function • Get ThreadID.X and ThreadID.Y • Fetch neighbors' • Decide Fate • Write result to resultant Cellular board

  7. Execution Graph Max Global Memory Throughput we achieved was = 95 GB/s

  8. Height PlotFish and Shark1300 generations with display

  9. Speedup against sequential CPU version Average Speedup = 878.91 X

  10. Matrix Multiplication on GPU Accelerated Cluster using MPJExpress • Algorithm • Use MPJExpress to distribute Data. • Call cudaMatMultiply function • Allocate device memory • Execute Kernel • Copy results back • Gather results at root www.Jcuda.org We have used JCUDA, Java binding for NVIDIA CUDA

More Related