1 / 24

Acceleration of Peruvian Anchovy Individual Based Model on a Single GPU

Acceleration of Peruvian Anchovy Individual Based Model on a Single GPU. Project By: Kevin Demers Advisors: Steve Cousins, Dr. Huijie Xue, Dr. Fei Chai. Project Goal.

keith
Download Presentation

Acceleration of Peruvian Anchovy Individual Based Model on a Single GPU

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Acceleration of Peruvian Anchovy Individual Based Model on a Single GPU Project By: Kevin Demers Advisors: Steve Cousins, Dr. Huijie Xue, Dr. Fei Chai

  2. Project Goal • The goal of this project was to convert an existing Fortran program to run on an NVIDIA 8800 GTX Graphics Processing Unit (GPU). • NVIDIAs CUDA parallel computing architecture was used. • New tools from The Portland Group allowed for direct execution of Fortran code on a CUDA-Capable GPU. • The desired result was a program capable of working with expanded data sets without a substantial increase in program run time.

  3. The Original Program • Peruvian Anchovy Individual Based Model • No interaction between Anchovies • Written entirely in Fortran • 14784 Anchovies • Years modeled: 1991-2007

  4. Slide from Yi Xu Dissertation Defense

  5. Slide from Yi Xu Dissertation Defense

  6. Slide from Yi Xu Dissertation Defense

  7. Conversion Process – Updating • Original Model – Heavy use of global variables. • Outdated method of global variables in Fortran. • Program rewritten do use Modules and Derived Types • Modules necessary for CUDA use. • CUDA kernels (subroutines) can only use data from modules they are members of.

  8. Conversion Process - CUDA program main use module1 implicit none integer :: a, b !local variables call subroutine1(a b c d e) !call subroutine end program subroutine subroutine1(a b c d e) use module1 implicit none integer :: a, b, c, d !Let subroutine know type of variables type(comp) :: e do some code end subroutine module module1 type comp !Derived type integer :: n real :: r end type comp integer :: c, d type(comp) :: e end module1

  9. Conversion Process - CUDA program main use cudafor use module1 implicit none integer :: a, b integer, allocatable, device :: a_d, b_d !seperate device variables allocate(a_d b_d c_d d_d e_d) !allocate variables on device a_d = a !copy variable data to device b_d = b c_d = c d_d = d e_d = e call<<<dimensions,dimensions>>>subroutine1(a_d, b_d, c_d, d_d, e_d) a = a_d !copy variable data from device b = b_d c = c_d d = d_d e = e_d deallocate(a_d b_d c_d d_d e_d) !deallocate VERY IMPORTANT end program

  10. Conversion Process - CUDA module module1 use cudafor type comp sequence integer :: n real :: r end type comp integer :: c, d integer, device, allocatable :: c_d, d_d type(comp) :: e type(comp), device, allocatable :: e_d contains !Subroutine MUST be in module now attributes(global) subroutine subroutine1 implicit none integer, device :: a_d, b_d, c_d, d_d !Let subroutine know about device variables type(comp),device :: e_d integer :: idx idx = (blockidx%x-1)*blockdim%x + threadidx%x !x coordinate of thread end subroutine end module module1

  11. Conversion Process - Identifying • Four subroutines = 60-80% of run time • The four subroutines loop sequentially over each fish in the model • Parallel version creates a thread for each fish • CPU code primarily handles File I/O

  12. Conversion Process – Problems • CUDA/Fortran tools are relatively new • Some CUDA features are unsupported/broken • Debugging – Cryptic error messages • Profiling Tools would not work • GPROF style profiling is very innacurate

  13. Conversion Process - Errors • NVIDIA Visual Profiler had to be used • Visual Profiler only profiles GPU code • Doesn’t provide detailed information • Runs the program 4 times

  14. Efficiency – Memory Transfers • Excessive memory transfers slow down programs • Program has thousands of memory transfers • Each memory transfer is large (10MB +) • Memory transfers were altered until as efficient as possible

  15. Results

  16. Results – Run Time

  17. Results

  18. Why Isn’t It Faster? • NVIDIA 8800 GTX has 128 parallel cores • Each group of 8 cores makes up a multiprocessor • Each multiprocessor runs 32 threads simultaneously • Maximum efficiency is obtained when all threads are identical

  19. Why Isn’t It Faster? – Cont’d • Slow clock speed • Inefficient instructions • Divergent threads • GPU code contains a large amount of branches

  20. Thread Comparison

  21. Total Runtime

  22. Future Direction • Rewrite code to group non-branching segments • Trade a single GPU kernel call for many small kernels without branches • Reduce data output to reduce memory transfer time • Dynamically decide what memory to transfer for each kernel call

  23. Acknowledgements • Thanks to Steve Cousins, Yifeng Zhu, Bruce Segee, and all SuperME REU members. • Thanks to Robert England and Janice Gomm for all the food. • This research work is supported by the NSF fund CCF  #0754951

  24. Questions? Comments? • I’m looking at you Robert…

More Related