240 likes | 328 Views
Acceleration of Peruvian Anchovy Individual Based Model on a Single GPU. Project By: Kevin Demers Advisors: Steve Cousins, Dr. Huijie Xue, Dr. Fei Chai. Project Goal.
E N D
Acceleration of Peruvian Anchovy Individual Based Model on a Single GPU Project By: Kevin Demers Advisors: Steve Cousins, Dr. Huijie Xue, Dr. Fei Chai
Project Goal • The goal of this project was to convert an existing Fortran program to run on an NVIDIA 8800 GTX Graphics Processing Unit (GPU). • NVIDIAs CUDA parallel computing architecture was used. • New tools from The Portland Group allowed for direct execution of Fortran code on a CUDA-Capable GPU. • The desired result was a program capable of working with expanded data sets without a substantial increase in program run time.
The Original Program • Peruvian Anchovy Individual Based Model • No interaction between Anchovies • Written entirely in Fortran • 14784 Anchovies • Years modeled: 1991-2007
Conversion Process – Updating • Original Model – Heavy use of global variables. • Outdated method of global variables in Fortran. • Program rewritten do use Modules and Derived Types • Modules necessary for CUDA use. • CUDA kernels (subroutines) can only use data from modules they are members of.
Conversion Process - CUDA program main use module1 implicit none integer :: a, b !local variables call subroutine1(a b c d e) !call subroutine end program subroutine subroutine1(a b c d e) use module1 implicit none integer :: a, b, c, d !Let subroutine know type of variables type(comp) :: e do some code end subroutine module module1 type comp !Derived type integer :: n real :: r end type comp integer :: c, d type(comp) :: e end module1
Conversion Process - CUDA program main use cudafor use module1 implicit none integer :: a, b integer, allocatable, device :: a_d, b_d !seperate device variables allocate(a_d b_d c_d d_d e_d) !allocate variables on device a_d = a !copy variable data to device b_d = b c_d = c d_d = d e_d = e call<<<dimensions,dimensions>>>subroutine1(a_d, b_d, c_d, d_d, e_d) a = a_d !copy variable data from device b = b_d c = c_d d = d_d e = e_d deallocate(a_d b_d c_d d_d e_d) !deallocate VERY IMPORTANT end program
Conversion Process - CUDA module module1 use cudafor type comp sequence integer :: n real :: r end type comp integer :: c, d integer, device, allocatable :: c_d, d_d type(comp) :: e type(comp), device, allocatable :: e_d contains !Subroutine MUST be in module now attributes(global) subroutine subroutine1 implicit none integer, device :: a_d, b_d, c_d, d_d !Let subroutine know about device variables type(comp),device :: e_d integer :: idx idx = (blockidx%x-1)*blockdim%x + threadidx%x !x coordinate of thread end subroutine end module module1
Conversion Process - Identifying • Four subroutines = 60-80% of run time • The four subroutines loop sequentially over each fish in the model • Parallel version creates a thread for each fish • CPU code primarily handles File I/O
Conversion Process – Problems • CUDA/Fortran tools are relatively new • Some CUDA features are unsupported/broken • Debugging – Cryptic error messages • Profiling Tools would not work • GPROF style profiling is very innacurate
Conversion Process - Errors • NVIDIA Visual Profiler had to be used • Visual Profiler only profiles GPU code • Doesn’t provide detailed information • Runs the program 4 times
Efficiency – Memory Transfers • Excessive memory transfers slow down programs • Program has thousands of memory transfers • Each memory transfer is large (10MB +) • Memory transfers were altered until as efficient as possible
Why Isn’t It Faster? • NVIDIA 8800 GTX has 128 parallel cores • Each group of 8 cores makes up a multiprocessor • Each multiprocessor runs 32 threads simultaneously • Maximum efficiency is obtained when all threads are identical
Why Isn’t It Faster? – Cont’d • Slow clock speed • Inefficient instructions • Divergent threads • GPU code contains a large amount of branches
Future Direction • Rewrite code to group non-branching segments • Trade a single GPU kernel call for many small kernels without branches • Reduce data output to reduce memory transfer time • Dynamically decide what memory to transfer for each kernel call
Acknowledgements • Thanks to Steve Cousins, Yifeng Zhu, Bruce Segee, and all SuperME REU members. • Thanks to Robert England and Janice Gomm for all the food. • This research work is supported by the NSF fund CCF #0754951
Questions? Comments? • I’m looking at you Robert…