Acceleration of Peruvian Anchovy Individual Based Model on a Single GPU

Acceleration of Peruvian Anchovy Individual Based Model on a Single GPU Project By: Kevin Demers Advisors: Steve Cousins, Dr. Huijie Xue, Dr. Fei Chai

Project Goal • The goal of this project was to convert an existing Fortran program to run on an NVIDIA 8800 GTX Graphics Processing Unit (GPU). • NVIDIAs CUDA parallel computing architecture was used. • New tools from The Portland Group allowed for direct execution of Fortran code on a CUDA-Capable GPU. • The desired result was a program capable of working with expanded data sets without a substantial increase in program run time.

The Original Program • Peruvian Anchovy Individual Based Model • No interaction between Anchovies • Written entirely in Fortran • 14784 Anchovies • Years modeled: 1991-2007

Slide from Yi Xu Dissertation Defense

Conversion Process – Updating • Original Model – Heavy use of global variables. • Outdated method of global variables in Fortran. • Program rewritten do use Modules and Derived Types • Modules necessary for CUDA use. • CUDA kernels (subroutines) can only use data from modules they are members of.

Conversion Process - CUDA program main use module1 implicit none integer :: a, b !local variables call subroutine1(a b c d e) !call subroutine end program subroutine subroutine1(a b c d e) use module1 implicit none integer :: a, b, c, d !Let subroutine know type of variables type(comp) :: e do some code end subroutine module module1 type comp !Derived type integer :: n real :: r end type comp integer :: c, d type(comp) :: e end module1

Conversion Process - CUDA program main use cudafor use module1 implicit none integer :: a, b integer, allocatable, device :: a_d, b_d !seperate device variables allocate(a_d b_d c_d d_d e_d) !allocate variables on device a_d = a !copy variable data to device b_d = b c_d = c d_d = d e_d = e call<<<dimensions,dimensions>>>subroutine1(a_d, b_d, c_d, d_d, e_d) a = a_d !copy variable data from device b = b_d c = c_d d = d_d e = e_d deallocate(a_d b_d c_d d_d e_d) !deallocate VERY IMPORTANT end program

Conversion Process - CUDA module module1 use cudafor type comp sequence integer :: n real :: r end type comp integer :: c, d integer, device, allocatable :: c_d, d_d type(comp) :: e type(comp), device, allocatable :: e_d contains !Subroutine MUST be in module now attributes(global) subroutine subroutine1 implicit none integer, device :: a_d, b_d, c_d, d_d !Let subroutine know about device variables type(comp),device :: e_d integer :: idx idx = (blockidx%x-1)*blockdim%x + threadidx%x !x coordinate of thread end subroutine end module module1

Conversion Process - Identifying • Four subroutines = 60-80% of run time • The four subroutines loop sequentially over each fish in the model • Parallel version creates a thread for each fish • CPU code primarily handles File I/O

Conversion Process – Problems • CUDA/Fortran tools are relatively new • Some CUDA features are unsupported/broken • Debugging – Cryptic error messages • Profiling Tools would not work • GPROF style profiling is very innacurate

Conversion Process - Errors • NVIDIA Visual Profiler had to be used • Visual Profiler only profiles GPU code • Doesn’t provide detailed information • Runs the program 4 times

Efficiency – Memory Transfers • Excessive memory transfers slow down programs • Program has thousands of memory transfers • Each memory transfer is large (10MB +) • Memory transfers were altered until as efficient as possible

Results

Results – Run Time

Results

Why Isn’t It Faster? • NVIDIA 8800 GTX has 128 parallel cores • Each group of 8 cores makes up a multiprocessor • Each multiprocessor runs 32 threads simultaneously • Maximum efficiency is obtained when all threads are identical

Why Isn’t It Faster? – Cont’d • Slow clock speed • Inefficient instructions • Divergent threads • GPU code contains a large amount of branches

Thread Comparison

Total Runtime

Future Direction • Rewrite code to group non-branching segments • Trade a single GPU kernel call for many small kernels without branches • Reduce data output to reduce memory transfer time • Dynamically decide what memory to transfer for each kernel call

Acknowledgements • Thanks to Steve Cousins, Yifeng Zhu, Bruce Segee, and all SuperME REU members. • Thanks to Robert England and Janice Gomm for all the food. • This research work is supported by the NSF fund CCF #0754951

Questions? Comments? • I’m looking at you Robert…

Acceleration of Peruvian Anchovy Individual Based Model on a Single GPU

Acceleration of Peruvian Anchovy Individual Based Model on a Single GPU

Presentation Transcript

KD-Tree Acceleration Structures for a GPU Raytracer

GPU Acceleration of Finite Element Computations

Physically-Based Simulation on the GPU

Acceleration Techniques for GPU-based Volume Rendering

GPU Acceleration of SVG November 2011

OpenFOAM on a GPU-based Heterogeneous Cluster

Fire Effects on Forest Habitats: An Individual Based Model

Single Subject Acceleration

A Detailed GPU Cache Model Based on Reuse Distance Theory

GPU Acceleration in ITK v4

An Implementation of the Language Model Based IR System on the GPU

GPU acceleration in Matlab

An Analytical Model for a GPU

A Model for Emission from Microquasar Jets: Consequences of a Single Acceleration Episode

DOORS: Individual Budgets Based on Individual Needs

Peruvian Anchovy Fishery and Austral Group S.A.A Sustainability Performance

Harvesting the Opportunity of GPU-Based Acceleration for Data-Intensive Applications

Harvesting the Opportunity of GPU-based Acceleration Matei Ripeanu

GPU Acceleration in ITK v4

Tests of Hypotheses Based on a Single Sample

Acceleration Techniques for GPU-based Volume Rendering

DOORS: Individual Budgets Based on Individual Needs