200 likes | 378 Views
Killdevil. Running CUDA programs on cluster. Requesting permission. https://onyen.unc.edu/cgi-bin/unc_id/services. Compiling CUDA programs. module load cuda Run script : compile.sh
E N D
Killdevil Running CUDA programs on cluster
Requesting permission • https://onyen.unc.edu/cgi-bin/unc_id/services
Compiling CUDA programs • module load cuda • Run script : compile.sh • nvcc-o MatrixMul -I/usr/local/cuda/include/ -L/usr/local/lib64 -L/usr/local/cuda/lib64 MatrixMul.cu
Running CUDA programs • ssh killdevil.unc.edu • module load cuda • Run script : submitjob.sh • bsub –q gpu –a gpuexcl_t –n 1 –o MYGPUJOB.o%J <myprogramscript>
CUDA SDK • https://developer.nvidia.com/cuda-downloads • Download the SDK depending on your OS • Windows : Requires Visual Studio to compile sample • Linux :Requires gcc
Recap • Kernel program is executed by a grid of threads
Thread Organization • Organized in two-level hierarchy • Grid composed of Blocks • gridDim : Number of blocks the grid has • Blocks composed of Threads • blockDim : Number of threads the block has • Each block gets a unique Id • blockIdx • Each thread gets a unique Id • threadIdx
Thread Organization • Each block has equal number of threads • blockDim.x, blockDim.y, blockDim.z • threadIdx is always local to the block
1D Example • Grid = 128 blocks • Block = 32 threads • blockDim.x in kernel returns 32 • Total threads = 128 x 32 = 4096 • Each thread has a unique Id • blockIdx.x * blockDim.x + threadId.x
Things to Note • Blocks are organized into 3D arrays of threads • 1D, 2D, 3D depending on your problem • Vector sum : 1D; Matrix multiplication : 2D • All blocks in a grid have the same dimensions • i.e all blocks have equal number of threads in each dimension • The total size of a block is limited to 512 threads • blockDim can be (512, 1, 1), (8, 16, 2), (16, 16, 2) • But not (32, 32, 1) • Total threads : 32 x 32 x 1 = 1024 which exceeds 512
USING blockIdx AND threadIdx 0, 0 1, 0 2, 0 width-1, 0 0, 1 width–1, 1 0, 2 0, width-1 width – 1, width - 1