Lecture 6: Shared-memory Computing with GPU

Lecture 6: Shared-memory Computing with GPU

START: download NVIDIA CUDA Free download NVIDIA CUDA https://developer.nvidia.com/cuda-downloads CUDA programming on visual studio 2010

START: Matrix Addition #include "cuda_runtime.h" #include "device_launch_parameters.h" #include <stdio.h> constint N = 1024; constintblocksize = 16; __global__ void add_matrix(float* a, float *b, float *c, int N) { inti = blockIdx.x * blockDim.x + threadIdx.x; intj = blockIdx.y * blockDim.y + threadIdx.y ; intindex = i+j*N; if (i< N && j <N) c[index] = a[index] + b[index]; } int main() { float *a = new float[N*N]; float *b = new float[N*N]; float *c = new float[N*N]; inti, j; for (int i = 0; i < N*N; ++i) { a[i] = 1.0f; b[i] = 3.5f; } float *ad, *bd, *cd; constint size = N*N*sizeof(float); cudaMalloc( (void**)&ad, size); cudaMalloc( (void**)&bd, size); cudaMalloc( (void**)&cd, size); cudaMemcpy(ad, a, size, cudaMemcpyHostToDevice); cudaMemcpy(bd, b, size, cudaMemcpyHostToDevice); dim3 dimBlock(blocksize, blocksize); dim3 dimGrid(N/dimBlock.x, N/dimBlock.y); add_matrix<<<dimGrid, dimBlock>>>(ad, bd, cd, N); cudaMemcpy(c, cd, size, cudaMemcpyDeviceToHost); for (i = 0; i<N; i++) { for (j=0; j<N; j++) printf("%f", c[i,j]); printf("\n"); }; delete[] a; delete b; delete [] c; return EXIT_SUCCESS; } (i,j) Global memory dimBlock.y threadIdx.x height threadIdy.y dimBlock.x width

Memory Allocation Example

Memory Allocation Example (xIdx,yIdy) dimBlock.y threadIdx.x height threadIdy.y dimBlock.x width

Memory Allocation Example

Memory Allocation Example shared memory Global memory (threadIDx.x, threadIDx.y) (2) yBlock (1) (threadIDx.y, threadIDx.x) (X,Y) height xBlock width (1) Read from global memory & write to block shared memory (2) Transposed address (3) Read from the shared memory & write to global memory

Memory Allocation Example shared memory Global memory (threadIDx.x, threadIDx.y) (2) yBlock yBlock (1) (threadIDx.y, threadIDx.x) (X,Y) (y,x) height height xBlock xBlock (3) width width (1) (2) (3) Global memory

Exercise Compile and execute program Matrix Addition. (2) Write a complete version of the program for Memory Allocation. (3) Write a program for calculate π, where the number of intervals = .

Lecture 6: Shared-memory Computing with GPU

Lecture 6: Shared-memory Computing with GPU

Presentation Transcript

Parallel Programming in C with MPI and OpenMP

Lecture #1: Embedded Computing Systems - An Overview

Memory Scaling: A Systems Architecture Perspective

Chapter 6, Process Synchronization

Memory Management:

CS 5600 Computer Systems

Dynamic Memory Management

The iterated shared memory model of computation and an enrichment with safe-consensus tasks

Writing and tuning OpenMP programs on distributed shared memory platforms

Parallel Computing with OpenMP on distributed shared memory platforms

DISTRIBUTED COMPUTING

Computing and Brokering

Distributed Memory and Datastream-based Reconfigurable Computing

DISTRIBUTED COMPUTING

Virtual Memory

Lecture 4A

Shared Memory Multiprocessors

Memory Interface