1 / 20

Lecture 6: Shared-memory Computing with GPU

Lecture 6: Shared-memory Computing with GPU . START: download NVIDIA CUDA . Free download NVIDIA CUDA https ://developer.nvidia.com/cuda-downloads. CUDA programming on visual studio 2010 . START : Matrix Addition . #include " cuda_runtime.h " #include " device_launch_parameters.h "

arlen
Download Presentation

Lecture 6: Shared-memory Computing with GPU

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Lecture 6: Shared-memory Computing with GPU

  2. START: download NVIDIA CUDA Free download NVIDIA CUDA https://developer.nvidia.com/cuda-downloads CUDA programming on visual studio 2010

  3. START: Matrix Addition #include "cuda_runtime.h" #include "device_launch_parameters.h" #include <stdio.h> constint N = 1024; constintblocksize = 16; __global__ void add_matrix(float* a, float *b, float *c, int N) { inti = blockIdx.x * blockDim.x + threadIdx.x; intj = blockIdx.y * blockDim.y + threadIdx.y ; intindex = i+j*N; if (i< N && j <N) c[index] = a[index] + b[index]; } int main() { float *a = new float[N*N]; float *b = new float[N*N]; float *c = new float[N*N]; inti, j; for (int i = 0; i < N*N; ++i) { a[i] = 1.0f; b[i] = 3.5f; } float *ad, *bd, *cd; constint size = N*N*sizeof(float); cudaMalloc( (void**)&ad, size); cudaMalloc( (void**)&bd, size); cudaMalloc( (void**)&cd, size); cudaMemcpy(ad, a, size, cudaMemcpyHostToDevice); cudaMemcpy(bd, b, size, cudaMemcpyHostToDevice); dim3 dimBlock(blocksize, blocksize); dim3 dimGrid(N/dimBlock.x, N/dimBlock.y); add_matrix<<<dimGrid, dimBlock>>>(ad, bd, cd, N); cudaMemcpy(c, cd, size, cudaMemcpyDeviceToHost); for (i = 0; i<N; i++) { for (j=0; j<N; j++) printf("%f", c[i,j]); printf("\n"); }; delete[] a; delete b; delete [] c; return EXIT_SUCCESS; } (i,j) Global memory dimBlock.y threadIdx.x height threadIdy.y dimBlock.x width

  4. Memory Allocation Example

  5. Memory Allocation Example (xIdx,yIdy) dimBlock.y threadIdx.x height threadIdy.y dimBlock.x width

  6. Memory Allocation Example

  7. Memory Allocation Example

  8. Memory Allocation Example shared memory Global memory (threadIDx.x, threadIDx.y) (2) yBlock (1) (threadIDx.y, threadIDx.x) (X,Y) height xBlock width (1) Read from global memory & write to block shared memory (2) Transposed address (3) Read from the shared memory & write to global memory

  9. Memory Allocation Example shared memory Global memory (threadIDx.x, threadIDx.y) (2) yBlock yBlock (1) (threadIDx.y, threadIDx.x) (X,Y) (y,x) height height xBlock xBlock (3) width width (1) (2) (3) Global memory

  10. Exercise Compile and execute program Matrix Addition. (2) Write a complete version of the program for Memory Allocation. (3) Write a program for calculate π, where the number of intervals = .

More Related