70 likes | 392 Views
Lab Assignment #2 Data Parallel Reduction. Farhad Parsan. Data Parallel Reduction. Sum reduction kernel (with thread divergence). Data Parallel Reduction. Sum reduction kernel (without thread divergence). Host Code. float computeOnDevice(float* h_data, int num_elements) {
E N D
Lab Assignment #2Data Parallel Reduction Farhad Parsan
Data Parallel Reduction • Sum reduction kernel (with thread divergence)
Data Parallel Reduction • Sum reduction kernel (without thread divergence)
Host Code float computeOnDevice(float* h_data, int num_elements) { intsize = num_elements*sizeof(float); float result; float* hd_data; // 1. Allocate and Load cudaMalloc((void**) &hd_data, size); cudaMemcpy(hd_data, h_data, size, cudaMemcpyHostToDevice); // 2. Kernel invocation code dim3 dimBlock(num_elements,1); dim3 dimGrid(1, 1); reduction<<<dimGrid, dimBlock>>>(hd_data); // 3. Store result cudaMemcpy(h_data, hd_data, size, cudaMemcpyDeviceToHost); result = h_data[0]; // Free device matrices cudaFree(hd_data); return result; }
Device Code #define NUM_ELEMENTS 512 __global__ void reduction(float *hd_data) { __shared__ float partialSum[NUM_ELEMENTS] unsigned int t = threadIdx.x; partialSum[t] = hd_data[t]; for (unsigned int stride = blockDim.x; stride > 1; stride >> 1) { __syncthreads(); if (t < stride) partialSum[t] += partialSum[t+stride]; } hd_data[t] = partialSum[t]; }
Question • How many times does your thread block synchronize to reduce the array of 512 elements to a single value? Number of synchronizations = Number of reduction iterations = Log2N − 1 Assuming N = 512 => Number of synchronizations = 8
Question • What is the minimum, maximum, and average number of "real" operations that a thread will perform? “real" operations are those that directly contribute to the final reduction value. Maximum : Thread 1 = Log2N − 1 if N = 512 => Maximum = 8 Minimum : Odd Threads = 1 Average : [ 1 + 2 + 4 + … + (N/2) ] / N if N = 512 => Average = 0.998 ≈ 1