370 likes | 494 Views
Steven Sparks Kibeom Kim. Micro Benchmarking CUDA Final Project Presentation. Micro Benchmarking - Objectives. Measure the texture cache parameters Latency Capacity Associativity Measure the memory bandwidth Texture global and cache memory Normal global memory Shared memory.
E N D
Steven Sparks Kibeom Kim Micro Benchmarking CUDAFinal Project Presentation
Micro Benchmarking - Objectives • Measure the texture cache parameters • Latency • Capacity • Associativity • Measure the memory bandwidth • Texture global and cache memory • Normal global memory • Shared memory
Measuring Cache Size (9600M GT) ms n (floating point)
Texture Cache • "Texture caches are used to save bandwidth and power only - in contrast to CPU caches which are also essential for lowering latency"http://www.realworldtech.com/page.cfm?ArticleID=RWT090808195242&p=10 • "For example, some programs can exceed the maximum theoretical memory bandwidth of the underlying global memory through judicious use of the texture memory cache. While the latency of texture cache reference is generally the same as DRAM" • http://www.ddj.com/architect/218100902
Compact Maximum Count Count 320 160 80 40 20 Stride
L1 cache of 9600M GT • Size : 640 * sizeof(float) = 2.5kb • Associativity : 20 • Associative offset :32 * sizeof(float) = 128byte • Block Size : Size/80 = 32byte
Cache 2D locality and Associativity power of 2 x power of 2 4byte(RGBA) x power of 2
Measuring Total Access Time • __global__ void CheckSet(int stride, int num, int N){ float temp1; __shared__ float temp2; int j0 = stride *( num - 1); for (int i = count; i >= 0; --i) { for (int index = j0; index >= 0; index -= stride) { temp1 = tex1Dfetch(texRef, index); } } temp2 = temp1;}
Measuring Loop Overhead Time • __global__ void CheckSet(int stride, int num, int N){ float temp1, temp3 = 1.0f; __shared__ float temp2; int j0 = stride *( num - 1); for (int i = count; i >= 0; --i) { for (int index = j0; index >= 0; index -= stride) { temp1 = temp3; } } temp2 = temp1;}
Measuring Exact Access Time • Texture Access Function - Register Access Fuction = • (Loop + Texture Access) - (Loop + Register Access) = • Texture Access - Register Access
Binary Search Example stride : 32. num : 256. time : 45.483139 ms. It is not Compactstride : 32. num : 128. time : 45.382069 ms. It is not Compactstride : 32. num : 64. time : 45.726257 ms. It is not Compact stride : 32. num : 32. time : 45.369858 ms. It is not Compact stride : 32. num : 16. time : 33.000290 ms. It is Compactstride : 32. num : 24. time : 45.356770 ms. It is not Compactstride : 32. num : 20. time : 33.172470 ms. It is Compactstride : 32. num : 22. time : 45.362503 ms. It is not Compactstride : 32. num : 21. time : 45.405190 ms. It is not Compact
Parallelization - Design Considerations • Texture cache exists on each multiprocessor • Test each set on different multiprocessor • Number of concurrent tests must be equal to or lower than the number of multiprocessors • Since we cannot measure time for each block to execute we can only determine if all sets are compact • If compactness fails, we do not know which sets are compact
Parallelization - Implementation • Two loops in single threaded algorithm that are candidates for parallelization • Both have a fixed stride • First loop starts with a stride of one and length of one and continues to double the length until a non-compact set is found • Second loop uses some stride and finds the smallest length that makes a non-compact set • Only one function needed • Function needs to take any stride as an input • Input parameter determines whether to double or to increase by one • Returns the first length to make a non-compact set
Parallelization - Host is_compact • The single threaded is_compact needs to simply create a cudaArray of length: • stride * (length - 1) + 1 • Then it binds a texture to this array and measures the time it takes for the kernel to execute • Multi threaded version receives a list of lengths and a list of strides • Loops through the lists and finds the longest cudaArray needed • Creates the cudaArray to this length • Supplies the kernel with the lists • Times the execution of the kernel
Parallelization - Kernel • each block looks at a different stride and length • each block will have a different maxIndex • __global__ void check_set(int* stride, int* length) • { • int i, index; • float temp; • // get the stride and length for this set • int S = stride[blockIdx.x]; • int N = length[blockIdx.x]; • // determine the max index that is to be accessed by this set • int maxIndex = S * (N - 1) + 1; • __shared__ float finish; • temp = 0; • index = 0; • for (i = 0; i < LOOP_COUNT; i++) • { • // call on texture memory • temp += tex1Dfetch(texRef, index); • // update with index of next memory access • index += S; • if (index >= maxIndex) • { • index = 0; • } • } // for (i = 0; i < LOOP_COUNT; i++) • finish = temp; • } // check_set
Parallelization - First non-compact length • Keeping the stride fixed, increase the length until a non-compact set is found • Start with number of multiprocessors sets • If all are compact, then start with the next length and try again • When a non-compact set is found then use binary search
Measuring Cache Size (9600M GT) ms n (floating point)
L1 cache of 9600M GT • Size : 640 * sizeof(float) = 2.5kb • Associativity : 20 • Associative offset :32 * sizeof(float) = 128byte • Block Size : Size/80 = 32byte
Memory Bandwidth - Objectives • Determine the maximum rate of transfer of data from memory to processing units • Limit overhead • Analyze texture memory - global and cache • Analyze global memory • Analyze shared memory • Accuracy top priority • Speed low priority
Memory Bandwidth - Algorithm • Access the memory as much as possible • Increase the number of threads accessing the memory until saturation occurs
Memory Bandwidth - Design Considerations • Compiler optimizations • Each value read from memory is added to temp variable and then before exiting kernel the value is set to shared memory • Keeps the compiler from removing the memory accesses and loops • Loop overhead • Could have used measurement of idle loop and subtracted from measurements • Made very little change in accuracy due to high number of iterations
Memory Bandwidth - Design • Outputs to CSV file • Number threads range from 32 to 512 in intervals of 32 • Number of blocks range from 1 to max blocks • GTX 260 - 512 Blocks • GeForce 9300 - 128 Blocks • CSV output contains every combination of the above and the number of bytes transfered per second (GB/s) • Throuput calculated: Thruput = (1000 * noThreads * noBlocks ) / ( accessTimer * 1073741824 )
Texture Memory Bandwidth - Implementation • __global__ void bandwidth_kernel() • { • float temp = 0; • __shared__ float finished; • int i; • for (i = LOOP_COUNT; i != 0; --i) • { • temp += tex1Dfetch(texRef, 0); • } • finished = temp; • } • Accesses one single address multiple times
Global Memory Bandwidth - Implmentation • __global__ void global_kernel(float* data) • { • float temp = 0; • __shared__ float finished; • int i; • float* ptr = &data[threadIdx.x]; • for (i = LOOP_COUNT; i != 0; --i) • { • temp += *ptr; • } • finished = temp; • } • Each thread accesses the same memory multiple times • Uses a pointer to that address to save offset calculation
Global Memory Bandwidth - Implementation • __global__ void shared_kernel(float* data) • { • __shared__ float finished; • __shared__ float data_s; • float temp; • int i; • if (threadIdx.x == 0) • { • data_s = data[threadIdx.x]; • } • __syncthreads(); • for (i = LOOP_COUNT; i != 0; --i) • { • temp += data_s; • } • finished = temp; • } • Thread 0 loads global into shared • Each thread access shared memory multiple times
Memory Bandwidth - Results • Shared memory has lower latency but slightly lower bandwidth • Texture memory has lowest bandwidth
Memory Bandwidth - Results GTX260 Memory Specifications From nVidia Wedsite • Measured results = ~107 GB/sec