ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, April 12, 2012 Timing

Measuring Performance • These notes will introduce: • Timing Program Execution • How to measure time of execution of CUDA programs • CUDA “events” • Synchronous and asynchronous CUDA routines • Bandwidth measures • Computation measures – floating point operations/sec ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, April 12, 2012 Timing.ppt

Ways to measure time of execution • Generally instrument code • Measure time at two places and get difference • Ways to measure time: • C clock() or time() routines • CUDA “events” (seems the best way) • CUDA SDK timer

Timing with clock() • If program uses cudaMemcpy, which is synchronous and waits for previous operations to complete and returns when it is complete, could use clock(): #include <time.h> // needed for clock() int main() { float start, stop; // actually return types are clock_t … start = clock(); cudaMemcpy mykernel<<<B,T>>>(); // kernel call cudaMemcpy stop = clock(); … printf(“Execution time is %f seconds\n", (stop-start)/(float)CLOCKS_PER_SEC); return 0; }

If just measuring time of asynchronous kernel with clock() • Important to remember that kernel calls asynchronous and return immediately and before kernels have fully executed. • Hence need to wait for kernel to complete. • Can be achieved using cudaThreadSynchronize(): start = clock(); mykernel<<<B,T>>>(); // kernel call cudaThreadSynchronize(); stop = clock(); (We will discuss synchronization within a computation later.)

CUDA event timer In general, better to use CUDA event timer. First need to create event objects. cudaEvent_t event1; cudaEventCreate(&event1); cudaEvent_t event1; cudaEventCreate(&event1); creates two “event” objects, event1 and event1.

Recording Events cudaEventRecord(event1, 0) record an “event” into default “stream” (0). Device will record a timestamp for the event when it reaches that event in the stream, that is, after all preceding operations have completed. (Default stream 0 will mean completed in CUDA context) NOTE: This operation is asynchronous and may return before recording event!

Making event actually recorded cudaEventSynchronize(event) --waits until named event actually recorded. Event recorded when all work done by threads to complete prior to specified event (Not strictly be necessary if synchronous CUDA call in code.)

Measuring time between two events cudaEventElapsedTime(&time, event1, event2) will return (pointer argument) the time elapsed between two events, in milliseconds. Resolution approx ½ millisecond. Timing measured using GPU clock.

Timing GPU Execution with CUDA events Code cudaEvent_t start, stop; float elapsedTime; cudaEventCreate(&start); // create event objects cudaEventCreate(&stop); cudaEventRecord(start, 0); // Record start event . . . cudaEventRecord(stop, 0); // record end event cudaEventSynchronize(stop); // wait for all device work to complete cudaEventElapsedTime(&elapsedTime, start, stop); //time between events cudaEventDestroy(start); //destroy start event cudaEventDestroy(stop);); //destroy stop event Time period

CUDA on-line documentation http://developer.download.nvidia.com/ compute/cuda/3_0/toolkit/docs/online/index.html http://developer.download.nvidia.com/ compute/cuda/2_3/toolkit/docs/online/index.html

Issues to watch for • first kernel launch will be more timing consuming than subsequent launches because of initialization • Asynchronous CUDA routines returning before they are complete – a big issue.

Asynchronous and synchronous calls • Kernels • Kernel starts after all previous CUDA calls completed • Control returned to CPU immediately (asynchronous, non-blocking) • cudaMemcpy • Copy starts after all previous CUDA calls completed • Returns after copy complete (synchronous)

Timing within Kernel -- Using clock() “B.10 Time Function clock_t clock(); when executed in device code, returns the value of a per-multiprocessor counter that is incremented every clock cycle. Sampling this counter at the beginning and at the end of a kernel, taking the difference of the two samples, and recording the result per thread provides a measure for each thread of the number of clock cycles taken by the device to completely execute the thread, but not of the number of clock cycles the device actually spent executing thread instructions. The former number is greater that the latter since threads are time sliced.” Possible to use clock() within kernel See NVIDIA CUDA C Programming Guide, page 115:

Timing within Kernel -- Using events Appears possible to use event timer within kernel. Events can be recorded in specific “stream” objects – sequences of in-order code operating on a data set. Events in default “stream 0” completed when all preceding operations completed by device See NVIDIA CUDA C Programming Guide, page 39 for more details on streams. (Will come back to this later.)

Bandwidth • Bandwidth is the rate at which data is transferred. • Physical connection will define the maximum system bandwidth. • S2050 (4 GPUs) 4121.6 GB/sec • C2050 Telsa (coit-grid06) 1030.4 GB/sec • GTX 280 141.6 GB/sec • GT 320M/330M (in Mac pro laptops) 25.6 GB/sec • Pentium Core i7 with Quickpath 25.6 GB/sec • Xbox 6.4 GB/sec • Effective bandwidth is the actual bandwidth achieved by a program. Hence if we measure the effective bandwidth of a program, we can compare that to the maximum possible. Wikipedia: Comparison of Nvidia graphics processing units http://en.wikipedia.org/wiki/Comparison_of_Nvidia_graphics_processing_units#Tesla

Effective Bandwidth Effective bandwidth, the actual bandwidth achieved by a program/kernel given by Effective Bandwidth = (number_Bytes/time) * 10-9 GB/s where: number_Bytes is total number of bytes read or written time is the time period in seconds GB/s = Gigabytes per second = 1,000,000,000 Bytes/s Use effective bandwidth as a metric for measuring performance/optimization benefits* * from NVIDIA CUDA C Best Practices Guide, Version 3.2, 8/20/2010

Bandwidth of Matrix Copy Operation Copying an N x N matrix: ((N2 x b x 2)/time) x 10-9 GB/sec where there are b bytes in each number Integers b = 4 (4 bytes) Floating point b = 4(4 bytes) Read plus write, 2 transfers. Need to know size of variables. From NVIDIA CUDA C Best Practices Guide, Version 3.2, 8/20/2010

Computational Measures • The classical measure in high performance computing (HPC) to measure performance is the number of floating point operations. Systems have peak GFLOPs • Tianhe-1 2.5 PFLOPS* • Cray Jaguar 1.75 PFLOPS • S2050 (4 GPUs) 5152 GFLOPS • C2050 Telsa (coit-grid06) 1288 GFLOPS • GTX 280 933 GFLOPS • GT 330M (in Mac pro laptops) 182 GFLOPS • Pentium Core i7 40-55 GFLOPS • Peak single precision GFLOPs • Petaflops, 1015 FLOPS, Gflops = 109 FLOPS) These numbers need checking

Actual FLOPS Measured using standard benchmark programs such as LINPACK If measure it on your program, can see how close it get to the peak (which presumably is doing only floating point operations).

#define N 1000 // a big number up to INT_MAX, 2,147,483,647 __global__ void gpu_compute(float *result) { int i, j; float a = 0.0; int tid = blockIdx.x * blockDim.x + threadIdx.x; for (i = 0; i < N; i++) for (j = 0; j < N; j++) a = a + 0.0001; // do something, N x N floating pt operations result[tid] = a; // store result return; } int main(int argc, char *argv[]) { int T = 1, B = 1; // threads per block and blocks per grid float cpu_result, *gpu_result, ans[T * B]; // result from gpu, to make sure computation is being done cudaEvent_t start, end; // using cuda events to measure time float time; // which is applicable for asynchronous code also cudaEventCreate(&start); // instrument code to measure start time cudaEventCreate(&end); cudaEventRecord(start, 0 ); cudaMalloc((void**) &gpu_result, T * B * sizeof(float)); gpu_compute<<<B,T>>>(gpu_result); cudaMemcpy(ans,gpu_result, T * B * sizeof(float),cudaMemcpyDeviceToHost); cudaEventRecord(end, 0 ); // instrument code to measure end time cudaEventSynchronize(end); cudaEventElapsedTime(&time, start, end); printf("GPU, Answer thread 0, %e\n", ans[0]); printf("GPU Number of floating pt operations done %e\n", (double) N * N * T * B); printf("GPU Time using CUDA events: %f ms\n", time); // time is in ms cudaEventDestroy(start); cudaEventDestroy(end); return 0; } Sample partial code to measure performance on GPU

Demo of a program measuring performance

Questions

ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, April 12, 2012 Timing