OpenCL

OpenCL Giuseppe Tagliavini giuseppe.tagliavini@unibo.it Christian Pinto christian.pinto@unibo.it Luca Benini luca.benini@unibo.it AdvancedFeatures

Outline • Profiling • Pinned memory • Local buffering • Differences between P2012 and GPUs • Smart memory management: DMA

Profiling (1/3) Enable profiling on the device command queue: cl_command_queue queue; cl_int error; queue = clCreateCommandQueue(context, device, CL_QUEUE_PROFILING_ENABLE, &error); Enable profiling Error handling Command Queue Context

Profiling (2/3) cl_int error; cl_ulongparam_value; error= clGetEventProfilingInfo(event, param_name, param_value_size, &param_value, &param_value_size_ret);

Profiling (3/3) On the host side, we can read the system clock: #include <time.h> structtimespec start, end; clock_gettime(CLOCK_MONOTONIC, &start); // Host code... clock_gettime(CLOCK_MONOTONIC, &end); unsigned long delta = (end.tv_sec*10E-9 + end.tv_nsec) – (start.tv_sec*10E-9 + start.tv_nsec);

Profiling on NVidiaplatforms export OPENCL_PROFILE=1 ./example cat opencl_profile_0.log # OPENCL_PROFILE_LOG_VERSION 2.0 # OPENCL_DEVICE 0 Tesla C2070 # OPENCL_CONTEXT 1 # TIMESTAMPFACTOR fffff6bbdff06bd0 method,gputime,cputime,occupancy method=[ memcpyHtoDasync ] gputime=[ 1.472 ] cputime=[ 7.000 ] method=[ memcpyHtoDasync ] gputime=[ 1.280 ] cputime=[ 4.000 ] method=[ add ] gputime=[ 4.096 ] cputime=[ 11.000 ] occupancy=[ 0.333 ] method=[ memcpyDtoHasync ] gputime=[ 3.712 ] cputime=[ 25.000 ]

Pinnedmemory • Pinned memory is a region in host memory space which is not pageble • Swapping is disabled • It enables faster host device transfer times (DMA cl_mem buffer; buffer = clCreateBuffer(context, CL_MEM_READ_WRITE | CL_MEM_ALLOC_HOST_PTR, szBufferBytes, NULL, &error); • How can we access this memory area?

Memorymapping • clEnqueueMapBufferenqueues a command to map a buffer object into the host address space, and returns a pointer to this mapped region unsigned char *hData; hData = (unsigned char *) clMapBuffer(queue, buffer, CL_TRUE, CL_MAP_READ | CL_MAP_WRITE, 0, nbytes,0, NULL, NULL, &error); • To unmap the memory region (safe for use by the device): clUnmapMemObject(queue, buffer, hData0, NULL, NULL); Offset, number of bytes

Case study: Matrix multiplication for(inti = 0; i < heightA; i++) { for(int j = 0; j < widthB; j++) { C[i][j] = 0; for(int k = 0; k < widthA; k++) C[i][j] += A[i][k] * B[k][j]; } } Using single dimensional arrays: A[i][k]  A[i*widthA + k]

Matrix multiplication: kernel v1 kernelvoidsimpleMultiply(globalfloat*outputC, intwidthA, intheightA, intwidthB, intheightB, globalfloat*inputA, globalfloat*inputB) { introw = get_global_id(1); int col = get_global_id(0); float sum = 0.0f; for(int i = 0; i < widthA; i++) sum += inputA[row*widthA+i] * inputB[i*widthB+col]; outputC[row*widthB+col] = sum; } NDRange size= [widthB, heightA]

OpenCLMemoryHierarchy Faster Smaller Slower Greater

Matrix multiplication: kernel v2 kernelvoidcoalescedMultiply(globalfloat*outputC, intwidthA, intheightA, intwidthB, intheightB, globalfloat*inputA, globalfloat*inputB) { localfloataTile[BLOCK_SIZE][widthA]; introw = get_global_id(1);int col = get_global_id(0); float sum = 0.0f; int x = get_local_id(0); int y = get_local_id(1); aTile[y][x] = a[row*widthA+x];barrier(CLK_LOCAL_MEM_FENCE); for(int i = 0; i < widthA; i++) sum += aTile[y][i] * inputB[i*widthB+col]; outputC[row*widthB+col] = sum; }

Matrix multiplication: kernel v2 kernelvoidcoalescedMultiply(globalfloat*outputC, intwidthA, intheightA, intwidthB, intheightB, globalfloat*inputA, globalfloat*inputB) { localfloataTile[BLOCK_SIZE][widthA]; introw = get_global_id(1);int col = get_global_id(0); float sum = 0.0f; int x = get_local_id(0); int y = get_local_id(1); aTile[y][x] = a[row*widthA+x];barrier(CLK_LOCAL_MEM_FENCE); for(int i = 0; i < widthA; i++) sum += aTile[y][i] * inputB[i*widthB+col]; outputC[row*widthB+col] = sum; } Local memory access is FASTER Transfers ot adjacent memory addresses are COALESCED Work-group size: (widthA, BLOCK_SIZE) Synchronization using local barrier

Matrix multiplication: kernel v3 kernelvoidcoalescedMultiply(globalfloat*outputC, intwidthA, intheightA, intwidthB, intheightB, globalfloat*inputA, globalfloat*inputB) { localfloataTile[BLOCK_SIZE][BLOCK_SIZE]; … for(int m = 0; m < widthA/BLOCK_SIZE; m++) { aTile[y][x] = a[row*widthA+m*BLOCK_SIZE+x];barrier(CLK_LOCAL_MEM_FENCE); for(int i = 0; i < BLOCK_SIZE; i++) sum += aTile[y][i] * inputB[i*widthB+col]; barrier(CLK_LOCAL_MEM_FENCE); } …

Matrix multiplication: kernel v3 kernelvoidcoalescedMultiply(globalfloat*outputC, intwidthA, intheightA, intwidthB, intheightB, globalfloat*inputA, globalfloat*inputB) { localfloataTile[BLOCK_SIZE][BLOCK_SIZE]; … for(int m = 0; m < widthA/BLOCK_SIZE; m++) { aTile[y][x] = a[row*widthA+m*BLOCK_SIZE+x];barrier(CLK_LOCAL_MEM_FENCE); for(int i = 0; i < BLOCK_SIZE; i++) sum += aTile[y][i] * inputB[i*widthB+col]; barrier(CLK_LOCAL_MEM_FENCE); } … Local memory usage is limited!!! Work-group size: (BLOCK_SIZE, BLOCK_SIZE)

P2012 “computing” as OpenCL device OpenCLConceptual Architecture P2012 Architecture Private Memory Private Memory Private Memory Private Memory Proc1 Proc M Proc 1 ProcM Proc 1 ProcM Proc1 Proc M Cluster N Cluster 1 Compute Unit N Compute Unit 1 L1 256K shared L1 256K shared Local Memory Local Memory DMA DMA Constant Data Cache DMA DMA Global Memory L3 External Memory • Scalable programming model • Supports SPMD model • Supports Task parallelism • Covers complex memory hierarchy • Support async memory transfers • Scalable architecture (cluster based) • Supports SPMD with 0-cost branch divergence • Supports Task parallelism • Shared Local memory • 1D/2D DMA engines • Hardware synchronizer

P2012 & OpenCL: Kernellevelparallelism With P2012 itispossible to implement more complexOpenCL task graph (more complexthanGPUs). Bothtask-level and data-level (ND-Range) are possible P2012 OpenCLruntimedoesnotaccept more than 16 work-items per work groupwhencreating an ND-Range. Thisbecausehaving more work itemsthanPEswould end in lots of contextswitches, which are reallyexpensive in thisarchitecture.

P2012 & GPUs: Differences • P2012 Cores are mono-threaded while GPUs cores are highly multithreaded • In GPU programming memory latencies are hidden running a massive battery of threads. GPUs have negligible task scheduling overhead. • In P2012 memory latencies are hidden by DMA asynchronous copies because context switches are expensive. • GPU cores execute in lock-step(SIMT fashion) • All threads in a warp execute the same instruction flow, diverging threads cause an important performance loss. • P2012 Cluster’s PEs can execute different instruction flows without affecting application performance.

P2012 & GPUs: Programming Style differences Data Data work-group16 work-item work-group16 work-item • ND-Range • 1 work item per • data element- number of WGs on • the size of WG ND-Range 2 work-groups16 work-items per WG GPU Several clusters 32 PE per cluster P2012 2 clusters 16 PE per cluster

P2012&OpenCL: Smart memoryusage Problem: • External L3 Memory accesseshave an high cost (hundredcycles) • P2012 cannothidememorylatencies with threadscheduling Solution: • Use localmemoryas a usermanagedcache • Hidememorylatenciesoverlappingcomputation with memorytransfers • DMA asynchronoustransfers - global local

P2012 & OpenCL: Overlap DMA transfers and computation The best way to hidememory transfer latencieswhenprogramming for P2012 is to overlapcomputation with DMA transfers. Thistechniqueisbased on software pipelining and double buffering(which reduce the totalamount of availablelocal or private memory ). 4 buffers are needed to implementsuchmechanism

P2012 & OpenCL: DMA Primitives DMA transfer primitives: - Per work-item memory transfer async_work_item_copy (void *src, void *dst, size_tbytes, event_tevent); - Per work-groupmemory transfer async_work_group_copy(void *src, void *dst, size_tbytes, event_tevent ); Copy the data for an entire work-group with a single DMA transfer Wait for DMA transfer to finish: wait_events(intnum_events, event_t * events); wait_group_events(intnum_events, event_t * events);

P2012&OpenCL: Overlap DMA transfers and computation Write block N-1 local → global Process block Nlocal→local Read block N+1 global → local • Assume each work-item works on multiple rows • Each work-item moves data in rowsunsingasynchrnonous DMA transfers • Each cluster has 2 DMAsto allowparalleltransfers in the twodirections (local -> global, global->local)

OpenCL

OpenCL

Presentation Transcript

OpenCL

OpenCL

OPENCL OVERVIEW

Evolution of OpenCL *

OpenCL

OpenCL

OpenCL Extensions

OpenCL

OpenCL ch. 2~4

Introduction to OpenCL

OpenCL

OpenCL Particle System

OpenCL Particle System

OpenCL

OpenCL Particle System

OpenCL

OpenCL™

OpenCL

OpenCL

Portability with OpenCL