1 / 27

OpenCL

OpenCL. Giuseppe Tagliavini giuseppe.tagliavini@unibo.it Christian Pinto christian.pinto@unibo.it Luca Benini luca.benini@unibo.it. Advanced Features. Outline. Profiling Pinned memory Local buffering Differences between P2012 and GPUs Smart memory management: DMA.

jeneil
Download Presentation

OpenCL

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. OpenCL Giuseppe Tagliavini giuseppe.tagliavini@unibo.it Christian Pinto christian.pinto@unibo.it Luca Benini luca.benini@unibo.it AdvancedFeatures

  2. Outline • Profiling • Pinned memory • Local buffering • Differences between P2012 and GPUs • Smart memory management: DMA

  3. Profiling (1/3) Enable profiling on the device command queue: cl_command_queue queue; cl_int error; queue = clCreateCommandQueue(context, device, CL_QUEUE_PROFILING_ENABLE, &error); Enable profiling Error handling Command Queue Context

  4. Profiling (2/3) cl_int error; cl_ulongparam_value; error= clGetEventProfilingInfo(event, param_name, param_value_size, &param_value, &param_value_size_ret);

  5. Profiling (3/3) On the host side, we can read the system clock: #include <time.h> structtimespec start, end; clock_gettime(CLOCK_MONOTONIC, &start); // Host code... clock_gettime(CLOCK_MONOTONIC, &end); unsigned long delta = (end.tv_sec*10E-9 + end.tv_nsec) – (start.tv_sec*10E-9 + start.tv_nsec);

  6. Profiling on NVidiaplatforms export OPENCL_PROFILE=1 ./example cat opencl_profile_0.log # OPENCL_PROFILE_LOG_VERSION 2.0 # OPENCL_DEVICE 0 Tesla C2070 # OPENCL_CONTEXT 1 # TIMESTAMPFACTOR fffff6bbdff06bd0 method,gputime,cputime,occupancy method=[ memcpyHtoDasync ] gputime=[ 1.472 ] cputime=[ 7.000 ] method=[ memcpyHtoDasync ] gputime=[ 1.280 ] cputime=[ 4.000 ] method=[ add ] gputime=[ 4.096 ] cputime=[ 11.000 ] occupancy=[ 0.333 ] method=[ memcpyDtoHasync ] gputime=[ 3.712 ] cputime=[ 25.000 ]

  7. Outline • Profiling • Pinned memory • Local buffering • Differences between P2012 and GPUs • Smart memory management: DMA

  8. Pinnedmemory • Pinned memory is a region in host memory space which is not pageble • Swapping is disabled • It enables faster host device transfer times (DMA cl_mem buffer; buffer = clCreateBuffer(context, CL_MEM_READ_WRITE | CL_MEM_ALLOC_HOST_PTR, szBufferBytes, NULL, &error); • How can we access this memory area?

  9. Memorymapping • clEnqueueMapBufferenqueues a command to map a buffer object into the host address space, and returns a pointer to this mapped region unsigned char *hData; hData = (unsigned char *) clMapBuffer(queue, buffer, CL_TRUE, CL_MAP_READ | CL_MAP_WRITE, 0, nbytes,0, NULL, NULL, &error); • To unmap the memory region (safe for use by the device): clUnmapMemObject(queue, buffer, hData0, NULL, NULL); Offset, number of bytes

  10. Outline • Profiling • Pinned memory • Local buffering • Differences between P2012 and GPUs • Smart memory management: DMA

  11. Case study: Matrix multiplication for(inti = 0; i < heightA; i++) { for(int j = 0; j < widthB; j++) { C[i][j] = 0; for(int k = 0; k < widthA; k++) C[i][j] += A[i][k] * B[k][j]; } } Using single dimensional arrays: A[i][k]  A[i*widthA + k]

  12. Matrix multiplication: kernel v1 kernelvoidsimpleMultiply(globalfloat*outputC, intwidthA, intheightA, intwidthB, intheightB, globalfloat*inputA, globalfloat*inputB) { introw = get_global_id(1); int col = get_global_id(0); float sum = 0.0f; for(int i = 0; i < widthA; i++) sum += inputA[row*widthA+i] * inputB[i*widthB+col]; outputC[row*widthB+col] = sum; } NDRange size= [widthB, heightA]

  13. OpenCLMemoryHierarchy Faster Smaller Slower Greater

  14. Matrix multiplication: kernel v2 kernelvoidcoalescedMultiply(globalfloat*outputC, intwidthA, intheightA, intwidthB, intheightB, globalfloat*inputA, globalfloat*inputB) { localfloataTile[BLOCK_SIZE][widthA]; introw = get_global_id(1);int col = get_global_id(0); float sum = 0.0f; int x = get_local_id(0); int y = get_local_id(1); aTile[y][x] = a[row*widthA+x];barrier(CLK_LOCAL_MEM_FENCE); for(int i = 0; i < widthA; i++) sum += aTile[y][i] * inputB[i*widthB+col]; outputC[row*widthB+col] = sum; }

  15. Matrix multiplication: kernel v2 kernelvoidcoalescedMultiply(globalfloat*outputC, intwidthA, intheightA, intwidthB, intheightB, globalfloat*inputA, globalfloat*inputB) { localfloataTile[BLOCK_SIZE][widthA]; introw = get_global_id(1);int col = get_global_id(0); float sum = 0.0f; int x = get_local_id(0); int y = get_local_id(1); aTile[y][x] = a[row*widthA+x];barrier(CLK_LOCAL_MEM_FENCE); for(int i = 0; i < widthA; i++) sum += aTile[y][i] * inputB[i*widthB+col]; outputC[row*widthB+col] = sum; } Local memory access is FASTER Transfers ot adjacent memory addresses are COALESCED Work-group size: (widthA, BLOCK_SIZE) Synchronization using local barrier

  16. Matrix multiplication: kernel v3 kernelvoidcoalescedMultiply(globalfloat*outputC, intwidthA, intheightA, intwidthB, intheightB, globalfloat*inputA, globalfloat*inputB) { localfloataTile[BLOCK_SIZE][BLOCK_SIZE]; … for(int m = 0; m < widthA/BLOCK_SIZE; m++) { aTile[y][x] = a[row*widthA+m*BLOCK_SIZE+x];barrier(CLK_LOCAL_MEM_FENCE); for(int i = 0; i < BLOCK_SIZE; i++) sum += aTile[y][i] * inputB[i*widthB+col]; barrier(CLK_LOCAL_MEM_FENCE); } …

  17. Matrix multiplication: kernel v3 kernelvoidcoalescedMultiply(globalfloat*outputC, intwidthA, intheightA, intwidthB, intheightB, globalfloat*inputA, globalfloat*inputB) { localfloataTile[BLOCK_SIZE][BLOCK_SIZE]; … for(int m = 0; m < widthA/BLOCK_SIZE; m++) { aTile[y][x] = a[row*widthA+m*BLOCK_SIZE+x];barrier(CLK_LOCAL_MEM_FENCE); for(int i = 0; i < BLOCK_SIZE; i++) sum += aTile[y][i] * inputB[i*widthB+col]; barrier(CLK_LOCAL_MEM_FENCE); } … Local memory usage is limited!!! Work-group size: (BLOCK_SIZE, BLOCK_SIZE)

  18. Outline • Profiling • Pinned memory • Local buffering • Differences between P2012 and GPUs • Smart memory management: DMA

  19. P2012 “computing” as OpenCL device OpenCLConceptual Architecture P2012 Architecture Private Memory Private Memory Private Memory Private Memory Proc1 Proc M Proc 1 ProcM Proc 1 ProcM Proc1 Proc M Cluster N Cluster 1 Compute Unit N Compute Unit 1 L1 256K shared L1 256K shared Local Memory Local Memory DMA DMA Constant Data Cache DMA DMA Global Memory L3 External Memory • Scalable programming model • Supports SPMD model • Supports Task parallelism • Covers complex memory hierarchy • Support async memory transfers • Scalable architecture (cluster based) • Supports SPMD with 0-cost branch divergence • Supports Task parallelism • Shared Local memory • 1D/2D DMA engines • Hardware synchronizer

  20. P2012 & OpenCL: Kernellevelparallelism With P2012 itispossible to implement more complexOpenCL task graph (more complexthanGPUs). Bothtask-level and data-level (ND-Range) are possible P2012 OpenCLruntimedoesnotaccept more than 16 work-items per work groupwhencreating an ND-Range. Thisbecausehaving more work itemsthanPEswould end in lots of contextswitches, which are reallyexpensive in thisarchitecture.

  21. P2012 & GPUs: Differences • P2012 Cores are mono-threaded while GPUs cores are highly multithreaded • In GPU programming memory latencies are hidden running a massive battery of threads. GPUs have negligible task scheduling overhead. • In P2012 memory latencies are hidden by DMA asynchronous copies because context switches are expensive. • GPU cores execute in lock-step(SIMT fashion) • All threads in a warp execute the same instruction flow, diverging threads cause an important performance loss. • P2012 Cluster’s PEs can execute different instruction flows without affecting application performance.

  22. P2012 & GPUs: Programming Style differences Data Data work-group16 work-item work-group16 work-item • ND-Range • 1 work item per • data element- number of WGs on • the size of WG ND-Range 2 work-groups16 work-items per WG GPU Several clusters 32 PE per cluster P2012 2 clusters 16 PE per cluster

  23. Outline • Profiling • Pinned memory • Local buffering • Differences between P2012 and GPUs • Smart memory management: DMA

  24. P2012&OpenCL: Smart memoryusage Problem: • External L3 Memory accesseshave an high cost (hundredcycles) • P2012 cannothidememorylatencies with threadscheduling Solution: • Use localmemoryas a usermanagedcache • Hidememorylatenciesoverlappingcomputation with memorytransfers • DMA asynchronoustransfers - global local

  25. P2012 & OpenCL: Overlap DMA transfers and computation The best way to hidememory transfer latencieswhenprogramming for P2012 is to overlapcomputation with DMA transfers. Thistechniqueisbased on software pipelining and double buffering(which reduce the totalamount of availablelocal or private memory ). 4 buffers are needed to implementsuchmechanism

  26. P2012 & OpenCL: DMA Primitives DMA transfer primitives: - Per work-item memory transfer async_work_item_copy (void *src, void *dst, size_tbytes, event_tevent); - Per work-groupmemory transfer async_work_group_copy(void *src, void *dst, size_tbytes, event_tevent ); Copy the data for an entire work-group with a single DMA transfer Wait for DMA transfer to finish: wait_events(intnum_events, event_t * events); wait_group_events(intnum_events, event_t * events);

  27. P2012&OpenCL: Overlap DMA transfers and computation Write block N-1 local → global Process block Nlocal→local Read block N+1 global → local • Assume each work-item works on multiple rows • Each work-item moves data in rowsunsingasynchrnonous DMA transfers • Each cluster has 2 DMAsto allowparalleltransfers in the twodirections (local -> global, global->local)

More Related