1 / 31

An OpenCL Framework for Heterogeneous Multicores with Local Memory

An OpenCL Framework for Heterogeneous Multicores with Local Memory. PACT 2010 Jaejin Lee, Jungwon Kim, Sangmin Seo , Seungkyun Kim, Jungho Park, Honggyu Kim, Thanh Tuan Dao, Yongjin Cho, Sung Jong Seo , Seung Hak Lee,

sahkyo
Download Presentation

An OpenCL Framework for Heterogeneous Multicores with Local Memory

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. An OpenCL Frameworkfor Heterogeneous Multicores with Local Memory PACT 2010 Jaejin Lee, JungwonKim, SangminSeo, SeungkyunKim, JunghoPark, HonggyuKim, Thanh Tuan Dao, YongjinCho, Sung JongSeo, SeungHakLee, Seung Mo Cho, Hyo Jung Song, Sang-Bum Suh, and Jong-DeokChoi School of Computer Science and Engineering, Seoul National University, Seoul 151-744, Korea Samsung Electronics Co., Nongseo-dong, Giheung-gu, Yongin-si, Geonggi-do 446-712, Korea Presenter : Jen-Jung, Cheng

  2. Outline • Introduction • Background • OpenCL platform • Design and Implementation • OpenCL runtime • Work-item coalescing • Web-based variable expansion • Preload-poststore buffering • Evaluation • Conclusion

  3. Introduction(1/2) GPC APC APC APC The target architecture … L1$ Local store Local store Local store L2$ Interconnect bus Main memory

  4. Introduction(2/2) • Two major challenges in design and implementation of the OpenCL framework • Implements hundreds of virtual PEs with a single accelerator core and make them efficient • Overcomes the limited size and incoherency of the local store

  5. OpenCLplatform(1/2) The OpenCL platform model

  6. OpenCLplatform(2/2) • OpenCL platform:a host processor, compute devices, compute units, and processing elements • Abstract Index Space:global ID, workgroup ID, and local ID • Memory Region:private, local, constant, and global • Synchronization:work-group barrier and command-queue barrier

  7. OpenCL runtime(1/3) Mapping platform components to the target architecture

  8. OpenCL runtime(2/3) OpenCLRuntime thread GPC Command Scheduler Command Executor DAG CUStatus Array Event Queue Device Command Queues Execution ordering Issue Assign CU OpenCLHost thread CU … Ready Queue … CU Work-groups The command scheduler and the commandexecutor

  9. OpenCL runtime(3/3) • Theruntime implements a software-managed cachein each APC‘s local store. It caches the contents of theglobaland constant memory. • To guarantee OpenCL memory consistency for shared memory objects between commands, the command executor flushes software-managed caches whenever it dequeues a command from the ready-queue or it removes an event object from the DAG after the associated command has completed.

  10. Work-item coalescing(1/3) • Executing work-items on a CU by switching one work-item to another incurs a significant overhead. • When a kernel and its calleefunctions do not contain any barrier, any execution ordering defined between the two statements from different work-items in the same work-group satisfies the OpenCLsemantics. • Work-item coalescing loop(WCL) iterates on the index space of a single work-group.

  11. Work-item coalescing(2/3) __kernel void vec_add( __global float *a, __global float *b, __global float *c) { int id; id = get_global_id(0); c[id] = a[id] + b[id]; } Int __i, __ j, __k; __kernel void vec_add (__global float *a, __global float *b, __global float *c) { int id; for( __k = 0; __k < __local_size[2]; __k++ ) { for( __ j = 0; __ j < __local_size[1]; __ j++ ) { for( __ i = 0; __ i < __local_size[0]; __ i++ ) { id = get_global_id(0); c[id] = a[id] + b[id]; } } } } OpenCL C source-to-source translator

  12. Work-item coalescing(3/3) S1 barrier(); S2 [S1’ barrier(); [S2’ if (c) { S1 barrier(); S2 } [t = C’; if (t) { [S1’ barrier(); [S2’ } while (c) { S1 barrier(); S2 } while (1){ [t = C’; if (!t) break; [S1’ barrier(); [S2’ }

  13. Web-based variable expansion(1/5) • A kernel code region that needs to be enclosed with a WCL is called a work-item coalescing region (WCR). • A work-item private variable that is defined in one WCR and used in another needs a separate location for different work-items. • A du-chain for a variable connects a definition of the variable to all uses reached by the definition. • A web for a variable is all du-chains of the variable that contain a common use of the variable.

  14. Web-based variable expansion(2/5) WCR x = … Entry t1 = C1 if (t1) while (1) …=x t2 = C2 if (t2) x = … x = … …=x x = … x = … barrier () … = x …=x x = … …=x Exit

  15. Web-based variable expansion(3/5) WCR x = … Entry t1 = C1 if (t1) while (1) …=x t2 = C2 if (t2) x = … x = … …=x x = … x = … barrier () … = x …=x x = … …=x Exit Identifying du-chains

  16. Web-based variable expansion(4/5) WCR x = … Entry t1 = C1 if (t1) while (1) …=x t2 = C2 if (t2) x = … x = … …=x x = … x = … barrier () … = x …=x x = … …=x Exit Identifying webs

  17. Web-based variable expansion(5/5) WCR x1[][][]= x1=malloc() Entry t1 = C1 if (t1) while (1) =x1[][][] t2 = C2 if (t2) x = … x = … …=x x1[][][]= x1[][][]= barrier () =x1[][][] =x1[][][] x = … …=x Free(x1) Exit After variable expansion

  18. Preload-poststore buffering(1/4) • Preload-poststore buffering enables gathering DMA transfers together for array accesses and minimizes the time spent waiting for them to complete by overlapping them.

  19. Preload-poststore buffering(2/4) for(k = 0; k < ls[2]; k++ ) { for(j = 0; j < ls[1]; j++ ) { PRELOAD(buf_b, &b[0], ls[0]); PRELOAD(buf_a1, &a[j][0], ls[0]+1024); for(i = 0; i < ls[0]; i++ ) PRELOAD(buf_a2[i], &a[j][3*i+1]); WAITFOR(buf_b); for(i = 0; i < ls[0]; i++ ) PRELOAD(buf_c[i], &c[j][buf_b[i]]); for(i = 0; i < ls[0]; i++ ) { if( i < 100) buf_a1[i] = buf_c[i]; buf_c[i] = buf_a2[i]+ buf_a1[i+1024]; } POSTSTORE(buf_a1, &a[j][0], ls[0]+1024); for(i = 0; i < ls[0]; i++ ) POSTSTORE (buf_c[i], &c[j][buf_b[i]]); } } for(k = 0; k < ls[2]; k++ ) { for(j = 0; j < ls[1]; j++ ) { for(i = 0; i < ls[0]; i++ ) { if( i < 100) a[j][i] = c[j][b[i]]; c[j][b[i]] = a[j][3*i+1] + a[j][i+1024]; } } }

  20. Preload-poststore buffering(3/4) • Buffering candidate … A 0 1 2 3 4 5 6 7 n-1 c*I + d, where c and d are loop invariant to L c*x + d, where x is an array reference and c and d are loop invariant to L. [lower bound : upper bound : stride] [1 : 3 * ls[0] - 2 : 3] 3 * i + 1 … 1 4 7 10 13 16 19 22 3 * ls[0] - 2 buf_a2

  21. Preload-poststorebuffering(4/4) • Condition for single buffer • a loop-independent flow dependence (read-after-write) • a loop-independent output dependence (write-after-write)

  22. Evaluation(1/5) • Experimental Setup • an IBM QS22 Cell blade server with two 3.2GHz PowerXCell 8i processors. • The Cell BE processor consists of a single Power Processor Element (PPE) and eight Synergistic Processor Elements (SPEs). • Fedora Linux 9 • SPE has 256KB of local store

  23. Evaluation(2/5) Applications used

  24. Evaluation(3/5) speedup

  25. Evaluation(4/5) Comparison with the IBM OpenCL framework for Cell BE.

  26. Evaluation(5/5) • two Intel Xeon X5660 hexa-core processors (CPU) • an NVIDIA Tesla C1060 GPU (GPU). The speedup of the OpenCLapplicationswith multicore CPUs and a GPU.

  27. Conclusion • This paper presents the design and implementation of an OpenCL runtime and OpenCL C source-to-source translator that target heterogeneous accelerator multicorearchitectures with local memory.

  28. Web-based variable expansion(1/3) WCR x = … Entry t1 = C1 if (t1) while (1) …=x t2 = C2 if (t2) x = … x = … …=x x = … x = … barrier () … = x …=x x = … …=x Exit

  29. Web-based variable expansion(2/3) WCR x = … Entry t1 = C1 if (t1) while (1) …=x t2 = C2 if (t2) x = … x = … …=x x = … x = … barrier () … = x …=x x = … …=x Exit

  30. Web-based variable expansion(3/3) WCR x = … Entry t1 = C1 if (t1) while (1) …=x t2 = C2 if (t2) x = … x = … …=x x = … x = … barrier () … = x …=x x = … …=x Exit

More Related