330 likes | 587 Views
An OpenCL Framework for Heterogeneous Multicores with Local Memory. PACT 2010 Jaejin Lee, Jungwon Kim, Sangmin Seo , Seungkyun Kim, Jungho Park, Honggyu Kim, Thanh Tuan Dao, Yongjin Cho, Sung Jong Seo , Seung Hak Lee,
E N D
An OpenCL Frameworkfor Heterogeneous Multicores with Local Memory PACT 2010 Jaejin Lee, JungwonKim, SangminSeo, SeungkyunKim, JunghoPark, HonggyuKim, Thanh Tuan Dao, YongjinCho, Sung JongSeo, SeungHakLee, Seung Mo Cho, Hyo Jung Song, Sang-Bum Suh, and Jong-DeokChoi School of Computer Science and Engineering, Seoul National University, Seoul 151-744, Korea Samsung Electronics Co., Nongseo-dong, Giheung-gu, Yongin-si, Geonggi-do 446-712, Korea Presenter : Jen-Jung, Cheng
Outline • Introduction • Background • OpenCL platform • Design and Implementation • OpenCL runtime • Work-item coalescing • Web-based variable expansion • Preload-poststore buffering • Evaluation • Conclusion
Introduction(1/2) GPC APC APC APC The target architecture … L1$ Local store Local store Local store L2$ Interconnect bus Main memory
Introduction(2/2) • Two major challenges in design and implementation of the OpenCL framework • Implements hundreds of virtual PEs with a single accelerator core and make them efficient • Overcomes the limited size and incoherency of the local store
OpenCLplatform(1/2) The OpenCL platform model
OpenCLplatform(2/2) • OpenCL platform:a host processor, compute devices, compute units, and processing elements • Abstract Index Space:global ID, workgroup ID, and local ID • Memory Region:private, local, constant, and global • Synchronization:work-group barrier and command-queue barrier
OpenCL runtime(1/3) Mapping platform components to the target architecture
OpenCL runtime(2/3) OpenCLRuntime thread GPC Command Scheduler Command Executor DAG CUStatus Array Event Queue Device Command Queues Execution ordering Issue Assign CU OpenCLHost thread CU … Ready Queue … CU Work-groups The command scheduler and the commandexecutor
OpenCL runtime(3/3) • Theruntime implements a software-managed cachein each APC‘s local store. It caches the contents of theglobaland constant memory. • To guarantee OpenCL memory consistency for shared memory objects between commands, the command executor flushes software-managed caches whenever it dequeues a command from the ready-queue or it removes an event object from the DAG after the associated command has completed.
Work-item coalescing(1/3) • Executing work-items on a CU by switching one work-item to another incurs a significant overhead. • When a kernel and its calleefunctions do not contain any barrier, any execution ordering defined between the two statements from different work-items in the same work-group satisfies the OpenCLsemantics. • Work-item coalescing loop(WCL) iterates on the index space of a single work-group.
Work-item coalescing(2/3) __kernel void vec_add( __global float *a, __global float *b, __global float *c) { int id; id = get_global_id(0); c[id] = a[id] + b[id]; } Int __i, __ j, __k; __kernel void vec_add (__global float *a, __global float *b, __global float *c) { int id; for( __k = 0; __k < __local_size[2]; __k++ ) { for( __ j = 0; __ j < __local_size[1]; __ j++ ) { for( __ i = 0; __ i < __local_size[0]; __ i++ ) { id = get_global_id(0); c[id] = a[id] + b[id]; } } } } OpenCL C source-to-source translator
Work-item coalescing(3/3) S1 barrier(); S2 [S1’ barrier(); [S2’ if (c) { S1 barrier(); S2 } [t = C’; if (t) { [S1’ barrier(); [S2’ } while (c) { S1 barrier(); S2 } while (1){ [t = C’; if (!t) break; [S1’ barrier(); [S2’ }
Web-based variable expansion(1/5) • A kernel code region that needs to be enclosed with a WCL is called a work-item coalescing region (WCR). • A work-item private variable that is defined in one WCR and used in another needs a separate location for different work-items. • A du-chain for a variable connects a definition of the variable to all uses reached by the definition. • A web for a variable is all du-chains of the variable that contain a common use of the variable.
Web-based variable expansion(2/5) WCR x = … Entry t1 = C1 if (t1) while (1) …=x t2 = C2 if (t2) x = … x = … …=x x = … x = … barrier () … = x …=x x = … …=x Exit
Web-based variable expansion(3/5) WCR x = … Entry t1 = C1 if (t1) while (1) …=x t2 = C2 if (t2) x = … x = … …=x x = … x = … barrier () … = x …=x x = … …=x Exit Identifying du-chains
Web-based variable expansion(4/5) WCR x = … Entry t1 = C1 if (t1) while (1) …=x t2 = C2 if (t2) x = … x = … …=x x = … x = … barrier () … = x …=x x = … …=x Exit Identifying webs
Web-based variable expansion(5/5) WCR x1[][][]= x1=malloc() Entry t1 = C1 if (t1) while (1) =x1[][][] t2 = C2 if (t2) x = … x = … …=x x1[][][]= x1[][][]= barrier () =x1[][][] =x1[][][] x = … …=x Free(x1) Exit After variable expansion
Preload-poststore buffering(1/4) • Preload-poststore buffering enables gathering DMA transfers together for array accesses and minimizes the time spent waiting for them to complete by overlapping them.
Preload-poststore buffering(2/4) for(k = 0; k < ls[2]; k++ ) { for(j = 0; j < ls[1]; j++ ) { PRELOAD(buf_b, &b[0], ls[0]); PRELOAD(buf_a1, &a[j][0], ls[0]+1024); for(i = 0; i < ls[0]; i++ ) PRELOAD(buf_a2[i], &a[j][3*i+1]); WAITFOR(buf_b); for(i = 0; i < ls[0]; i++ ) PRELOAD(buf_c[i], &c[j][buf_b[i]]); for(i = 0; i < ls[0]; i++ ) { if( i < 100) buf_a1[i] = buf_c[i]; buf_c[i] = buf_a2[i]+ buf_a1[i+1024]; } POSTSTORE(buf_a1, &a[j][0], ls[0]+1024); for(i = 0; i < ls[0]; i++ ) POSTSTORE (buf_c[i], &c[j][buf_b[i]]); } } for(k = 0; k < ls[2]; k++ ) { for(j = 0; j < ls[1]; j++ ) { for(i = 0; i < ls[0]; i++ ) { if( i < 100) a[j][i] = c[j][b[i]]; c[j][b[i]] = a[j][3*i+1] + a[j][i+1024]; } } }
Preload-poststore buffering(3/4) • Buffering candidate … A 0 1 2 3 4 5 6 7 n-1 c*I + d, where c and d are loop invariant to L c*x + d, where x is an array reference and c and d are loop invariant to L. [lower bound : upper bound : stride] [1 : 3 * ls[0] - 2 : 3] 3 * i + 1 … 1 4 7 10 13 16 19 22 3 * ls[0] - 2 buf_a2
Preload-poststorebuffering(4/4) • Condition for single buffer • a loop-independent flow dependence (read-after-write) • a loop-independent output dependence (write-after-write)
Evaluation(1/5) • Experimental Setup • an IBM QS22 Cell blade server with two 3.2GHz PowerXCell 8i processors. • The Cell BE processor consists of a single Power Processor Element (PPE) and eight Synergistic Processor Elements (SPEs). • Fedora Linux 9 • SPE has 256KB of local store
Evaluation(2/5) Applications used
Evaluation(3/5) speedup
Evaluation(4/5) Comparison with the IBM OpenCL framework for Cell BE.
Evaluation(5/5) • two Intel Xeon X5660 hexa-core processors (CPU) • an NVIDIA Tesla C1060 GPU (GPU). The speedup of the OpenCLapplicationswith multicore CPUs and a GPU.
Conclusion • This paper presents the design and implementation of an OpenCL runtime and OpenCL C source-to-source translator that target heterogeneous accelerator multicorearchitectures with local memory.
Web-based variable expansion(1/3) WCR x = … Entry t1 = C1 if (t1) while (1) …=x t2 = C2 if (t2) x = … x = … …=x x = … x = … barrier () … = x …=x x = … …=x Exit
Web-based variable expansion(2/3) WCR x = … Entry t1 = C1 if (t1) while (1) …=x t2 = C2 if (t2) x = … x = … …=x x = … x = … barrier () … = x …=x x = … …=x Exit
Web-based variable expansion(3/3) WCR x = … Entry t1 = C1 if (t1) while (1) …=x t2 = C2 if (t2) x = … x = … …=x x = … x = … barrier () … = x …=x x = … …=x Exit