OpenCL Compiler Support Based on Open64 for MPUs+GPUs

OpenCL Compiler Support Based on Open64 for MPUs+GPUs Yu-Te Lin, Chung-Ju Wu Chia-Han Lu, Shao-Chung Wang Jenq-Kuen Lee Department of Computer Science, National TsingHua University, HsinChu, Taiwan

Outline • OpenCL overview • Issues on supporting OpenCL • On-going implementation on Open64 • Summary & Discussion

OpenComputing Language • Programming framework on heterogeneous platforms consisting of CPU, GPU , and other processors. • Initially developed by Apple Inc. and submitted to Khronos Group in 2008. • Open and Royalty-Free. • OpenCL Language • C language extension • built-in functions • OpenCL Runtime • platform APIs • runtime APIs OpenCL tutorial, IEEE HotChips, Aug.23, 2009

OpenCL Framework overview Application OpenCL Kernels Host Program Separate programs into host-side and kernel-side code fragment __kernel void dot(__global const float4 *a __global const float4 *b __global float4 *c) { inttid = get_global_id(0); c[tid] = a[tid] * b[tid]; } int main(intargc, char **argv) { ... clBuildProgram(program, ...); clCreateKernel(program, “dot”...); ... } OpenCL Framework OpenCL Runtime OpenCL Compiler Compiler • compile OpenCL C language just-in-time Runtime • allow host program to manipulate context Runtime APIs Front-end Back-end Platform APIs Platform MPU : host, kernel program GPU : kernel program MPU GPU OpenCL_for_Halifux.pdf, OpenCL overview, Intel Visual Adrenaline

OpenCL Sample Code main.c kernel.cl int runCL(int *a, int *b, int *c, int size) { // Device initialization • clGetDeviceIDs(platform, CL_DEVICE_TYPE_GPU, 1, &device, NULL); • ... // Program and Kernel Creation char *source = load_program_source(“kernel.cl”); cl_program program = clCreateProgramWithSource(context, 1, \ (const char**) &source, NULL, &err); cl_BuildProgram(program, 1, &device, NULL, NULL, NULL); cl_kernel kernel = clCreateKernel(program, “add”, &err); • cl_mem buffer = clCreateBuffer(context, CL_MEM_WRITE_ONLY, \ • NWITEMS*sizeof(cl_uint4), NULL, NULL ); ... // Device teardown clFinish(queue); ... } int main() { int a[100],b[100],c[100]; inti, size = 100; runCL(a, b, c, size); return 0; } Configure platform devices __kernel void add( __global float *a, __global float *b, __global float *c) { intgid = get_global_id(0); c[gid] = a[gid] + b[gid]; } Convert kernel source into character array OpenCL Compiler is embedded in API Dispatch kernels on devices int main() { int a[100],b[100],c[100]; int size = 100; inti; for(i=0;i<size;i++) c[i] = a[i] + b[i]; return 0; } Prepare memory object OpenCL Runtime

OpenCL Execution Scenario Data-level parallelism X O kernel { code fragment 1 code fragment 2 code fragment 3 } kernel A { code 1 code 2 code 3 } A:1,2,3 data(0) A:1,2,3 data(1) A:1,2,3 data(2) A:1,2,3 data(3) A:1,2,3 data(4) A:1,2,3 data(5) Task-level parallelism kernel A { code 1 } kernel B { code 2 } kernel C { code 3 } 1 2 3 OpenCL Runtime A:1 B:2 C:3 A:1 B:2 C:3

Supporting OpenCL • Syntax parsing by compiler • qualifier • vector • built-in function • Optimizations onsingle core • Runtime implementation • handle multi-core issues • Co-work with device vendor __kernel void add( __globalfloat4 *a, __globalfloat4 *b, __globalfloat4*c) { intgid = get_global_id(0); float4 data = (float4) (1.0, 2.0, 3.0, 4.0); c[gid] = a[gid] + b[gid] + data; } Platform APIs Runtime APIs MPU GPU

OpenCL compiler implementation • Nowadays, LLVM is commonly used to be the solution. • Apple • ATI • NVIDIA • RapidMind • ... • Clang is the front-end parser.

ATI SDK OpenCL support • We took ATI SDK to try OpenCL programs func 1204 ; __OpenCL_memset_kernel...... dcl_literal l7, 0x00000001, 0x00000001, 0x00000001, 0x00000001; int: 1dcl_literal l8, 0x00000004, 0x00000004, 0x00000004, 0x00000004; int: 4...... (more) ...... (more) dcl_literal l19, 0x00000040, 0x00000040, 0x00000040, 0x00000040; int: 64dcl_literal l20, 0x00000050, 0x00000050, 0x00000050, 0x00000050; int: 80dcl_num_thread_per_group 64, 1, 1 dcl_raw_uav_id(1)......... (more) endfunc ;ARGEND:__OpenCL_memset_kernelfunc 1205 ; memsetmov r176.x___, r1.xxxxmov r177, l7mov r177, r177.xxxxmov r178, l8mov r179, l9iadd r177, r177.xyz0, r178.000xmov r181, l10iadd r177, r177.xy0w, r181.00x0mov r181, l11mov r179, r179.xxxxmov r182, l12iadd r177, r177.x0zw, r181.0x00mov r1049, l13mov r2.x___, r176.xxxxmov r1.x___, r1049.xxxx call 1061 ; get32BitStorePrivateiadd r176, r179.xyz0, r182.000xmov r179, l14mov r1050, l15mov r2, r177mov r1.x___, r1050.xxxx call 1063 ; get128BitStorePrivateiadd r176, r176.xy0w, r179.00x0mov r179, l16mov r1051, l17mov r2, r177mov r1.x___, r1051.xxxx __kernel void memset( __global uint4 *dst ) { uint4 a = (uint4)(1, 2, 3, 4); uint4 b = (uint4)(97,98,99,100); uint4 c = a * b;dst[get_global_id(0)] = c; } call 1063 ; get128BitStorePrivateiadd r176, r176.x0zw, r179.0x00mov r1052, l18mov r2, r176mov r1.x___, r1052.xxxx call 1063 ; get128BitStorePrivatemov r1053, l19mov r2, r176mov r1.x___, r1053.xxxx call 1063 ; get128BitStorePrivatemov r1054, l17mov r1.x___, r1054.xxxx call 1066 ; get128BitLoadPrivatemov r177, r1imul r176, r177, r176mov r1055, l20mov r2, r176mov r1.x___, r1055.xxxx call 1063 ; get128BitStorePrivatemov r1056, l13mov r1.x___, r1056.xxxx call 1064 ; get32BitLoadPrivatemov r176.x___, r1.xxxxmov r177, l13mov r1.x___, r177.xxxx call 1027 ; get_global_idmov r177.x___, r1.xxxxishl r177.x___, r177.xxxx, r178.xxxxiadd r176.x___, r176.xxxx, r177.xxxxmov r1057, l20mov r1.x___, r1057.xxxx call 1066 ; get128BitLoadPrivatemov r177, r1mov r2, r177mov r1.x___, r176.xxxx call 1080 ; get128BitStoreUAV retendfunc ; memset LLVM O0 compilation dcl_literal l7, 0x00000061, 0x00000061, 0x00000061, 0x00000061; int: 97dcl_literal l8, 0x00000190, 0x00000190, 0x00000190, 0x00000190; int: 400.........dcl_literal l10, 0x00000129, 0x00000129, 0x00000129, 0x00000129; int: 297dcl_literal l11, 0x000000c4, 0x000000c4, 0x000000c4, 0x000000c4; int: 196......... func 1210 ; memsetmov r176.x___, r1.xxxx call 1111 ; __amdil_get_global_id_intmov r177, r1mov r178, l7mov r178, r178.xxxxmov r179, l8mov r177, r177.x000 mov r181, l9iadd r178, r178.xyz0, r179.000xmov r179, l10ishl r177.x___, r177.xxxx, r181.xxxxiadd r178, r178.xy0w, r179.00x0mov r179, l11iadd r176.x___, r176.xxxx, r177.xxxxiadd r177, r178.x0zw, r179.0x00mov r2, r177mov r1.x___, r176.xxxx call 1080 ; get128BitStoreUAV retendfunc ; memset { 97, 97, 97, 97 } { 97, 97, 97, 400 } internal & linker optimization { 97, 97, 297, 400 } { 97, 196, 297, 400 }

Our experiences on Open64 • PACDSP Compiler • Based on Open64 compiler • Intermediate Representation:WHIRL • CGIR level implementation • PACDSP impact • Distributed Register Files • 5-way VLIW

Previous Works for Distributed Register Files • Local register allocation • Compiler Supports and Optimizations for PAC VLIW DSP Processors [Lin, LCPC’05] • Register Allocation for VLIW DSP Processors with Irregular Register Files [Lin, CPC’06] • PALF: Compiler Supports for Irregular Register Files in Clustered VLIW DSP Processors [Lin, CC:PE’07] • Effective Code Generation for Distributed and Ping-Pong Register Files: a Case Study on PAC VLIW DSP Cores [Lin, JoVSPS’08] • Expression Rematerialization for VLIW DSP Processors with Distributed Register Files [Wu, CPC’09] • Global register allocation • A Local-Conscious Global Register Allocator for VLIW DSP Processors with Distributed Register Files [Lu, CPC’07] • LC-GRFA: Global Register File Assignment with Local Consciousness for VLIW DSP Processors with Non-uniform Register Files [Lu, CC:PE’09] • Local optimization • Copy Propagation Optimizations for VLIW DSP Processors with Distributed Register Files [Wu, LCPC’06] • Loop optimization • Enabling Compiler Flow for Embedded VLIW DSP Processors with Distributed Register Files [Chen, LCTES’07]

Our on-going implementation Front-end Back-end .c .B .s cc142 .spin wgen42 WHIRL CGIR Opt asm vendor toolkits OpenCL spec. WHIRL node vendor runtimelibrary GCC 4 parser GCC 4 tree structure Very High WHIRL vendor architecture High WHIRL • Qualifier • WHIRL symbol table • Vector • initialization • operation • built-in function • extract essential modulefrom ATI SDK Mid WHIRL Low WHIRL Very Low WHIRL GPU asm code CPU asm code • preserve qualifier information • vector type optimizations

Qualifier Parsing • Support OpenCL qualifiers • __global, __local, __constant, __private ,and __kernel • Add function “c_parser_ocl_qualifiers” to parse qualifier and build attribute tree node • Add function “handle_global_attribute” , “handle_local_attribute” and etc. to set flag. static tree c_parser_ocl_qualifiers (c_parser *parser){ …… attr = build_tree_list (attr_name, NULL_TREE); attrs = chainon(attrs,attr); …… return attrs; } static tree handle_global_attribute (tree *node,…….){ DECL_GLOBAL (*node) = 1; return NULL_TREE; }

Qualifier Parsing

Vector Parsing • Built-in Vector Data Types • Support OpenCL vector data types such as float4, int4… • With the help of GCC vector extension. • The idea is from using union to define OpenCL vector. • Create the built-in vector tree node in GCC front-end. typedef float __ocl_float4 __attribute((vector_size(16))); typedef union { struct {float x, y, z, w}; __ocl_float4 _ocl_vec; } float4; d = build_decl (FIELD_DECL, get_identifier (“w”, float_type_node); TREE_CHAIN(d) = decls; …… vec = build_vector_type_for_mode(float_type_node, V4SFmode); d = build_decl (FIELD_DECL, get_identifier ("_ocl_vec"), vec); TREE_CHAIN(d) = decls; …… record_builtin_type (RID_TYPEDEF, "float4", float4_type_node);

Vector Parsing • Vector Initialization • Add c_parser_parend_init() for vector data initialization. • Similar to c_parser_braced_init() used for array initialization. • a = (float4) (3.0, 5.0, 7.0, 9.0) • Vector Assignment • Add additional fields in c_expr to record vector expressions. • a.x = b.y or a.xyzw = b.xxyy • Modify c_parser_postfix_expression_after_primary() to reference vector elements and store in c_expr. • Use build_modify_expr() to build vector assignment. • Vector Operation • Type casting for scalar or vectors with different number of components. • Use GCC’s vector operation if possible.

Vector Parsing • Data Initialization INIT Symtab OpenCL source WHIRL

Vector Parsing • Binary Operation

Runtime Compilation Flow on Multi-Core CPU(x86) When the Runtime API clBuildProgramis invoked..... ATI SDK libatiocl.so builtin-x86.bc Internal optimizer and linker clc prelink.bc opt.s as kernel.cl ld Reuse stub code and metadata stub/metadata opencc OpenCL_kernel.s lib.c llvm-extract/llc builtin-x86.bc Open64 clc : OpenCL-LLVM compiler front-end .bc : LLVM Intermediate Representation

ATI Sample Test

Experiment Set 1 • x86 optimization testing • 4-core CPU, Linux OS, ATI SDK • Small programs for common optimizations If-conversion Global Code Hoisting unsigned int count = 0; unsigned inti; count += (input[i%100] > 5) ? 1 : 2; copy propagation for(i=0;i<5000;i++) { for(j=0;j<2000;j++) { output[101] = sum * 3; output[102] = sum / 2; output[i%100] = i + j; } } Loop Nested Optimization int a, b, c, d, e; a = value; b = a; c = b; d = c; e = d; output[100] = e; for(i=0;i<100;i++) for(j=0;j<100;j++) for(k=0;k<9000;k++) output[tid] = input[tid] * multiplier; dead code elimination inti, j, k;int sum, a=1, b=2; sum = a + b; for(i=0;i<100;i++) for(j=0;j<100;j++) for(k=0;k<9000;k++) a = i + j + k + b; output[100] = sum; constant folding intindex,sum;int a = 10, b = 2; sum = 1 + 2 + 3 + 4; sum = sum + 5 + 6 + 7 + 8 + 9 + 10; sum += a * b + index; common available expression inti;int a, b, c, d, e, f; d = a + b + c + 20; e = (a + b + c) * 5; f = a + b + c - 99;

Experiment Set 2 • ATI OpenCL samples testing • 4-core CPU, Linux OS, ATI SDK • Real-world OpenCL samples for(uint k=0; k < blockWidth; k++) { uint index1 = (inverse)? i*blockWidth + k : k * blockWidth + i; uint index2 = getIdx(groupIdx, groupIdy, j, k, blockWidth, width); acc += dct8x8[index1] * input[index2]; } inter[j*blockWidth + i] = acc; barrier(CLK_LOCAL_MEM_FENCE); acc = 0.0f; for(uint k=0; k < blockWidth; k++) { uint index1 = i* blockWidth + k; uint index2 = (inverse)? j*blockWidth + k : k* blockWidth + j; acc += inter[index1] * dct8x8[index2]; } output[idx] = acc; DCT uint count = 0; float prev_diff = (diagonal[0] - x); count += (prev_diff < 0)? 1 : 0; for(uint i = 1; i < width; i += 1) { float diff = (diagonal[i] - x) - ((offDiagonal[i-1] * offDiagonal[i-1])/prev_diff); count += (diff < 0)? 1 : 0; prev_diff = diff; } return count; EigenValue inttid = get_global_id(0); inti = tid%width; int j = tid/width; float x0 = ((i*scale) - ((scale/2)*width))/width; float y0 = ((j*scale) - ((scale/2)*width))/width; float x = x0; float y = y0; float x2 = x*x; float y2 = y*y; float scaleSquare = scale * scale; uintiter=0; for(iter=0; (x2+y2 <= scaleSquare) && (iter < maxIterations); ++iter) { y = 2 * x * y + y0; x = x2 - y2 + x0; x2 = x*x; y2 = y*y; } Mandelbrot sum0.x += tempA0.x * tempB0.x + tempA0.y * tempB1.x + tempA0.z * tempB2.x + tempA0.w * tempB3.x; sum0.y += tempA0.x * tempB0.y + tempA0.y * tempB1.y + tempA0.z * tempB2.y + tempA0.w * tempB3.y; sum0.z += tempA0.x * tempB0.z + tempA0.y * tempB1.z + tempA0.z * tempB2.z + tempA0.w * tempB3.z; sum0.w += tempA0.x * tempB0.w + tempA0.y * tempB1.w + tempA0.z * tempB2.w + tempA0.w * tempB3.w; sum1.x += tempA1.x * tempB0.x + tempA1.y * tempB1.x + tempA1.z * tempB2.x + tempA1.w * tempB3.x; sum1.y += tempA1.x * tempB0.y + tempA1.y * tempB1.y + tempA1.z * tempB2.y + tempA1.w * tempB3.y; sum1.z += tempA1.x * tempB0.z + tempA1.y * tempB1.z + tempA1.z * tempB2.z + tempA1.w * tempB3.z; sum1.w += tempA1.x * tempB0.w + tempA1.y * tempB1.w + tempA1.z * tempB2.w + tempA1.w * tempB3.w; sum2.x += tempA2.x * tempB0.x + tempA2.y * tempB1.x + tempA2.z * tempB2.x + tempA2.w * tempB3.x; sum2.y += tempA2.x * tempB0.y + tempA2.y * tempB1.y + tempA2.z * tempB2.y + tempA2.w * tempB3.y; sum2.z += tempA2.x * tempB0.z + tempA2.y * tempB1.z + tempA2.z * tempB2.z + tempA2.w * tempB3.z; sum2.w += tempA2.x * tempB0.w + tempA2.y * tempB1.w + tempA2.z * tempB2.w + tempA2.w * tempB3.w; sum3.x += tempA3.x * tempB0.x + tempA3.y * tempB1.x + tempA3.z * tempB2.x + tempA3.w * tempB3.x; sum3.y += tempA3.x * tempB0.y + tempA3.y * tempB1.y + tempA3.z * tempB2.y + tempA3.w * tempB3.y; sum3.z += tempA3.x * tempB0.z + tempA3.y * tempB1.z + tempA3.z * tempB2.z + tempA3.w * tempB3.z; sum3.w += tempA3.x * tempB0.w + tempA3.y * tempB1.w + tempA3.z * tempB2.w + tempA3.w * tempB3.w; MatrixMultiplication

Common Optimization Experiment • Taking multi-core CPU as OpenCL device to test optimizations • Comparison of Execution Time , y-axis is theimprovement

ATI OpenCL Sample Experiment • The performance comparison based on ATI SDK LLVM is shown as the following graph. • We take four classical sample codes from ATI SDK.

Reviewers Comment • Is it sufficient to test it on CPU instead of GPU? • difference: back-end code generation & runtime compilation flow • How to handle multi-dimensional thread execution model? • In OpenCL perspective, programmer is charge of it. • How is a kernel function compiled differently from a non-kernel function? • Host Program: Host Compiler • Kernel Program: OpenCL Compiler • Does your OpenCL compiler handle both MPU and GPU code or just the GPU code? If both, what is the MPU architecture? • OpenCL compiler only generates ONE target assembly code. • Why is ATI GPU targeted rather than the nVidia GPU? • Simply pick one up to start research work.

Summary/Discussion • Supporting OpenCL • OpenCL compiler • OpenCL runtime • OpenCL compiler on Open64 • front-end & back-end • Preliminary Experiment • Open64 performs good optimizations • Refinements are still required • Future Work • Keep going on GPU targ-info

Thank you !!

OpenCL Compiler Support Based on Open64 for MPUs+GPUs

OpenCL Compiler Support Based on Open64 for MPUs+GPUs

Presentation Transcript

The Scout Compiler

An Introduction to Open64 Compiler

Development of an efficient DSP Compiler based on Open64

iGPU : Exception Support and Speculative Execution on GPUs

OpenCL

ValuePack: Value-Based Scheduling Framework for CPU-GPU Clusters

OpenCL

OpenCL Framework for Heterogeneous CPU/GPU Programming

NVIDIA’s Experience with Open64

Open64: A Framework for High performance Compiler

Optimizing OpenCL Kernels for Iterative Statistical Applications on GPUs

OpenCL

Open64 | The Open Research Compiler

A Fully saturated OpenCL Particle swarm optimizer

Porting physical parametrizations to GPUs using compiler directives

High Level OpenCL Implementation

Demand-driven Alias Analysis Implementation Based on Open64

Enabling Dynamic Memory Management Support for MTC on NVIDIA GPUs

Relational Query Processing on OpenCL-based FPGAs

OpenCL

Development of an efficient DSP Compiler based on Open64

Open64 | The Open Research Compiler