200 likes | 310 Views
Exploiting SIMD parallelism with the CGiS compiler framework. Nicolas Fritz , Philipp Lucas, Reinhard Wilhelm Saarland University. Outline. CGiS Language, compiler and GPU back-end SIMD back-end Hardware Challenges Transformations and optimizations Experimental results Future Work
E N D
Exploiting SIMD parallelism with the CGiS compiler framework Nicolas Fritz, Philipp Lucas, Reinhard Wilhelm Saarland University
Outline • CGiS • Language, compiler and GPU back-end • SIMD back-end • Hardware • Challenges • Transformations and optimizations • Experimental results • Future Work • Conclusion
CGiS • C-like data-parallel programming language • Goals: • Exploitation of parallel processing units in common PCs (GPU, SIMD units) • Easy access for inexperienced programmers • High abstraction level • 32-bit scalar and small vector data types • Two forms of explicit parallelism • SPMP (iteration), SIMD (vector types)
CGiS Example: YUV to RGB PROGRAM yuv_to_rgb; INTERFACE extern in float3 YUV<_>; extern out float3 RGB<_>; CODE procedure yuv2rgb (in float3 yuv, out float3 rgb) { rgb = yuv.x + [0, 0.344, 1.77 ] * yuv.y + [1.403, 0.714, 0] * yuv.z; } CONTROL forall (yuv in YUV, rgb in RGB) { yuv2rgb (yuv, rgb); }
PPU Code CGiS Compiler CGiSSource Interface Application CGiS Runtime CGiS Compiler Overview
CGiS for GPUs • nVidia G80: • 128 floating points units • Scalar and vector data processible • 2-on-2 mapping of CGiS‘ parallelism • Code generation for various GPU generations • NV30, NV40, G80, CUDA • Limited access to hardware features through the driver
SIMD Hardware • Every common PC features SIMD units • Intel‘s SSE and Freescale‘s AltiVec • SIMD parallelism not easily accessible for standard compilers • Well-known vectorization problems • Data access • Hardware requires 16-byte aligned loads • Slow but cached • Only 4-way SIMD vector parallelism usable
The SIMD Back-end • Goal is mapping of CGiS parallelisms to SIMD hardware • “2-on-1” mapping • SIMD vectorization problems • Avoided by design: data dependency analyses • Control flow • Divergence in consecutive elements • Misalignment and data layout • Reordering might be needed • Gathering operations are bottle-necks in load-heavy algorithms on multidimensional streams
Transformations and Optimizations • Control flow conversion • If/loop conversion • Loop sectioning for 2D streams • Increase cache performance for gather accesses • Kernel flattening • IR transformation that replaces compound variables and operations by scalar ones • “2-on-1”
Control Flow Conversion • Full inlining • If/loop converison with slightly modified Allen-Kennedy algorithm • No guarded assignments • Masks for select operations are the results of vector compares • Live and written variables after a control flow join are copied at the branching • Select operations are inserted at the join
Loop Sectioning • Adaptation of iteration sequence to better exploit cached data • Only interesting for 2D streams • Iterations subdivided in stripes • Width depends on access pattern, cache size and local variables
Kernel Flattening • SIMD vectorization for yuv2rgb not applicable • Thus “flatten” the procedure or kernel: • Code transformation on the IR • All variables and all statements are split into scalar ones • Those can be subjected to SIMD vectorization procedure yuv2rgb (in float3 yuv, out float3 rgb) { rgb = yuv.x + [0, 0.344, 1.77 ] * yuv.y + [1.403, 0.714, 0] * yuv.z; }
Kernel Flattening Example • Procedure yuv2rgb_f now features data types suitable to be SIMD-parellelized procedure yuv2rgb_f (in float yuv_x, in float yuv_y, in float yuv_z, out float rgb_x, out float rgb_y, out float rgb_z) { float cy = 0.344, cz = 1.77, dx = 1.403, dy = 0.714; rgb_x = yuv_x + + dx * yuv.z; rgb_y = yuv_x + cy * yuv.y + dy * yuv.z; rgb_z = yuv_x + cz * yuv.y; }
Kernel Flattening • But: data layout doesn’t fit • No stride-one access for single components • Reordering of data required • Locally via permutes or shuffles • Globally via memory copy
Global vs. Local Reordering • Global reordering • Reusable for further iterations • Simple, but expensive in-memory copy • Destroys locality for gather accesses • Local reordering • Original stream data untouched • Insertion of possibly many relatively cheap in-register permutation operations • Locality for gathering preserved
Experimental Results • Tested on Intel Core 2 Duo 1.83GHz and PowerPC G5 1.8GHz • Compiled with intrinsics on gcc 4.0.1 • Examples • Image processing: Gaussian blur • Loop sectioning • Computation of mandelbrot set • Control flow conversion • Block cipher encryption: rc5 encryption • Kernel flattening
Future Work • Replace intrinsics by inline-assembly • Improvement of conditionals • Better control over register allocation • Improvement of register re-utilization for AltiVec • Raises with inline-assembly • Cell back-end • SIMD instruction set close to AltiVec • Work list algorithm to distribute stream parts to single PEs • More applications
Conclusion • CGiS abstracts GPUs as well as SIMD units • SIMD back-end of the CGiS compiler produces efficient code • Other transformations and optimizations needed than for the GPU backend • Full control flow conversion needed • Gather accesses gain speed with loop sectioning • Kernel flattening enables better exploitation