430 likes | 627 Views
Multi-core Compilation an Industrial Approach. Alastair F. Donaldson EPSRC Postdoctoral Research Fellow, University of Oxford Formerly at Codeplay Software Ltd. Thanks to the Codeplay Sieve team: Pete Cooper, Uwe Dolinsky, Andrew Richards, Colin Riley, George Russell. Coverage.
E N D
Multi-core Compilationan Industrial Approach • Alastair F. Donaldson • EPSRC Postdoctoral Research Fellow, University of Oxford • Formerly at Codeplay Software Ltd. • Thanks to the Codeplay Sieve team: • Pete Cooper, Uwe Dolinsky, Andrew Richards, Colin Riley, George Russell
Coverage • Limits of automatic parallelisation • Programming heterogeneous multi-core processors • Codeplay Sieve Threads approach • Like pthreads for accelerator processors • The promises and limitations of OpenCL Laboratory session: Sieve Partitioning System for Cell Linux – a Practical Introduction
Limits of automatic parallelisation Dream:a tool which takes a serial program, finds opportunities for parallelism, produces parallel code optimized for target processor, preserves determinism • Part of why this has not been achieved • C/C++, pointers, function pointers, multiple source files, precompiled libraries • Why this will never be achieved • Many parallelisable programs require ingenuity to parallelise! • State-of-the-art: we are good at parallelising regular loops, when we can see all the code
Example: Floyd-Steinberg error diffusion 255 22 200 Threshold = 128 64 180 55 22 < 128 so set pixel value to 0 error = old – new = 22 error 255 0 200 7/16 255 0 210 64 180 55 3/16 5/16 1/16 68 187 56
Error diffusion can be parallelised 1 1 2 2 3 3 4 4 5 5 6 7 8 9 10 11 12 error 7/16 3 3 4 4 5 5 6 7 8 9 10 11 12 13 14 3/16 5/16 1/16 5 5 6 7 8 9 10 11 12 13 14 15 16 7 8 9 10 11 12 13 14 15 16 17 17 18 9 10 11 12 13 14 15 16 17 17 18 19 20 11 12 13 14 15 16 17 17 18 19 20 21 22 13 14 15 16 17 17 18 19 20 21 22 23 24 15 16 17 17 18 19 20 21 22 23 24 25 26 17 17 18 19 20 21 22 23 24 25 26 27 28 19 20 21 22 23 24 25 26 27 28 29 30
Error diffusion can be parallelised • ...but approach is problem-specific and requires human ingenuity • Panafiotis Metaxas: Parallel Digital Halftoning by Error-Diffusion, PCK50 (2003) • Previously believed to be non-parallelisable: “[the Floyd-Steinberg algorithm] is an inherently serial method; the value of [the pixel in the lower right corner of the image] depends on all m.n entries of [the input]” • Donald Knuth: Digital Halftones by Dot Diffusion, ACM Transactions on Graphics (1987)
Another example: collision response void apply_collisions(GameWorld* world, CollisionPair* collisions, int num_collisions) { for(int i=0; i<num_collisions; i++) { world -> update_velocities(collisions[i].first, collisions[i].second); } } • Can process (a, b) and (c, d) in either order • If { a, b } intersects { c, d } then cannot process (a, b) and (c, d) simultaneously • How should we deal with this? • Locks? Transactional memory? Data preprocessing?
Our perspective Let's nail auto-parallelisation for special cases In general, we are stuck with multi-threading Let's design sophisticated tools to help with multi-threaded programming Modern problem: multi-threaded programming for heterogeneous multi-core is very hard
Heterogeneous architectures x86 PC Power Processing Element Host Main memory Direct memory access (DMA) mailbox/interrupt data bus Accelerator Accelerator Accelerator RAM RAM RAM Synergistic Processing Element, GPU, FPGA, etc.
Example: Cell Broadband Engine PPE = Power Processing Element (Host) SPE = Synergistic Processing Element (Accelerator) 128-bit SIMD processor (3.2 GHz) 256 KB RAM PPE Dual hyperthreaded PowerPC core, connected to main memory SPE SPE SPE SPE SPEs access main memory via DMA interface SPE SPE SPE SPE
Programming heterogeneous machines • Write separate programs for host and accelerator • Lots of “glue” code • launch accelerators • orchestrate data movement • clear down accelerators • Can achieve great performance, but: • Time consuming • Non portable • Error prone (limited scope for static checking) • Multiple source files for logically related functionality
Illustrative example • Serial code for Mandelbrot loops #define HEIGHT ... #define WIDTH ... unsigned char mand(int, int); void computeMandelbrot(unsigned char* pixels) { for(int y = 0; y < HEIGHT; ++y ) { for(int x = 0; x < WIDTH; ++x) { pixels[y*WIDTH + x] = mand(x, y); } } }
Illustrative example (continued) unsignedchar mand(int, int); #define BLOCK ... volatileunsignedchar myPixels[BLOCK] __attribute__ ((aligned (16))); volatile context ctx __attribute__ ((aligned (16))); int main(unsignedlonglong spu_id, unsignedlonglong ctxAddress) { spu_mfcdma32(&ctx, ctxAddress, sizeof(context), MFC_GET_CMD); spu_mfcstat(MFC_TAG_UPDATE_ALL); for(int y=0; y<ctx.length; y++) { for(int x=0; x<WIDTH; x+=BLOCK) { int N = (WIDTH-x < BLOCK ? WIDTH-x : BLOCK); for(int k=0; k < N; k++) { myPixels[k] = mand(x+k, ctx.row + y); } spu_mfcdma32(myPixels, ctx.dest+y*WIDTH+x, N*sizeof(unsignedchar), MFC_PUT_CMD); spu_mfcstat(MFC_TAG_UPDATE_ALL); } } return 0; } #define HEIGHT ... #define WIDTH ... typedefstruct { int row; int length; unsignedchar* dest; int padding; } context; // PPE uses this handle to run SPE code extern spe_program_handle_t speComputeMandelbrot; void ppeComputeMandelbrot(unsignedchar* pixels) { speid_t spe_ids[8]; context ctxs[8] __attribute__ ((aligned (16))); constint count = HEIGHT / 8; for(int i=0, offset=0; i<8; i++, offset += count) { ctxs[i].length = (i==7) ? HEIGHT - offset : count; ctxs[i].dest = & (pixels[offset*WIDTH]); ctxs[i].row = offset; spe_ids[i] = spe_create_thread( &speComputeMandelbrot, &ctxs[i]); } for(int i=0; i<8; i++) { spe_wait(spe_ids[i]); } } SPE Code PPE Code
Why bother with heterogeneous architectures? Homogeneous multi-threading relatively easier Every thread running on same type of processor All methods compiled as usual No need for explicit data movement code Minimal start-up code: pthread_create(...) Heterogeneous architectures can give better performance Scratchpad memory => contention-free local access Accelerator faster than host at e.g. vector processing PlayStation is a registered trademark of Sony Computer Entertainment Inc.
Codeplay Sieve Thread approach Wrap code inside sievethread block to say “run this code asynchronously on accelerator” #include <libsieve> void GameWorld::doFrame(...) { // Suppose calculateStrategy and // detectCollisions are independent this->calculateStrategy(...); this->detectCollisions(); this->updateEntities(); this->renderFrame(); } #include <libsieve> void GameWorld::doFrame(...) { int handle = sievethread(...) { this->calculateStrategy(...); } this->detectCollisions(); sieveThreadJoin(handle); this->updateEntities(); this->renderFrame(); } Call graph for calculateStrategy compiled for accelerator Offload to accelerator – non-blocking Host can wait for sievethread to complete • Full implementation for Cell. Sievethread runs on SPE.
Parameters to sievethread block #include<libsieve> void start_accelerators(int* handles) { for(int i=0; i<NUM_SPES; i++) { handles[i] = sievethread { do_work(i); }; } } void wait_for_accelerators(int* handles) { for(int i=0; i<NUM_SPES; i++) { sieveThreadJoin(handles[i]); } } Illegal: i may change, or disappear!
Parameters to sievethread block #include<libsieve> void start_accelerators(int* handles) { for(int i=0; i<NUM_SPES; i++) { handles[i] = sievethread(i) { do_work(i); }; } } void wait_for_accelerators(int* handles) { for(int i=0; i<NUM_SPES; i++) { sieveThreadJoin(handles[i]); } } Solution – pass i by value as a parameter to sievethread block Parameters and handles omitted from many of the following examples
Working with multiple threading libraries #ifdef __WIN32__ #include<windows.h> #define thread_handle_t WinThreadHandle_t #define createThread(context, program) WinThreadCreate(context, program) #else #ifdef __LINUX__ #include<pthread.h> #define thread_handle_t pthread_t #define createThread(context, program) pthread_create(context, program) #else #ifdef __SIEVE_THREADS__ #include<sievethread.h> #define thread_handle_t SieveThreadHandle_t #define createThread(context, program) sievethread(context) { \\ program(context); \\ } #endif #endif #endif
Pointer recap int * p; // pointer to integer *p = 5 // assign location pointed to by p to 5 int x; p = &x // p is address of x const int* p; // pointer to constant integer int const* p; // means same thing
Pointer types Separate pointers into two categories: Pointer to host data: marked with __outer qualifier Pointer to accelerator data: not marked Similar to const, volatile. “__” is common in C++ Accelerator memory (kilobytes) Host memory (gigabytes) Not supported int* x 5 112 int__outer * y int __outer* __outer * z int __outer* * w
Pointer types Pointers outside sievethread context: implicitly __outer On accelerator, dereferencing __outer pointer => DMA transfer Illegal to assign between local and outer pointers For sensible code, can statically eliminate attempts to dereference host address as if it were accelerator address, and vice versa C++ => programmer can always get their way if they really want!
Pointer types: example float f_out = 3.0f; float* out_ptr; // Implicitly __outer pointer sievethread { float f_in = 5.0f; float* in_ptr; in_ptr = &f_in; out_ptr = &f_out; *in_ptr = *out_ptr; // DMA: Host -> Accelerator out_ptr = in_ptr; // ILLEGAL } DMA Accelerator Host 3.0 5.0 f_in 3.0 f_out in_ptr out_ptr
Method duplication Method has pointer/reference parameters void func(float* x, int* y) { ... } • Called from sievethread context with mixture of outer and local pointers/references int x; sievethread { float y; func(&y, &x); // signature: void (float*, __outer int*) } • For each accelerator calling context, compile separate version of method
Method duplication example class Circle { ... public: staticbool collides(Circle* c1, Circle* c2) { ... } }; void my_func() { Circle out_circ_1, out_circ_2; sievethread { Circle in_circ_1, in_circ_2; if( collides( &out_circ_1, &out_circ_2) && collides( &out_circ_1, &in_circ_2) && collides( &in_circ_1, &out_circ_2) && collides( &in_circ_1, &in_circ_2) ) { ... } } } collides duplicated: bool collides( __outer Circle*, __outer Circle*) collides duplicated: bool collides( __outer Circle*, Circle*) collides duplicated: bool collides(Circle*, __outer Circle*) collides duplicated: bool collides(Circle*, Circle*)
Challenges Function pointers, virtual methods Method duplication across multiple compilation units Silent deduction of __outer (type inference)
Function pointers Given function type: typedef void (* int_to_void) (int); • + methods void meth1(int); void meth2(int); • + function pointer: int_to_void f_ptr; • + call in sievethread context sievethread { f_ptr(25); ... } • Don’t know until runtime which method is called • How do we know what to duplicate?
Possible solutions Compile and load all matching methods May be hundreds: Long compilation time Large code size Compile all methods, load on demand Slow compilation Significant runtime overhead Compile methods on demand Prohibitive runtime overhead Requires access to compiler at runtime Delegate call to host Defeats point of offloading Would only work if all pointers are __outer Useful as a fallback
Our solution – function domains Sievethread block equipped with domain of functions OK to call via pointers Duplicate call graphs for meth1 and meth3, have methods loaded and ready to call typedef void (* int_to_void) (int); void meth1(int x) { ... } void meth2(int x) { ... } void meth3(int x) { ... } int_to_void f_ptr; ... sievethread [ meth1, meth3 ] { f_ptr(25); } Runtime exception iff_ptr == meth2
Domains in practice Virtual methods handled similarly // 2d table of methods collisionFunction collisionFunctions[3][3] = { fix_fix, fix_mov, ..., dead_dead }; sievethread [ fix_fix, fix_mov, ..., dead_dead ]{ for(...i, j...){ // Apply function according to objects’ status collisionFunctions[ status[i] ] [ status[j]] (...); } }
Method duplication across compilation units Box.h class Box{ public: bool collides(Entity &); ... Need to duplicate collides, but don’t have source code AB #include Box.cpp // Implementation of ‘collides’ bool Box::collides(Entity &) { ... } Physics.cpp Box b, c; sievethread { if (b.collides(c)) { … } }
Method duplication across compilation units Current solution: Mark externally called functions to be duplicated bool collides(Entity &) __attribute((__duplicate( bool (__outer Entity &) __outer ))); • If collides calls other functions in its compilation unit these will be automatically duplicated • Possible automatic solution: • Build up “compilation conditions” while processing files • Repeatedly process files until conditions are fulfilled
Silent deduction of__outer __outer short* x; __outer int* y; // z is given type ‘__outer int*’ due to initializer int* z = x; // OK to use ‘short*’ rather than ‘__outer short*’ in cast x = (short*) y; • Two simple relaxations help with automatic method duplication
Other features Mark method sievethread: only compile for accelerator (can then be hand-optimized) Overloading based on __outer Facility for accelerator to invoke method on host e.g. to allocate a lot of memory Sieve Partitioning System comes with libraries to help optimize data movement Compiler generates advice to suggest how to use these libraries
Development approach Identify code to offload (manually, using profiler) Enclose in sievethread block (fix a few __outer issues) Basic offload may not yield optimal performance ...but any offloading frees up host Incremental performance improvements Overload core functions with sievethread versions optimized for accelerator Compiler advice guides optimization
Performance Results on PS3 (image processing, raytracing, fractals): Linear scaling With 6 SPEs, speedup between 3x and 14x over host, after some optimization Possible to hand-optimize as much as desired Tradeoff: hand-optimization increases performance at expense of portability
OpenCL Language and API from Khronos group for programming heterogeneous multicore systems Codeplay is a contributing member Motivation: unify bespoke languages for programming CPUs, GPUs and Cell BE-like systems Host code: C/C++ with API calls to launch kernels to run on devices Kernels written in OpenCL C – C99 with some restrictions and some extensions OpenCL is portable, but too low level for large applications
Sievethreads → OpenCL Hot spots enclosed in sievethread blocks C++ application Hot spot 3 Hot spot 1 Low level OpenCL code for data movement automatically generated Hot spot 2 Automatic translation OpenCL kernel for hot spot 1 Runs on accelerator(s) C++ application Hot spots replaced by kernel calls OpenCL kernel for hot spot 2 OpenCL kernel for hot spot 3 Runs on host
Sievethreads → OpenCL - challenges Various limitations in OpenCL 1.0 (e.g. no recursion, no function pointers) which will probably go away Severe (prohibitive?) limitation: accelerator cannot randomly access host memory void some_method(__outerint* x) { ... = *x; // Read from “who knows where?” in host memory } • On Cell processor, DMA from host on demand is fine • OpenCL does not support this (due to limitations of GPUs)
Related work Hera-JVM (University of Glasgow) - Java virtual machine on Cell SPEs CUDA (NVIDIA), Brook+ (AMD) - somewhat subsumed by OpenCL Cilk++ (Cilk Arts) - shared memory only OpenMP (IBM have an implementation for Cell) PS-Algol (Atkinson, Chisholm, Cockshott) – pointers to memory vs. pointers to disk is analogous to local vs. outer pointers
Summary Sievethreads: practical way get C++ code running on heterogeneous systems Can co-exist with other threading methods Core technology: method duplication Main area for future work: data movement Data movement optimizations Declarative language for specifying data movement patterns
Thank you! After the break, come back and use the Sieve Partitioning System! Codeplay are interested in academic collaborations, e.g. student project applying sievethreads to a large open-source application