190 likes | 308 Views
Heterogeneous Computing in Charm++. David Kunzman. Motivations. Performance and Popularity of Accelerators Our work currently focuses on Cell (and Larrabee) Difficult to program accelerators Architecture specific code (not portable) Many asynchronous events (data movement, multiple cores)
E N D
Heterogeneous Computingin Charm++ David Kunzman
Motivations • Performance and Popularity of Accelerators • Our work currently focuses on Cell (and Larrabee) • Difficult to program accelerators • Architecture specific code (not portable) • Many asynchronous events (data movement, multiple cores) • Heterogeneous Clusters Exist Already • Roadrunner at LANL (Opterons and Cells) • Lincoln at NCSA (Xeons and GPUs) • MariCel at BSC (Powers and Cells)
Goals • Portability of code • Code should be portable between systems with and without accelerators • Across homogeneous and heterogeneous clusters • Reduce programmer effort • Allow various pieces of code to be written independently • Pieces of code share the accelerator(s) • Scheduled by the runtime system automatically • Naturally extend the existing Charm++ model • Same programming model for all hosts and accelerators
Approach • Make entry methods portable between host and accelerator cores • Allows the programmer to write entry method code once and use the same code for all cores • Still make use of architecture/core specific features • Take advantage of the clear communication boundaries in Charm++ • Almost all data is encapsulated within chare objects • Data is passed between chare objects by invoking entry methods
Extending Charm++ • SIMD Instruction Abstraction • To reach any significant fraction of peak, must use SIMD instructions on modern cores • Abstract SIMD instructions so code is portable • Accelerated Entry Methods • May execute on accelerators • Essentially a standard entry method split into two stages • Function body (accelerator or host; limited) • Callback function (host; not limited)
SIMD Instruction Abstraction • Abstract SIMD instructions supported by multiple architectures • Currently adding support for: SSE (x86), AltiVec/VMX (PowerPC; PPE), SIMD instructions on SPEs, and Larrabee • Generic C implementation when no direct architectural support is present • Types: vecf, veclf, veci, ... • Operations: vaddf, vmulf, vsqrtf, ...
Example Entry Method entry void accum(int inArrayLen, float inArray[inArrayLen]) { if (inArrayLen != localArrayLen) return; for (int i = 0; i < inArrayLen; ++i) localArray[i] = localArray[i] + inArray[i]; }; To Invoke: myChareObj.accum(someFloatArray_len, someFloatArray_ptr);
Example Entry Method w/ SIMD entry void accum(int inArrayLen, align(sizeof(vecf)) float inArray[inArrayLen]) { if (inArrayLen != localArrayLen) return; vecf *inArrayVec = (vecf*)inArray; vecf *localArrayVec = (vecf*)localArray; int arrayVecLen = inArrayLen / vecf_numElems; for (int i = 0; i < arrayVecLen; ++i) localArrayVec[i] = vaddf(localArrayVec[i], inArrayVec[i]); for (int i = arrayVecLen * vecf_numElems; i < inArrayLen; ++i) localArray[i] = localArray[i] + inArray[i]; }; To Invoke: myChareObj.accum(someFloatArray_len, someFloatArray_ptr);
Accel Entry Method Structure Standard Interface File: entry void entryName ( …passed parameters… ); Source File: void ChareClass::entryName ( …passed parameters … ) { … function body … } Accelerated Interface File: entry [accel] void entryName ( …passed parameters… ) [ …local parameters… ] { … function body … } callback_member_function; Invocation (both):chareObj.entryName(… passed parameters …) vs.
Example Accelerated Entry Method entry [accel] void accum(int inArrayLen, align(sizeof(vecf)) float inArray[inArrayLen]) [ readOnly : int localArrayLen <impl_obj->localArrayLen>, readWrite : float localArray[localArrayLen] <impl_obj->localArray> ] { if (inArrayLen != localArrayLen) return; vecf *inArrayVec = (vecf*)inArray; vecf *localArrayVec = (vecf*)localArray; int arrayVecLen = inArrayLen / vecf_numElems; for (int i = 0; i < arrayVecLen; ++i) localArrayVec[i] = vaddf(localArrayVec[i], inArrayVec[i]); for (int i = arrayVecLen * vecf_numElems; i < inArrayLen; ++i) localArray[i] = localArray[i] + inArray[i]; } accum_callback; To Invoke: myChareObj.accum(someFloatArray_len, someFloatArray_ptr);
Timeline of Events • Runtime system… • Directs data movement (messages & DMAs) • Schedules accelerated entry methods and callbacks
Communication Overlap • Data movement automatically overlapped with accelerated entry method execution on SPEs and entry method execution on PPE
Handling Host Core Differences • Automatic modification of application data at communication boundaries • Structure of data is known via parameters and Pack-UnPack (PUP) routines • During packing process, add information on how the data is encoded • During unpacking, if needed, modify data to match local architecture
Molecular Dynamics (MD) Code • Based on object interaction seen in NAMD’s nonbonded electrostatic force computation (simplified) • Coulomb’s Law • Single precision floating-point • Particles evenly divided between patch objects • ~92K particles in 144 patches (similar to ApoA1 benchmark) • Compute objects (self and pair wise) compute forces for patch objects • Patches integrate combined force data and update particle positions
MD Code Results • Executing on 2 Xeons cores, 8 PPEs, and 56 SPEs • 3 ISAs, 3 SIMD instruction extensions, and 2 memory structures • Better scaling is achieved when Xeons are present • 331.1 GFlop/s (19.82% peak; serial code limited to 27.7% peak on one SPE, assuming that SPE has an infinite local store)
Summary • Support for accelerators and heterogeneous execution in Charm++ • Programming model and runtime system changes • Accelerated entry methods • SIMD instruction abstraction • Automatic modification of application data • Visualization support • Support • Currently supports Cell • Adding support for Larrabee • Clusters where host cores have different architectures
Future Work • Dynamic measurement based load balancing on heterogeneous systems • Increase support for more accelerators • In the process of adding support for Larrabee • Increasing support for existing abstractions and/or developing new abstractions