420 likes | 605 Views
Ocelot: supported devices. Overview. Ocelot PTX Emulator Multicore-Backend NVIDIA GPU Backend AMD GPU Backend. Overview. Ocelot PTX Emulator Multicore-Backend NVIDIA GPU Backend AMD GPU Backend. Multicore CPU Backend: Introduction. Target: Efficient execution of PTX kernels on CPUs
E N D
Overview • Ocelot PTX Emulator • Multicore-Backend • NVIDIA GPU Backend • AMD GPU Backend
Overview • Ocelot PTX Emulator • Multicore-Backend • NVIDIA GPU Backend • AMD GPU Backend
Multicore CPU Backend: Introduction • Target: Efficient execution of PTX kernels on CPUs • ISA Translation from PTX to LLVM • Execution-model translation from PTX thread hierarchy to serialized PTX threads • Light-weight thread scheduler • LLVM Just-in-time compilation to x86 • LLVM transformations applied before code gen
Some Interesting Features • Utilize all resources • JIT for Parallel Code • Serialization Transforms
Translation to CPUs: Thread Fusion Multicore Host Threads • Execution Manager • thread scheduling • context management Thread Blocks Thread serialization Execute a kernel • Execution Model Translation • Thread scheduling • Dealing with specialized operations, e.g., custom hardware • Control flow restructuring • Resource management (multiple cores) • Multiple address spaces One worker pthread per CPU core J. Stratton, S. Stone, and W. meiHwu, Mcuda: An efficient implementation of cuda kernels on multi-cores," University of Illinois at Urbana-Champaign, Tech. Rep. IMPACT-08-01,March 2008. G. Diamos, A. Kerr, S. Yalamanchili and N. Clark, “Ocelot: A Dynamic Optimizing Compiler for Bulk-Synchronous Applications in Heterogeneous,” PACT October 2010
Ocelot Source Code: Multicore CPU Backend • ocelot/ • executive/ • interface/MulticoreCPUDevice.h • interface/LLVMContext.h • interface/LLVMExecutableKernel.h • interface/LLVMCooperativeThreadArray.h • interface/LLVMModuleManager.h • interface/TextureOperations.h • ir/ • interface/LLVMInstruction.h • translator/ • interface/PTXToLLVMTranslator.h • transforms/ • interface/SubkernelFormationPass.h • interface/RemoveBarrierPass.h 7
Multicore CPU: ISA Translation • Translate PTX IR to LLVM Internal Representation • Arithmetic instructions have one-to-few mapping • Load store architectures • Special instructions and registers handled by LLVM intrinsics (e.g. cos, clock64, bar.sync) • Texture sampling calls Ocelot’s texture library • LLVMContext contains pointers to address spaces, next entry ID, thread ID • Custom LLVM IR implementation insulates Ocelot from LLVM changes • LLVM requires SSA form -> Ocelot converts PTX to SSA • Remove predication
PTX to LLVM ISA Translation // // ocelot/translation/implementation/PTXToLLVMTranslator.cpp // voidPTXToLLVMTranslator::_translateAdd( constir::PTXInstruction& i ) { if( ir::PTXOperand::isFloat( i.type ) ) { ir::LLVMFadd add; ir::LLVMInstruction::Operand result = _destination( i ); add.a = _translate( i.a ); add.b = _translate( i.b ); add.d = result; _llvmKernel->_statements.push_back( ir::LLVMStatement( add ) ); } else { .. .. .. }; } • Translate each PTX instruction to LLVM IR instruction sequence • Special PTX registers and instructions mapped to LLVM intrinsics: • llvm.readcyclecounter() • llvm.sqrt.f32() • Result is LLVM function implementing PTX kernel • Should be invertible if coupled to LLVM->PTX code generator (not implemented)
Thread Serialization • Thread loops • Enter next executable region via scheduler block • Barriers: • store live values into thread-local memory, return to thread scheduler
Execution Management Thread Serialization • Translation takes place over (sub)kernels • Code cache for translated kernels • Must synthesize thread scheduling (serialization) code
Spilling Live Values // ocelot/analysis/implementation/RemoveBarrierPass.cpp // voidRemoveBarrierPass::_addSpillCode( DataflowGraph::iterator block, constDataflowGraph::Block::RegisterSet& alive ) { unsigned intbytes = 0; ir::PTXInstruction move ( ir::PTXInstruction::Mov ); move.type = ir::PTXOperand::u64; move.a.identifier = "__ocelot_remove_barrier_pass_stack"; move.a.addressMode = ir::PTXOperand::Address; move.a.type = ir::PTXOperand::u64; move.d.reg = _kernel->dfg()->newRegister(); move.d.addressMode = ir::PTXOperand::Register; move.d.type = ir::PTXOperand::u64; _kernel->dfg()->insert( block, move, block->instructions().size() - 1 ); ...
Spilling Live Values ... for( DataflowGraph::Block::RegisterSet::const_iterator reg = alive.begin(); reg != alive.end(); ++reg ) { ir::PTXInstruction save( ir::PTXInstruction::St ); save.type = reg->type; save.addressSpace = ir::PTXInstruction::Local; save.d.addressMode = ir::PTXOperand::Indirect; save.d.reg = move.d.reg; save.d.type = ir::PTXOperand::u64; save.d.offset = bytes; bytes += ir::PTXOperand::bytes( save.type ); save.a.addressMode = ir::PTXOperand::Register; save.a.type = reg->type; save.a.reg = reg->id; _kernel->dfg()->insert( block, save, block->instructions().size() - 1 ); } _spillBytes = std::max( bytes, _spillBytes ); }
Using the Multicore Backend • executive: { • devices: [ llvm ], • asynchronousKernelLaunch: true, • optimizationLevel: none, • workerThreadLimit: 1, • warpSize: 1 • }, • optimizations: { • subkernelSize: 1000, • simplifyCFG: true, • hoistSpecialValues: true • }, • Edit configure.ocelot • Controls Ocelot’s initial state • Located in application’s startup directory • executive controls device properties • trace: • Trace Generators may be active for devices other than PTX Emulator • Only initialize(), finish() called • event() and postEvent() never called • Enables uniform interface for profiling kernel launches • executive: • devices: • llvm – efficient execution of PTX on multicore CPU • optimizationLevel– basic, none, full, memory, debug • workerThreadLimit -- number of worker threads • optimizations: • subkernelSize- size of subkernels in instructions • simplifyCFG – whether to apply CFG simplification pass • hoistSpecialValues – whether to load LLVMContext values at launch of kernel
Overview • Ocelot PTX Emulator • Multicore-Backend • NVIDIA GPU Backend • AMD Backend
NVIDIA GPU: Introduction • Executes PTX kernels on GPUs via the CUDA Driver API • Thin layer on top of CUDA Driver API • Ocelot enables rewriting of PTX kernels • Register reallocation • Runtime optimizations • Instrumentation
Ocelot Source Code: NVIDIA GPU Device Backend • ocelot/ • executive/ • interface/NVIDIAGPUDevice.h • interface/NVIDIAExecutableKernel.h 17
Using the NVIDIA GPU Backend • Edit configure.ocelot • Controls Ocelot’s initial state • Located in application’s startup directory • executive controls device properties • trace: • Trace Generators may be active for devices other than PTX Emulator • Only initialize(), finish() called • event() and postEvent() never called • Enables uniform interface for profiling kernel launches • executive: • devices: • nvidia– invokes NVIDIA GPU backend • executive: { • devices: [ nvidia ], • },
Dynamic Instrumentation PhD Student: NailaFarooqui, Joint with K. Schwan and A. Gavrilovska • Run-time generation of user-defined, custom instrumentation code for CUDA kernels • Harness chip-level instrumentation when possible • Instrumentation data to drive • Off-line workload characterization • On-line debugging & program optimization • On-line resource management • Inspired in part by the PIN1 infrastructure 1C.-K. Luk, R. Cohn, R. Muth, H. Patil, A. Klauser, G. Lowney, S. Wallace, V. J. Reddi, and K. Hazelwood. “Pin: building customized program analysis tools with dynamic instrumentation,” PLDI '05
Instrumentation Support in Ocelot • High-level, C constructs to define instrumentation + (C-to-PTX) JIT • Integration with system management software and dynamic compiler • Online resource management based on profiling • Additional Instrumentor APIs to provide criteria for instrumentation • Selectively perform instrumentation on kernels
Custom Instrumentation • Transparent profiling and characterization of library implementations Example Instrumentation Code CUDA nvcc Libraries Lynx PTX Instrumentation APIs C-on-Demand JIT Ocelot Run Time Instrumentor C-PTX Translator PTX-PTX Transformer
Instrumentation: Instruction count * Scan (CUDA SDK)
Remote Device Layer • Remote procedure call layer for Ocelot device calls • Execute local applications that run kernels remotely • Multi-GPU applications can become multi-node
Switchable Compute • Switch devices at runtime • Load balancing • Instrumentation • Fault-and-emulate • Remote execution
Overview • Ocelot PTX Emulator • Multicore-Backend • NVIDIA Backend • AMD GPU Backend Rodrigo Dominguez, Dana Schaa, and David Kaeli. “Caracal: Dynamic Translation of Runtime Environments for GPUs.” In Proceedings of the Fourth Workshop on General Purpose Processing on Graphics Processing Units, GPGPU-4
AMD GPU Backend • Executes PTX kernels on GPUs via the CAL Driver API • Rewriting of PTX kernels (for optimization, instrumentation, etc.) also gets translated to the AMD backend • Ocelot Device Interface: • Module registration • Memory management • Global/Shared/Constant/Parameter memory allocation • Kernel launches • Translation from PTX to IL • Texture management • OpenGL interoperability • Streams and Events Rodrigo Dominguez, Dana Schaa, and David Kaeli. “Caracal: Dynamic Translation of Runtime Environments for GPUs.” In Proceedings of the Fourth Workshop on General Purpose Processing on Graphics Processing Units, GPGPU-4
AMD Evergreen Architecture • AMD Radeon HD 5870 • 20 SIMD cores • 16 Stream Cores (SC) per SIMD core • Each SC is VLIW-5 • A total of 1600 ALUs • Wavefronts of 64 threads • Peak is 2.72 TFLOPS (SP) and 544 GFLOPS (DP)
AMD Evergreen Architecture • Each Stream Core includes: • 4 Processing Elements • 4 independent SP or integer operations • 2 DP operation • 1 DP fma or mult operation • 1 Special Function Unit • 1 SP or integer operation • SP or DP transcendental • Branch Execution Unit • GPR = 5.24 MB One SIMD Engine One Stream Core Instruction and Control Flow T-Processing Element Branch Execution Unit Processing Elements General Purpose Registers Source: AMD OpenCL University Kit
AMD Evergreen Architecture • Local Data Share • 2 TB/s • 32 KB per SIMD • Global Data Share • Shared between all threads in a kernel • Low latency global reductions • L1 (8 KB) • L2 • 512 KB • 450 GB/s • Global Memory • GDDR5 153 GB/s • CompletePath • FastPath
Memory Hierarchy SIMD Engine Local Mem + Registers L1 Cache Crossbar Write Cache L2 Cache Atomic Path Global Memory
Memory Hierarchy • Benefits from vector operations (int4, float4) • Atomics are faster on local memory than in global memory (FastPathvsCompletePath) • Typical tiled data layout for images to hit L1 cache • Compiler optimizes by: • Minimizing ALU code • Maximizing number of threads • Scheduling instructions to increase VLIW packing
Address Spaces • Unordered Access Views (raw) • 8 different UAVs • Byte-addressable (linear) • Dword (4 bytes) alignment • Arena UAV for sub-dword data • Constant Buffers • Non-linear addressing (x, y, z, w components) • Local Data Share • Byte-addressable (linear) • Dword-aligned and dword-sized (pack/unpack overhead)
Translation from PTX to IL PTX • RISC style syntax • Load-Store instruction set • Registers are typed and scalar • Unlimited virtual registers • Predicate registers • Control flow based on branches and labels • Designed for compute (GPGPU) • .entry vecAdd ( • .param .u64 A, • .param.u64 B, • .param .u64 C, • .param .s32 N) • { • mov.u16 rh1, ctaid.x; • mov.u16 rh2, ntid.x; • mul.wide.u16 r1, rh1, rh2; • cvt.u32.u16 r2, tid.x; • add.u32 r3, r2, r1; • ld.param.s32 r4, [N]; • setp.le.s32 p1, r4, r3; • @p1 bra Label_1; • ... • }
Translation from PTX to IL IL • Registers are 32-bit and vectors (4 components) • Registers have no type • Swizzles and destination modifiers • Resources are globally scoped • Structured control flow (if-end, while-end) • Designed for graphics, not compute (see FSAIL) il_cs_2_0 dcl_raw_uav_id(0) dcl_cb cb0[2] dcl_cb cb1[4] dcl_literall0, 4, 4, 4, 4 movr0.x, vThreadGrpId.x movr1.x, cb0[0].x imulr2.x, r0.x, r1.x movr3.x, vTidInGrp.x iaddr4.x, r3.x, r2.x movr5.x, cb1[3].x iger6.x, r4.x, r5.x if_logicalz r6.x ... endif end
AMD GPU Backend • Validated over 30 applications from the CUDA SDK • Support for pre-compiled libraries • Device selection can be made at runtime • What is supported? • Global memory (cudaMalloc, cudaMemcpy) • Shared memory (including extern) • Constant memory (no caching) • Atomics (global and shared) • Barriers and Fences • 30+ PTX instructions Rodrigo Dominguez, Dana Schaa, and David Kaeli. Caracal: Dynamic Translation of Runtime Environments for GPUs. In Proceedings of the Fourth Workshop on General Purpose Processing on Graphics Processing Units, GPGPU-4
Ocelot Source Code: AMD GPU Device Backend • ocelot/ • analysis/ • interface/StructuralAnalysis.h • executive/ • interface/ATIGPUDevice.h • interface/ATIExecutableKernel.h • transforms/ • interface/StructuralTransform.h 36
Using the AMD GPU Backend • Edit configure.ocelot • Controls Ocelot’s initial state • Located in application’s startup directory • executive controls device properties • trace: • Trace Generators may be active for devices other than PTX Emulator • Only initialize(), finish() called • event() and postEvent() never called • Enables uniform interface for profiling kernel launches • executive: • devices: • amd– invokes AMD GPU backend • executive: { • devices: [ amd ], • },
Unstructured to Structured Control Flow* • Branch Divergence is key to high performance in GPU • Its impact is different depending upon whether the control flow is structured or unstructured • Not all GPUs support unstructuredCFG directly • Using dynamic translation to support AMD GPUs** * Wu H, Diamos G, Li S, Yalamanchili S. Characterization and Transformation of Unstructured Control Flow in GPU Applications. CACHES. 2011. ** R. Dominguez, D. Schaa, and D. Kaeli. Caracal: Dynamic translation of runtime environments for gpus. In Proceedings of the Fourth Workshop on General Purpose Processing on Graphics Processing Units, pages 5–11. ACM, 2011.
Structured/Unstructured Control Flow • Structured Control Flow has a single entry and a single exit • Unstructured Control Flow has multiple entries or exits Entry Entry Entry/Exit Exit Exit for-loop/while-loop do-while-loop if-then-else
Sources of Unstructured Control Flow (1/2) • goto statement of C/C++ • Language semantics • Not all conditions need to be evaluated • Sub-graphs in red circles have 2 exits if (cond1() || cond2()) && cond3() || cond4())) { …… } entry B1 bra cond1() B2 bra cond2() B3 bra cond3() B4 bra cond4() B5 …… exit
Re-convergence in AMD & Intel GPUs • AMD IL does not support arbitrary branch • It also uses ELSE, LOOP, ENDLOOP, etc. • Intel GEN5 works in a similar manner C Code AMD IL if (i < N) { C[i] = A[i] + B[i] } ige r6, r4, r5 if_logicalz r6 uav_raw_load_id(0) r11, r10 uav_raw_load_id(0) r14, r13 iadd r17, r16, r8 uav_raw_store_id(0) r17, r15 endif
T6 T4 T5 T3 T0 T1 T2 Re-converge at immediate post-dominator Entry Entry Entry Entry Entry Entry Entry B1 B1 B1 B1 B1 B1 B1 Entry Entry Entry Entry Entry Entry Entry B2 B2 B2 B2 T6 T4 T5 T3 T0 T1 T2 B1 B1 B1 B1 B1 B1 B1 B3 B3 B3 1 B2 B2 B2 B2 B4 B4 2 B3 B3 B3 B5 3 B4 B4 B3 B3 B3 4 B5 B4 B4 5 B5 entry B5 6 B3 B3 B3 7 B5 B5 B1 bra cond1() B4 B4 8 B3 B3 B3 B3 B3 B3 B2 bra cond2() B5 9 B4 B4 B4 B4 B5 B3 bra cond3() B5 B5 10 Exit Exit Exit Exit Exit Exit Exit B4 bra cond4() B5 B5 11 Exit Exit Exit Exit Exit Exit Exit 12 B5 …… exit