330 likes | 527 Views
Ocelot: supported devices. Overview. Ocelot PTX Emulator Multicore-Backend NVIDIA GPU Backend AMD GPU Backend. Multicore CPU Backend: Introduction. Target: Efficient execution of PTX kernels on CPUs ISA Translation from PTX to LLVM
E N D
Overview • Ocelot PTX Emulator • Multicore-Backend • NVIDIA GPU Backend • AMD GPU Backend
Multicore CPU Backend: Introduction • Target: Efficient execution of PTX kernels on CPUs • ISA Translation from PTX to LLVM • Execution-model translation from PTX thread hierarchy to serialized PTX threads • Light-weight thread scheduler • LLVM Just-in-time compilation to x86 • LLVM transformations applied before code generation
Some Interesting Features • Utilize all resources • JIT for Parallel Code • Serialization Transforms
Translation to CPUs: Thread Fusion Multicore Host Threads • Execution Manager • thread scheduling • context management Thread Blocks Thread serialization Execute a kernel • Execution Model Translation • Thread scheduling • Dealing with specialized operations • e.g. custom hardware • Control flow restructuring • Resource management (multiple cores) One worker pthread per CPU core J. Stratton, S. Stone, and W. meiHwu, Mcuda: An efficient implementation of cuda kernels on multi-cores," University of Illinois at Urbana-Champaign, Tech. Rep. IMPACT-08-01,March 2008. G. Diamos, A. Kerr, S. Yalamanchili and N. Clark, “Ocelot: A Dynamic Optimizing Compiler for Bulk-Synchronous Applications in Heterogeneous,” PACT October 2010
Ocelot Source Code: Multicore CPU Backend • ocelot/ • executive/ • interface/MulticoreCPUDevice.h • interface/LLVMContext.h • interface/LLVMExecutableKernel.h • interface/LLVMCooperativeThreadArray.h • interface/LLVMModuleManager.h • interface/TextureOperations.h • ir/ • interface/LLVMInstruction.h • translator/ • interface/PTXToLLVMTranslator.h • transforms/ • interface/SubkernelFormationPass.h • interface/RemoveBarrierPass.h 6
Multicore CPU: ISA Translation • Translate PTX IR to LLVM Internal Representation • Arithmetic instructions have one-to-few mapping • Special instructions and registers handled by LLVM intrinsics (e.g. cos, clock64, bar.sync) • Texture sampling calls Ocelot’s texture library • LLVMContext contains pointers to address spaces, next entry ID, thread ID • Custom LLVM IR implementation insulates Ocelot from LLVM changes • LLVM requires SSA form -> Ocelot converts PTX to SSA • Remove predication
PTX to LLVM ISA Translation // // ocelot/translation/implementation/PTXToLLVMTranslator.cpp // voidPTXToLLVMTranslator::_translateAdd( constir::PTXInstruction& i ) { if( ir::PTXOperand::isFloat( i.type ) ) { ir::LLVMFadd add; ir::LLVMInstruction::Operand result = _destination( i ); add.a = _translate( i.a ); add.b = _translate( i.b ); add.d = result; _llvmKernel->_statements.push_back( ir::LLVMStatement( add ) ); } else { .. .. .. }; } • Translate each PTX instruction to LLVM IR instruction sequence • Special PTX registers and instructions mapped to LLVM intrinsics: • llvm.readcyclecounter() • llvm.sqrt.f32() • Result is LLVM function implementing PTX kernel • Should be invertible if coupled to LLVM->PTX code generator (not implemented)
Thread Serialization • Thread loops • Enter next executable region via scheduler block • Barriers: • store live values into thread-local memory, return to thread scheduler
Using the Multicore Backend • executive: { • devices: [ llvm ], • asynchronousKernelLaunch: true, • optimizationLevel: none, • workerThreadLimit: 1, • warpSize: 1 • }, • optimizations: { • subkernelSize: 1000, • simplifyCFG: true, • hoistSpecialValues: true • }, • Edit configure.ocelot • Executive: • devices: • llvm – efficient execution of PTX on multicore CPU • optimizationLevel – basic, none, full, memory, debug • workerThreadLimit -- number of worker threads • Optimizations: • subkernelSize- size of subkernels in instructions • simplifyCFG – whether to apply CFG simplification pass • hoistSpecialValues – whether to load LLVMContext values at launch of kernel
Overview • Ocelot PTX Emulator • Multicore-Backend • NVIDIA GPU Backend • AMD Backend
NVIDIA GPU: Introduction • Executes PTX kernels on GPUs via the CUDA Driver API • Thin layer on top of CUDA Driver API • Ocelot enables rewriting of PTX kernels • Register reallocation • Runtime optimizations • Instrumentation
Ocelot Source Code: NVIDIA GPU Device Backend • ocelot/ • executive/ • interface/NVIDIAGPUDevice.h • interface/NVIDIAExecutableKernel.h 13
Using the NVIDIA GPU Backend • Edit configure.ocelot • executive: • devices: • nvidia – invokes NVIDIA GPU backend • executive: { • devices: [ nvidia ], • },
Dynamic Instrumentation PhD Student: NailaFarooqui, Joint with K. Schwan and A. Gavrilovska • Run-time generation of user-defined, custom instrumentation code for CUDA kernels • Harness chip-level instrumentation when possible • Instrumentation data to drive • Off-line workload characterization • On-line debugging & program optimization • On-line resource management • Inspired in part by the PIN1 infrastructure 1C.-K. Luk, R. Cohn, R. Muth, H. Patil, A. Klauser, G. Lowney, S. Wallace, V. J. Reddi, and K. Hazelwood. “Pin: building customized program analysis tools with dynamic instrumentation,” PLDI '05 NailaFarooqui, Andrew Kerr, Greg Eisenhauer, Karsten Schwan, SudhakarYalamanchili. Lynx: A Dynamic Instrumentation System for Data-Parallel Applications on GPGPU Architectures. ISPASS. April 2012.
Instrumentation Support in Ocelot • High-level, C constructs to define instrumentation + (C-to-PTX) JIT • Integration with system management software and dynamic compiler • Online resource management based on profiling • Additional Instrumentor APIs to provide criteria for instrumentation • Selectively perform instrumentation on kernels
Custom Instrumentation • Transparent profiling and characterization of library implementations Example Instrumentation Code CUDA nvcc Libraries Lynx PTX Instrumentation APIs C-on-Demand JIT Ocelot Run Time Instrumentor C-PTX Translator PTX-PTX Transformer
Instrumentation: Instruction count * Scan (CUDA SDK)
Remote Device Layer • Remote procedure call layer for Ocelot device calls • Execute local applications that run kernels remotely • Multi-GPU applications can become multi-node
Switchable Compute • Switch devices at runtime • Load balancing • Remote execution
Overview • Ocelot PTX Emulator • Multicore-Backend • NVIDIA Backend • AMD GPU Backend Rodrigo Dominguez, Dana Schaa, and David Kaeli. “Caracal: Dynamic Translation of Runtime Environments for GPUs.” In Proceedings of the Fourth Workshop on General Purpose Processing on Graphics Processing Units, GPGPU-4
AMD GPU Backend • Executes PTX kernels on GPUs via the CAL Driver API • Rewriting of PTX kernels (for optimization, instrumentation, etc.) also gets translated to the AMD backend • Ocelot Device Interface: • Module registration • Memory management • Global/Shared/Constant/Parameter memory allocation • Kernel launches • Translation from PTX to IL • Texture management • OpenGL interoperability • Streams and Events Rodrigo Dominguez, Dana Schaa, and David Kaeli. “Caracal: Dynamic Translation of Runtime Environments for GPUs.” In Proceedings of the Fourth Workshop on General Purpose Processing on Graphics Processing Units, GPGPU-4
AMD Evergreen Architecture • AMD Radeon HD 5870 • 20 SIMD cores • 16 Stream Cores (SC) per SIMD core • Each SC is VLIW-5 • A total of 1600 ALUs • Wavefronts of 64 threads • Peak is 2.72 TFLOPS (SP) and 544 GFLOPS (DP)
AMD Evergreen Architecture • Each Stream Core includes: • 4 Processing Elements • 4 independent SP or integer operations • 2 DP operation • 1 DP fma or mult operation • 1 Special Function Unit • 1 SP or integer operation • SP or DP transcendental • Branch Execution Unit • GPR = 5.24 MB One SIMD Core One Stream Core Instruction and Control Flow T-Processing Element Branch Execution Unit Processing Elements General Purpose Registers Source: AMD OpenCL University Kit
AMD Evergreen Architecture • Local Data Share • 2 TB/s • 32 KB per SIMD • Global Data Share • Shared between all threads in a kernel • Low latency global reductions • L1 (8 KB) • L2 • 512 KB • 450 GB/s • Global Memory • GDDR5 153 GB/s
Translation from PTX to IL PTX • RISC style syntax • Load-Store instruction set • Registers are typed and scalar • Unlimited virtual registers • Predicate registers • Control flow based on branches and labels • Designed for compute (GPGPU) • .entry vecAdd ( • .param .u64 A, • .param.u64 B, • .param .u64 C, • .param .s32 N) • { • mov.u16 rh1, ctaid.x; • mov.u16 rh2, ntid.x; • mul.wide.u16 r1, rh1, rh2; • cvt.u32.u16 r2, tid.x; • add.u32 r3, r2, r1; • ld.param.s32 r4, [N]; • setp.le.s32 p1, r4, r3; • @p1 bra Label_1; • ... • }
Translation from PTX to IL IL • Registers are 32-bit and vectors (4 components) • Registers have no type • Swizzles • Resources are globally scoped • Structured control flow (if-end, while-end) • Designed for graphics, not compute (see FSAIL) il_cs_2_0 dcl_raw_uav_id(0) dcl_cb cb0[2] dcl_cb cb1[4] dcl_literall0, 4, 4, 4, 4 movr0.x, vThreadGrpId.x movr1.x, cb0[0].x imulr2.x, r0.x, r1.x movr3.x, vTidInGrp.x iaddr4.x, r3.x, r2.x movr5.x, cb1[3].x iger6.x, r4.x, r5.x if_logicalz r6.x ... endif end
AMD GPU Backend • Validated over 30 applications from the CUDA SDK • Support for pre-compiled libraries • Device selection can be made at runtime • What is supported? • Global memory (cudaMalloc, cudaMemcpy) • Shared memory (including extern) • Constant memory • Atomics (global and shared) • Barriers and Fences • 30+ PTX instructions Rodrigo Dominguez, Dana Schaa, and David Kaeli. Caracal: Dynamic Translation of Runtime Environments for GPUs. In Proceedings of the Fourth Workshop on General Purpose Processing on Graphics Processing Units, GPGPU-4
Ocelot Source Code: AMD GPU Device Backend • ocelot/ • analysis/ • interface/StructuralAnalysis.h • executive/ • interface/ATIGPUDevice.h • interface/ATIExecutableKernel.h • transforms/ • interface/StructuralTransform.h 29
Using the AMD GPU Backend • Edit configure.ocelot • executive: • devices: • amd– invokes AMD GPU backend • executive: { • devices: [ amd ], • },
Unstructured to Structured Control Flow* • Branch Divergence is key to high performance in GPU • Its impact is different depending upon whether the control flow is structured or unstructured • Not all GPUs support unstructuredCFG directly • Using dynamic translation to support AMD GPUs** * Wu H, Diamos G, Li S, Yalamanchili S. Characterization and Transformation of Unstructured Control Flow in GPU Applications. CACHES. 2011. ** R. Dominguez, D. Schaa, and D. Kaeli. Caracal: Dynamic translation of runtime environments for gpus. In Proceedings of the Fourth Workshop on General Purpose Processing on Graphics Processing Units, pages 5–11. ACM, 2011.