Overview of Ocelot: architecture

Overview of Ocelot: architecture

Overview • GPU Ocelot overview • Building, configuring, and executing Ocelot programs • Ocelot Device Interface and CUDA Runtime API • Ocelot PTX Internal Representation • PTX Pass Manager 2

Ocelot: Multiplatform Dynamic Compilation Just-in-time code generation and optimization for data intensive applications esd.lbl.gov Data Parallel IR Language Front-End R. Domingo & D. Kaeli (NEU) • Environment for i) compiler research, ii) architecture research, and iii) productivity tools 3

NVIDIA’s Compute Unified Device Architecture (CUDA) • Integrate the concept of a compute kernel called from standard languages • Multithreaded host programs • The compute kernel specifies data parallel computation as thousands of threads • An accelerator model of computing • Explicit functions for off-loading computation to GPUs • Data movement explicitly managed by the programmer 4

NVIDIA’s Compute Unified Device Architecture (CUDA) Host GPU • For access to CUDA tutorials http://developer.nvidia.com/cuda-education-training 5

Structure of a Compute Kernel Parallel Thread Execution (PTX) instruction set architecture • Arrays of (data parallel) thread blocks called cooperative thread arrays (CTAs) • Barrier synchronization • Mapped to single instruction stream multiple data stream (SIMD) processor 6

NVIDIA Fermi GF 100 • 4 Global Processing Clusters (GPCs) containing 4 SMs each • Each SM has 32 ALUs, 4 SFUs, and 16 LS units • Each ALU has access to 1024 32bit registers (total of 128kB per SM) • Each SM has its own Shared Memory/L1 cache (64kB total) • Unified L2 cache (768kB) • Six 64bit Memory Controllers (total 384bit wide) Streaming multiprocessor (SM) ALU 7

Ocelot Structure1 PTX Kernel CUDA Application nvcc • Ocelot is low-level compiler after CUDA apps have been compiled with nvcc • Structured around a PTX IR • Compile stock CUDA applications without modification

CUDA to PTX • PTX modules stored as string literals in fat binary • We ignore accompanying binary image (GPU native binary) 9

Dependencies • Software • C++ Compiler (GCC 4.5.x) • LexLexer Generator (Flex 2.5.35) • YACC Parser Generator (Bison 2.4.1) • Scons (Python 2.7) • LLVM (3.1) • Libraries • boost_system (1.46) • boost_filesystem (1.46) • boost_serialization (1.46) • GLEW (optional for GL interop) (1.5) • GL (for NVIDIA GPU Devices) • Library headers • Boost (1.46) http://code.google.com/p/gpuocelot/wiki/Installation 11

Ocelot Source Code • Freely available via Google Code project site (New BSD License) • ocelot/ • analysis/ -- Analysis passes • api/ -- Ocelot-specific API extensions • cuda/ -- Implements CUDA runtime • executive/ -- Device interface and backend implementations • ir/ -- Internal representations (PTX, LLVM, AMD IL) • parser/ -- Parser (to PTX) • tools/ -- Standalone applications using Ocelot • trace/ -- Trace generation and analysis tools • translator/ -- Translators from PTX to LLVM and AMD IL • transforms/ -- Program transformations http://code.google.com/p/gpuocelot/ svn checkout http://gpuocelot.googlecode.com/svn/trunk/ gpuocelot-read-only 12

Building GPU Ocelot • Obtain source code • svn checkout http://gpuocelot.googlecode.com/svn/trunk/ gpuocelot-read-only • Compile with Scons • sudo ./build.py –install • Build and execute unit tests • sudo ./build.py –test=full • Output appears in .release_build • libocelot.so • OcelotConfig • Tests • Installation directory: • /usr/local/include/ocelot • /usr/local/lib http://code.google.com/p/gpuocelot/wiki/Installation

Configuring Ocelot trace: { memoryChecker: { enabled: true, checkInitialization: false }, raceDetector: { enabled: false, ignoreIrrelevantWrites: true }, debugger: { enabled: false, kernelFilter: "_Z13scalarProdGPUPfS_S_ii", alwaysAttach: true }, }, executive: { devices: [ "emulated" ], } } • configure.ocelot • Controls Ocelot’s initial state • Located in application’s startup directory • trace specifies which trace generators are initially attached • executive controls device properties • trace: • memoryChecker – ensures accesses to a memory region associated with the currently selected device • raceDetector - enforces synchronized access to .shared • debugger - interactive debugger • executive: • devices: • List of Ocelot backend devices that are enabled • nvidia - NVIDIA GPU backend • emulated – Ocelot PTX emulator (trace generators) • llvm – efficient execution of PTX on multicore CPU • amd – translation to AMD IL for PTX on AMD RADEON GPU 14

Building and Executing CUDA Programs • nvcc -c example.cu -arch sm_23 • g++ -o example example.o `OcelotConfig -l` • `OcelotConfig -l` expands to ‘-locelot’ • libocelot.so replaces libcudart.so

CUDA Runtime API • Ocelot implements CUDA Runtime API • Transparent hooks into existing CUDA applications • override methods of cuda::CudaDeviceInterface • Maps CUDA RT onto Ocelot device interface abstraction • cuda::CudaRuntime • Extended through custom Ocelot API • e.g. ocelot::registerPTXModule( ); 17

Ocelot CUDA Runtime Overview • A reimplementation of the CUDA Runtime API • Compatible with existing applications • Link against libocelot.so instead of libcudart.so R. Domingo & D. Kaeli (NEU) Kernels execute anywhere  Key to portability!

Ocelot CUDA Runtime • Clean device abstraction • All back-ends implement same interface • Ocelot API Extensions • Add/remove trace generators • Device memory sharing among host threads • Device switching 19

Ocelot Source Code: CUDA Runtime API • ocelot/ • cuda/ -- Implements CUDA runtime • interface/CudaRuntimeInterface.h • interface/CudaRuntime.h • interface/CudaRuntimeContext.h • interface/FatBinaryContext.h • interface/CudaDriverFrontend.h 20

Ocelot CUDA Runtime API Implementation • Implement interface defined by cuda::CudaRuntimeInterface • ocelot/cuda/interface/CudaRuntime.h • ocelot/cuda/implementation/CudaRuntime.cpp • class cuda::CudaRuntime • cuda::CudaRuntime members • Host thread contexts • Ocelot devices • Registered modules, textures, kernels • Fat binaries • Global mutex • CUDA Runtime API functions • e.g. cudaMemcpy, cudaLaunch, __cudaRegisterModule() • Additional functions • E.g. _lock(), _unlock(), _registerModule()

Ocelot Source Code: Device Interface • ocelot/ • executive/ -- Device interface and backend implementations • interface/Device.h • interface/EmulatorDevice.h • interface/NVIDIAGPUDevice.h • interface/MulticoreCPUDevice.h • interface/ATIGPUDevice.h 22

Ocelot Device Interface • class executive::Device • Succinct interface for device objects • Module registration • Memory management • Kernel configuration and launching • Global variable and texture management • OpenGL interoperability • Streams and Events • Trace generators • Minimal set of APIs for device-oriented programming model • 57 functions (versus CUDA Runtime’s 120+) • Capture device state: • Memory allocations, global variables, textures, graphics interoperability • Facilitate creation of backend execution targets • Implement Device interface • Enable multiple API front ends • Implement front ends targeting Device interface

Ocelot PTX Intermediate Representation (IR) • Backend compiler framework for PTX • Full-featured PTX IR • Class hierarchy for PTX instructions/directives • PTX control flow graph • Static single-assignment form • Dataflow/dominance analysis • Enables PTX optimization • IR to IR translation • From PTX to other IRs • LLVM (x86/PowerPC/ARM) • CAL (AMD GPUs) PTX Kernel 25

Ocelot Source Code: Intermediate Representation • ocelot/ • ir/ -- Internal representations (PTX, LLVM, AMD IL) • interface/Module.h • interface/PTXInstruction.h • interface/PTXOperand.h • interface/PTXKernel.h • interface/ControlFlowGraph.h • interface/ILInstruction.h • interface/LLVMInstruction.h • parser/ -- Parser (to PTX) • interface/PTXParser.h 26

Ocelot PTX Internal Representation • C++ classes representing PTX module • ir::PTXModule • ir::PTXKernel • ir::PTXInstruction • ir::PTXOperand • ir::GlobalVariable • ir::LocalVariable • ir::Parameter • Translator source • PTX to LLVM • PTX to AMD IL • Suitable for analysis and transformation • Executable representation • PTX Emulator

Ocelot PTX IR: Kernels ir::Module .global .f32 globalVariable; .entry sequence ( .param .u64 __cudaparm_sequence_A, .param .s32 __cudaparm_sequence_N) { .reg .u32 %r<11>; .reg .u64 %rd<6>; .local u32 %rp0; . . . . . . $LDWbegin_sequence: ld.param.s32 %r6, [__cudaparm_sequence_N]; setp.le.s32 %p1, %r6, %r5; @%p1 bra $Lt_0_1026; . . . . . . $Lt_0_1026: exit; $LDWend_sequence: } // sequence ir::Global ir::Kernel ir::Parameter ir::Local ir::BasicBlock 28

Ocelot PTX IR: Instructions ir::BasicBlock ir::PTXInstruction add.s32 %r7, %r5, 1; ld .param .u64 %rd1, [__cudaparm_sequence_A]; cvt.s64.s32 %rd2, %r5; mul.wide.s32 %rd3, %r5, 4; add.u64 %rd4, %rd1, %rd3; st .global .s32 [ %rd4 + 0 ], %r7; @%p1 bra $Lt_0_6146; ir::PTXOperand addressMode: address opcode addressSpace dataType d a addressMode: register addressMode: immediate addressMode: indirect Guard predicate addressMode: label 29

Control and Data-Flow Graphs • Data structure for representing kernels • Basic blocks • fall-through and branch edges • instruction vector • label • Block traversals: • pre-order, topological, post-order • Data-flow graph overlaying CFG • definition-use chains, .. • CFG Transformations: • split blocks, edges • DFG Transformations: • insert and remove values • iterate over def-use 30

Example: Control-Flow Graphs // example: splits basic blocks containing barriers // for (ir::ControlFlowGraph::iterator bb_it = kernel->cfg()->begin(); bb_it != kernel->cfg()->end(); ++bb_it) { // iterate over basic blocks unsignedint n = 0; ir::BasicBlock::InstructionList::iterator inst_it; for (inst_it = (bb_it)->instructions.begin(); inst_it != (bb_it)->instructions.end(); ++inst_it, n++) { // iterate over instructions in *bb_it const ir::PTXInstruction *inst = static_cast< const ir::PTXInstruction *>(*inst_it); if (inst->opcode == ir::PTXInstruction::Bar) { if (n + 1 < (unsigned int)(bb_it)->instructions.size()) { std::string label = (bb_it)->label + "_bar"; kernel->cfg()->split_block(bb_it, n+1, ir::BasicBlock::Edge::FallThrough, label); // split block containing bar.sync // so that it’s always the last } // instruction in a block break; } } // end for (inst_it) } // end for (bb_it) 31

Example: Spilling Live Values // ocelot/analysis/implementation/RemoveBarrierPass.cpp // voidRemoveBarrierPass::_addSpillCode( DataflowGraph::iterator block, constDataflowGraph::Block::RegisterSet& alive ) { unsigned intbytes = 0; ir::PTXInstruction move ( ir::PTXInstruction::Mov ); move.type = ir::PTXOperand::u64; move.a.identifier = "__ocelot_remove_barrier_pass_stack"; move.a.addressMode = ir::PTXOperand::Address; move.a.type = ir::PTXOperand::u64; move.d.reg = _kernel->dfg()->newRegister(); move.d.addressMode = ir::PTXOperand::Register; move.d.type = ir::PTXOperand::u64; _kernel->dfg()->insert( block, move, block->instructions().size() - 1 ); ...

Example: Spilling Live Values ... for( DataflowGraph::Block::RegisterSet::const_iterator reg = alive.begin(); reg != alive.end(); ++reg ) { ir::PTXInstruction save( ir::PTXInstruction::St ); save.type = reg->type; save.addressSpace = ir::PTXInstruction::Local; save.d.addressMode = ir::PTXOperand::Indirect; save.d.reg = move.d.reg; save.d.type = ir::PTXOperand::u64; save.d.offset = bytes; bytes += ir::PTXOperand::bytes( save.type ); save.a.addressMode = ir::PTXOperand::Register; save.a.type = reg->type; save.a.reg = reg->id; _kernel->dfg()->insert( block, save, block->instructions().size() - 1 ); } _spillBytes = std::max( bytes, _spillBytes ); }

IR for AMD and LLVM AMD Backend: R. Domingo & D. Kaeli (NEU) • LLVM IR • Implements all of the LLVM instruction set • AMD IL • Supports translation from PTX to AMD interface 34

PTX Pass Manager • Orchestrates analysis and transformation passes • Derived from LLVM model • Analysis Passes generate meta-data • Meta-data consumed by transformations • Transformation Passes modify the IR

Using the Pass Manager • Passes added to a manager • Schedules execution • Manages meta-data • Ensures meta-data available • Up to date; not redundantly computed

Analysis Passes • Analysis runs over the PTX IR • Generates meta-data • Possibly updates or invalidates existing meta-data • Examples • Data-flow graph • Dominator and Post-dominator trees • Thread frontiers

Analysis Passes • Control Flow Graph • ir/interface/ControlFlowGraph.h • Data Flow Graph • analysis/interface/DataflowGraph.h • Dominator and Post-Dominator Trees • analysis/interface/DominatorTree.h • analysis/interface/PostDominatorTree.h • Superblock Analysis • analysis/interface/SuperblockAnalysis.h • Divergence Graph • analysis/interface/DivergenceGraph.h • Thread Frontiers • analysis/interface/ThreadFrontiers.h 39

Transformation Passes • Modify the PTX IR • Consume meta-data • Examples: • Dead-code elimination • transforms/interface/DeadCodeEliminationPass.h • Control-flow structuring • transforms/interface/StructuralTransform.h • Sync elimination • transforms/interface/SyncElimination.h • Dynamic instrumentation

Example: Dead Code Elimination Transformation Pass

Dead Code Elimination • Approach • Run once on each kernel • Consume data-flow analysis meta-data • Delete instructions producing values with no users • Implementation • transforms/interface/DeadCodeEliminationPass.h • transforms/implementation/DeadCodeEliminationPass.cpp

Dead Code Elimination (1 of 5) • Setup pass dependencies DeadCodeEliminationPass::DeadCodeEliminationPass() : KernelPass(Analysis::DataflowGraphAnalysis | Analysis::StaticSingleAssignment, "DeadCodeEliminationPass") { }

Dead Code Elimination (2 of 5) • Run pass void DeadCodeEliminationPass::runOnKernel(ir::IRKernel& k) { • Get analysis metadata Analysis* dfgAnalysis = getAnalysis(Analysis::DataflowGraphAnalysis); assert(dfgAnalysis != 0); // cast up analysis::DataflowGraph& dfg = *static_cast<analysis::DataflowGraph*>(dfgAnalysis); assert(dfg.ssa());

Dead Code Elimination (3 of 5) • Loop until change BlockSet blocks; for (iterator block = dfg.begin(); block != dfg.end(); ++block) { report(" Queueing up BB_" << block->id()); blocks.insert(block); } while(!blocks.empty()) { iterator block = *blocks.begin(); blocks.erase(blocks.begin()); eliminateDeadInstructions(dfg, blocks, block); }

Dead Code Elimination (4 of 5) • Remove unused live-out values AliveKillListaliveOutKillList; for (RegisterSet::iterator aliveOut = block->aliveOut().begin(); aliveOut!= block->aliveOut().end(); ++aliveOut) { if (canRemoveAliveOut(dfg, block, *aliveOut)) { report(" removed " << aliveOut->id); aliveOutKillList.push_back(aliveOut); } } for (AliveKillList::iterator killed = aliveOutKillList.begin(); killed != aliveOutKillList.end(); ++killed) { block->aliveOut().erase(*killed); }

Dead Code Elimination (5 of 5) • Check if an instruction can be removed • if (ptx.hasSideEffects()) return false; • for (RegisterPointerVector::iterator reg = instruction->d.begin(); • reg!= instruction->d.end(); ++reg) { • // the reg is alive outside the block • if (block->aliveOut().count(*reg) != 0) return false; • InstructionVector::iterator next = instruction; • for (++next; next != block->instructions().end(); ++next) { • for (RegisterPointerVector::iterator source = next->s.begin(); • source != next->s.end(); ++source) { • // found a user in the block • if (*source->pointer == *reg->pointer) return false; • } • } • }

Running Passes on PTX • Static optimizer • PTXOptimizer • Runs passes on PTX assembly files • ocelot/tools/PTXOptimizer.cpp • JIT optimization • Runs passes before kernels are launched • ocelot/api/implementation/OcelotRuntime.cpp

Questions • GPU Ocelot • Google Code site: http://code.google.com/p/gpuocelot • Research Project site: http://gpuocelot.gatech.edu • Mailing list: gpuocelot@googlegroups.com • Contributors • Gregory Diamos, Rodrigo Dominguez, NailaFarooqui, Andrew Kerr, AshwinLele, Si Li, Tri Pho, Jin Wang, Haicheng Wu, SudhakarYalamanchili • Sponsors • AMD, IBM, Intel, LogicBlox, NSF, NVIDIA

Overview of Ocelot: architecture