280 likes | 376 Views
JASM: A Java Library for the Generation and Scheduling of PTX Assembly. Christos Kartsaklis christos.kartsaklis@ichec.ie ICHEC. Purpose. NVIDIA GPUs Many self-conflicting parameters affect performance. Some not nvcc -tunable . JASM Similar to a compiler back-end but programmable itself.
E N D
JASM: A Java Library for the Generation and Scheduling of PTX Assembly Christos Kartsaklis christos.kartsaklis@ichec.ie ICHEC
Purpose • NVIDIA GPUs • Many self-conflicting parameters affect performance. • Some not nvcc-tunable. • JASM • Similar to a compiler back-end but programmable itself. • Different constructs to generate variants of the same kernel. • Explore the optimisations strategy space faster. • For CUDA programmers and instruction bottlenecks.
Structure • Introduction • Features absent from nvcc • Code compilation • JASM • Library Features • Dependencies • Aliasing • Predication • Snippets • Reverse compilation • Summary
Features absent from nvcc – 1/3 • What is PTX? • Virtual machine ISA for NVIDIA’s GPUs. • Generated by nvcc (-ptxas) and compiled by ptxas. • nvcc limitation • Cannot write inlined PTX in CUDA C/C++. • Some extra #pragmas needed.
Features absent from nvcc – 2/3 • Predication • Not exposed; would like to have: • predicate p = ...; // some condition • #pragma predicate using p • if (p) { ... • Address space derivation • nvcc must be able to determine what space pointers refer to; cannot do: • *(__shared__ float* someint) = ... • double d = *(__constant__ double*) foo;
Features absent from nvcc – 3/3 • Latency Hints • Hard for the compiler to determine (non)coalesced accesses; ideally: • #pragma !coalescedfoo = bar[i]; • Reflection • __device__ float foo(...) { ... }#pragma N=registersUsedBy(foo);
Code compilation – 1/2 • Instruction Generation • Compiler does it for you • High-level code intermediary form (I/F). • Transform the I/F. • Generate machine code from the transformed I/F. • How good the generated code is? • Need to manually inspect it.
Code compilation – 2/2 • Instruction Scheduling • Core part of any compiler. • Determines the order that instructions will execute in. • Purpose • Correctness, latency hiding and ILP. • Problems • Hard to steer from a high-level language. • Compiler often generates its own code. • #pragmadirectives & compiler options.
JASM – 1/3 • Dedicated tools, such as BAGEL • User selects instructions and BAGEL schedules. • Generating code written in C using the BAGEL lib. • Uses the notion of “abstract instructions”. • JASM • Similar philosophy – enhanced functionality. • Focus: • Reflection. • User-Programmable Instruction Scheduling.
JASM – 2/3 • Basic Block (BB) • A bunch of instructions that do not change the flow of execution. • Control Flow Graph (CFG) • A directed graph of BBs where edges indicate changes in execution flow. • Instructions Stream • The order of instructions in memory. • Each instruction is “notionally” part of a BB.
JASM – 3/3 • Examples • Append an instruction to a basic block: • lBB.append(LDSHAREDS32, lArg1, 4, lArg2); • Branching: • lSource.branchTo(lTargetBB, false); • Reorder: • lBB = lBB.reorder(BASIC_COST_FUNCTION, DA_PTX_DATA_REGISTER_HAZARDS, DA_ALIASING); • Predicate: • lBB = lBB.predicate(lP, lRegisterFile); • Obtain macro: • SnippetDescriptorlSD = CToHLA.obtain(“x*x / (y-z)”,...);BasicBlocklNewBB = lSD.instantiate(lArg1, ...);
Structure • Introduction • Features absent from nvcc • Code compilation • JASM • Library Features • Dependencies • Aliasing • Predication • Snippets • Reverse compilation • Summary
Dependencies – 1/2 • All contribute to the final instructions stream. • What is the ideal layout? • What complicates the compiler • Not enough information to distinguish between true and false dependencies. • Variable-latency instructions • E.g. coalesced vs non-coalesced accesses.
Dependencies – 2/2 • JASM determines instruction order based on: • Dependency Analysis (DA) modules & Cost Function. • Full space exploration – no heuristics: • DAs constrain instructions’ motion in the stream. • Cost function estimates execution time for any stream. • Scheduling done by external constraints solver. • Only applicable to basic blocks.
Aliasing – 1/3 • PTX is not the final thing. • Further optimised by ptxas before machine code generation. • Want to specify exactly what is and what’s not aliased. • No #pragma aliased / disjoint in PTX. • Goal: • Simplify declaration of aliasing/disjoint accesses. • Handle all memory spaces.
Aliasing – 2/3 • JASM Addressable Memory Regions (AMRs) • An n-ary tree of AMRs where: • Root nodes represent spaces (e.g. shared, global) • Each node is a memory region and a sub-region of its parent’s. • Siblings are disjoint regions, collectively making their parent’s • Instructions are associated with AMRs. • AMRs predefined for CUDA memory spaces.
Aliasing – 3/3 • Example • Generally: • No need for pointer grouping (a la “#pragma disjoint” etc.) • We work with instructions, not pointers. global mem AMR 01: st.global.f32 [%r0], %f5 02: ld.global.f64 [%r1], %lf8 03: st.global.f32 [%r3], %f4 04: st.global.s32 [%r8], %s1
Predication – 1/3 • Conditional execution • if (foo==0) bar=5; • Thread divergence. • Predicated execution • setp.eq.s32 %p, foo, 0;@%p mov.s32 bar, 5; • Non-divergent cycle-burner. • Fine line between the two. • Cannot predicate code explicitly in CUDA.
Predication – 2/3 • Explicit • Can allocate predicates, modify them and predicate instructions. • Example: • Direct: Register lP = lRegFile.allocate(BT.PRED, “%p1”); // @%p1 mov.s32 bar, 5 lBB.append(PMOVS32, lP, lBar, new Immediate(5)); • By reflection: Instruction lT = new InstructionImpl( MOVS32, lBar, new Immediate(5)); lBB.append(lT.descriptor().toPredicatedVersion(), lP, lArg1, lArg2);
Predication – 3/3 • Any basic block can be predicated. • Including already-predicated instructions. • Example: • %p mov.s32 %d, %s; // if (%p) %d = %s; • Predicate by %q • Output: • @ %q and.pred %t, %p, %q;@!%q mov.pred %t, %q; // i.e. %t = %q ? (%p && %q) : false;@ %t mov.s32 %d, %s;
Snippets – 1/3 • Problem: • Certain tasks require knowing in advance how the compiler treats a piece of code. • Software pipelining • template<typename T>vmult( T* aDst, T* aSrc1, T* aSrc2) { for(inti=0 ; i<N ; i++)aDst[i] = aSrc2[i] * aSrc2[i];}
Snippets – 2/3 • Consider H/W vs S/W instructions • Tradeoff between pipeline stall & register pressure. • However, register pressure: • Is also a function of the # of thread blocks. • Ideally • Want to generate pipelined code for a variable number of dependencies. • Solution: • Encapsulate function in a reflective macro • Reflect instructions & dependencies.
Snippets – 3/3 • Consider the complex “multiplication” • (a+ib)*(c+id) (ac+bd)+i(ad+bc) • 2 stages: 2 muls, then 2 madds. • Snippet descriptor organisation: • Group 0: • mul.f32 ?x, ?a, ?c; mul.f32 ?y, ?a, ?d • Group 1: • mad.f32 ?x, ?x, ?b, ?d; mad.f32 ?y, ?y, ?b, ?c • ?* items are parameters. • Anybasic block can be “snippetised”.
Reverse compilation – 1/3 • What to do with legacy CUDA code? • Option: Manually re-write in JASM. • No. Any PTX file can be loaded in JASM. • Not just loaded in. • Organised in basic blocks within a Control Flow Graph. • Malleable from thereon like every JASM code.
Reverse compilation – 2/3 • Inlined C in JASM • Idea: obtain a snippet from a C function. • Opposite of “inlined assembly in C”. • Why? • Reuse what nvcc makes available. • Enjoy the benefits that come with snippets.
Reverse compilation – 3/3 • At the moment, we can do the following: • Code: • SnippetDescriptorlSD= CToHLA.obtain( “(x*y) % 3”, “int”, “r”, “int”, “x”, “int”, “y”); • r is the return parameter; x & y are arguments. • Equivalent to: • int r = (x*y) % 3; • Now we can write: • if(lSD.numberOfRegisters() > 5) { ...
Summary • NVIDIA GPUs • Many self-conflicting parameters affect performance. • Some not nvcc-tunable. • JASM • Similar to a compiler back-end but programmable itself. • Different constructs to generate variants of the same kernel. • Explore the optimisations strategy space faster. • The optimisations are expressed as a function of the code.