1 / 27

JASM: A Java Library for the Generation and Scheduling of PTX Assembly

JASM: A Java Library for the Generation and Scheduling of PTX Assembly. Christos Kartsaklis christos.kartsaklis@ichec.ie ICHEC. Purpose. NVIDIA GPUs Many self-conflicting parameters affect performance. Some not nvcc -tunable . JASM Similar to a compiler back-end but programmable itself.

alton
Download Presentation

JASM: A Java Library for the Generation and Scheduling of PTX Assembly

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. JASM: A Java Library for the Generation and Scheduling of PTX Assembly Christos Kartsaklis christos.kartsaklis@ichec.ie ICHEC

  2. Purpose • NVIDIA GPUs • Many self-conflicting parameters affect performance. • Some not nvcc-tunable. • JASM • Similar to a compiler back-end but programmable itself. • Different constructs to generate variants of the same kernel. • Explore the optimisations strategy space faster. • For CUDA programmers and instruction bottlenecks.

  3. Structure • Introduction • Features absent from nvcc • Code compilation • JASM • Library Features • Dependencies • Aliasing • Predication • Snippets • Reverse compilation • Summary

  4. Features absent from nvcc – 1/3 • What is PTX? • Virtual machine ISA for NVIDIA’s GPUs. • Generated by nvcc (-ptxas) and compiled by ptxas. • nvcc limitation • Cannot write inlined PTX in CUDA C/C++. • Some extra #pragmas needed.

  5. Features absent from nvcc – 2/3 • Predication • Not exposed; would like to have: • predicate p = ...; // some condition • #pragma predicate using p • if (p) { ... • Address space derivation • nvcc must be able to determine what space pointers refer to; cannot do: • *(__shared__ float* someint) = ... • double d = *(__constant__ double*) foo;

  6. Features absent from nvcc – 3/3 • Latency Hints • Hard for the compiler to determine (non)coalesced accesses; ideally: • #pragma !coalescedfoo = bar[i]; • Reflection • __device__ float foo(...) { ... }#pragma N=registersUsedBy(foo);

  7. Code compilation – 1/2 • Instruction Generation • Compiler does it for you • High-level code  intermediary form (I/F). • Transform the I/F. • Generate machine code from the transformed I/F. • How good the generated code is? • Need to manually inspect it.

  8. Code compilation – 2/2 • Instruction Scheduling • Core part of any compiler. • Determines the order that instructions will execute in. • Purpose • Correctness, latency hiding and ILP. • Problems • Hard to steer from a high-level language. • Compiler often generates its own code. • #pragmadirectives & compiler options.

  9. JASM – 1/3 • Dedicated tools, such as BAGEL • User selects instructions and BAGEL schedules. • Generating code written in C using the BAGEL lib. • Uses the notion of “abstract instructions”. • JASM • Similar philosophy – enhanced functionality. • Focus: • Reflection. • User-Programmable Instruction Scheduling.

  10. JASM – 2/3 • Basic Block (BB) • A bunch of instructions that do not change the flow of execution. • Control Flow Graph (CFG) • A directed graph of BBs where edges indicate changes in execution flow. • Instructions Stream • The order of instructions in memory. • Each instruction is “notionally” part of a BB.

  11. JASM – 3/3 • Examples • Append an instruction to a basic block: • lBB.append(LDSHAREDS32, lArg1, 4, lArg2); • Branching: • lSource.branchTo(lTargetBB, false); • Reorder: • lBB = lBB.reorder(BASIC_COST_FUNCTION, DA_PTX_DATA_REGISTER_HAZARDS, DA_ALIASING); • Predicate: • lBB = lBB.predicate(lP, lRegisterFile); • Obtain macro: • SnippetDescriptorlSD = CToHLA.obtain(“x*x / (y-z)”,...);BasicBlocklNewBB = lSD.instantiate(lArg1, ...);

  12. Structure • Introduction • Features absent from nvcc • Code compilation • JASM • Library Features • Dependencies • Aliasing • Predication • Snippets • Reverse compilation • Summary

  13. Dependencies – 1/2 • All contribute to the final instructions stream. • What is the ideal layout? • What complicates the compiler • Not enough information to distinguish between true and false dependencies. • Variable-latency instructions • E.g. coalesced vs non-coalesced accesses.

  14. Dependencies – 2/2 • JASM determines instruction order based on: • Dependency Analysis (DA) modules & Cost Function. • Full space exploration – no heuristics: • DAs constrain instructions’ motion in the stream. • Cost function estimates execution time for any stream. • Scheduling done by external constraints solver. • Only applicable to basic blocks.

  15. Aliasing – 1/3 • PTX is not the final thing. • Further optimised by ptxas before machine code generation. • Want to specify exactly what is and what’s not aliased. • No #pragma aliased / disjoint in PTX. • Goal: • Simplify declaration of aliasing/disjoint accesses. • Handle all memory spaces.

  16. Aliasing – 2/3 • JASM Addressable Memory Regions (AMRs) • An n-ary tree of AMRs where: • Root nodes represent spaces (e.g. shared, global) • Each node is a memory region and a sub-region of its parent’s. • Siblings are disjoint regions, collectively making their parent’s • Instructions are associated with AMRs. • AMRs predefined for CUDA memory spaces.

  17. Aliasing – 3/3 • Example • Generally: • No need for pointer grouping (a la “#pragma disjoint” etc.) • We work with instructions, not pointers. global mem AMR 01: st.global.f32 [%r0], %f5 02: ld.global.f64 [%r1], %lf8 03: st.global.f32 [%r3], %f4 04: st.global.s32 [%r8], %s1

  18. Predication – 1/3 • Conditional execution • if (foo==0) bar=5; • Thread divergence. • Predicated execution • setp.eq.s32 %p, foo, 0;@%p mov.s32 bar, 5; • Non-divergent cycle-burner. • Fine line between the two. • Cannot predicate code explicitly in CUDA.

  19. Predication – 2/3 • Explicit • Can allocate predicates, modify them and predicate instructions. • Example: • Direct: Register lP = lRegFile.allocate(BT.PRED, “%p1”); // @%p1 mov.s32 bar, 5 lBB.append(PMOVS32, lP, lBar, new Immediate(5)); • By reflection: Instruction lT = new InstructionImpl( MOVS32, lBar, new Immediate(5)); lBB.append(lT.descriptor().toPredicatedVersion(), lP, lArg1, lArg2);

  20. Predication – 3/3 • Any basic block can be predicated. • Including already-predicated instructions. • Example: • %p mov.s32 %d, %s; // if (%p) %d = %s; • Predicate by %q • Output: • @ %q and.pred %t, %p, %q;@!%q mov.pred %t, %q; // i.e. %t = %q ? (%p && %q) : false;@ %t mov.s32 %d, %s;

  21. Snippets – 1/3 • Problem: • Certain tasks require knowing in advance how the compiler treats a piece of code. • Software pipelining • template<typename T>vmult( T* aDst, T* aSrc1, T* aSrc2) { for(inti=0 ; i<N ; i++)aDst[i] = aSrc2[i] * aSrc2[i];}

  22. Snippets – 2/3 • Consider H/W vs S/W instructions • Tradeoff between pipeline stall & register pressure. • However, register pressure: • Is also a function of the # of thread blocks. • Ideally • Want to generate pipelined code for a variable number of dependencies. • Solution: • Encapsulate function in a reflective macro • Reflect instructions & dependencies.

  23. Snippets – 3/3 • Consider the complex “multiplication” • (a+ib)*(c+id)  (ac+bd)+i(ad+bc) • 2 stages: 2 muls, then 2 madds. • Snippet descriptor organisation: • Group 0: • mul.f32 ?x, ?a, ?c; mul.f32 ?y, ?a, ?d • Group 1: • mad.f32 ?x, ?x, ?b, ?d; mad.f32 ?y, ?y, ?b, ?c • ?* items are parameters. • Anybasic block can be “snippetised”.

  24. Reverse compilation – 1/3 • What to do with legacy CUDA code? • Option: Manually re-write in JASM. • No. Any PTX file can be loaded in JASM. • Not just loaded in. • Organised in basic blocks within a Control Flow Graph. • Malleable from thereon like every JASM code.

  25. Reverse compilation – 2/3 • Inlined C in JASM • Idea: obtain a snippet from a C function. • Opposite of “inlined assembly in C”. • Why? • Reuse what nvcc makes available. • Enjoy the benefits that come with snippets.

  26. Reverse compilation – 3/3 • At the moment, we can do the following: • Code: • SnippetDescriptorlSD= CToHLA.obtain( “(x*y) % 3”, “int”, “r”, “int”, “x”, “int”, “y”); • r is the return parameter; x & y are arguments. • Equivalent to: • int r = (x*y) % 3; • Now we can write: • if(lSD.numberOfRegisters() > 5) { ...

  27. Summary • NVIDIA GPUs • Many self-conflicting parameters affect performance. • Some not nvcc-tunable. • JASM • Similar to a compiler back-end but programmable itself. • Different constructs to generate variants of the same kernel. • Explore the optimisations strategy space faster. • The optimisations are expressed as a function of the code.

More Related