1 / 33

HARP Control Divergence & Assignment 4

Explore Harp control divergence techniques - Predication and Split-Join - supported by ISA. Dive into Harp Predication, Compiler Support, and Split-Join implementation details with advantages and limitations. Discover insights on control divergence handling at instruction and block granularity.

wagoner
Download Presentation

HARP Control Divergence & Assignment 4

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. HARP Control Divergence& Assignment 4 Blaise Tine Georgia Institute of Technology

  2. Agenda • Harp Control Divergence • Predication • Split-Join • Assignment 4 • Codebase • Clone • Barriers • Samples Walkthrough • Questions?

  3. Control Divergence Two techniques supported by ISA: • Predication • Control branch divergence at instruction granularity • Split-Join • Control branch divergence at block granularity

  4. Harp Predication • Full Predication • All instructions can be predicated • Implementation • Separate predicate register file • All predicated instructions execute • Fetch => Decode => Execute • Conditional Commit stage • Only instructions with predicate value ‘true’

  5. Harp Predication (2) • Compiler Support • If-conversion: Converts control dependencies into data dependencies • Example Set predicate if (r1) { ++r2; } else { --r2; } rtop @p0, %r1 @p0 ? addi %r2, %r2, #1 ntop @p0, @p0 @p0 ? Subi %r2, %r2, #1 Inverse predicate

  6. Harp Predication (3) • Predicate Value Test Instructions • rtop @dst %src • isneg @dst %src • iszero @dst %src • Predicate Manipulation Instructions • ntop @dst @src0 • andp @dst @src0 @src1 • orp @dst @src0 @src1 • xorp @dst @src0 @src1

  7. Harp Predication (4) • Advantages • No branching overhead • Simple microarchitecture • Limitations • If-conversion is not always possible • e.g. loops, indirect branches • Inefficient with unanimous branches • Both paths are always executed

  8. Harp Split-Join • ISA Support • @p split: partition a warp using predicate mask, each subset taking different target • join: merge partitioned subset into single execution block • Implementation • Hardware stack management • Compiler support

  9. Harp Split-Join (2) • Example Set predicate rtop @p0, %r1 @p0 ? split @p0 ? jmp then • subi %r2, %r2, #1 • jmp next • then: addi %r2, %r2, #1 • next: join if (r1) { ++r2; } else { --r2; }

  10. Harp Split-Join (2) • Example push PC and mask onto HW stack rtop @p0, %r1 @p0 ? split @p0 ? jmp then • subi %r2, %r2, #1 • jmp next • then: addi %r2, %r2, #1 • next: join if (r1) { ++r2; } else { --r2; }

  11. Harp Split-Join (2) • Example Execute threads with ‘true’ predicate rtop @p0, %r1 @p0 ? split @p0 ? jmp then • subi %r2, %r2, #1 • jmp next • then: addi %r2, %r2, #1 • next: join if (r1) { ++r2; } else { --r2; }

  12. Harp Split-Join (2) • Example Execute threads with ‘true’ predicate rtop @p0, %r1 @p0 ? split @p0 ? jmp then • subi %r2, %r2, #1 • jmp next • then: addi %r2, %r2, #1 • next: join if (r1) { ++r2; } else { --r2; }

  13. Harp Split-Join (2) • Example Pop HW stack and jmp to @2 rtop @p0, %r1 @p0 ? split @p0 ? jmp then • subi %r2, %r2, #1 • jmp next • then: addi %r2, %r2, #1 • next: join if (r1) { ++r2; } else { --r2; }

  14. Harp Split-Join (2) • Example Execute threads with ‘false’ predicate rtop @p0, %r1 @p0 ? split @p0 ? jmp then • subi %r2, %r2, #1 • jmp next • then: addi %r2, %r2, #1 • next: join if (r1) { ++r2; } else { --r2; }

  15. Harp Split-Join (2) • Example Execute threads with ‘false’ predicate rtop @p0, %r1 @p0 ? split @p0 ? jmp then • subi %r2, %r2, #1 • jmp next • then: addi %r2, %r2, #1 • next: join if (r1) { ++r2; } else { --r2; }

  16. Harp Split-Join (2) • Example Execute threads with ‘false’ predicate rtop @p0, %r1 @p0 ? split @p0 ? jmp then • subi %r2, %r2, #1 • jmp next • then: addi %r2, %r2, #1 • next: join if (r1) { ++r2; } else { --r2; }

  17. Harp Split-Join (2) • Example Pop HW stack and jmp to @7 rtop @p0, %r1 @p0 ? split @p0 ? jmp then • subi %r2, %r2, #1 • jmp next • then: addi %r2, %r2, #1 • next: join if (r1) { ++r2; } else { --r2; }

  18. Harp Split-Join (2) • Advantages • Efficient with unanimous branches • Only a single path is executed • The active mask turns off inactive threads • Challenges • Complex microarchitecture • HW stack manager • Split-jmp-Join overhead

  19. Assignment 4: Mini Harp • Minimal ISA • Word encoding • Integers only • A single predicate register • No Split-Join • No warps creation • No interrupts • No virtual addressing • Instructions Set • Nop, Add, Sub, And, Or, Xor, Not, Shr, Shl, Ld, St, Jmp, Jal, Bar • Configuration • Register size, warp size, number of warps

  20. Assignment 4: Code base • Shared header • Common.h // common includes and definitions • Utility Library • utils.cpp/h // utility functions • Core classes • mem.cpp/h // memory • lrucache.cpp/h // cache • Instr.cpp/h // instruction • decode. cpp/h // decoder • regfile.h // register file • warp.cpp/h // warp unit • core.cpp/h // processor core

  21. Assignment 4: Core Initialization Program RAM • Core Construction Console output Load/Store Unit ICache & DCache IDecoder Warps

  22. Assignment 4: Memory Layout console RAM

  23. Assignment 4: Warp Initialization • Warp Construction GP Registers Pred Registers Boot enable

  24. Assignment 4: Warp Execute • Step Function Pipeline stages Fetch Decode

  25. Assignment 4: Warp Execute (2) • Execution Instructions Predication Jump instruction Set predicate Add your code!

  26. Assignment 4: Clone • Instruction Format • clone %src0 • Operation • Copy current lane registers into %src0 lane. • Register %src0 holds the destination lane index. • e.g. ldi %r0, #2 clone %r0 # copy current registers into 3rd lane.

  27. Assignment 4: Barrier • Instruction Format • bar %src0, %src1 • Operation • Synchronize %src1 number of warps with barrier identifier %src0. • Register %src0 holds the barrier id (supported max value is 3). • Register %src1 holds the number of warps to wait on. • e.g. ldi %r0, #1 ldi %r1, # 2 bar %r0, %r1 # insert a size-2 named barrier with id=1

  28. Assignment 4: Testing • Emulator command line • ./miniharp.out –r #regs–t #threads –w #warps –o #output • Sample programs • $ ./miniharp.outhello.bin -t 4 -w 1 -r 8 -o output.log • $ ./miniharp.outsum.bin -t 4 -w 1 -r 8 -o output.log • $ ./miniharp.outbarrier.bin -t 4 -w 4 -r 8 -o output.log • Output format • “<Program Output>” • “Instruction Count: <?>”

  29. Assignment 4: runtime.s Print Hex Print String Print NewLine

  30. Assignment 4: hello.s Load string Call prints Exit String data

  31. Assignment 4: sum.s Clone Registers Parallel Call Print result0 Array data Output address

  32. Assignment 4: barrier.s Start new Warp Barrier Single warp Print results

  33. Questions? Questions?

More Related