330 likes | 343 Views
Explore Harp control divergence techniques - Predication and Split-Join - supported by ISA. Dive into Harp Predication, Compiler Support, and Split-Join implementation details with advantages and limitations. Discover insights on control divergence handling at instruction and block granularity.
E N D
HARP Control Divergence& Assignment 4 Blaise Tine Georgia Institute of Technology
Agenda • Harp Control Divergence • Predication • Split-Join • Assignment 4 • Codebase • Clone • Barriers • Samples Walkthrough • Questions?
Control Divergence Two techniques supported by ISA: • Predication • Control branch divergence at instruction granularity • Split-Join • Control branch divergence at block granularity
Harp Predication • Full Predication • All instructions can be predicated • Implementation • Separate predicate register file • All predicated instructions execute • Fetch => Decode => Execute • Conditional Commit stage • Only instructions with predicate value ‘true’
Harp Predication (2) • Compiler Support • If-conversion: Converts control dependencies into data dependencies • Example Set predicate if (r1) { ++r2; } else { --r2; } rtop @p0, %r1 @p0 ? addi %r2, %r2, #1 ntop @p0, @p0 @p0 ? Subi %r2, %r2, #1 Inverse predicate
Harp Predication (3) • Predicate Value Test Instructions • rtop @dst %src • isneg @dst %src • iszero @dst %src • Predicate Manipulation Instructions • ntop @dst @src0 • andp @dst @src0 @src1 • orp @dst @src0 @src1 • xorp @dst @src0 @src1
Harp Predication (4) • Advantages • No branching overhead • Simple microarchitecture • Limitations • If-conversion is not always possible • e.g. loops, indirect branches • Inefficient with unanimous branches • Both paths are always executed
Harp Split-Join • ISA Support • @p split: partition a warp using predicate mask, each subset taking different target • join: merge partitioned subset into single execution block • Implementation • Hardware stack management • Compiler support
Harp Split-Join (2) • Example Set predicate rtop @p0, %r1 @p0 ? split @p0 ? jmp then • subi %r2, %r2, #1 • jmp next • then: addi %r2, %r2, #1 • next: join if (r1) { ++r2; } else { --r2; }
Harp Split-Join (2) • Example push PC and mask onto HW stack rtop @p0, %r1 @p0 ? split @p0 ? jmp then • subi %r2, %r2, #1 • jmp next • then: addi %r2, %r2, #1 • next: join if (r1) { ++r2; } else { --r2; }
Harp Split-Join (2) • Example Execute threads with ‘true’ predicate rtop @p0, %r1 @p0 ? split @p0 ? jmp then • subi %r2, %r2, #1 • jmp next • then: addi %r2, %r2, #1 • next: join if (r1) { ++r2; } else { --r2; }
Harp Split-Join (2) • Example Execute threads with ‘true’ predicate rtop @p0, %r1 @p0 ? split @p0 ? jmp then • subi %r2, %r2, #1 • jmp next • then: addi %r2, %r2, #1 • next: join if (r1) { ++r2; } else { --r2; }
Harp Split-Join (2) • Example Pop HW stack and jmp to @2 rtop @p0, %r1 @p0 ? split @p0 ? jmp then • subi %r2, %r2, #1 • jmp next • then: addi %r2, %r2, #1 • next: join if (r1) { ++r2; } else { --r2; }
Harp Split-Join (2) • Example Execute threads with ‘false’ predicate rtop @p0, %r1 @p0 ? split @p0 ? jmp then • subi %r2, %r2, #1 • jmp next • then: addi %r2, %r2, #1 • next: join if (r1) { ++r2; } else { --r2; }
Harp Split-Join (2) • Example Execute threads with ‘false’ predicate rtop @p0, %r1 @p0 ? split @p0 ? jmp then • subi %r2, %r2, #1 • jmp next • then: addi %r2, %r2, #1 • next: join if (r1) { ++r2; } else { --r2; }
Harp Split-Join (2) • Example Execute threads with ‘false’ predicate rtop @p0, %r1 @p0 ? split @p0 ? jmp then • subi %r2, %r2, #1 • jmp next • then: addi %r2, %r2, #1 • next: join if (r1) { ++r2; } else { --r2; }
Harp Split-Join (2) • Example Pop HW stack and jmp to @7 rtop @p0, %r1 @p0 ? split @p0 ? jmp then • subi %r2, %r2, #1 • jmp next • then: addi %r2, %r2, #1 • next: join if (r1) { ++r2; } else { --r2; }
Harp Split-Join (2) • Advantages • Efficient with unanimous branches • Only a single path is executed • The active mask turns off inactive threads • Challenges • Complex microarchitecture • HW stack manager • Split-jmp-Join overhead
Assignment 4: Mini Harp • Minimal ISA • Word encoding • Integers only • A single predicate register • No Split-Join • No warps creation • No interrupts • No virtual addressing • Instructions Set • Nop, Add, Sub, And, Or, Xor, Not, Shr, Shl, Ld, St, Jmp, Jal, Bar • Configuration • Register size, warp size, number of warps
Assignment 4: Code base • Shared header • Common.h // common includes and definitions • Utility Library • utils.cpp/h // utility functions • Core classes • mem.cpp/h // memory • lrucache.cpp/h // cache • Instr.cpp/h // instruction • decode. cpp/h // decoder • regfile.h // register file • warp.cpp/h // warp unit • core.cpp/h // processor core
Assignment 4: Core Initialization Program RAM • Core Construction Console output Load/Store Unit ICache & DCache IDecoder Warps
Assignment 4: Memory Layout console RAM
Assignment 4: Warp Initialization • Warp Construction GP Registers Pred Registers Boot enable
Assignment 4: Warp Execute • Step Function Pipeline stages Fetch Decode
Assignment 4: Warp Execute (2) • Execution Instructions Predication Jump instruction Set predicate Add your code!
Assignment 4: Clone • Instruction Format • clone %src0 • Operation • Copy current lane registers into %src0 lane. • Register %src0 holds the destination lane index. • e.g. ldi %r0, #2 clone %r0 # copy current registers into 3rd lane.
Assignment 4: Barrier • Instruction Format • bar %src0, %src1 • Operation • Synchronize %src1 number of warps with barrier identifier %src0. • Register %src0 holds the barrier id (supported max value is 3). • Register %src1 holds the number of warps to wait on. • e.g. ldi %r0, #1 ldi %r1, # 2 bar %r0, %r1 # insert a size-2 named barrier with id=1
Assignment 4: Testing • Emulator command line • ./miniharp.out –r #regs–t #threads –w #warps –o #output • Sample programs • $ ./miniharp.outhello.bin -t 4 -w 1 -r 8 -o output.log • $ ./miniharp.outsum.bin -t 4 -w 1 -r 8 -o output.log • $ ./miniharp.outbarrier.bin -t 4 -w 4 -r 8 -o output.log • Output format • “<Program Output>” • “Instruction Count: <?>”
Assignment 4: runtime.s Print Hex Print String Print NewLine
Assignment 4: hello.s Load string Call prints Exit String data
Assignment 4: sum.s Clone Registers Parallel Call Print result0 Array data Output address
Assignment 4: barrier.s Start new Warp Barrier Single warp Print results
Questions? Questions?