Advanced Microarchitecture

Advanced Microarchitecture Lecture 8: Data-Capture Instruction Schedulers

Out-of-Order Execution • The goal is to execute instructions in dataflow order as opposed to the sequential order specified by the ISA • The renamer plays a critical role by removing all of the false register dependencies • The scheduler is responsible for: • for each instruction, detecting when all dependencies have been satisifed (and therefore ready to execute) • propagating readiness information between instructions Lecture 8: Data-Capture Instruction Schedulers

Dynamic Instruction Stream Renamed Instruction Stream Dynamically Scheduled Instructions Fetch Rename Schedule Out-of-Order Execution (2) Static Program Out-of-order = out of the original sequential order Lecture 8: Data-Capture Instruction Schedulers

2-wide In-Order 2-wide Out-of-Order 1-wide In-Order 1-wide Out-of-Order A A A A C cache miss cache miss cache miss C D E D F E G B B C B B C D F 5 cycles D E F G E G 7 cycles F 8 cycles G 10 cycles Superscalar != Out-of-Order A: R1 = Load 16[R2] B: R3 = R1 + R4 C: R6 = Load 8[R9] D: R5 = R2 – 4 E: R7 = Load 20[R5] F: R4 = R4 – 1 G: BEQ R4, #0 cache miss A C D F B E G Lecture 8: Data-Capture Instruction Schedulers

Data-Capture Scheduler • At dispatch, instruction read all available operands from the register files and store a copy in the scheduler • Unavailable operands will be “captured” from the functional unit outputs • When ready, instructions can issue directly from the scheduler without reading additional operands from any other register files Fetch & Dispatch ARF PRF/ROB Physical register update Data-Capture Scheduler Bypass Functional Units Lecture 8: Data-Capture Instruction Schedulers

Non-Data-Capture Scheduler Fetch & Dispatch More on this next lecture! Scheduler ARF PRF Physical register update Functional Units Lecture 8: Data-Capture Instruction Schedulers

Arbiter A B B C E D F G Components of a Scheduler Method for tracking state of dependencies (resolved or not) Buffer for unexecuted instructions Method for choosing between multiple ready instructions competing for the same resource Method for notification of dependency resolution “Scheduler Entries” or “Issue Queue” (IQ) or “Reservation Stations” (RS) Lecture 8: Data-Capture Instruction Schedulers

Lather, Rinse, Repeat… • Scheduling Loop or Wakeup-Select Loop • Wake-Up Part: • Instructions selected for execution notify dependents (children) that the dependency has been resolved • For each instruction, check whether all input dependencies have been resolved • if so, the instruction is “woken up” • Select Part: • Choose which instructions get to execute • If 3 add instructions are ready at the same time, but there are only two adders, someone must wait to try again in the next cycle (and again, and again until selected) Lecture 8: Data-Capture Instruction Schedulers

Scalar Scheduler (Issue Width = 1) Tag Broadcast Bus T14 = T39 Select Logic To Execute Logic T16 = T39 T8 = T6 = = T42 T17 = T39 = T15 T17 = T39 Lecture 8: Data-Capture Instruction Schedulers

Superscalar Scheduler (detail of one entry) Tags, Ready Logic Select Logic Tag Broadcast Buses grants = = = = = = = = RdyL RdyL SrcL ValL SrcL ValL Dst Issued bid Lecture 8: Data-Capture Instruction Schedulers

Interaction with Execution Payload RAM Select Logic opcode ValL ValR D SL SR A ValL ValR ValL ValR ValL ValR Lecture 8: Data-Capture Instruction Schedulers

The scheduler captures the data, hence “Data-Capture” Again, But Superscalar Select Logic ValL ValR A opcode ValL ValR D SL SR ValL ValR ValL ValR D SL SR B opcode ValL ValR ValL ValR ValL ValR Lecture 8: Data-Capture Instruction Schedulers

Issue Width • Maximum number of instructions selected for execution each cycle is the issue width • Previous slide showed an issue width of two • The slide before that showed the details of a scheduler entry for an issue width of four • Hardware requirements: • Typically, an Issue Width of N requires N tag broadcast buses • Not always true: can specialize such that, for example, one “issue slot” can only handle branches Lecture 8: Data-Capture Instruction Schedulers

Pipeline Timing A B Select Payload Execute A: C result broadcast tag broadcast enable capture on tag match Capture Select Payload Execute Wakeup B: tag broadcast enable capture Wakeup C: Cycle i Cycle i+1 Lecture 8: Data-Capture Instruction Schedulers

Can’t read and write payload RAM at the same time; may need to bypass the results Pipelined Timing A B Select Payload Execute A: C result broadcast tag broadcast enable capture Capture Wakeup B: Select Payload Execute tag broadcast enable capture Capture Wakeup C: Select Payload Execute Cycle i Cycle i+1 Cycle i+2 Cycle i+3 Lecture 8: Data-Capture Instruction Schedulers

Pipelined Timing (2) A • Previous slide placed the pipeline boundary at the writing of the ready bits • This slide shows a pipeline where latches are placed right before the tag broadcast B Wakeup Select Payload Execute A: result broadcast tag broadcast enable capture Capture B: Select Payload Execute Wakeup Cycle i+1 Cycle i+2 Cycle i Lecture 8: Data-Capture Instruction Schedulers

Need a second level of bypassing Wakeup tag match on first operand No simultaneous read/write! More Pipelined Timing A B result broadcast and bypass C Select Payload Execute A: tag broadcast enable capture B: Wakeup Capture Select Payload Execute C: Capture Wakeup Capture tag match on second operand (now C is ready) Select Payload Exec Cycle i Cycle i+1 Cycle i+2 Cycle i+3 Lecture 8: Data-Capture Instruction Schedulers

B Select Payload Execute B: Select A&B both ready, only A selected, B bids again Wakeup Capture More-er Pipelined Timing A C Dependent instructions cannot execute in back-to-back cycles! Select Payload Execute A: D AC and CD must be bypassed, but no bypass for BD C: Wakeup Capture Select Payload Execute D: Wakeup Capture Select Payload Ex Cycle i i+1 i+2 i+3 i+4 i+5 Lecture 8: Data-Capture Instruction Schedulers

Critical Loops • Wakeup-Select Loop cannot be trivially pipelined while maintaining back-to-back execution of dependent instructions Regular Scheduling No Back- to-Back • Worst-case IPC reduction by ½ • Shouldn’t be that bad (previous slide had IPC of 4/3) • Studies indicate 10-15% IPC penalty • “Loose Loops Sink Chips”, Borch et al. A A B C B C Lecture 8: Data-Capture Instruction Schedulers

Frequency doesn’t double • latch/pipeline overhead • unbalanced stages • Other sources of IPC penalties • branches: ↑ pipe depth, ↓ predictor size, ↑ predict-to-update latency • caches/memory: same time in seconds, ↑ frequency,  more cycles • Power limitations: more logic, higher frequency P=½CV2f 900ps 450ps 450ps 900ps 350 550 1.5GHz IPC vs. Frequency • 10-15% IPC drop doesn’t seem bad if we can double the clock frequency 1000ps 500ps 500ps 1.7 IPC, 2GHz 2.0 IPC, 1GHz 2 BIPS 3.4 BIPS Lecture 8: Data-Capture Instruction Schedulers

Select Logic • Goal: minimize DFG height (execution time) • NP-Hard • Precedence Constrained Scheduling Problem • Even harder because the entire DFG is not known at scheduling time • Scheduling decisions made now may affect the scheduling of instructions not even fetched yet • Heuristics • For performance • For ease of implementation Lecture 8: Data-Capture Instruction Schedulers

O(log S) gates S entries yields O(S) gate delay 1 grant0 x8 x6 x5 x4 x3 x7 x1 x0 x2 grant1 grant2 grant3 grant4 grant5 grant6 grant7 grant8 grant9 Simple Select Logic 1 Grant0 = 1 Grant1 = !Bid0 Grant2 = !Bid0 & !Bid1 Grant3 = !Bid0 & !Bid1 & !Bid2 Grantn-1 = !Bid0 & … & !Bidn-2 granti xi = Bidi Scheduler Entries Lecture 8: Data-Capture Instruction Schedulers

Simple Select Logic • Instructions may be located in scheduler entries in no particular order • The first ready entry may be the oldest, youngest, anywhere in between • Simple select results in a “random” schedule • the schedule is still “correct” in that no dependencies are violated • it just may be far from optimal Lecture 8: Data-Capture Instruction Schedulers

Oldest First Select • Intuition: • An instruction that has just entered the scheduler will likely have few, if any, dependents (only intra-group) • Similarly, the instructions that have been in the scheduler the longest (the oldest) are likely to have the most dependents • Selecting the oldest instructions has a higher chance of satisfying more dependencies, thus making more instructions ready to execute  more parallelism Lecture 8: Data-Capture Instruction Schedulers

Compress Up A B E D F C G E H F G I H J K I L J Newly dispatched Implementing Oldest First Select B D E F G H Write instructions into scheduler in program order Lecture 8: Data-Capture Instruction Schedulers

An entire instruction’s worth of data: tags, opcodes, immediates, readiness, etc. Implementing Oldest First Select (2) • Compressing buffers are very complex • gates, wiring, area, power Ex. 4-wide Need up to shift by 4 Lecture 8: Data-Capture Instruction Schedulers

0 0 Grant 3 0 ∞ 2 2 Implementing Oldest First Select (3) G 6 0 A 5 F D 3 1 B 7 H 2 C E 4 Age-Aware Select Logic Lecture 8: Data-Capture Instruction Schedulers

Mul R1 = R2 × R3 Sched PayLd Exec Exec Exec Add R4 = R1 + R5 Sched PayLd Exec Handling Multi-Cycle Instructions Add R1 = R2 + R3 Sched PayLd Exec Xor R4 = R1 ^ R5 Sched PayLd Exec Add attemps to execute too early! Result not ready for another two cycles. Lecture 8: Data-Capture Instruction Schedulers

Delayed Tag Broadcast • It works, but… • Must make sure tag broadcast bus will be available N cycles in the future when needed • Bypass, data-capture potentially get more complex Mul R1 = R2 × R3 Sched PayLd Exec Exec Exec Add R4 = R1 + R5 Sched PayLd Exec Assume pipelined such that tag broadcast occurs at cycle boundary Lecture 8: Data-Capture Instruction Schedulers

Delayed Tag Broadcast (2) Mul R1 = R2 × R3 Sched PayLd Exec Exec Exec Add R4 = R1 + R5 Sched PayLd Exec Assume issue width equals 2 Sub R7 = R8 – #1 Sched PayLd Exec Xor R9 = R9 ^ R6 Sched PayLd Exec In this cycle, three instructions need to broadcast their tags! Lecture 8: Data-Capture Instruction Schedulers

Delayed Tag Broadcast (3) • Possible solutions • Have one select for issuing, another select for tag broadcast • messes up timing of data-capture • Pre-reserve the bus • select logic more complicated, must track usage in future cycles in addition to the current cycle • Hold the issue slot from initial launch until tag broadcast sch payl exec exec exec Issue width effectively reduced by one for three cycles Lecture 8: Data-Capture Instruction Schedulers

Delayed Wakeup • Push the delay to the consumer Tag Broadcast for R1 = R2 × R3 Tag arrives, but we wait three cycles before acknowledging it R1 R5 = R1 + R4 = R4 ready! = Also need to know parent’s latency Lecture 8: Data-Capture Instruction Schedulers

Non-Deterministic Latencies • Problem with previous approaches is that they assume that all instruction latencies are known at the time of scheduling • Makes things uglier for the delayed broadcast • This pretty much kills the delayed wakeup approach • Examples • Load instructions • Latency  {L1_lat, L2_lat, L3_lat, DRAM_lat} • DRAM_lat is not a constant either, queuing delays • Some architecture specific cases • PowerPC 603 has a “early out” for multiplication with a low-bit-width multiplicand • Intel Core 2’s divider also has an early out Lecture 8: Data-Capture Instruction Schedulers

Scheduler May be able to design cache s.t. hit/miss known before data DL1 Tags DL1 Data Exec Exec Exec Penalty reduced to 1 cycle Sched PayLd Exec The Wait-and-See Approach • Just wait and see whether a load hits or misses in the cache R1 = 16[$sp] Sched PayLd Exec Exec Exec Cache hit known, can broadcast tag R2 = R1 + #4 Sched PayLd Exec Load-to-Use latency increases by 2 cycles (3 cycle load appears as 5) Lecture 8: Data-Capture Instruction Schedulers

Load-Hit Speculation • Caches work pretty well • hit rates are high, otherwise caches wouldn’t be too useful • Just assume all loads will hit in the cache Sched PayLd Exec Exec Exec Cache hit, data forwarded R1 = 16[$sp] Broadcast delayed by DL1 latency R2 = R1 + #4 Sched PayLd Exec • Um, ok, what happens when there’s a load miss? Lecture 8: Data-Capture Instruction Schedulers

Cache Miss Detected! L2 hit Exec … Exec Value at cache output is bogus Sched PayLd Exec Load Miss Scenario Sched PayLd Exec Exec Exec Broadcast delayed by DL1 latency Sched PayLd Exec Invalidate the instruction (ALU output ignored) Broadcast delayed by L2 latency Each mis-scheduling wastes an issue slot: the tag broadcast bus, payload RAM read port, writeback/bypass bus, etc. could have been used for another instruction Rescheduled assuming a hit at the DL2 cache There could be a miss at the L2, and again at the L3 cache. A single load can waste multiple issuing opportunities. Lecture 8: Data-Capture Instruction Schedulers

Scheduler Deallocation • Normally, as soon as an instruction issues, it can vacate its scheduler entry • The sooner an entry is deallocated, the sooner another instruction can reuse that entry  leads to that instruction executing earlier • In the case of a load, the load must hang around in the scheduler until it can be guaranteed that it will not have to rebroadcast its destination tag • Decreases the effective size of the scheduler Lecture 8: Data-Capture Instruction Schedulers

Sched Sched Sched Sched PayLd PayLd PayLd PayLd Exec Exec Exec Exec Sched PayLd Ex Sched PayLd Ex Sched PayLd Ex “But wait, there’s more!” DL1 Miss Not only children get squashed, there may be grand-children to squash as well Sched PayLd Exec Exec Exec Squash Sched PayLd Exec All waste issue slots All must be rescheduled All waste power None may leave scheduler until load hit known Lecture 8: Data-Capture Instruction Schedulers

Squashing • The number of cycles worth of dependents that must be squashed is equal to Cache-Miss-Known latency minus one • previous example, miss-known latency = 3 cycles, there are two cycles worth of mis-scheduled dependents • Early miss-detection helps reduce mis-scheduled instructions • A load may have many children, but the issue width limits how many can possibly be mis-scheduled • Max = Issue-width × (Miss-known-lat – 1) Lecture 8: Data-Capture Instruction Schedulers

Sched PayLd Exec Sched PayLd Exec Sched PayLd Exec Sched PayLd Exec Sched PayLd Exec Squashing (2) • Simple: squash anything “in-flight” between schedule and execute • This may include non-dependent instructions • All instructions must stay in the scheduler for a few extra cycles to make sure they will not be rescheduled due to a squash Sched PayLd Exec Exec Exec Sched PayLd Exec Sched PayLd Exec Lecture 8: Data-Capture Instruction Schedulers

… … Squashing (3) • Selective squashing: use “load colors” • each load is assigned a unique color • every dependent “inherits” its parents’ colors • on a load miss, the load broadcasts its color and anyone in the same color group gets squashed • An instruction may end up with many colors Explicitly tracking each color would require a huge number of comparisons Lecture 8: Data-Capture Instruction Schedulers

X X X Squashing (4) • Can list “colors” in unary (bit-vector) form • Each instruction’s color vector is the bitwise OR of its parents’ vectors • A load miss now only squashes the dependent instructions • Hardware cost increases quadratically with number of load instructions Load R1 = 16[R2] 1 0 0 0 0 0 0 0 Add R3 = R1 + R4 1 0 0 0 0 0 0 0 Load R5 = 12[R7] 0 1 0 0 0 0 0 0 Load R8 = 0[R1] 1 0 1 0 0 0 0 0 Load R7 = 8[R4] 0 0 0 1 0 0 0 0 Add R6 = R8 + R7 1 0 1 1 0 0 0 0 Lecture 8: Data-Capture Instruction Schedulers

Tail Allocation • Allocate in-order, Deallocate in-order • Very simple! • Smaller effective scheduler size • instructions may have already executed out-of-order, but their RS entries cannot be reused • Can be very bad if a load goes to main memory Head Tail Circular Buffer Lecture 8: Data-Capture Instruction Schedulers

RS Allocator 0 0 0 0 1 0 0 1 0 0 0 0 0 1 1 1 Entry availability bit-vector Allocation (2) • With arbitrary placement, entries much better utilized • Allocator more complex • must scan availability and find N free entries • Write logic more complex • must route N instructions to arbitrary entries of the scheduler Lecture 8: Data-Capture Instruction Schedulers

Allocation (3) • Segment the entries • only one entry per segment may be allocated per cycle • instead of 4-of-16 alloc (previous slide), each allocator only does 1-of-4 • write logic simplified as well • Still possible inefficiencies • full segment leads to allocating less than N instructions per cycle A Alloc 0 0 1 0 Alloc B 1 0 1 0 C 0 Alloc 0 0 X 0 1 Alloc D 0 1 1 0 Free RS entries exist, just not in the correct segment Lecture 8: Data-Capture Instruction Schedulers

Advanced Microarchitecture