Advanced Microarchitecture

Advanced Microarchitecture Lecture 10: ALUs and Bypass

This Lecture: Execution Datapath • ALUs • Scheduler to Execution Unit interface • Execution unit organization • Bypass networks • Clustering Lecture 10: ALUs and Bypassing

Result Opcode “ALU” Adder Logic Shift Mult Div Operand1 Operand2 ALUs • ALU: Arithmetic Logic Units • FU: Functional Units • EU: Execution Units Adder ALU What’s the difference? Implementation details, algorithms, etc. of adders, multipliers, dividers not covered in this course Lecture 10: ALUs and Bypassing

Interfacing ALUs to the Scheduler • Issue N instructions • Read N sets of operands, immediates, opcodes, destination tags • Route to correct functional units Fetch & Dispatch ARF PRF/ROB Physical register update Data-Capture Scheduler Bypass Functional Units Lecture 10: ALUs and Bypassing

Effectively one nasty crossbar Data-Capture Payload RAM Payload RAM Issue Port 3 Select Logic opcode ValL ValR Issue Port 0 opcode ValL ValR Issue Port 2 opcode ValL ValR Select decisions, port bindings, etc. Issue Lane 0 Issue Lane 1 Issue Lane 2 Issue Lane 3 Lecture 10: ALUs and Bypassing

Payload RAM select 0 select 1 select 2 select 3 Issue 0 Issue 3 Issue 2 Issue 1 “Register File” Organization Each RF read port input has a 1-to-1 correspondence with one and only one RF read port output No MUXing of outputs is required Register File “R1” “R7” “R3” “R4” val(R1) val(R7) val(R3) val(R4) Lecture 10: ALUs and Bypassing

SRAM Row Decoders But how do you assign which set of data gets routed to which set of read port outputs? “Register File” Is An Overkill RS entries Payload RAM Select Select Select Select Lecture 10: ALUs and Bypassing

Execution Lane ↔ Select Binding RS entries Payload RAM Select Select Select Select Payload RAM read port outputs are in the same order as the Select Blocks Lecture 10: ALUs and Bypassing

Tri-State Driver Single Entry Close-Up Select Port 0 Select Port 1 Select Port 2 Select Port 3 Opcode Src L Src R Single RS Entry grant 0 bid 0 grant 1 bid 1 grant 2 bid 2 grant 3 bid 3 One RS entry can only bid on one select port, so payload never driven to more than one port Output buses connected to all payload RAM entries Each select port only gives the grant to a single RS entry, so more than one payload entry can never drive the same payload output port Lecture 10: ALUs and Bypassing

Need to “Swizzle” at the End Opcode Silo Src L Silo Src R Silo Nasty tangle of wires (Src’s are 64-128 bits each!) Lecture 10: ALUs and Bypassing

Non-Data-Capture Scheduler Register File Row Decoders RS entries Payload RAM Register File SRAM Array Select Select Select Select Src L tags Src R tags Lecture 10: ALUs and Bypassing

Immediate Values • data-capture can store immediate values in payload bay • non-DC needs separate storage • Could add extra field to payload • could allocate a physical register and store the immediate there • Could store in a separate “immediate file” Lecture 10: ALUs and Bypassing

Distributed Scheduler • Grant/Payload read lines may have to travel further horizontally (multiple RS widths) • ScheduleExecute latency less critical than ScheduleSchedule (wakeup-select) loop latency FM/D FAdd Shift Store FP-Ld FP-St ALU1 ALU2 M/D Load Select 2 Select 3 Select 0 Select 1 Payload RAM Lecture 10: ALUs and Bypassing

store Fadd FMul shift mult load FDiv add div Naive ALU Organization • Besides making scheduling hard to scale, arbitrary any issue  any ALU makes operand routing a horrible mess (needs full cross bar) From Payload/RF Read Ports Lecture 10: ALUs and Bypassing

FPCvt FP ld store mult load shift add add div Execution-Port-Based Layout • Just need to fan-out data to FUs within the same execution lane; no cross-bar needed • Each FU needs a “valid” input to know that the incoming data is meant for it and not another FU in the same lane • Or just let them all compute in parallel and use only the output that you want  wasted power Lane 0 Lane 1 Lane 2 Lane 3 Lecture 10: ALUs and Bypassing

shift mult add div Bypass Network Organization From Payload RAM/Register File N × 2 sets of inputs f × 64 bits N=Issue Width, f=Num FUs O(f2N) area just for the bypass wiring!!! … which is cubic since f = W(N) f × 64 bits Previous slide had f=9 FUs, and that didn’t even include all of the FP units Lecture 10: ALUs and Bypassing

Fadd FPCvt store Fmul load FP st shift mult Fdiv add add div FP ld ALU Stacks From Payload/RF Integer Bypass Floating Point Bypass Bypass  FU Fan-Out Bypass MUXes reduced to one pair per ALU stack (as opposed to one per FU) Lecture 10: ALUs and Bypassing

FPCvt store FP ld FP st Fadd shift Fmul mult load Fdiv add add div Bypass Sharing From Payload/RF Integer Bypass Floating Point Bypass Bypass wiring reduced to one output per execution lane/ALU stack Bypass  FU Fan-Out Local FU Output Lecture 10: ALUs and Bypassing

shift load add S X X E1 E2 2-cycle shift, to Lane 1 S X X E 1-cycle add, to Lane 1 Two instructions want to writeback using same bypass path! X Bypass Sharing (2) • If all FU’s in a stack have the same latency, writeback conflicts are impossible • because only one instruction can issue to each lane per cycle • But not all FU’s have the same latency: Lecture 10: ALUs and Bypassing

E1 E2 X X Bypass Sharing (3) • How to resolve this structural hazard? • Obvious solution: stall • Creates scheduling headaches • Treat bypass/WB as another structural resource • Separate select logic* for bypass allocation 0 1 2 3 4 5 To Bypass S X X 2-cycle shift, to Lane 1 To Bypass S S E 1-cycle add, to Lane 1 0 1 2 3 4 5 6 *Not same as regular select logic, just a table read/write Writeback Scoreboard X Lecture 10: ALUs and Bypassing

S X X E S X X E1 E2 E3 Bypass Sharing (4) 0 1 2 3 4 5 6 7 8 S X X E1 E2 A: 2-cycle shift, to Lane 1 S B: 1-cycle add, to Lane 1 S S C: 3-cycle load, to Lane 1 0 1 2 3 4 5 6 7 8 Wasted issue opportunity: B picked by select, but cannot issue due to WB conflict C could have issued, but is stalled by one cycle Select B C Lecture 10: ALUs and Bypassing

FPCvt store FP ld FP st Fadd shift Fmul mult load Fdiv add add div Bypass Critical Path Total wire length is about twice the total width plus twice the total height Lecture 10: ALUs and Bypassing

Each execution lane/ALU stack is self-contained FPCvt store FP ld shift mult load add add div Longest path only crosses total width once Bypass Critical Path (2) Lecture 10: ALUs and Bypassing

Bypass Control Problem • We now have the datapaths to forward values between ALUs/FUs • How do we orchestrate what goes where and when? • In particular, how do we set the controls of each of the bypass MUXes on a cycle-by-cycle basis? Lecture 10: ALUs and Bypassing

add mul Port 2: MUL P30 = P21 * P17 S X X E E E Scoreboarding • For each value produced, make note (in the scoreboard) of where it will be available • For each source, consult scoreboard to find out how to rendezvous 4 R 0 1 2 3 4 5 6 7 17 - 0 Port 1: ADD P21 = … S X X E 21 1 R Port 0: ADD P17 = P21 + P4 S X X E Lecture 10: ALUs and Bypassing

add Scoreboarding (2) • Setting bypass controls is easy • Read where the value will come from and feed to bypass MUXes in the operand read stage P21 1 Payload (src tags) WB Score board P4 R • May add scheduleexecute stages for data-capture scheduler • why not for non-data-capture? Lecture 10: ALUs and Bypassing

Value read from RF A needs to update SB this cycle for C to correctly source its operand Scoreboarding (3) • Updating can be more complicated • Depends on when SB read occurs w.r.t. operand reading • earlier reads cause more disconnect Assume SB read in 1st cycle after schedule RF A Value bypassed, WB to RF S X X E1 E2 E3 B S X X E C S X X E Lecture 10: ALUs and Bypassing

Scoreboarding (4) • Scoreboard can become a critical timing bottleneck • All sources must read from scoreboard • All destinations must update scoreboard • Once at schedule to indicate bypass location • Once later to indicate value has written back to RF • ~ 4×N ports for the scoreboard! • If scoreboard becomes multi-cycle, things can get really crazy • need to bypass scoreboard reads/writes like inter-group rename bypassing Lecture 10: ALUs and Bypassing

CAM-based Bypass • Extend data-capture concept to bypass network Register Value from Payload/RF Result Value Register Tag Result Tag Lane 0 Lane 1 Lane 2 Lane 3 = = = = Use Lane 0 Use Lane 1 Use Lane 2 Use Lane 3 Use PL/RF Lecture 10: ALUs and Bypassing

CAM-based Bypass (2) • Must carry destination tag to execution and broadcast along with result • But you have to do this anyway; need the destination tag for RF writeback • A lot of CAM logic • Costs power and area • Control is simple: it’s basically control-less Lecture 10: ALUs and Bypassing

Writeback to Data-Caputure • Looks very similar to bypass CAM Payload of DC Scheduler = = = = = = = = Exec Lane 0 Exec Lane 1 Exec Lane 2 Exec Lane 3 SrcL SrcR ValL ValR Lecture 10: ALUs and Bypassing

?? A B Problem: How does C pickup the value of P21? A B PRF Writeback Latency Physical Register File (3-cycle write latency) A: ADD P21 = … B: ADD P17 = P21 + … C: MUL P30 = P21× P17 A B C A Bypass Network Lecture 10: ALUs and Bypassing

Multi-Level Bypass • Bypass network must cover the latency of the writeback operation • If WB requires N cycles, then bypass must be able to source all N cycles worth of results Physical Register File C B A But this is only for one ALU (or ALU stack) C B A 3-level Bypass From PL/RF Lecture 10: ALUs and Bypassing

Superscalar, Multi-Level Bypass 3-cycle PRF WB latency ALU Stack 0 ALU Stack 1 ALU Stack 2 AL Lecture 10: ALUs and Bypassing

A Bit More Hierarchical To Physical Register Writeback ALU Stack 0 ALU Stack 1 ALU Stack 2 ALU Stack 3 Lecture 10: ALUs and Bypassing

Bypass Network Complexity • Parameters • N = Issue width • f = Number of functional units • b = bit width of data* (e.g., 32 bits, 64 bits) • D = Network depth (RF write latency) • Metrics • Area • Latency … Both contribute directly to power *For CAM-based bypass logic, should include tag width as well Lecture 10: ALUs and Bypassing

Bypass Network Complexity (Area) • Width • 2×(N+D) + 1 inputs at b bits each • Replicated N times • Total 2N2b + Nb(D+1) • Height • N values at b bits each, times D levels • MUXes: O((D-1)×(lg N) + lg(N+D)) • Assume FUs per ALU stack is constant: f/N = O(1) • Total O(NDb) • Total Area • O(N3b2D + N2b2D2) • Cubic in N, Quadratic in D and b N stacks N values O(lg N) N values O(lg N) N values O(lg(N+D)) N+D inputs ALU Stack 0 O(f/N)-to-1 MUX for outputs: O(lg(f/N)) height 1 value Lecture 10: ALUs and Bypassing

Bypass Network Complexity (Delay) • ALU output to 1st latch • O(lg(f/N)) gates for the MUX • O(N+D) wire delay horizontally • O(f/N + lg(N+D)) wire delay vertically • Last latch to ALU input • O(N+D) wire horizontally • O(lg N) gate delay for 1st MUX • O(N + lg N) wire delay vertically • O(lg(N+D)) gate delay • Gate Delay (worse of the two) • O(lg(N+D)) or O(lg(f/N)) • Wire Length (ditto) • O(N + D + f/N) • Unbuffered wire has quadratic delay N stacks N values O(lg N) N values O(lg N) N values O(lg(N+D)) N+D inputs ALU Stack 0 O(f/N)-to-1 MUX for outputs: O(lg(f/N)) height 1 value Lecture 10: ALUs and Bypassing

Bypass Network Complexity* * Complexity analysis is entirely dependent on the layout assumptions. For example, hierarchical vs. non-hierarchical bypass organizations lead to different areas, wire lengths and gate delays When someone says “this circuit’s area scales quadratically with respect to X”, this really means that “this circuit’s area scales quadratically with respect to X assuming a layout style of Z” Lecture 10: ALUs and Bypassing

ALU Clustering • The exact distribution of FUs to ALU stacks and/or select binding groups can affect layout • Already saw how separation of INT and FP stacks reduces unnecessary datapaths • Has additional benefits when bits(INT) != bits(FP) • Ex. x86 uses 32/64-bit integers, but internally uses 80-bit FP • SSE3 introduces 128-bit packed SIMD values, but normal GPRs are still only 64 bits wide • Certain instructions do not generate outputs (branches) • Memory instructions treated differently (outputs go to LSQ), and stores don’t generate a register result Lecture 10: ALUs and Bypassing

Clustered Microarchitectures • Bypass network delays scale poorly • Scheduling delays scale poorly • RF delays scale poorly • Partition into smaller control and data domains Lecture 10: ALUs and Bypassing

FUs FUs FUs FUs Clustered Scheduling RS Entries (Cluster 1) RS Entries (Cluster 2) RS Entries (Cluster 3) RS Entries (Cluster 0) Cross-Cluster Wakeup Interconnection Network Cross-cluster wakeup may take > 1 cycle Payload0 Payload1 Payload2 Payload3 Execution Cluster 0 Lecture 10: ALUs and Bypassing

Cross-Cluster Wakeup A C D B  E But a different clustering algorithm only needs 3! Cross-Cluster Wakeup Delay 2 cluster, round-robin cluster assignment A C B D Cross-Cluster Wakeup A C E   Normally takes 3 cycles (assume all 1-cycle latencies) B D  E Now it takes 5 cycles Lecture 10: ALUs and Bypassing

FUs FUs FUs FUs Cross-Cluster Bypass Payload0 Payload0 Payload0 Payload0 Cross-Cluster Bypass Network Similar delay issues like the case for scheduling Values may take > 1 cycle to get from cluster to cluster Lecture 10: ALUs and Bypassing

 X X X X E Penalties are not additive! Cross-Cluster Bypass (2) • So do we have to pay X-cluster penalties once at schedule and again at bypass? This assumes that the Wakeup Delay (CiCj) is equal to the Bypass Delay (CiCj) If true for all i and j, then bypass and wakeup delays always overlapped S X X X X E A  S B B schedules 2 cycles after A due to extra cycle of wakeup delay Lecture 10: ALUs and Bypassing

Clustered RFs • Place 1/nth of the physical registers in each cluster • How to partition? • ARF/PRF: read at dispatch, extra latency may require more levels of bypassing • Unified PRF: latency may make schedexec delay intolerable (replay penalty too expensive), plus all of the bypassing • Replicate PRF • Keep a full copy of the register file in each cluster • Reduces per cluster read port requirements • Still need to write to all clusters (each cluster needs full set of write ports) Lecture 10: ALUs and Bypassing

Advanced Microarchitecture