Advanced Microarchitecture

Advanced Microarchitecture Lecture 2: Pipelining and Superscalar Review

Pipelined Design • Motivation: Increase throughput with little increase in cost (hardware, power, complexity, etc.) • Bandwidth or Throughput = Performance • BW = num. tasks/unit time • For a system that operates on one task at a time: BW = 1 / latency • Pipelining can increase BW if many repetitions of same operation/task • Latency per task remains same or increases Lecture 2: Pipelining and Superscalar Review

Pipelining Illustrated Combinatorial Logic N Gate Delays BW = ~(1/n) Combinatorial Logic N/2 Gate Delays Combinatorial Logic N/2 Gate Delays BW = ~(2/n) Comb. Logic N/3 Gates Comb. Logic N/3 Gates Comb. Logic N/3 Gates BW = ~(3/n) Lecture 2: Pipelining and Superscalar Review

T Performance Model • Starting from an unpipelined version with propagation delay T and BW=1/T Perfpipe = BWpipe = 1 / (T/k + S) where S = latch delay where k = num stages k-stage pipelined unpipelined T/k S T/k S Lecture 2: Pipelining and Superscalar Review

G Hardware Cost Model • Starting from an unpipelined version with hardware cost G Costpipe = G + kL where L = latch cost incl. control where k = num stages k-stage pipelined unpipelined G/k L G/k L Lecture 2: Pipelining and Superscalar Review

Cost/Performance Tradeoff C/P Cost/Performance: C/P = [Lk + G] / [1/(T/k + S)] = (Lk + G) (T/k + S) = LT + GS + LSk + GT/k Optimal Cost/Performance: find min. C/P w.r.t. choice of k k æ ö Lk + G ç ÷ d ç ÷ GT G T 1 = 0 + 0 + LS - k = - - - - - - - - ç ç o p t dk T L S k2 ç ç + S ç ÷ k è ø Lecture 2: Pipelining and Superscalar Review

“Optimal” Pipeline Depth: kopt x104 Cost/Performance Ratio (C/P) G=175, L=41, T=400, S=22 G=175, L=21, T=400, S=11 Pipeline Depth k Lecture 2: Pipelining and Superscalar Review

Cost? • “Hardware Cost” • Transistor/Gate Count • Should include additional logic to control the pipeline • Area (related to gate count) • Power! • More gates  more switching • More gates  more leakage • Many metrics to optimize • Very difficult to determine what really is “optimal” Lecture 2: Pipelining and Superscalar Review

Good Examples: Automobile assembly line Floating-Point multiplier Instruction pipeline (?) Pipelining Idealism • Uniform Suboperations • The operation to be pipelined can be evenly partitioned into uniform-latency suboperations • Repetition of Identical Operations • The same operations are to be performed repeatedly on a large number of different inputs • Repetition of Independent Operations • All the repetitions of the same operation are mutually independent, i.e., no data dependences and no resource conflicts Lecture 2: Pipelining and Superscalar Review

Instruction Pipeline Design • Uniform Suboperations … NOT! • Balance pipeline stages • Stage quantization to yield balanced stages • Minimize internal fragmentation (some waiting stages) • Identical operations … NOT! • Unifying instruction types • Coalescing instruction types into one “multi-function” pipe • Minimize external fragmentation (some idling stages) • Independent operations … NOT! • Resolve data and resource hazards • Inter-instruction dependency detection and resolution • Minimize performance loss Lecture 2: Pipelining and Superscalar Review

The Generic Instruction Cycle • The “computation” to be pipelined: • Instruction Fetch (IF) • Instruction Decode (ID) • Operand(s) Fetch (OF) • Instruction Execution (EX) • Operand Store (OS) • a.k.a. writeback (WB) • Update Program Counter (PC) Lecture 2: Pipelining and Superscalar Review

The Generic Instruction Pipeline Based on Obvious Subcomputations: IF Instruction Fetch ID Instruction Decode OF/RF Operand Fetch EX Instruction Execute OS/WB Operand Store Lecture 2: Pipelining and Superscalar Review

Balancing Pipeline Stages IF TIF= 6 units • Without pipelining • Tcyc TIF+TID+TOF+TEX+TOS • = 31 • Pipelined • Tcyc max{TIF, TID, TOF, TEX, TOS} • = 9 • Speedup= 31 / 9 • Can we do better in terms of either performance or efficiency? ID TID= 2 units OF/RF TID= 9 units EX TEX= 5 units OS/WB TOS= 9 units Lecture 2: Pipelining and Superscalar Review

Balancing Pipeline Stages • Two methods for stage quantization • Merging multiple subcomputations into one • Subdividing a subcomputation into multiple smaller ones • Recent/Current trends • Deeper pipelines (more and more stages) • To a certain point: then cost function takes over • Multiple different pipelines/subpipelines • Pipelining of memory accesses (tricky) Lecture 2: Pipelining and Superscalar Review

Granularity of Pipeline Stages Finer-Grained Machine Cycle: 11 machine cyc /instruction Coarser-Grained Machine Cycle: 4 machine cyc / instruction IF IF TIF&ID= 8 units IF ID ID OF TOF= 9 units OF OF Tcyc= 3 units OF EX TEX= 5 units EX EX OS OS TOS= 9 units OS OS TIF,TID,TOF,TEX,TOS = (6/2/9/5/9) Lecture 2: Pipelining and Superscalar Review

Hardware Requirements • Logic needed for each pipeline stage • Register file ports needed to support all (relevant) stages • Memory accessing ports needed to support all (relevant) stages IF IF IF ID ID OF OF OF OF EX EX EX OS OS OS OS Lecture 2: Pipelining and Superscalar Review

Pipeline Examples AMDAHL 470V/7 IF PC GEN MIPS R2000/R3000 Cache Read IF Cache Read IF ID ID Decode OF RD Read REG OF Add GEN ALU EX Cache Read Cache Read MEM OS EX EX 1 EX 2 WB OS Check Result Write Result Lecture 2: Pipelining and Superscalar Review

Instruction Dependencies • Data Dependence • True Dependence (RAW) • Instruction must wait for all required input operands • Anti-Dependence (WAR) • Later write must not clobber a still-pending earlier read • Output Dependence (WAW) • Earlier write must not clobber an already-finished later write • Control Dependence (a.k.a. Procedural Dependence) • Conditional branches cause uncertainty to instruction sequencing • Instructions following a conditional branch depends on the execution of the branch instruction • Instructions following a computed branch depends on the execution of the branch instruction Lecture 2: Pipelining and Superscalar Review

Example: Quick Sort on MIPS • bge $10, $9, $36 • mul $15, $10, 4 • addu $24, $6, $15 • lw $25, 0($24) • mul $13, $8, 4 • addu $14, $6, $13 • lw $15, 0($14) • bge $25, $15, $36 • $35: • addu $10, $10, 1 • . . . • $36: • addu $11, $11, -1 • . . . • # for (;(j<high)&&(array[j]<array[low]);++j); • # $10 = j; $9 = high; $6 = array; $8 = low Lecture 2: Pipelining and Superscalar Review

Hardware Dependency Analysis • Processor must handle • Register Data Dependencies • RAW, WAW, WAR • Memory Data Dependencies • RAW, WAW, WAR • Control Dependencies Lecture 2: Pipelining and Superscalar Review

Terminology • Pipeline Hazards: • Potential violations of program dependencies • Must ensure program dependencies are not violated • Hazard Resolution: • Static method: performed at compile time in software • Dynamic method: performed at runtime using hardware Stall, Flush or Forward • Pipeline Interlock: • Hardware mechanism for dynamic hazard resolution • Must detect and enforce dependencies at runtime Lecture 2: Pipelining and Superscalar Review

Pipeline: Steady State t0 t1 t2 t3 t4 t5 Instj IF ID RD ALU MEM WB IF ID RD ALU MEM WB Instj+1 IF ID RD ALU MEM WB Instj+2 IF ID RD ALU MEM WB Instj+3 IF ID RD ALU MEM WB Instj+4 IF ID RD ALU MEM IF ID RD ALU IF ID RD IF ID IF Lecture 2: Pipelining and Superscalar Review

Pipeline: Data Hazard t0 t1 t2 t3 t4 t5 Instj IF ID RD ALU MEM WB IF ID RD ALU MEM WB Instj+1 IF ID RD ALU MEM WB Instj+2 IF ID RD ALU MEM WB Instj+3 IF ID RD ALU MEM WB Instj+4 IF ID RD ALU MEM IF ID RD ALU IF ID RD IF ID IF Lecture 2: Pipelining and Superscalar Review

Pipeline: Stall on Data Hazard t0 t1 t2 t3 t4 t5 Instj IF ID RD ALU MEM WB IF ID RD ALU MEM WB Instj+1 IF ID Stalled in RD RD ALU MEM WB Instj+2 IF Stalled in ID ID RD ALU MEM WB Instj+3 Stalled in IF IF ID RD ALU MEM Instj+4 IF ID RD ALU IF ID RD IF ID IF Lecture 2: Pipelining and Superscalar Review

Different View Lecture 2: Pipelining and Superscalar Review

Pipeline: Forwarding Paths t0 t1 t2 t3 t4 t5 Instj IF ID RD ALU MEM WB Many possible paths IF ID RD ALU MEM WB Instj+1 IF ID RD ALU MEM WB Instj+2 IF ID RD ALU MEM WB Instj+3 IF ID RD ALU MEM WB Instj+4 IF ID RD ALU MEM IF ID RD ALU IF ID RD IF ID IF Requires stalling even with fwding paths MEM ALU Lecture 2: Pipelining and Superscalar Review

ALU Forwarding Paths src1 IF ID Register File src2 dest = = Deeper pipeline may require additional forwarding paths = = ALU MEM Lecture 2: Pipelining and Superscalar Review

Pipeline: Control Hazard t0 t1 t2 t3 t4 t5 Insti IF ID RD ALU MEM WB IF ID RD ALU MEM WB Insti+1 IF ID RD ALU MEM WB Insti+2 IF ID RD ALU MEM WB Insti+3 IF ID RD ALU MEM WB Insti+4 IF ID RD ALU MEM IF ID RD ALU IF ID RD IF ID IF Lecture 2: Pipelining and Superscalar Review

Pipeline: Stall on Control Hazard t0 t1 t2 t3 t4 t5 Insti IF ID RD ALU MEM WB IF ID RD ALU MEM WB Insti+1 Stalled in IF IF ID RD ALU MEM Insti+2 IF ID RD ALU Insti+3 IF ID RD Insti+4 IF ID IF Lecture 2: Pipelining and Superscalar Review

nop nop nop ALU nop RD ALU ID RD nop nop nop Pipeline: Prediction for Control Hazards t0 t1 t2 t3 t4 t5 Insti IF ID RD ALU MEM WB Speculative State Cleared IF ID RD ALU MEM WB Insti+1 IF ID RD ALU nop nop Insti+2 IF ID RD nop nop Insti+3 IF ID nop nop Insti+4 IF ID RD New Insti+2 Fetch Resteered IF ID New Insti+3 IF New Insti+4 Lecture 2: Pipelining and Superscalar Review

Going Beyond Scalar • Simple pipeline limited to execution of CPI ≥ 1.0 • “Superscalar” can achieve CPI ≤ 1.0 (i.e., IPC ≥ 1.0) • Superscalar means executing more than one scalar instruction in parallel (e.g., add + xor + mul) • Contrast to Vector which effectively executes multiple operations in parallel, but they all must be the same (e.g., four parallel additions) Lecture 2: Pipelining and Superscalar Review

Architectures for Instruction Parallelism • Scalar pipeline (baseline) • Instruction/overlap parallelism = D • Operation Latency = 1 • Peak IPC = 1 D D different instructions overlapped Successive Instructions 1 2 3 4 5 6 7 8 9 10 11 12 Time in cycles Lecture 2: Pipelining and Superscalar Review

Superscalar Machine • Superscalar (pipelined) Execution • Instruction parallelism = D x N • Operation Latency = 1 • Peak IPC = N per cycle D x N different instructions overlapped N Successive Instructions 1 2 3 4 5 6 7 8 9 10 11 12 Time in cycles Lecture 2: Pipelining and Superscalar Review

Ex. Original Pentium Prefetch 4× 32-byte buffers Decode1 Decode up to 2 insts Decode2 Decode2 Read operands, Addr comp Asymmetric pipes Execute Execute Both u-pipe v-pipe mov, lea, simple ALU, push/pop test/cmp shift rotate some FP jmp, jcc, call, fxch Writeback Writeback Lecture 2: Pipelining and Superscalar Review

Pentium Hazards, Stalls • “Pairing Rules” (when can/can’t two insts exec at the same time?) • read/flow dependence moveax, 8 mov [ebp], eax • output dependence moveax, 8 moveax, [ebp] • partial register stalls mov al, 1 mov ah, 0 • function unit rules • some instructions can never be paired: MUL, DIV, PUSHA, MOVS, some FP Lecture 2: Pipelining and Superscalar Review

Limitations of In-Order Pipelines • CPI of inorder pipelines degrades very sharply if the machine parallelism is increased beyond a certain point • i.e., when N approaches the average distance between dependent instructions • Forwarding is no longer effective • Must stall more often • Pipeline may never be full due to frequency of dependency stalls Lecture 2: Pipelining and Superscalar Review

N Instruction Limit Pentium: Superscalar degree N=2 is reasonable… going much further encounters rapidly diminishing returns Ex. Superscalar degree N = 4 Dependent inst must be N = 4 instructions away Any dependency between these instructions will cause a stall On average, the parent- child separation is only about 5± instructions! (Franklin and Sohi ’92) Average of 5 means there are many cases when the separation is < 4… each of these limits parallelism Lecture 2: Pipelining and Superscalar Review

In Search of Parallelism • “Trivial” Parallelism is limited • What is trivial parallelism? • In-order: sequential instructions do not have dependencies • in all previous examples, all instructions executed either at the same time or after earlier instructions • previous slides show that superscalar execution quickly hits a ceiling • So what is “non-trivial” parallelism? … Lecture 2: Pipelining and Superscalar Review

What is Parallelism? • Work T1: time to complete a computation on a sequential system • Critical Path T: time to complete the same computation on an infinitely-parallel system • Average Parallelism Pavg = T1/ T • For a p-wide system Tp max{T1/p , T} Pavg >> p Tp T1/p x = a + b; y = b * 2 z =(x-y) * (x+y) Lecture 2: Pipelining and Superscalar Review

ILP: Instruction-Level Parallelism • ILP is a measure of the amount of inter-dependencies between instructions • Average ILP = num instructions / longest path code1: ILP = 1 (must execute serially) T1 = 3, T = 3 code2: ILP = 3 (can execute at the same time) T1 = 3, T = 1 code2:r1  r2 + 1 r3  r9 / 17 r4  r0 - r10 code1:r1  r2 + 1 r3  r1 / 17 r4  r0 - r3 Lecture 2: Pipelining and Superscalar Review

ILP != IPC • Instruction level parallelism usually assumes infinite resources, perfect fetch, and unit-latency for all instructions • ILP is more a property of the program dataflow • IPC is the “real” observed metric of exactly how many instructions are executed per machine cycle, which includes all of the limitations of a real machine • The ILP of a program is an upper-bound on the attainable IPC Lecture 2: Pipelining and Superscalar Review

ILP=3 ILP=1 ILP=2 Scope of ILP Analysis r1  r2 + 1 r3  r1 / 17 r4  r0 - r3 r11  r12 + 1 r13  r19 / 17 r14  r0 - r20 Lecture 2: Pipelining and Superscalar Review

DFG Analysis A: R1 = R2 + R3 B: R4 = R5 + R6 C: R1 = R1 * R4 D: R7 = LD 0[R1] E: BEQZ R7, +32 F: R4 = R7 - 3 G: R1 = R1 + 1 H: R4  ST 0[R1] J: R1 = R1 – 1 K: R3  ST 0[R1] Lecture 2: Pipelining and Superscalar Review

In-Order Issue, Out-of-Order Completion In-order Inst. Stream Execution Begins In-order INT Fadd1 Fmul1 Ld/St Fadd2 Fmul2 Issue stage needs to check: 1. Structural Dependence 2. RAW Hazard 3. WAW Hazard 4. WAR Hazard Fmul3 Out-of-order Completion Issue = send an instruction to execution Lecture 2: Pipelining and Superscalar Review

A B 2: C 3: D 4: 5: 6: E F G 7: H J 8: K Example A: R1 = R2 + R3 B: R4 = R5 + R6 C: R1 = R1 * R4 D: R7 = LD 0[R1] E: BEQZ R7, +32 F: R4 = R7 - 3 G: R1 = R1 + 1 H: R4  ST 0[R1] J: R1 = R1 – 1 K: R3  ST 0[R1] Cycle 1: A B C IPC = 10/8 = 1.25 D G E F J H K Lecture 2: Pipelining and Superscalar Review

A B 2: C 3: D E F G 4: 5: 6: H J 7: K Example (2) A: R1 = R2 + R3 B: R4 = R5 + R6 C: R1 = R1 * R4 D: R9 = LD 0[R1] E: BEQZ R7, +32 F: R4 = R7 - 3 G: R1 = R1 + 1 H: R4  ST 0[R9] J: R1 = R9 – 1 K: R3  ST 0[R1] Cycle 1: A B E C D F G H J K IPC = 10/7 = 1.43 Lecture 2: Pipelining and Superscalar Review

Track with Simple Scoreboarding • Scoreboard: a bit-array, 1-bit for each GPR • If the bit is not set: the register has valid data • If the bit is set: the register has stale data i.e., some outstanding instruction is going to change it • Issue in Order: RD Fn (RS, RT) • If SB[RS] or SB[RT] is set  RAW, stall • If SB[RD] is set  WAW, stall • Else, dispatch to FU (Fn) and set SB[RD] • Complete out-of-order • Update GPR[RD], clear SB[RD] Lecture 2: Pipelining and Superscalar Review

Out-of-Order Issue In-order Inst. Stream Need an extra Stage/buffers for Dependency Resolution DR DR DR DR Out of Program Order Execution INT Fadd1 Fmul1 Ld/St Fadd2 Fmul2 Fmul3 Out-of-order Completion Lecture 2: Pipelining and Superscalar Review

OOO Scoreboarding • Similar to In-Order scoreboarding • Need new tables to track status of individual instructions and functional units • Still enforce dependencies • Stall dispatch on WAW • Stall issue on RAW • Stall completion on WAR • Limitations of Scoreboarding? • Hints • No structural hazards • Can always write a RAW-free code sequence Add R1 = R0 + 1; Add R2 = R0 + 1; Add R3 = R0 + 1; … • Think about x86 ISA with only 8 registers Finite number of registers in any ISA will force you to reuse register names at some point  WAR, WAW  stalls Lecture 2: Pipelining and Superscalar Review

Lessons thus Far • More out-of-orderness More ILP exposed • But more hazards • Stalling is a generic technique to ensure sequencing • RAW stall is a fundamental requirement (?) • Compiler analysis and scheduling can help (not covered in this course) Lecture 2: Pipelining and Superscalar Review

Advanced Microarchitecture