770 likes | 946 Views
CSE 502: Computer Architecture. Core Pipelining. Before there was pipelining…. Single-cycle control: hardwired Low CPI (1) Long clock period (to accommodate slowest instruction) Multi-cycle control: micro-programmed Short clock period High CPI
E N D
CSE 502:Computer Architecture Core Pipelining
Before there was pipelining… • Single-cycle control: hardwired • Low CPI (1) • Long clock period (to accommodate slowest instruction) • Multi-cycle control: micro-programmed • Short clock period • High CPI • Can we have both low CPI and short clock period? Single-cycle insn0.(fetch,decode,exec) insn1.(fetch,decode,exec) Multi-cycle insn0.fetch insn0.dec insn0.exec insn1.fetch insn1.dec insn1.exec time
Pipelining • Start with multi-cycle design • When insn0 goes from stage 1 to stage 2… insn1 starts stage 1 • Each instruction passes through all stages… but instructions enter and leave at faster rate Multi-cycle insn0.fetch insn0.dec insn0.exec insn1.fetch insn1.dec insn1.exec insn0.fetch insn0.dec insn0.exec Pipelined insn1.fetch insn1.dec insn1.exec time insn2.fetch insn2.dec insn2.exec • Can have as many insns in flight as there are stages
Pipeline Examples = address hit? = Stage delay = Bandwidth = = = = • Stage delay = • Bandwidth = address hit? = = = = Stage delay = • Bandwidth = address hit? = = = • Increases throughput at the expense of latency
ALU Processor Pipeline Review Fetch Decode Execute Memory (Write-back) +4 I-cache Reg File D-cache PC
Stage 1: Fetch • Fetch an instruction from memory every cycle • Use PC to index memory • Increment PC (assume no branches for now) • Write state to the pipeline register (IF/ID) • The next stage will read this pipeline register
M U X 1 PC + 1 + Instruction Cache PC Instruction bits en en IF / ID Pipeline register Stage 1: Fetch Diagram target Decode
Stage 2: Decode • Decodes opcode bits • Set up Control signals for later stages • Read input operands from register file • Specified by decoded instruction bits • Write state to the pipeline register (ID/EX) • Opcode • Register contents • PC+1 (even though decode didn’t use it) • Control signals (from insn) for opcode and destReg
PC + 1 regA PC + 1 Register File regA contents regB destReg regB contents data Instruction bits en Control signals ID / EX Pipeline register IF / ID Pipeline register Stage 2: Decode Diagram target Execute Fetch
Stage 3: Execute • Perform ALU operations • Calculate result of instruction • Control signals select operation • Contents of regA used as one input • Either regB or constant offset (from insn) used as second input • Calculate PC-relative branch target • PC+1+(constant offset) • Write state to the pipeline register (EX/Mem) • ALU result, contents of regB, and PC+1+offset • Control signals (from insn) for opcode and destReg
PC+1 +offset ALU result regA contents + regB contents M U X regB contents A L U Control signals EX/Mem Pipeline register ID / EX Pipeline register Stage 3: Execute Diagram target PC + 1 Decode Memory Control signals destReg data
Stage 4: Memory • Perform data cache access • ALU result contains address for LD or ST • Opcode bits control R/W and enable signals • Write state to the pipeline register (Mem/WB) • ALU result and Loaded data • Control signals (from insn) for opcode and destReg
ALU result ALU result Data Cache in_data in_addr Loaded data en R/W Control signals EX/Mem Pipeline register Mem/WB Pipeline register Stage 4: Memory Diagram PC+1 +offset target Execute Write-back regB contents Control signals destReg data
Stage 5: Write-back • Writing result to register file (if required) • Write Loaded data to destReg for LD • Write ALU result to destReg for arithmetic insn • Opcode bits control register write enable signal
data M U X destReg M U X Stage 5: Write-back Diagram ALU result Memory Loaded data Control signals Mem/WB Pipeline register
+ + A L U Putting It All Together M U X target 1 PC+1 PC+1 0 R0 eq? ALU result regA R1 Register file regB R2 M U X valA instruction PC Inst Cache Data Cache R3 ALU result mdata R4 valB R5 M U X R6 data R7 offset dest valB M U X dest dest dest op op op IF/ID ID/EX EX/Mem Mem/WB
Pipelining Idealism • Uniform Sub-operations • Operation can partitioned into uniform-latency sub-ops • Repetition of Identical Operations • Same ops performed on many different inputs • Repetition of Independent Operations • All repetitions of op are mutually independent
Pipeline Realism • Uniform Sub-operations … NOT! • Balance pipeline stages • Stage quantization to yield balanced stages • Minimize internal fragmentation (left-over time near end of cycle) • Repetition of Identical Operations … NOT! • Unifying instruction types • Coalescing instruction types into one “multi-function” pipe • Minimize external fragmentation (idle stages to match length) • Repetition of Independent Operations … NOT! • Resolve data and resource hazards • Inter-instruction dependency detection and resolution • Pipelining is expensive
The Generic Instruction Pipeline IF Instruction Fetch ID Instruction Decode OF Operand Fetch EX Instruction Execute WB Write-back
Balancing Pipeline Stages IF TIF= 6 units • Without pipelining • Tcyc TIF+TID+TOF+TEX+TOS • = 31 • Pipelined • Tcyc max{TIF, TID, TOF, TEX, TOS} • = 9 • Speedup= 31 / 9 ID TID= 2 units OF TID= 9 units EX TEX= 5 units WB TOS= 9 units • Can we do better?
Balancing Pipeline Stages (1/2) • Two methods for stage quantization • Merge multiple sub-ops into one • Divide sub-ops into smaller pieces • Recent/Current trends • Deeper pipelines (more and more stages) • Multiple different pipelines/sub-pipelines • Pipelining of memory accesses
Balancing Pipeline Stages (2/2) Coarser-Grained Machine Cycle: 4 machine cyc / instruction Finer-Grained Machine Cycle: 11 machine cyc /instruction IF IF TIF&ID= 8 units IF ID ID OF TOF= 9 units OF OF # stages = 11 Tcyc= 3 units # stages = 4 Tcyc= 9 units OF EX TEX= 5 units EX EX WB WB TOS= 9 units WB WB
Pipeline Examples AMDAHL 470V/7 IF PC GEN MIPS R2000/R3000 Cache Read IF Cache Read IF ID ID Decode OF RD Read REG OF AddrGEN ALU EX Cache Read Cache Read MEM WB EX EX 1 EX 2 WB WB Check Result Write Result
Instruction Dependencies (1/2) • Data Dependence • Read-After-Write (RAW) (only true dependence) • Read must wait until earlier write finishes • Anti-Dependence (WAR) • Write must wait until earlier read finishes (avoid clobbering) • Output Dependence (WAW) • Earlier write can’t overwrite later write • Control Dependence (a.k.a. Procedural Dependence) • Branch condition must execute before branch target • Instructions after branch cannot run before branch
Instruction Dependencies (1/2) • #for (;(j<high)&&(array[j]<array[low]);++j); • bge j, high, $36 • mul $15, j, 4 • addu $24, array, $15 • lw $25, 0($24) • mul $13, low, 4 • addu $14, array, $13 • lw $15, 0($14) • bge $25, $15, $36 • $35: • addu j, j, 1 • . . . • $36: • addu $11, $11, -1 • . . . • Real code has lots of dependencies
Hardware Dependency Analysis • Processor must handle • Register Data Dependencies (same register) • RAW, WAW, WAR • Memory Data Dependencies (same address) • RAW, WAW, WAR • Control Dependencies
Pipeline Terminology • Pipeline Hazards • Potential violations of program dependencies • Must ensure program dependencies are not violated • Hazard Resolution • Static method: performed at compile time in software • Dynamic method: performed at runtime using hardware • Two options: Stall (costs perf.) or Forward (costs hw.) • Pipeline Interlock • Hardware mechanism for dynamic hazard resolution • Must detect and enforce dependencies at runtime
Pipeline: Steady State t0 t1 t2 t3 t4 t5 Instj IF ID RD ALU MEM WB IF ID RD ALU MEM WB Instj+1 IF ID RD ALU MEM WB Instj+2 IF ID RD ALU MEM WB Instj+3 IF ID RD ALU MEM WB Instj+4 IF ID RD ALU MEM IF ID RD ALU IF ID RD IF ID IF
Pipeline: Data Hazard t0 t1 t2 t3 t4 t5 Instj IF ID RD ALU MEM WB IF ID RD ALU MEM WB Instj+1 IF ID RD ALU MEM WB Instj+2 IF ID RD ALU MEM WB Instj+3 IF ID RD ALU MEM WB Instj+4 IF ID RD ALU MEM IF ID RD ALU IF ID RD IF ID IF
Option 1: Stall on Data Hazard t0 t1 t2 t3 t4 t5 IF ID RD ALU MEM WB Instj IF ID RD ALU MEM WB Instj+1 IF ID Stalled in RD RD ALU MEM WB Instj+2 IF Stalled in ID ID RD ALU MEM WB Instj+3 Stalled in IF IF ID RD ALU MEM Instj+4 IF ID RD ALU IF ID RD IF ID IF
Option 2: Forwarding Paths (1/3) t0 t1 t2 t3 t4 t5 Instj IF ID RD ALU MEM WB Many possible paths IF ID RD ALU MEM WB Instj+1 IF ID RD ALU MEM WB Instj+2 IF ID RD ALU MEM WB Instj+3 IF ID RD ALU MEM WB Instj+4 IF ID RD ALU MEM IF ID RD ALU IF ID RD IF ID IF Requires stalling even with forwarding paths MEM ALU
Option 2: Forwarding Paths (2/3) src1 IF ID Register File src2 dest ALU MEM WB
Option 2: Forwarding Paths (3/3) src1 IF ID Register File src2 dest = = = = Deeper pipeline may require additional forwarding paths = = ALU MEM WB
Pipeline: Control Hazard t0 t1 t2 t3 t4 t5 Insti IF ID RD ALU MEM WB IF ID RD ALU MEM WB Insti+1 IF ID RD ALU MEM WB Insti+2 IF ID RD ALU MEM WB Insti+3 IF ID RD ALU MEM WB Insti+4 IF ID RD ALU MEM IF ID RD ALU IF ID RD IF ID IF
Pipeline: Stall on Control Hazard t0 t1 t2 t3 t4 t5 Insti IF ID RD ALU MEM WB IF ID RD ALU MEM WB Insti+1 Stalled in IF IF ID RD ALU MEM Insti+2 IF ID RD ALU Insti+3 IF ID RD Insti+4 IF ID IF
nop nop nop ALU nop RD ALU ID RD nop nop nop Pipeline: Prediction for Control Hazards t0 t1 t2 t3 t4 t5 Insti IF ID RD ALU MEM WB Speculative State Cleared IF ID RD ALU MEM WB Insti+1 IF ID RD ALU nop nop Insti+2 IF ID RD nop nop Insti+3 IF ID nop nop Insti+4 IF ID RD New Insti+2 Fetch Resteered IF ID New Insti+3 IF New Insti+4
Going Beyond Scalar • Scalar pipeline limited to CPI ≥ 1.0 • Can never run more than 1 insn. per cycle • “Superscalar” can achieve CPI ≤ 1.0 (i.e., IPC ≥ 1.0) • Superscalar means executing multiple insns. in parallel
Architectures for Instruction Parallelism • Scalar pipeline (baseline) • Instruction/overlap parallelism = D • Operation Latency = 1 • Peak IPC = 1.0 D D different instructions overlapped Successive Instructions 1 2 3 4 5 6 7 8 9 10 11 12 Time in cycles
Superscalar Machine • Superscalar (pipelined) Execution • Instruction parallelism = D x N • Operation Latency = 1 • Peak IPC = N per cycle D x N different instructions overlapped N Successive Instructions 1 2 3 4 5 6 7 8 9 10 11 12 Time in cycles
Superscalar Example: Pentium Prefetch 4× 32-byte buffers Decode1 Decode up to 2 insts Decode2 Decode2 Read operands, Addr comp Asymmetric pipes Execute Execute both u-pipe v-pipe mov, lea, simple ALU, push/pop test/cmp shift rotate some FP jmp, jcc, call, fxch Writeback Writeback
Pentium Hazards & Stalls • “Pairing Rules” (when can’t two insns exec?) • Read/flow dependence • moveax, 8 • mov [ebp], eax • Output dependence • moveax, 8 • moveax, [ebp] • Partial register stalls • mov al, 1 • mov ah, 0 • Function unit rules • Some instructions can never be paired • MUL, DIV, PUSHA, MOVS, some FP
Limitations of In-Order Pipelines • If the machine parallelism is increased • … dependencies reduce performance • CPI of in-order pipelines degrades sharply • As N approaches avg. distance between dependent instructions • Forwarding is no longer effective • Must stall often • In-order pipelines are rarely full
The In-Order N-Instruction Limit • On average, parent-child separation is about ± 5 insn. • (Franklin and Sohi ’92) Dependent insn must be N = 4 instructions away Ex. Superscalar degree N = 4 Any dependency between these instructions will cause a stall Average of 5 means there are many cases when the separation is < 4… each of these limits parallelism • Reasonable in-order superscalar is effectively N=2
In Search of Parallelism • “Trivial” Parallelism is limited • What is trivial parallelism? • In-order: sequential instructions do not have dependencies • In all previous examples, all instructions executed either at the same time or after earlier instructions • previous slides show that superscalar execution quickly hits a ceiling • So what is “non-trivial” parallelism? …
What is Parallelism? • Work • T1: time to complete a computation on a sequential system • Critical Path • T: time to complete the same computation on an infinitely-parallel system • Average Parallelism • Pavg = T1/ T • For a p-wide system • Tp max{T1/p , T} • Pavg >> p Tp T1/p x = a + b; y = b * 2 z =(x-y) * (x+y)
ILP: Instruction-Level Parallelism • ILP is a measure of the amount of inter-dependencies between instructions • Average ILP = num instructions / longest path • code1: ILP = 1 (must execute serially) • T1 = 3, T = 3 • code2: ILP = 3 (can execute at the same time) • T1 = 3, T = 1 code2:r1 r2 + 1 r3 r9 / 17 r4 r0 - r10 code1:r1 r2 + 1 r3 r1 / 17 r4 r0 - r3
ILP != IPC • Instruction level parallelism usually assumes infinite resources, perfect fetch, and unit-latency for all instructions • ILP is more a property of the program dataflow • IPC is the “real” observed metric of exactly how many instructions are executed per machine cycle, which includes all of the limitations of a real machine • The ILP of a program is an upper-bound on the attainable IPC
ILP=3 ILP=1 ILP=2 Scope of ILP Analysis r1 r2 + 1 r3 r1 / 17 r4 r0 - r3 r11 r12 + 1 r13 r19 / 17 r14 r0 - r20
DFG Analysis • A: R1 = R2 + R3 • B: R4 = R5 + R6 • C: R1 = R1 * R4 • D: R7 = LD 0[R1] • E: BEQZ R7, +32 • F: R4 = R7 - 3 • G: R1 = R1 + 1 • H: R4 ST 0[R1] • J: R1 = R1 – 1 • K: R3 ST 0[R1]
In-Order Issue, Out-of-Order Completion In-order Inst. Stream Execution Begins In-order INT Fadd1 Fmul1 Ld/St Fadd2 Fmul2 Issue stage needs to check: 1. Structural Dependence 2. RAW Hazard 3. WAW Hazard 4. WAR Hazard Fmul3 Out-of-order Completion Issue = send an instruction to execution