1 / 76

CSE 502: Computer Architecture

CSE 502: Computer Architecture. Core Pipelining. Before there was pipelining…. Single-cycle control: hardwired Low CPI (1) Long clock period (to accommodate slowest instruction) Multi-cycle control: micro-programmed Short clock period High CPI

galia
Download Presentation

CSE 502: Computer Architecture

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CSE 502:Computer Architecture Core Pipelining

  2. Before there was pipelining… • Single-cycle control: hardwired • Low CPI (1) • Long clock period (to accommodate slowest instruction) • Multi-cycle control: micro-programmed • Short clock period • High CPI • Can we have both low CPI and short clock period? Single-cycle insn0.(fetch,decode,exec) insn1.(fetch,decode,exec) Multi-cycle insn0.fetch insn0.dec insn0.exec insn1.fetch insn1.dec insn1.exec time

  3. Pipelining • Start with multi-cycle design • When insn0 goes from stage 1 to stage 2… insn1 starts stage 1 • Each instruction passes through all stages… but instructions enter and leave at faster rate Multi-cycle insn0.fetch insn0.dec insn0.exec insn1.fetch insn1.dec insn1.exec insn0.fetch insn0.dec insn0.exec Pipelined insn1.fetch insn1.dec insn1.exec time insn2.fetch insn2.dec insn2.exec • Can have as many insns in flight as there are stages

  4. Pipeline Examples = address hit? = Stage delay = Bandwidth = = = = • Stage delay = • Bandwidth = address hit? = = = = Stage delay = • Bandwidth = address hit? = = = • Increases throughput at the expense of latency

  5. ALU Processor Pipeline Review Fetch Decode Execute Memory (Write-back) +4 I-cache Reg File D-cache PC

  6. Stage 1: Fetch • Fetch an instruction from memory every cycle • Use PC to index memory • Increment PC (assume no branches for now) • Write state to the pipeline register (IF/ID) • The next stage will read this pipeline register

  7. M U X 1 PC + 1 + Instruction Cache PC Instruction bits en en IF / ID Pipeline register Stage 1: Fetch Diagram target Decode

  8. Stage 2: Decode • Decodes opcode bits • Set up Control signals for later stages • Read input operands from register file • Specified by decoded instruction bits • Write state to the pipeline register (ID/EX) • Opcode • Register contents • PC+1 (even though decode didn’t use it) • Control signals (from insn) for opcode and destReg

  9. PC + 1 regA PC + 1 Register File regA contents regB destReg regB contents data Instruction bits en Control signals ID / EX Pipeline register IF / ID Pipeline register Stage 2: Decode Diagram target Execute Fetch

  10. Stage 3: Execute • Perform ALU operations • Calculate result of instruction • Control signals select operation • Contents of regA used as one input • Either regB or constant offset (from insn) used as second input • Calculate PC-relative branch target • PC+1+(constant offset) • Write state to the pipeline register (EX/Mem) • ALU result, contents of regB, and PC+1+offset • Control signals (from insn) for opcode and destReg

  11. PC+1 +offset ALU result regA contents + regB contents M U X regB contents A L U Control signals EX/Mem Pipeline register ID / EX Pipeline register Stage 3: Execute Diagram target PC + 1 Decode Memory Control signals destReg data

  12. Stage 4: Memory • Perform data cache access • ALU result contains address for LD or ST • Opcode bits control R/W and enable signals • Write state to the pipeline register (Mem/WB) • ALU result and Loaded data • Control signals (from insn) for opcode and destReg

  13. ALU result ALU result Data Cache in_data in_addr Loaded data en R/W Control signals EX/Mem Pipeline register Mem/WB Pipeline register Stage 4: Memory Diagram PC+1 +offset target Execute Write-back regB contents Control signals destReg data

  14. Stage 5: Write-back • Writing result to register file (if required) • Write Loaded data to destReg for LD • Write ALU result to destReg for arithmetic insn • Opcode bits control register write enable signal

  15. data M U X destReg M U X Stage 5: Write-back Diagram ALU result Memory Loaded data Control signals Mem/WB Pipeline register

  16. + + A L U Putting It All Together M U X target 1 PC+1 PC+1 0 R0 eq? ALU result regA R1 Register file regB R2 M U X valA instruction PC Inst Cache Data Cache R3 ALU result mdata R4 valB R5 M U X R6 data R7 offset dest valB M U X dest dest dest op op op IF/ID ID/EX EX/Mem Mem/WB

  17. Pipelining Idealism • Uniform Sub-operations • Operation can partitioned into uniform-latency sub-ops • Repetition of Identical Operations • Same ops performed on many different inputs • Repetition of Independent Operations • All repetitions of op are mutually independent

  18. Pipeline Realism • Uniform Sub-operations … NOT! • Balance pipeline stages • Stage quantization to yield balanced stages • Minimize internal fragmentation (left-over time near end of cycle) • Repetition of Identical Operations … NOT! • Unifying instruction types • Coalescing instruction types into one “multi-function” pipe • Minimize external fragmentation (idle stages to match length) • Repetition of Independent Operations … NOT! • Resolve data and resource hazards • Inter-instruction dependency detection and resolution • Pipelining is expensive

  19. The Generic Instruction Pipeline IF Instruction Fetch ID Instruction Decode OF Operand Fetch EX Instruction Execute WB Write-back

  20. Balancing Pipeline Stages IF TIF= 6 units • Without pipelining • Tcyc TIF+TID+TOF+TEX+TOS • = 31 • Pipelined • Tcyc max{TIF, TID, TOF, TEX, TOS} • = 9 • Speedup= 31 / 9 ID TID= 2 units OF TID= 9 units EX TEX= 5 units WB TOS= 9 units • Can we do better?

  21. Balancing Pipeline Stages (1/2) • Two methods for stage quantization • Merge multiple sub-ops into one • Divide sub-ops into smaller pieces • Recent/Current trends • Deeper pipelines (more and more stages) • Multiple different pipelines/sub-pipelines • Pipelining of memory accesses

  22. Balancing Pipeline Stages (2/2) Coarser-Grained Machine Cycle: 4 machine cyc / instruction Finer-Grained Machine Cycle: 11 machine cyc /instruction IF IF TIF&ID= 8 units IF ID ID OF TOF= 9 units OF OF # stages = 11 Tcyc= 3 units # stages = 4 Tcyc= 9 units OF EX TEX= 5 units EX EX WB WB TOS= 9 units WB WB

  23. Pipeline Examples AMDAHL 470V/7 IF PC GEN MIPS R2000/R3000 Cache Read IF Cache Read IF ID ID Decode OF RD Read REG OF AddrGEN ALU EX Cache Read Cache Read MEM WB EX EX 1 EX 2 WB WB Check Result Write Result

  24. Instruction Dependencies (1/2) • Data Dependence • Read-After-Write (RAW) (only true dependence) • Read must wait until earlier write finishes • Anti-Dependence (WAR) • Write must wait until earlier read finishes (avoid clobbering) • Output Dependence (WAW) • Earlier write can’t overwrite later write • Control Dependence (a.k.a. Procedural Dependence) • Branch condition must execute before branch target • Instructions after branch cannot run before branch

  25. Instruction Dependencies (1/2) • #for (;(j<high)&&(array[j]<array[low]);++j); • bge j, high, $36 • mul $15, j, 4 • addu $24, array, $15 • lw $25, 0($24) • mul $13, low, 4 • addu $14, array, $13 • lw $15, 0($14) • bge $25, $15, $36 • $35: • addu j, j, 1 • . . . • $36: • addu $11, $11, -1 • . . . • Real code has lots of dependencies

  26. Hardware Dependency Analysis • Processor must handle • Register Data Dependencies (same register) • RAW, WAW, WAR • Memory Data Dependencies (same address) • RAW, WAW, WAR • Control Dependencies

  27. Pipeline Terminology • Pipeline Hazards • Potential violations of program dependencies • Must ensure program dependencies are not violated • Hazard Resolution • Static method: performed at compile time in software • Dynamic method: performed at runtime using hardware • Two options: Stall (costs perf.) or Forward (costs hw.) • Pipeline Interlock • Hardware mechanism for dynamic hazard resolution • Must detect and enforce dependencies at runtime

  28. Pipeline: Steady State t0 t1 t2 t3 t4 t5 Instj IF ID RD ALU MEM WB IF ID RD ALU MEM WB Instj+1 IF ID RD ALU MEM WB Instj+2 IF ID RD ALU MEM WB Instj+3 IF ID RD ALU MEM WB Instj+4 IF ID RD ALU MEM IF ID RD ALU IF ID RD IF ID IF

  29. Pipeline: Data Hazard t0 t1 t2 t3 t4 t5 Instj IF ID RD ALU MEM WB IF ID RD ALU MEM WB Instj+1 IF ID RD ALU MEM WB Instj+2 IF ID RD ALU MEM WB Instj+3 IF ID RD ALU MEM WB Instj+4 IF ID RD ALU MEM IF ID RD ALU IF ID RD IF ID IF

  30. Option 1: Stall on Data Hazard t0 t1 t2 t3 t4 t5 IF ID RD ALU MEM WB Instj IF ID RD ALU MEM WB Instj+1 IF ID Stalled in RD RD ALU MEM WB Instj+2 IF Stalled in ID ID RD ALU MEM WB Instj+3 Stalled in IF IF ID RD ALU MEM Instj+4 IF ID RD ALU IF ID RD IF ID IF

  31. Option 2: Forwarding Paths (1/3) t0 t1 t2 t3 t4 t5 Instj IF ID RD ALU MEM WB Many possible paths IF ID RD ALU MEM WB Instj+1 IF ID RD ALU MEM WB Instj+2 IF ID RD ALU MEM WB Instj+3 IF ID RD ALU MEM WB Instj+4 IF ID RD ALU MEM IF ID RD ALU IF ID RD IF ID IF Requires stalling even with forwarding paths MEM ALU

  32. Option 2: Forwarding Paths (2/3) src1 IF ID Register File src2 dest ALU MEM WB

  33. Option 2: Forwarding Paths (3/3) src1 IF ID Register File src2 dest = = = = Deeper pipeline may require additional forwarding paths = = ALU MEM WB

  34. Pipeline: Control Hazard t0 t1 t2 t3 t4 t5 Insti IF ID RD ALU MEM WB IF ID RD ALU MEM WB Insti+1 IF ID RD ALU MEM WB Insti+2 IF ID RD ALU MEM WB Insti+3 IF ID RD ALU MEM WB Insti+4 IF ID RD ALU MEM IF ID RD ALU IF ID RD IF ID IF

  35. Pipeline: Stall on Control Hazard t0 t1 t2 t3 t4 t5 Insti IF ID RD ALU MEM WB IF ID RD ALU MEM WB Insti+1 Stalled in IF IF ID RD ALU MEM Insti+2 IF ID RD ALU Insti+3 IF ID RD Insti+4 IF ID IF

  36. nop nop nop ALU nop RD ALU ID RD nop nop nop Pipeline: Prediction for Control Hazards t0 t1 t2 t3 t4 t5 Insti IF ID RD ALU MEM WB Speculative State Cleared IF ID RD ALU MEM WB Insti+1 IF ID RD ALU nop nop Insti+2 IF ID RD nop nop Insti+3 IF ID nop nop Insti+4 IF ID RD New Insti+2 Fetch Resteered IF ID New Insti+3 IF New Insti+4

  37. Going Beyond Scalar • Scalar pipeline limited to CPI ≥ 1.0 • Can never run more than 1 insn. per cycle • “Superscalar” can achieve CPI ≤ 1.0 (i.e., IPC ≥ 1.0) • Superscalar means executing multiple insns. in parallel

  38. Architectures for Instruction Parallelism • Scalar pipeline (baseline) • Instruction/overlap parallelism = D • Operation Latency = 1 • Peak IPC = 1.0 D D different instructions overlapped Successive Instructions 1 2 3 4 5 6 7 8 9 10 11 12 Time in cycles

  39. Superscalar Machine • Superscalar (pipelined) Execution • Instruction parallelism = D x N • Operation Latency = 1 • Peak IPC = N per cycle D x N different instructions overlapped N Successive Instructions 1 2 3 4 5 6 7 8 9 10 11 12 Time in cycles

  40. Superscalar Example: Pentium Prefetch 4× 32-byte buffers Decode1 Decode up to 2 insts Decode2 Decode2 Read operands, Addr comp Asymmetric pipes Execute Execute both u-pipe v-pipe mov, lea, simple ALU, push/pop test/cmp shift rotate some FP jmp, jcc, call, fxch Writeback Writeback

  41. Pentium Hazards & Stalls • “Pairing Rules” (when can’t two insns exec?) • Read/flow dependence • moveax, 8 • mov [ebp], eax • Output dependence • moveax, 8 • moveax, [ebp] • Partial register stalls • mov al, 1 • mov ah, 0 • Function unit rules • Some instructions can never be paired • MUL, DIV, PUSHA, MOVS, some FP

  42. Limitations of In-Order Pipelines • If the machine parallelism is increased • … dependencies reduce performance • CPI of in-order pipelines degrades sharply • As N approaches avg. distance between dependent instructions • Forwarding is no longer effective • Must stall often • In-order pipelines are rarely full

  43. The In-Order N-Instruction Limit • On average, parent-child separation is about ± 5 insn. • (Franklin and Sohi ’92) Dependent insn must be N = 4 instructions away Ex. Superscalar degree N = 4 Any dependency between these instructions will cause a stall Average of 5 means there are many cases when the separation is < 4… each of these limits parallelism • Reasonable in-order superscalar is effectively N=2

  44. In Search of Parallelism • “Trivial” Parallelism is limited • What is trivial parallelism? • In-order: sequential instructions do not have dependencies • In all previous examples, all instructions executed either at the same time or after earlier instructions • previous slides show that superscalar execution quickly hits a ceiling • So what is “non-trivial” parallelism? …

  45. What is Parallelism? • Work • T1: time to complete a computation on a sequential system • Critical Path • T: time to complete the same computation on an infinitely-parallel system • Average Parallelism • Pavg = T1/ T • For a p-wide system • Tp  max{T1/p , T} • Pavg >> p  Tp  T1/p x = a + b; y = b * 2 z =(x-y) * (x+y)

  46. ILP: Instruction-Level Parallelism • ILP is a measure of the amount of inter-dependencies between instructions • Average ILP = num instructions / longest path • code1: ILP = 1 (must execute serially) • T1 = 3, T = 3 • code2: ILP = 3 (can execute at the same time) • T1 = 3, T = 1 code2:r1  r2 + 1 r3  r9 / 17 r4  r0 - r10 code1:r1  r2 + 1 r3  r1 / 17 r4  r0 - r3

  47. ILP != IPC • Instruction level parallelism usually assumes infinite resources, perfect fetch, and unit-latency for all instructions • ILP is more a property of the program dataflow • IPC is the “real” observed metric of exactly how many instructions are executed per machine cycle, which includes all of the limitations of a real machine • The ILP of a program is an upper-bound on the attainable IPC

  48. ILP=3 ILP=1 ILP=2 Scope of ILP Analysis r1  r2 + 1 r3  r1 / 17 r4  r0 - r3 r11  r12 + 1 r13  r19 / 17 r14  r0 - r20

  49. DFG Analysis • A: R1 = R2 + R3 • B: R4 = R5 + R6 • C: R1 = R1 * R4 • D: R7 = LD 0[R1] • E: BEQZ R7, +32 • F: R4 = R7 - 3 • G: R1 = R1 + 1 • H: R4  ST 0[R1] • J: R1 = R1 – 1 • K: R3  ST 0[R1]

  50. In-Order Issue, Out-of-Order Completion In-order Inst. Stream Execution Begins In-order INT Fadd1 Fmul1 Ld/St Fadd2 Fmul2 Issue stage needs to check: 1. Structural Dependence 2. RAW Hazard 3. WAW Hazard 4. WAR Hazard Fmul3 Out-of-order Completion Issue = send an instruction to execution

More Related