Recap (Pipelining)

Recap(Pipelining)

What is Pipelining? • A way of speeding up execution of tasks • Key idea: overlap execution of multiple taks

Automobile Manufacturing 1. Build frame. 60 min. 2. Add engine. 50 min. 3. Build body. 80 min. 4. Paint. 40 min. 5. Finish. 45 min. 275 min. Latency: Time from start to finish for one car. 275 minutes per car. (smaller is better) Throughput: Number of finished cars per time unit. 1 car/275 min = 0.218 cars/hour (larger is better) Issues: How can we make the process better by adding?

3 4 1 2 2 3 4 1 time 4 1 2 3 3 1 4 2 4 3 2 1 An Assembly line 80 80 60 80 50 80 80 40 45 Last two stages only receive onecar/80 min to work on. Latency: 400 min/car Throughput: 4 cars/640 min (1 car/160 min) Will approach 1 car/80 min as time goes on First two stagescan’t produce faster thanone car/80 min or a backlog will occurat third stage.

1ns 200ps 200ps 200ps 200ps 200ps Pipeline Register Pipelining a Digital System • Key idea: break big computation up into piecesSeparate each piece with a pipeline register

Non-pipelined: 1 operation finishes every 1ns 1ns Pipelined: 1 operation finishes every 200ps 200ps 200ps 200ps 200ps 200ps Pipelining a Digital System • Why do this? Because it's faster for repeated computations

Comments about pipelining • Pipelining increases throughput, but not latency • Answer available every 200ps, BUT • A single computation still takes 1ns • Limitations: • Computations must be divisible into stages of equal sizes • Pipeline registers add overhead

30ns 3ns Comb. Logic R E G Clock Another Example Unpipelined System • One operation must complete before next can begin • Operations spaced 33ns apart Delay = 33ns Throughput = 30MHz Op1 Op2 Op3 ?? Time

10ns 3ns 10ns 3ns 10ns 3ns Comb. Logic R E G Comb. Logic R E G Comb. Logic R E G Clock 3 Stage Pipelining • Space operations 13ns apart • 3 operations occur simultaneously Delay = 39ns Throughput = 77MHz Op1 Op2 Op3 Op4 Time

5ns 3ns 15ns 3ns 10ns 3ns Com. Log. R E G Comb. Logic R E G Comb. Logic R E G Limitation: Nonuniform Pipelining • Throughput limited by slowest stage • Delay determined by clock period * number of stages • Must attempt to balance stages Delay = 18 * 3 = 54 ns Throughput = 55MHz Clock

5ns 3ns 5ns 3ns 5ns 3ns 5ns 3ns 5ns 3ns 5ns 3ns Com. Log. R E G Com. Log. R E G Com. Log. R E G Com. Log. R E G Com. Log. R E G Com. Log. R E G Clock Limitation: Deep Pipelines • Diminishing returns as add more pipeline stages • Register delays become limiting factor • Increased latency • Small throughput gains • More hazards Delay = 48ns, Throughput = 128MHz

MIPSPipelining

MIPS 5-stage pipeline • The MIPS processor needs 5 stages to execute instructions • Pipelining stages: • IF - Instruction Fetch • ID - Instruction Decode • EX - Execute / Address Calculation • MEM - Memory Access (read / write) • WB - Write Back (results into register file) • Not all instructions need all the stages (e.g., add instruction does not need the MEM stage)

Pipeline Registers Basic MIPS Pipelined Processor IF/ID ID/EX EX/MEM MEM/WB

Consider the following instruction sequence: lw $r0, 10($r1) sw $sr3, 20($r4) add $r5, $r6, $r7 sub $r8, $r9, $r10 Pipelined Example - Executing Multiple Instructions

Executing Multiple InstructionsClock Cycle 1 LW

Executing Multiple InstructionsClock Cycle 2 SW LW

Executing Multiple InstructionsClock Cycle 3 ADD SW LW

Executing Multiple InstructionsClock Cycle 4 ADD SW LW SUB

Executing Multiple InstructionsClock Cycle 5 SUB ADD SW LW

Executing Multiple InstructionsClock Cycle 6 SUB ADD SW

Executing Multiple InstructionsClock Cycle 7 ADD SUB

Executing Multiple InstructionsClock Cycle 8 SUB

Alternative View - Multicycle Diagram

Processor Pipelining • There are two ways that pipelining can help: • Reduce the clock cycle time, and keep the same CPI • Reduce the CPI, and keep the same clock cycle time CPU time = Instruction count * CPI * Clock cycle time

Reduce the clock cycle time, and keep the same CPI CPI = 1 Clock = X Hz

Pipeline Registers ADD ADD M U X M E U X X T N D Reduce the clock cycle time, and keep the same CPI CPI = 1 Clock = X*5 Hz 4 PC <<2 Instruction I ADDR RD 32 32 16 5 5 5 Instruction Memory RN1 RN2 WN RD1 Register File ALU WD RD2 ADDR Data RD Memory 16 32 WD

Reduce the CPI, and keep the same cycle time CPI = 5 Clock = X*5 Hz

Pipeline Registers ADD ADD 4 PC <<2 Instruction I ADDR RD 32 32 16 5 5 5 Instruction Memory RN1 RN2 WN RD1 Register File ALU WD RD2 M ADDR U X Data RD Memory M E U X 16 32 X WD T N D Reduce the CPI, and keep the same cycle time CPI = 1 Clock = X*5 Hz

Pipeline performance • Ideally we get a speedup (by reducing clock cycle or reducing the CPI) equal to the number of stages. • In practice, we do not achieve that – but we get close: • Pipelining has additional overhead (e.g., pipeline registers) • Pipeline hazards

Pipeline Hazards • Hazards are situations in pipelining which prevent the next instruction in the instruction stream from executing during the designated clock cycle. • Hazards reduce the ideal speedup gained from pipelining (e.g., CPI =1) and are classified into three classes: • Structural hazards • Data hazards • Control hazards

Recap (Pipelining)

Recap (Pipelining)

Presentation Transcript

Pipelining ( Week 8 )

PIPELINING

Pipelining

Intel IA-64

Pipelining

Review: Pipelining

第七章

Computer Architecture Principles Dr. Mike Frank

Computer Architecture

Recap: Summary of Pipelining Basics

Pipelining and Retiming

Chapter One Introduction to Pipelined Processors

Recap (Pipelining)

PIPELINING AND VECTOR PROCESSING

Pipelining

Chapter One Introduction to Pipelined Processors

Computer Architecture

Lecture 7: Pipelining Review