320 likes | 428 Views
Pipelining. Automobile Manufacturing. 1. Build frame. 60 min. 2. Add engine. 50 min. 3. Build body. 80 min. 4. Paint. 40 min. 5. Finish. 45 min. 275 min. Latency : Time from start to finish for one car. 275 minutes per car. (smaller is better).
E N D
Automobile Manufacturing 1. Build frame. 60 min. 2. Add engine. 50 min. 3. Build body. 80 min. 4. Paint. 40 min. 5. Finish. 45 min. 275 min. Latency: Time from start to finish for one car. 275 minutes per car. (smaller is better) Throughput: Number of finished cars per time unit. 1 car/275 min = 0.218 cars/hour (larger is better) Issues: How can we make the process better by adding more workers? 6.1
2 3 1 4 1 3 4 2 time 3 1 4 2 4 3 2 1 1 3 2 4 An Assembly line 80 80 60 80 50 80 80 40 45 Last two stages only receive onecar/80 min to work on. Latency: 400 min/car Throughput: 4 cars/640 min (1 car/160 min) Will approach 1 car/80 min as time goes on First two stagescan’t produce faster thanone car/80 min or a backlog will occurat third stage. 6.1
Applying Assembly Lines to CPUs • The single-cycle design did everything “at once” • Can we break the single-cycle design up into stages? • Issues: • Car assembly works well. Will it be so easy to do the same technique to a CPU? 6.1
1 0 Registers Read reg. num A Read reg num A Read reg data A Read reg num B Write reg num Read reg data B 0 Write reg data 1 Breaking up the Single-Cycle Datapath 4 Reg.Write-back Result Add Result Sh.Left2 Add Rs:[25-21] Read address Rt:[20-16] Data Memory Read address PC Zero Read data 1 Instruction [31-0] Write address Result InstructionMemory 0 0 Write data Rd:[15-11] 1 Imm:[15-0] 16 32 signextend Instr. Fetch,PC=PC+4 Stages frommulti-cycle design Instr. DecodeRegister Fetch Execute,Address Calc. Memory 6.2
1 0 Registers Read reg. num A Read reg num A Read reg data A Read reg num B Write reg num Read reg data B 0 Write reg data 1 The Key - Pipeline Registers Reg.Write-back Instr. DecodeRegister Fetch 4 Result PC+4 Add Result Sh.Left2 Add Rs:[25-21] Read address Rt:[20-16] Data Memory Read address PC Zero Read data 1 Instruction [31-0] Write address Result InstructionMemory 0 0 Write data Rd:[15-11] 1 Instr. Fetch,PC=PC+4 Imm:[15-0] Execute,Address Calc. 16 32 signextend Memory clock 6.2
1 0 Registers Read reg. num A Read reg num A Read reg data A Read reg num B Write reg num Read reg data B 0 Write reg data 1 Example: R-type Instruction 4 Result PC+4 Add Result Sh.Left2 Add Rs:[25-21] Read address Rt:[20-16] Data Memory Read address PC Zero Read data 1 Instruction [31-0] Write address Result InstructionMemory 0 0 Write data Rd:[15-11] 1 Imm:[15-0] Writes the correct data to thewrongregister 16 32 signextend In general, arrows that go backwards across pipeline stages may be bad news... 6.2
1 0 Registers Read reg. num A Read reg num A Read reg data A Read reg num B Write reg num Read reg data B 0 Write reg data 1 Rd:[15-11] 0 1 Correcting the Write Register Problem 4 Result PC+4 Add Result Sh.Left2 Add Rs:[25-21] Read address Rt:[20-16] Data Memory Read address PC Zero Read data 1 Instruction [31-0] Write address Result InstructionMemory 0 0 Write data 1 Imm:[15-0] 16 32 signextend Rt:[20-16] Rd:[15-11] 6.2
5 4 3 1 2 Assembly-line Control Signals In an assembly line, the manufacturing instructions can be attachedto the car. The instructions then move along with the car. F: Standard E: 135 HP B: 2-door P: Green F: Leather E: 190 HP B: 4-door P: Blue F: Cotton B: 2-door P: Lavender F: Leather P: Green F: Vinyl F: Leather By separating the control signals by stages, only the signals needed for the current stage must be decoded. All signals for later stages must be passed along. 6.1
1 0 Registers Read reg. num A Read reg num A Read reg data A Read reg num B Write reg num Read reg data B Write reg data 0 1 The Pipelined Control Logic E Control PCSrc M M Op:[31-26] W W W 4 Branch PC+4 Result Result Add MemToReg Sh.Left2 RegWrite Add MemWrite Rs:[25-21] Read address Rt:[20-16] Data Memory ALUSrc Read address PC Zero Read data 1 Instruction [31-0] Write address Result InstructionMemory 0 0 Write data 1 Imm:[15-0] ALUcontrol 16 32 signextend MemRead Rt:[20-16] Rd:[15-11] ALUOp RegDest 6.3
How’d we do? • Compared to Single-cycle • 5 stages --> Potentially 5x speedup • Not likely • Stages won’t all be equally long • Pipeline registers will cause some delays • Latency --> Greater than in single-cycle design • More complexity, but nicely divided up
Example 1 • Consider executing the following code add $3, $4, $5 and $6, $7, $8 sub $9, $10, $11 on • A single-cycle machine with a cycle time of 200 ns • A 5-stage pipeline machine with a cycle time of 50 ns Which one runs faster? What if the instructions were 100 instead of 3?
IF RF M WB ADD EX IF M WB RF SUB EX IF M WB RF AND EX IF M WB RF SW EX IF M WB RF OR EX Analyzing Pipelines ADD $10, $14, $0 SUB $12, $13, $2 AND $1, $6, $11 SW $3, 200($9) OR $9, $13, $7 6.4
IF RF M WB ADD EX IF M WB RF SUB EX IF M WB RF AND EX IF M WB RF SW EX IF M WB RF OR EX Data Hazards ADD $13, $14, $0 SUB $12, $13, $2 AND $1, $6, $13 SW $3, 200($13) OR $9, $13, $7 Writes register $13 Reads wrong $13 Reads wrong $13 Reads ? $13 Reads correct $13 6.4
IF RF M WB ADD EX IF IF M M WB WB RF RF SUB SUB EX EX IF M RF AND EX IF RF SW EX IF RF OR Preventing Data Hazards ADD $13, $14, $0 NOP NOPNOP SUB $12, $13, $2 AND $1, $6,$13 SW $3, 200($13) OR $9, $13, $7 Insert NOP’s into the instruction stream to allow WB to happen before RF. Assume we can’t write a registerand read the new value in the same cycle 6.4
IF RF M WB ADD EX IF M WB RF SUB EX IF M WB RF AND EX IF M RF SW EX IF RF OR EX ADD $13, $14, $0 SUB $12, $13, $2 AND $1, $6, $13 SW $3, 200($13) OR $9, $13, $7 Detecting Hazards Write: $13 Compare write reg #in EX with read reg #in RF Compare write reg #in M with read reg #in RF Read A: $13 Compare write reg #in WB with read reg #in RF Read B: $13 Read A: $13 • Check each instruction as it is being decoded (RF-ID stage). • If it reads a register that will be written by any instruction ahead of it (in RF, EX, or M stages), there is a hazard. 6.5
IF RF M WB ADD EX = = IF SUB = IF SUB IF SUB IF M WB RF SUB EX IF M RF AND EX IF RF SW EX IF RF OR ADD $13, $14, $0 SUB $12, $13, $2 AND $1, $6,$13 SW $3, 200($13) OR $9, $13, $7 Stalling with Bubbles • Stalling: • Kill the current executionby “neutralizing” all the controlsignals so that it won’t write any registers. • Don’t write PC+4 into PC --> Stay at the current instruction and try again. 6.5
IF RF M WB ADD EX IF RF M WB SUB EX IF RF M WB AND EX IF RF M WB SW EX IF RF M WB OR EX Register Forwarding ADD $13, $14, $0 SUB $12, $13, $2 AND $1, $6, $13 SW $3, 200($13) OR $9, $13, $2 Register $13’s value is computed in the EX stage of the ADD even thoughit isn’t written in the register until the WB stage. --> The pipeline register following the EX stage hold the value of $13 that’s needed in the SUB instruction’s EX stage. 6.6
IF RF M WB LW EX IF RF M WB AND EX IF RF M WB AND EX IF RF M WB SW EX IF RF M WB OR OR EX Unforwardable Loads LW $2, 30($2) AND $1, $2, $13 SW $3, 200($2) OR $9, $2, $1 Loads don’t compute the register to write back until the Memory stage. This is one stage to late for the next instruction. ---> We can’t prevent stalls if the instruction following a Load uses the result of the Load. 6.6
Example 2 • Consider executing the following code on a 5-stage pipeline datapath add $3, $4, $5 lw $7, 100($3) sub $8, $7, $9 • Identify any potential data dependencies • How many cycles will it take to execute this code assuming no register forwarding? • How many cycles will it take to execute this code assuming register forwarding is available?
IF RF WB M BEQ EX RF M WB IF AND EX RF M WB IF SW EX RF M WB IF OR EX RF M WB IF OR LW EX Branch Hazards BEQ $2, $1, SKIP AND $1, $2, $13 SW $3, 200($2) OR $9, $2, $4 ADD $3, $2, $5 SKIP: LW $2,32($4) Don’t know result of branch untilthe end of the M stage If the branch is taken, we’ve blown it by executingthe intervening instructions 6.7
IF RF M WB BEQ EX IF AND IF AND IF AND IF M WB RF AND EX IF M RF SW EX IF RF OR EX IF RF ADD BEQ $2, $1, SKIP AND $1, $2, $13 SW $3, 200($2) OR $9, $2, $4 ADD $3, $2, $5 SKIP: LW $2,32($4) Solution 1: Stall Branchnot taken Stalling always solves theproblem. If we didn’t have somany branches in programs, it wouldnot be a problem 6.5
IF RF WB M BEQ EX RF M WB IF AND EX RF M WB IF SW EX RF M WB IF OR EX RF M WB IF LW EX BEQ $2, $1, SKIP AND $1, $2, $13 SW $3, 200($2) OR $9, $2, $4 ADD $3, $2, $5 SKIP: LW $2,32($4) Solution 2: Assume not Taken Branch is taken... Must be undone if branchis taken! If we guess right, we win --> No stall at all If we guessed wrong, 1. We have to undo all that we did (fortunately, no writebacks have occured yet). 2. We still take all the time of a stall 6.7
Solution 3: Better Prediction • Predict that the branch goes the same way as the last time • Works great for loops • Works great for “special-case” code • Need to keep track of the information for each branch, though... • One or two bits will do • Keep a small table of recently used branches and which way they went 6.7
Solution 4: Delayed Branches XOR $1, $3, $3 ADD $2, $3, $4 SUB $4, $3, $1 OR $3, $2, $0 BEQ $10, $11, SKIP LW $4, 60($2) SKIP AND $1, $2, $3 If we had some warning, wecould compute the branch aheadof time... XOR $1, $3, $3 Branch-After-Three-EQ $10,$11,SKIP ADD $2, $3, $4 SUB $4, $3, $1 OR $3, $2, $0 LW $4, 60($2) SKIP AND $1, $2, $3 3 delay slots These instructionsare always executed. Branch can’t dependon them... 6.7
IF RF M WB B3E EX IF RF M WB ADD EX IF RF M WB SUB EX IF RF M WB OR EX IF RF M WB LW or AND EX 3-slot Delayed Branch Branch-After-Three-EQ $10,$11,SKIP ADD $2, $3, $4 SUB $4, $3, $1 OR $3, $2, $0 LW $4, 60($2) SKIP AND $1, $2, $3 6.7
Branch summary • Two decent solutions: • Branch prediction • Requires more hardware • Used in modern microprocessors • Delayed branch • Requires special software manipulation • Often doesn’t deliver its promise • Used often in CPUs 4-10 years ago
Example 3 • Consider executing the following code LOOP: add $3, $4, $5 and $6, $7, $8 bne $12, $8, LOOP on • A single-cycle machine with a cycle time of 200 ns • A 5-stage pipeline machine with a cycle time of 50 ns • Assume the loop executes 10 times • Assume the loop executes 100 times • Assume the loop executes 1000 times Which one runs faster?
Example 4 • Consider executing the following code on a 5-stage pipeline datapath addi $3, $0, 10 LOOPSTART: lw $5, ARRAY($3) addi $5, $5, 1 sw $5, ARRAY addi $3, $3, -1 bne $3, $0, LOOPSTART add $3, $5, $6 sub $7, $8, $9 addi $4, $6, 3 • Identify potential data dependencies • How many cycles will it take to execute this code? • With nops/stalls • With branch prediction assuming branch not taken • With branch prediction based on one previous result