550 likes | 715 Views
COMP541 Pipelined MIPS. Montek Singh Nov {20, 27}, 2017. Topics. Today’s topic: Pipelining Can think of it as: A way to parallelize, or A way to make better utilization of the hardware. Goal : Try to use all hardware every clock cycle Reading Section 7.5 of textbook.
E N D
COMP541Pipelined MIPS Montek Singh Nov {20, 27}, 2017
Topics • Today’s topic: Pipelining • Can think of it as: • A way to parallelize, or • A way to make better utilization of the hardware. • Goal: Try to use all hardware every clock cycle • Reading • Section 7.5 of textbook
Parallelism • Two types of parallelism: • Spatial parallelism • duplicate hardware performs multiple tasks at once • Temporal parallelism • overlapping execution of successive instructions • task is broken into multiple stages • each stage operating on different parts of distinct instructions • also called pipelining • example: an assembly line
Parallelism Definitions • Some definitions: • Token: A group of inputs processed together to produce a group of outputs • a “bundle” • an instruction is a token • Latency: Time for one token to pass from start to end • Throughput: Number of tokens processed per unit time • Parallelism increases throughput • Often sacrificing latency • Why?
Parallelism Example: Baking Cookies! • Ben is baking cookies • 2-part task: • It takes 5 minutes to roll the cookies… • … and 15 minutes to bake them • After finishing one batch he immediately starts the next batch. What is the latency and throughput if Ben does NOT use parallelism? Latency = 5 + 15 = 20 min = 1/3 hour Throughput = 1 tray/ 20 min = 3 trays/hour
Parallelism Example: Baking Cookies! • What is the latency and throughput if Ben uses parallelism? • Spatial parallelism: Ben asks Allysa to help, using two ovens/kitchens instead of one. • Temporal parallelism: Ben breaks the task into two stages: roll and baking. He uses two trays. While the first batch is baking he rolls the second batch, and so on.
Baking Cookies: Spatial Parallelism Latency = ? Throughput = ?
Baking Cookies: Spatial Parallelism Latency = 5 + 15 = 20 min = 1/3 hour (same) Throughput = 2 trays/ 20 min = 6 trays/hour (doubled)
Baking Cookies: Temporal Parallelism Latency = ? Throughput = ? Note: This scenario assumes each step is performed as soon as a resource is available. Latency will be lower if rolling is started as late as possible.
Baking Cookies: Temporal Parallelism Latency = 5 + 15 = 20 min = 1/3 hour Throughput = 1 trays/ 15 min = 4 trays/hour Cost? An extra tray Using both types of parallelism together, the throughput would be 8 trays/hour
Another example: Laundry A B C D Laundry Example: • Ann, Brian, Cathy, Dave each have one load of clothes to wash, dry, and fold • Washer takes 30 minutes • Dryer takes 40 minutes • Folding takes 20 minutes
Sequential laundry takes 6 hours for 4 loads With pipelining, how long would laundry take? Sequential Laundry 6 PM Midnight 7 8 9 11 10 Time 30 40 20 30 40 20 30 40 20 30 40 20 T a s k O r d e r A B C D
Pipelined laundry takes ?? hours for 4 loads Pipelined Laundry A B C D ?? 6 PM Time T a s k O r d e r
Pipelined laundry takes 3.5 hours for 4 loads Pipelined Laundry 30 40 40 40 40 20 A B C D 6 PM Midnight 7 8 9 11 10 Time T a s k O r d e r
Pipelining Principles 30 40 40 40 40 20 A B C D • Pipelining doesn’t help latency of single task, it helps throughput of entire workload • Pipeline rate limited by slowest pipeline stage • Multiple tasks operating simultaneously • Potential speedup = Number pipe stages • Unbalanced stage delays reduces speedup • Time to “fill” pipeline and to “drain” it reduces speedup • can be ignored for very large number of laundry loads 6 PM 7 8 9 Time T a s k O r d e r
Pipelined MIPS • Temporal parallelism • Divide single-cycle processor into 5 stages: • Fetch • Decode • Execute • Memory • Writeback • Add pipeline registers between stages • sort of like extra cookie trays
Pipelined MIPS • Divide single-cycle processor into 5 stages: • Fetch (IF) / Decode (ID) / Execute (EX) / Memory (MEM) / Writeback (WB) Write-Back Decode Execute Memory Fetch
Single-Cycle vs. Pipelined Performance • Pipelining • Break instruction into 5 steps • Each is 250 ps long (length of longest step) Write Reg and Read Reg occur during the first half and second half of the clock cycle, respectively. This allows more instruction overlap!
Pipelining Abstraction • 5 stages of execution • instruction memory (fetch) • register file read • ALU/execution • data memory access • register file write (writeback)
Single-Cycle and Pipelined Datapath • Key difference: insertion of pipeline registers
Multi-Cycle and Pipelined Datapath • Key difference: full-length registers for datapathand control
One Problem • There is a problem: mismatched paths • Resultto write back and reg desare out of step
Corrected Pipelined Datapath • Solution: One modification needed • Keep resultand regdesttogether • both go through equal number of pipeline stages • The belong together as a single “token”
Pipelined Control • What is similar/different w.r.t. single-cycle MIPS? • Same control signal values • Values must be delayed and delivered in correct cycles Control signals must go through registers as well! “Token” = data + control!
Pipelining Challenges • “Hazards” • when an instruction depends on results from previous instruction that has not yet completed • 2 types of hazards: • Data hazard: register value not written back to register file yet • an instruction produces a result needed by next instruction • new value will be stored during write-back stage • new value needed by next instr during register read following too close behind! • Control hazard: don’t know which is next instruction • next PC not decided yet • cause by conditional branches • cannot fetch next instr if current branch is not yet decided
Data Hazard: Example • First instruction computes a result ($s0) needed by the next 3 instructions • first two cause problems • third actually does not! • Why not? Register file written in first half of clock cycle, read in second half
Handling Data Hazards • Static/compiler approaches: • Insert nop’s (no operations) in code at compile time • Rearrange code at compile time • Dynamic/runtime approaches: • Forward data at run time • data goes directly from producer to consumer • Stall the consuming instruction at run time
Compile-Time Hazard Elimination Insert enough nopsbetween dependent instrs • Or re-arrange: move independent instructions up
Compile-Time Hazard Elimination • Insert enough nopsbetween dependent instrs Or re-arrange: move independent instructions up xor $t6, $t7, $t8 xor xor $t9, $t6, $t10 xor
Dynamic Approach: Data Forwarding • Also known as register bypassing • results are actually available even though not stored in reg. • grab a copy and send where needed! • extra hardware needed • Note: forwarding actually not needed for sub. Why again?
Data Forwarding: from WriteBack if ((rsE != 0) AND (rsE == WriteRegM) AND RegWriteM) then ForwardAE = 10 else if ((rsE != 0) AND (rsE == WriteRegW) AND RegWriteW)) then ForwardAE = 01 else ForwardAE = 00 X X
Data Forwarding • Forward to Execute stage from either: • Memory stage (MEM) or • WriteBack (WB) stage • Forwarding logic for ForwardAE: • for instriin Execute stage: • if the destination reg of instri-1 matches source reg, then forward from Mem stage • if the destination reg of instri-2matches source reg, then forward from WB stage • logic equations: if ((rsE != 0) AND (rsE == WriteRegM) AND RegWriteM) then ForwardAE= 10 else if ((rsE != 0) AND (rsE == WriteRegW) AND RegWriteW) then ForwardAE= 01 else ForwardAE= 00 • Forwarding logic for ForwardBE same, but replace rsE with rtE
Forwarding may not always work… • Example: Load followed immediately by R-type • loads are harder to deal with because they need mem access! • data cannot go backwards in time! • need one extra cycle, compared to R-type instructions lw has a 2-cycle latency!
Stalling • Stall for a cycle, then forward • solves the Load followed by R-type problem
Stalling Hardware lwstall = ((rsD == rtE) OR (rtD == rtE)) AND MemtoRegE StallF = StallD = FlushE = lwstall
Control Hazards • beq: • branch is not determined until the fourth stage of the pipeline • Instructions after the branch are fetched before branch occurs • These instructions must be flushed if the branch is taken
Effect & Solutions • Could always stall until branch is completed • Expensive: 3 cycles lost per branch! • Could predict banch outcome, and flush if wrong • Branch misprediction penalty • Instructions flushed when branch is taken • May be reduced by determining branch earlier
Control Hazards: Flushing • Flushing • turn instruction into a NOP (put all zeros) • renders it harmless!
Control Hazards: Early Branch Resolution Introduced another data hazard in Decode stage (fix a few slides away)
Control Hazards with Early Branch Resolution Penalty now only one lost cycle
Aside: Delayed Branch • Pipelined MIPS always executes instruction following a branch • So branch is delayed • This allows us to avoid killing next instruction • Compilers move instruction that has no conflict w/ branch into delay slot
Example • This sequence add $4, $5, $6 beq$1, $2, label … label: … reordered to this beq$1, $2, label add $4, $5, $6 … label: • The add is always executed! • If branch is taken • add is executed before the PC is set to the branch target • If branch is not taken • add is executed as the first instruction in follow-through
Control Forwarding and Stalling Hardware • Forwarding logic: ForwardAD = (rsD !=0) AND (rsD == WriteRegM) AND RegWriteM ForwardBD = (rtD !=0) AND (rtD == WriteRegM) AND RegWriteM • Stalling logic: branchstall = BranchD AND RegWriteE AND (WriteRegE == rsD OR WriteRegE == rtD) OR BranchD AND MemtoRegM AND (WriteRegM == rsD OR WriteRegM == rtD) StallF = StallD = FlushE = lwstall OR branchstall
Branch Prediction • Strategy is to predict outcome and start fetching from predicted address • Especially important if branch penalty > 1 cycle • Guess whether branch will be taken • Backward branches are usually taken (loops) • Perhaps consider history of whether branch was previously taken to improve the guess • Start fetching from the correct place • Either the next instruction if prediction is branch will not be taken • Or, from the branch target otherwise • Good prediction reduces fraction of branches requiring a flush
Pipelined Performance Example • Ideally CPI = 1 • But less due to: stalls (caused by loads and branches) • SPECINT2000 benchmark: • 25% loads • 10% stores • 11% branches • 2% jumps • 52% R-type • Suppose: • 40% of loads used by next instruction • 25% of branches mispredicted • All jumps flush next instruction • What is the average CPI?
Pipelined Performance Example • SPECINT2000 benchmark: • 25% loads • 10% stores • 11% branches • 2% jumps • 52% R-type • Suppose: • 40% of loads used by next instruction • 25% of branches mispredicted • All jumps flush next instruction • What is the average CPI? • Load/Branch CPI = 1 when no stalling, 2 when stalling. Thus, • CPIlw = 1(0.6) + 2(0.4) = 1.4 • CPIbeq = 1(0.75) + 2(0.25) = 1.25 • Average CPI = (0.25)(1.4) + (0.1)(1) + (0.11)(1.25) + (0.02)(2) + (0.52)(1) = 1.15
Pipelined Performance • Pipelined processor critical path: Tc = max { tpcq + tmem + tsetup, 2(tRFread + tmux + teq+ tAND+ tmux + tsetup), tpcq + tmux+ tmux + tALU + tsetup, tpcq + tmemwrite + tsetup, 2(tpcq + tmux + tRFwrite) }