190 likes | 231 Views
Explore the challenges and solutions in CPU pipelining to enhance performance. Understand the impact of hazards and the importance of forwarding and stalling. Learn about data hazards and their types, like RAW, WAR, and WAW. Discover strategies for handling branch hazards and improving overall CPU efficiency.
E N D
CPU Pipelining Issues What have you been beating your head against? This pipe stuff makes my head hurt! Read Chapter 4.7-4.8
00 00 00 00 00 +4 PCWB PCREG PCALU PCMEM IRWB IRALU A B WDALU YWB IRMEM YMEM WDMEM IRREG Rs: <25:21> Rt: <20:16> BSEL 1 0 x4 0 1 2 WDSEL 0x80000000 PC<31:29>:J<25:0>:00 0x80000040 5-Stage miniMIPS 0x80000080 JT BT PCSEL 6 5 4 3 2 1 0 Instruction PC Memory • Omits some details A D Instruction • NO bypass or interlock logic Fetch J:<25:0> Register RA1 RA2 WA File RD1 RD2 = JT Imm: <15:0> SEXT SEXT BZ shamt:<10:6> + “16” ASEL Register 0 1 2 File BT Address is available right after instruction enters Memory stage A B ALU ALUFN N V C Z ALU Wr R/W Adr WD PC+4 almost 2 clock cycles Memory Data Memory RD Rt:<20:16> “27” “31” Rd:<15:11> Data is needed just before rising clock edge at end of Write Back stage 0 1 2 3 WASEL Write Register WD WA Back WA WERF File WE
Pipelining • Improve performance by increasing instruction throughput • Ideal speedup is number of stages in the pipeline. Do we achieve this?
Pipelining • What makes it easy • all instructions are the same length • just a few instruction formats • memory operands appear only in loads and stores • What makes it hard? • structural hazards: suppose we had only one memory • control hazards: need to worry about branch instructions • data hazards: an instruction depends on a previous instruction • Individual Instructions still take the same number of cycles • But we’ve improved the through-put by increasing the number of simultaneously executing instructions
Data Hazards • Problem with starting next instruction before first is finished • dependencies that “go backward in time” are data hazards
Software Solution • Have compiler guarantee no hazards • Where do we insert the “nops” ? sub $2, $1, $3 and $12, $2, $5 or $13, $6, $2 add $14, $2, $2 sw $15, 100($2) • Problem: this really slows us down!
Forwarding • Use temporary results, don’t wait for them to be written register file forwarding to handle read/write to same register ALU forwarding
Can't always forward • Load word can still cause a hazard: • an instruction tries to read a register following a load instruction that writes to the same register. • Thus, we need a hazard detection unit to “stall” the instruction
Stalling • We can stall the pipeline by keeping an instruction in the same stage
Types of Data Hazards • RAW hazards corresponding to value dependences are most difficult to deal with, since they can never be eliminated • The second instruction is waiting for information produced by the first instruction • WAR and WAW hazards are name dependences • Two instructions happen to use the same register (name), although they don’t have to • Can often be eliminated by renaming, either in software or hardware • Implies the use of additional resources, hence additional cost • Renaming is not always possible: implicit operands such as accumulator, PC, or condition codes cannot be renamed • These hazards don’t cause problems for MIPS pipeline • Relative timing does not change even with pipelined execution, because reads occur early and writes occur late in pipeline
Branch Hazards • When we decide to branch, other instructions are in the pipeline! • We are predicting “branch not taken” • need to add hardware for flushing instructions if we are wrong
Improving Performance • Try to avoid stalls! E.g., reorder these instructions: • lw $t0, 0($t1) • lw $t2, 4($t1) • sw $t2, 0($t1) • sw $t0, 4($t1) • Add a “branch delay slot” • the next instruction after a branch is always executed • rely on compiler to “fill” the slot with something useful • Superscalar: start more than one instruction in the same cycle
Dynamic Scheduling • The hardware performs the “scheduling” • hardware tries to find instructions to execute • out of order execution is possible • speculative execution and dynamic branch prediction • All modern processors are very complicated • Pentium 4: 20 stage pipeline, 6 simultaneous instructions • PowerPC and Pentium: branch history table • Compiler technology important
NOP 00 00 00 00 00 • broke the sequential semantics of ISA by adding a branch delay-slot and early branch resolution logic +4 PCMEM PCWB PCALU PCREG IRWB YMEM A YWB B IRMEM WDMEM IRALU IRREG WDALU Rs: <25:21> Rt: <20:16> • added A/B bypass muxes to get data before it’s written to regfile BSEL 1 0 x4 • added CLK EN to freeze IF/RF stages so we can wait for lw to reach WB stage 0 1 2 WDSEL 0x80000000 PC<31:29>:J<25:0>:00 0x80000040 5-Stage miniMIPS 0x80000080 JT BT PCSEL 6 5 4 3 2 1 0 We wanted a simple, clean pipeline but… Instruction PC Memory A D Instruction Fetch J:<25:0> Register RA1 RA2 WA File RD1 RD2 = JT Imm: <15:0> SEXT SEXT BZ shamt:<10:6> + “16” ASEL Register 0 1 2 File BT A B ALU ALUFN N V C Z ALU Wr R/W Adr WD PC+4 Memory Data Memory RD Rt:<20:16> “27” “31” Rd:<15:11> 0 1 2 3 WASEL Write Register WD WA Back WA WERF File WE
Bypass MUX Details The previous diagram was oversimplified. Really need for the bypass muxes to precede the A and B muxes to provide the correct values for the jump target (JT), write data, and early branch decision logic. Register File RD1 RD2 from ALU/MEM/WB/PC from ALU/MEM/WB/PC A Bypass B Bypass JT = shamt ’16’ SEXT(imm) 2 0 1 0 1 ASEL BSEL BZ AALU BALU WDALU To ALU To ALU To Mem
00 00 00 00 00 +4 PCMEM PCWB PCALU PCREG YWB IRWB WDMEM YMEM IRMEM A IRALU B WDALU IRREG Rs: <25:21> Rt: <20:16> BSEL 1 0 x4 0 1 2 WDSEL 0x80000000 PC<31:29>:J<25:0>:00 0x80000040 Final 5-Stage miniMIPS 0x80000080 JT BT PCSEL 6 5 4 3 2 1 0 Instruction PC Memory •Added branch delay slot and early branch resolution logic to fix a CONTROL hazard • Added lots of bypass paths and detection logic to fix various STRUCTURAL hazards • Added pipeline interlocks to fix load delay STRUCTURAL hazard A D Instruction Fetch J:<25:0> Register RA1 RA2 WA File RD1 RD2 = Imm: <15:0> JT SEXT SEXT BZ NOP shamt:<10:6> + “16” ASEL Register 0 1 2 File BT A B NOP ALU ALUFN N V C Z ALU Wr R/W Adr WD PC+4 Memory Data Memory RD Rt:<20:16> “27” “31” Rd:<15:11> 0 1 2 3 WASEL Write Register WD WA Back WA WERF File WE
Pipeline Summary (I) • Started with unpipelined implementation – direct execute, 1 cycle/instruction – it had a long cycle time: mem + regs + alu + mem + wb • We ended up with a 5-stage pipelined implementation – increase throughput (3x???) – delayed branch decision (1 cycle) Choose to execute instruction after branch – delayed register writeback (3 cycles) Add bypass paths (6 x 2 = 12) to forward correct value – memory data available only in WB stage Introduce NOPs at IRALU, to stall IF and RF stages until LD result was ready
Pipeline Summary (II) • Fallacy #1: Pipelining is easy • Smart people get it wrong all of the time! • Fallacy #2: Pipelining is independent of ISA • Many ISA decisions impact how easy/costly it is to implement pipelining (i.e. branch semantics, addressing modes). • Fallacy #3: Increasing Pipeline stages improves performance • Diminishing returns. Increasing complexity.