Graduate Computer Architecture I

Graduate Computer Architecture I Lecture 2: Processor and Pipelining Young Cho

Instruction Set Architecture • Set of Elementary Commands • Good ISA … • CONVENIENT functionality to higher levels • EFFICIENT functionality to higher levels • GENERAL: Used in many different ways • PORTABLE: Lasts through manyGen • Points of View • Provides different HW and SW interface

Processors Today • General Purpose Register • Split of CISC and RISC • Development of RISC • Very Complex Design • No longer REDUCED • Rapid Technology Advancements

Performance Definition • Performance is in units of things per second • bigger is better • If we are primarily concerned with response time • " X is n times faster than Y" means

Amdahl’s Law Best you could ever hope to do:

Semester Schedule Review • Basic Architecture Organization: Weeks 2-4 • Processors and Pipelining – Week 2 • Memory Hierarchy and Cache Design – Week 3 • Hazards and Predictions – Week 4 • Quiz 1 – Week 4 • Quantitative Approach: Weeks 5-10 • Instructional Level Parallelism – Week 5 & 6 • Vector and Multi-Processors – Week 7 & 8 • Storage and I/O – Week 9 • Interconnects and Clustering – Week 10 • Quiz 2 – Week 6 • Quiz 3 – Week 9 • Advanced Topics: Weeks 11-15 • Network Processors – Week 11 • Reconfigurable Devices and SoC – Week 12 • Low Power Hardware and Techniques – Week 12 • HW and SW Co-design – Week 13 • Other Topics – Week 14 & 15 • Quiz 4 – Week 11

Administrative • Course Web Site • http://www.arl.wustl.edu/~young/cse560m • Xilinx Tools • May use Urbauer Room 116 Computers • Accounts will be available • ISE Version 7.1 and Modelsim 6.0a • http://direct.xilinx.com/direct/webpack/71/WebPACK_71_fcfull_i.exe • http://direct.xilinx.com/direct/webpack/71/MXE_6.0a_Full_installer.exe • Prerequisite Course Text • (Optional) D. Patterson and J. Hennessy, Computer Organization and Design: The Hardware/Software Interface, Third Edition. • Quizzes A and B • For your own benefit • May need prerequisite course text but not necessary • Look for answers on the WWW • Project • Groups of 2-3 by Thursday • Weeks 1-5: Pipelined 32bit Processor • Build on top of the basic Processor afterwards • Lectures at Urbauer Room 116 (project check points) • Sep 27, Oct 11, Nov 08, Nov 17, and Nov 22

Fabricated IC Costs

Traditional CISC and RISC • Reduced Instruction Set Computer • Smaller Design Footprint  Reduced Cost • Essential Set of Instructions • Intuitively Larger Program • Complex Instruction Set Computer • Complex set of desired Instructions • Pack many functions in one Instruction • Compact Program: Memory WAS Expensive • RISC a better fit for the Changes • Cheaper Memory • Shorter Critical Path = Fast Clock Cycles • CISC Chips Integrated the RISC Concepts • Better Compilers • RISC!? of Today • Very Complex and Large set of Instructions • The original motivation cannot be seen • High Performance and Throughput H-Line V-Line Circle A lot of H and V-Lines

Real Performance Measurement CPI Inst Count Cycle Time CPU time is the REAL measure of computer performance. NOT Clock rate and NOT CPI

Cycles Per Instructions “Average Cycles per Instruction” • CPI = (CPU Time * Clock Rate) / Instruction Count • = Cycles / Instruction Count “Instruction Frequency”

Calculating CPI Run benchmark and collect workload characterization (simulate, machine counters, or sampling) Base Machine (Reg / Reg) Op Freq Cycles CPI(i) (% Time) ALU 50% 1 .5 (33%) Load 20% 2 .4 (27%) Store 10% 2 .2 (13%) Branch 20% 2 .4 (27%) 1.5 Typical Mix of instruction types in program Design guideline: Make the common case fast MIPS 1% rule: only consider adding an instruction of it is shown to add 1% performance improvement on reasonable benchmarks.

Impact of Stalls • Assume CPI = 1.0 ignoring branches (ideal) • Assume solution was stalling for 3 cycles • If 30% branch, Stall 3 cycles on 30% Op Freq Cycles CPI(i) (% Time) Other 70% 1 .7 (37%) Branch 30% 4 1.2 (63%)  new CPI = 1.9 • The Machine is 1/1.9 = 0.52 times • Far from ideal

Instruction Set Architecture Design • Definition • Set of Operations • Instruction Format • Hardware Data Types • Named Storage • Addressing Modes and Sequencing • Description in Register Transfer Language • Intermediate Representation • Map Instruction to RTLs • Technology Constraint Considerations • Architected storage mapped to actual storage • Function units to do all the required operations • Possible additional storage (e.g. MAddressR, MBufferR, …) • Interconnect to move information among regs and FUs • Controller • Sequences into symbolic controller state transition diagram (STD) • Lower symbolic STD to control points • Controller Implementation

Typical Load/Store Processor IF/ID ID/EX EX/MEM MEM/WB Register File PC Control ALU Data Memory Instruction Memory

Instruction Format funct opcode opcode rd opcode rs rt opcode rt rs opcode opcode rt rs rt rs rd funct 27 25 31 28 21 19 24 22 27 25 24 22 31 28 31 28 31 28 24 22 21 19 31 28 31 28 27 25 27 25 24 22 3 0 3 0 imm16 imm16 imm16 imm16 15 0 15 0 15 0 15 0 General instruction format 4 bits remaining 28 bits vary according to instruction type R-type instruction unused unused I-type instruction unused unused J-type instruction unused unused

Instruction Type Datapath R-type instructions J-type instructions I-type instructions

Cloth Washing Process 30 minutes 35 minutes 25 minutes One set of Clothes in 1 Hour 30 minutes

Pipelining Laundry 30 minutes 35 minutes 35 minutes 35 minutes 25 minutes ~53 min/set 3X Increase in Productivity!!! With large number of sets, the each load takes average of ~35 min to wash Three sets of Clean Clothes in 2 hours 40 minutes

Introducing Problems • Hazards prevent next instruction from executing during its designated clock cycle • Structural hazards: HW cannot support this combination of instructions (single person to dry and iron clothes simultaneously) • Data hazards: Instruction depends on result of prior instruction still in the pipeline (missing sock – needs both before putting them away) • Control hazards: Caused by delay between the fetching of instructions and decisions about changes in control flow (Er…branch & jump)

One Memory Port/Structural Hazards Reg Reg Reg Reg Reg Reg Reg Reg Ifetch Ifetch Ifetch Ifetch DMem DMem DMem ALU ALU ALU ALU Load DMem Instr 1 Instr 2 Bubble Bubble Bubble Bubble Bubble Time (clock cycles) Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 I n s t r. O r d e r Instr 3 Stall Instruction Fetch as well as Load from Memory

Speed Up Equation for Pipelining For simple RISC pipeline, CPI = 1:

Memory and Pipeline • Machine A: Dual ported memory • Machine B: Single ported memory • 1.05 times faster clock rate • Ideal CPI = 1 for both • Loads are 40% of instructions executed • FreqRatio = Clockunpipe/Clockpipe • SpeedUpA = (Pipeline Depth/(1 + 0)) x FreqRatio = Pipeline Depth x FreqRatio • SpeedUpB = (Pipeline Depth/(1 + 0.4 x 1)) x FreqRatio x 1.05 = Pipeline Depth x 0.75 x FreqRatio • SpeedUpA / SpeedUpB = 1.33 • Machine A is 1.33 times faster

Data Hazard on r1 Reg Reg Reg Reg Reg Reg Reg Reg Reg Reg ALU ALU ALU ALU ALU Ifetch Ifetch Ifetch Ifetch Ifetch DMem DMem DMem DMem DMem EX WB MEM IF ID/RF I n s t r. O r d e r add r1,r2,r3 sub r4,r1,r3 and r6,r1,r7 or r8,r1,r9 xor r10,r1,r11 Time (clock cycles)

Data Hazards • Read After Write (RAW)InstrJ tries to read operand before InstrI writes it • Caused by a “Dependence” (in compiler nomenclature). This hazard results from an actual need for communication. I: add r1,r2,r3 J: sub r4,r1,r3

Data Hazards I: sub r4,r1,r3 J: add r1,r2,r3 K: mul r6,r1,r7 • Write After Read (WAR) InstrJ writes operand before InstrI reads it • Called an “anti-dependence” by compiler writers.This results from reuse of the name “r1”. • Can’t happen in DLX 5 stage pipeline because: • All instructions take 5 stages, and • Reads are always in stage 2, and • Writes are always in stage 5

Data Hazards I: sub r1,r4,r3 J: add r1,r2,r3 K: mul r6,r1,r7 • Write After Write (WAW) InstrJ writes operand before InstrI writes it. • “Output dependence” by compiler writers • This also results from the reuse of name “r1”. • Can’t happen in DLX 5 stage pipeline because: • All instructions take 5 stages, and • Writes are always in stage 5 • Will see WAR and WAW in complicated pipelines

Solution: Data Forwarding Reg Reg Reg Reg Reg Reg Reg Reg Reg Reg ALU ALU ALU ALU ALU Ifetch Ifetch Ifetch Ifetch Ifetch DMem DMem DMem DMem DMem I n s t r. O r d e r add r1,r2,r3 sub r4,r1,r3 and r6,r1,r7 or r8,r1,r9 xor r10,r1,r11 Time (clock cycles)

HW Change for Forwarding ALU ID/EX EX/MEM MEM/WR NextPC mux Registers Data Memory mux mux Immediate

Data Hazard Even with Forwarding ALU Reg Mem ALU Reg Reg Mem ALU Reg Reg IF Mem Time (clock cycles) lw r1, 0(r2) I n s t r. O r d e r ALU Reg Reg IF Mem Bubble sub r4,r1,r6 Reg IF Bubble and r6,r1,r7 IF Bubble or r8,r1,r9

Software Scheduling Fast code: LW Rb,b LW Rc,c LW Re,e ADD Ra,Rb,Rc LW Rf,f SW a,Ra SUB Rd,Re,Rf SW d,Rd Try producing fast code for a = b + c; d = e – f; assuming a, b, c, d ,e, and f in memory. Slow code: LW Rb,b LW Rc,c ADD Ra,Rb,Rc SW a,Ra LW Re,e LW Rf,f SUB Rd,Re,Rf SW d,Rd Compiler optimizes for performance. Hardware checks for safety.

Control Hazard on Branches Reg Reg Reg Reg Reg Reg Reg Reg Reg Reg ALU ALU ALU ALU ALU Ifetch Ifetch Ifetch Ifetch Ifetch DMem DMem DMem DMem DMem 10: beq r1,r3,36 14: and r2,r3,r5 18: or r6,r1,r7 22: add r8,r1,r9 36: xor r10,r1,r11 What do you do with the 3 instructions in between? How do you do it? Where is the “commit”?

Branch Hazard Alternatives • Stall until branch direction is clear • Predict Branch Not Taken • Execute successor instructions in sequence • “Squash” instructions in pipeline if branch actually taken • Advantage of late pipeline state update • 47% DLX branches not taken on average • PC+4 already calculated, so use it to get next instr • Predict Branch Taken • 53% DLX branches taken on average • DLX still incurs 1 cycle branch penalty • Other machines: branch target known before outcome

Branch Hazard Alternatives • Delayed Branch • Define branch to take place AFTER a following instruction (Fill in Branch Delay Slot) branch instruction sequential successor1 sequential successor2 ........ sequential successorn branch target if taken • 1 slot delay allows proper decision and branch target address in 5 stage pipeline Branch delay of length n

Evaluating Branch Alternatives Scheduling Branch CPI speedup v. speedup v. scheme penalty unpipelined stall Stall pipeline 3 1.42 3.5 1.0 Predict taken 1 1.14 4.4 1.26 Predict not taken 1 1.09 4.5 1.29 Delayed branch 0.5 1.07 4.6 1.31 Conditional & Unconditional = 14%, 65% change PC

Conclusion • Instruction Set Architecture • Things to Consider when designing a new ISA • Processor • Concept behind Pipelining • Five Stage Pipeline RISC • Proper Processor Performance Evaluation • Limitations of Pipelining • Structural, Data, and Control Hazards • Techniques to Recover Performance • Re-evaluating Speed-ups

Graduate Computer Architecture I