Lecture 16: More Pipelining Complications

Lecture 16:More Pipelining Complications Michael B. Greenwald Computer Architecture CIS 501 Fall 1999

HW questions: • #2 (3.5 in H&P): Can assume that branches are resolved in ID or in EX --- JUST STATE ASSUMPTION CLEARLY! (It’s slightly easier to assume that branches are resolved in ID in this question). • #3 (3.6 in H&P): • Fig 3.19 already has forwarding for Address calculation in BRANCH instr. • It already assumes that branches are resolved in EX. • You’re adding entries for STORE and BRANCH; think about all possible source instructions that might need forwarding.

Pipelining Review • Just overlap tasks, and easy if tasks are independent • Speed Up <= Pipeline Depth; if ideal CPI is 1, then: • Hazards limit performance on computers: • Structural: need more HW resources • Data: need forwarding, compiler scheduling • Control: early evaluation & PC, delayed branch, prediction Pipeline Depth Clock Cycle Unpipelined Speedup = X Clock Cycle Pipelined 1 + Pipeline stall CPI

Pipelining Review • Increasing length of pipe increases impact of hazards; pipelining helps instruction bandwidth, not latency • Interrupts, Instruction Set, makes pipelining harder • Compilers reduce cost of data and control hazards • Load delay slots • Branch delay slots • Branch prediction

Pipelining Complications • Complex Addressing Modes and Instructions • Address modes: Autoincrement causes register change during instruction execution • Interrupts? Need to restore register state • Adds WAR and WAW hazards since writes no longer last stage • Memory-Memory Move Instructions • Must be able to handle multiple page faults • Long-lived instructions: partial state save on interrupt • Condition Codes: set by previous instruction (what if interrupt in between?)

Pipelining Complications: When not all instructions take the same time... • Floating Point: long execution time compared to int. • Impractical to require that FP operations complete in one clock cycle: • Slow clock (as long as longest stage). • So we have at least one stage of the pipeline that will complete with different latencies for different instructions.

What’s the problem with variable/long latency? • If we treat an FP EX like a single long stage: • we need to stall all following instructions • Consecutive floating point ops act as if clock cycle is full length of FP EX stage. • If we treat each EX like independent stages: • Need to deal with out of order execution • More hazards

Dealing with Pipeline components of differing latency Stages for operations Varying latency Independent operation?Pipelined? Duplicated? (Effectively) functional units

Independent vs. Interdependent paths Varying latency Independent operation?Pipelined? Duplicated?

Pipeline with long latency components UnpipelinedPipelined Unpipelined can only support one operation at a timePipelined takes > N cycles to complete, but can have as many active inst.s as stages.

Duplication of components • Whether pipelined or unpipelined, duplicating a stage N times keeps latency the same, reduces initiation interval by a factor of N, increases cost by a factor of N, and reduces structural hazards.

Reality Check • Reality can be more complicated. • For example: • Partially pipelined (multiple stages, each taking more than a clock cycle). • Alternatively, multiple stages per clock cycle, if can be done in parallel (or if stage shorter than a clock cycle). • Different operations can use multiple functional units -- e.g. multiply uses the FP adder in the final stages.

Pipeline with long latency components Latency: number of cycles, cannot be reducedInitiation rate: number of cycles between consecutive issues. Initiation rate can be decreased by pipelining (ups latency!) or duplication. Increases due to structural hazards.

Terminology Review • Functional Units: hardware resources e.g. FP Adder, FP multiplier, FP divider • Stage: combines functional unit(s) with control and multiplexer, bracketed by latches. • Operation/instruction: a sequence or path of stages. • Latency and initiation interval are (generally) properties of an operation, not a stage (a stage may have latency of its own, usually = initiation interval). Previous drawings did not show functional units.

R4000 FP Pipeline stages Stage: Func.Unit Description U Unpack FP nums E Mult Exception test D Divide Division M Mult 1st stage of Mult N Mult 2nd stage of mult A Add Add Mantissa’s R Add Round S Add Shift (not independent!)

R4000 FP Instructions Approx. Can be computed using timing diagram with stages. FP Instruction LatencyInitiation Rate Stages Add, Subtract 4 3 U,S+A,A+R,R+S Multiply 8 4 U,E+M,M3,N,N+A,R Divide 36 35 U,A,R,D27,(D+A,D+R)2,A,R Square root 112 111 U,E,(A+R)108,A,R Negate 2 1 U,S Absolute value 2 1 U,S FP compare 3 2 U,A,R Cycles before use result Cycles before issue instr of same type

Longer Latency Components • Ideally, pipeline or duplicate enough so that initiation interval is 1 cycle. • If not duplicated enough, then structural hazards • Why not duplicate or pipeline? • Pipeline: increase latency (and hazards!) • Duplicate: expensive and infrequent: worthwhile? • Example: Divide, Square Root take 10X to 30X longer than Add.

Longer Latency Components:Complications • Because of increased latency, RAW hazards are more likely. • Different latencies means multiple instructions might exit pipe simultaneously. (multiple writes in WB). • Structural hazards because initiation interval > 1 • Adds WAW (and WAR?) hazards since pipelines are no longer same length • Out of order execution (Interrupts?)

Longer Latency Components:Complications • Because of increased latency, RAW hazards are more likely. • Deeper pipeline means that need a bigger window where no one uses a destination of this instruction as a source. • If there is a RAW hazard, penalty is likely to be longer.

Longer Latency Components:Complications • Different latencies means multiple instructions might exit pipe simultaneously. (multiple writes in WB). • Why not just implement multiple write ports to register bank? • Average issue rate is (obviously) 1, so underutilized. • How implement interlock for this structural hazard? (For fixed pipeline it’s easy to predict structural hazards).

Longer Latency Components:Complications • Interlock for write port for variable latency instructions: • Approach #1: FIFO ? 5 Instr Case 1: No write scheduled in 5 cycles ID Stage

Longer Latency Components:Complications • Interlock for write port for variable latency instructions: • Approach #1: FIFO ? 5 Instr Mark write in 5 cycles ID Stage

Longer Latency Components:Complications • Interlock for write port for variable latency instructions: • Approach #1: FIFO ? Instr Shift left ID Stage

Longer Latency Components:Complications • Interlock for write port for variable latency instructions: • Approach #1: FIFO ? 5 Instr Case 2: Write scheduled in 5 cycles ID Stage

Longer Latency Components:Complications • Interlock for write port for variable latency instructions: • Approach #1: FIFO ? 5 Instr Shift left Insert stall! Try again. ID Stage

Longer Latency Components:Complications • Interlock for write port for variable latency instructions: • Approach #2: • 1) Wait until MEM or WB stage. • 2) 2 writes? Stall the one with lowest latency • Advantages: • Easy to detect, choice of which instruction to stall (fastest, latest, slowest, etc.) • Disadvantages: • Stalls from 2 different stages; stalls can trickle back (readjust our predicted writes). Approach #1 is preferred.

Longer Latency Components:Complications • Adds WAW (and WAR?) hazards since pipelines are no longer same length MULTD F5,F4,F6 DIV F5,F2,F4 ADDD F5,F3,F1

WAW hazards • Example: MULTD F5,F4,F6 DIV F5,F2,F4 ADDD F5,F3,F1 • Can this ever occur in real code? Consider: MULTD F5,F4,F6 ADDD F1,F5,#4 ADDD F5,F3,F1 • RAW hazard should always come first! • But delayed branch, and interrupts...

WAW hazards • How do we implement interlock? • Either • stall ADDD in ID until DIV in MEM, • or suppress DIV before WB, and let ADDD issue immediately (DIV never writes). • Rare, infrequent, so not a big difference: • Simple: Any instruction in ID tries to write to same register as an instruction already in the pipe, stall in ID and don’t go to EX.

When can instruction be released from ID? • Structural hazards? Is functional unit free? No write port conflicts? • RAW data hazard? Source registers not listed as pending destinations in a pipeline register that won’t be avail. (via forwarding) when this instruction needs it. • WAW data hazard? Any instructions in the pipe have same destination as this?

Longer Latency Components:Complications • Out of order execution (Interrupts?)

Out of order execution • Given hazard detection, why is it a problem? • Interrupt. • Can’t drain pipeline, because early instruction might destroy inputs to later instructions, so they are not restartable. • Options: • Ignore (imprecise interrupts). • Buffer results until earlier ones complete: big buffer, need bypasses to values in buffer • Simulate in software after interrupt • Treat exceptions as hazards and stall if complete while exception may be raised (external interrupts not an issue, since they can be delayed until safe).

Lecture 16: More Pipelining Complications