480 likes | 602 Views
Computer Architecture Lecture Notes Spring 2005 Dr. Michael P. Frank. (New) Competency Area 6: Introduction to Pipelining. Basic Pipelining Concepts. P&H 3 rd ed., Chapter 6 H&P 3 rd ed. § A.1. Pipelining - The Basic Concept.
Computer Architecture Lecture Notes Spring 2005Dr. Michael P. Frank (New) Competency Area 6: Introduction to Pipelining
Basic Pipelining Concepts P&H 3rd ed., Chapter 6 H&P 3rd ed. §A.1
Pipelining - The Basic Concept • In early CPUs, deep combinational logic networks were used in between state updates. • Signal delays may vary widely across different paths. • New input cannot be provided to the network until the slowest paths have finished. • Slow clock speed, slow overall processing rates. • In pipelined design, deep logic networks are subdivided into relatively shallow slices (pipeline stages). • Delays through the network are made uniform. • A new input can be provided to each slice as soon as its quick, shallow network has finished. • Multiple inputs are processed simultaneously across stages. • Clock cycle is only as long as the slowest pipeline stage.
Generic Pipelining Illustration • Let represent any of a variety of logic gates • Initial, non-pipelined design for some random block of complex logic: latch latch
Pipelining Illustration cont. • Aggressively pipelined version of same logic: • Insert extra “pipeline registers” periodically • Here, after every 1-2 logic layers • This design can process 5x as much data at once! latch latch
Another View of Pipelining • Space-time diagrams: • Here, each colored area shows which parts of the logic network are occupied with data computed from a given input item, at which times. Depth in logic network Depth in logic network Data 1 Time Time Data 2 Pipelined (depth 6) Non-Pipelined
Simple Multicycle RISC Datapath IF ID EX MEM WB Next PC Loadfr. Mem.Data ProgramCounter Inst.Reg.
Basic RISC Execution Pipeline • Basic idea of instruction-execution pipelining: • Each instruction spends 1 clock cycle in each of the execution stages (in our example, there are 5). • during 1 clock cycle, the pipeline can be processing (different stages of) 5 different instructions simultaneously! stage time
Different Visualizations Same Time,Different Places Same instruction, different steps Same Time,DifferentData Item /Instruction Same Time, Different Places Skew Same Place, Different Times Same Place, Different Times
Dependences (from H&P 3rd ed. §3.1)
Dependences • A dependence is a way in which one instruction can depend on (be impacted by) another for scheduling purposes. • Three major dependence types: • Data dependence • Name dependence • Control dependence • I’ll sometimes use the word dependency for a particular instance of one instruction depending on another. • The instructions can’t be effectively (as opposed to just syntactically) fully parallelized, or reordered.
Data Dependence • Recursive definition: • Instruction B is data dependent on instruction A iff: • B uses a data result produced by instruction A, or • There is another instruction C such that B is data dependent on C, and C is data dependent on A. • When a data dependence is present, there is a potential RAW hazard. Loop: LD F0,0(R1) ADDD F4,F0,F2 SD 0(R1),F4 SUBI R1,R1,#8 BNEZ R1,Loop A A B C B Direct data dependenciesin a simple examplecode fragment
Name Dependence • When two instructions access the same data storage location, but are not data dependent. • Also, at least one of the accesses must be a write. • Two sub-types (for inst. B after inst. A): • Antidependence: A reads, then B writes. • Potential for aWARhazard. • Output dependence: A writes, then B writes. • Potential for aWAWhazard. • Note: Name dependencies can be avoided by changing instructions to use different locations • (Rather than reusing 1 location for 2 purposes.) • This fix is called renaming. A time B A time B
Control Dependence • Occurs when the execution of an instruction (as in, will it be executed, or not?) depends on the outcome of some earlier, conditional branch instruction. • We generally can’t easily change which branches an instruction depends on w/o ruining the program’s functional behavior. • However, there are exceptions.
Hazards, Stalls, & Forwarding H&P 3rd ed. §A.2-3
Hazards • Hazards are circumstances which may lead to stalls in the pipeline if not addressed. • Stalls are delays, and may be called “bubbles” • There are three major types of hazards: • Structural hazards: • Not enough HW resources to keep all instrs. moving. • Data hazards • Data results of earlier instrs. not yet avail. when needed. • Control hazards • Control decisions resulting from earlier instrs. (branches) not yet made; don’t know which new instrs. to execute.
Structural Hazard Example Suppose you had a combined instruction+data memory w. only 1 read port
Hazards Produce “Bubbles” Bubble rises Progress through pipe Time Unskew
Textual View A pipeline stalled for a structural hazard – a load with only one memory port
Three Types of Data Hazards • Let i be an earlier instruction, j a later one. • RAW (read after write) • j is supposed to Read a value After iWrites it, • But instead j tries to read the value before i has written it • WAW (write after write) • j should Write to a given place After iWrites there, • But they end up writing in the wrong order. • Only occurs if >1 pipeline stage can write. • WAR (write after read) • j should Write a new value After iReads the old, • But instead j writes the new value before i has read the old one. • Only occurs if writes can happen before reads in pipeline.
Data Hazard Prevention • A clever compiler can often reschedule instructions to avoid a stall. • A simple example: • Original code:lw r2, 0(r4) add r1, r2, r3 Note: Stall happens here!lw r5, 4(r4) • Transformed code:lw r2, 0(r4) lw r5, 4(r4) add r1, r2, r3 No stall needed!
Simple RISC Pipeline Stall Statistics Note that ~1 in 5loads causes a stallin many programs! Percentageof loads thatcause a stall Benchmark
Hazard Detection Logic • Example: Detecting whether an instruction that has just been fetched needs to be stalled 1 cycle because of an immediately preceding load. IF/ID ID/EX EX/ME ME/WB IF ID EX ME WB IF/ID
Control Hazards, Branch Prediction, Delayed Branches H&P 3rd ed., §§A.2-3 & §4.2
Control Hazards • Suppose the new PC value was not computed until the MEM stage (like orig. RISC design). • Then we must stall 3 clocks after every branch!
Control Instruction Statistics • ~10% of dynamic insts.are fwd. cond. branches • only ~3% are backwardscond. branches • similar percentage areunconditional branches`
Stats on Taken Branches ~67% of cond.branches aretaken
Delayed Branches Machine code sequence: Branch instruction Delay slot instruction(s) Post-branch instructions Branch is taken(if taken) at this point
Static Branch Prediction • Earlier we discussed predict-taken, predict-not-taken static prediction strategies • Applied uniformly across all branches in program • Static analysis in compiler may be able to do better, if it can non-uniformly predict whether each specific branch is likely to be taken or not • One way: Backwards taken, forwards not taken. • If we can do better, it can help with static code scheduling to reduce data hazard stalls… • Also may assist later dynamic prediction
Prediction Helps Static Scheduling LD R1,0(R2) DSUBU R1,R1,R3 BEQZ R1,else OR R4,R5,R6 DADDU R10,R4,E3 J after else: DADDU R7,R8,R9 … after: Some data dependences Codemovementsto consider: Potential load delay to fill Which way will thisbranch go? Ifcase If-then-elsecontrol flow Elsecase
Some Static Prediction Schemes • Always predict taken • 34% mispredict rate on SPEC (range 9%-54%) • Backwards predict taken, forwards not taken • In SPEC, more than ½ of forwards are taken! • This does worse than “always predict taken” strategy • Usu. not better than 30-40% misprediction rate • Better than either: Use profile information! • Collect statistics on earlier program runs. • Works well because individual branches tend to be strongly biased (taken or not) given average data • Bias tends to remain stable across multiple runs
Profile-Based Predictor Statistics Floating-Point
Predict-Taken vs. Profile-Based Instructions executed in between mispredictions Floating-point (Logscale!)