720 likes | 747 Views
Linear Pipeline Processors. Cascade of processing stages that are linearly connected Perform a fixed function k processing stages External input fed in at stage S 1 Final result emerges from stage S k. Asynchronous Model. Data flow between adjacent stages controlled by handshaking
E N D
Linear Pipeline Processors • Cascade of processing stages that are linearly connected • Perform a fixed function • k processing stages • External input fed in at stage S1 • Final result emerges from stage Sk EENG-630
Asynchronous Model • Data flow between adjacent stages controlled by handshaking • Si sends ready signal to Si+1when ready to transmit • Si+1sends ack signal to Siafter receiving data • Allows variable throughput rate at stages EENG-630
Synchronous Model • Clocked latches used to interface between stages • Upon arrival of clock pulse, all latches transfer data to next stage • Approximately equal delay in all stages EENG-630
Reservation Table • Specifies utilization pattern of successive stages • Follows a diagonal streamline • Need k clock cycles to flow through • One result emerges at each cycle if tasks are independent of each other EENG-630
Clock Cycle • i = time delay of circuitry in Si • d = time delay of a latch • m = max stage delay • = m + d (clock cycle of pipeline) • Data latched to master f/f of each latch register at rising edge of clock pulse • d = width of clock pulse (m >> d) EENG-630
Pipeline Throughput • f = 1/ = pipeline frequency • At best, can expect one result per cycle, therefore, f represents the maximum throughput • Actual throughput < f due to initiation and dependencies EENG-630
Clock Skewing • Same clock pulse may arrive at different stages with time offset of s • tmax(tmin)= time delay of longest (shortest) logic path within a stage • Choose m > tmax + s and d tmin - s • d + tmax+ s m + tmin - s • Ideally s = 0, tmax = mand tmin = d EENG-630
Speedup Factor • Ideally k stage pipeline can process n tasks in k+(n-1) cycles • Tk = [k + (n –1)] • Flow thru delay = k for nonpipelined proc. • For n tasks: T1 = nk • Sk = T1/Tk = nk / [k+(n-1)] EENG-630
Number of Stages • Micropipelining: divide at logic gate level • Macropipelining: divide at processor level • Optimal # of stages should maximize performance/cost ratio • p = t/k + df = 1/p • Total cost = c + kh • PCR = f/(c+kh) = 1/(t/k + d)(c + kh) EENG-630
Efficiency and Throughput • Ek = Sk /k = n/[k +(n-1)] • Hk = n/[k + (n-1)] = nf / [k + (n-1)] • Max f when Ek 1 as n • Hk = Ek f = Ek / = Sk / k EENG-630
Dynamic Pipeline • Can be reconfigured to perform variable functions at different times • Allows feedforward/feedback connections • Making nonlinear pipelines • Linear pipelines are static for fixed functions • Following different dataflow patterns, can use same pipeline to evaluate different functions EENG-630
Reservation Tables • Multiple reservation tables can be generated for evaluation of different functions • Different fxns may follow dif. paths • One to many mapping b/t pipeline configuration and reservation tables • # of columns is evaluation time of a given fxn EENG-630
Latency • # of time units b/t two initiations • Any attempt by two or more initiations to use the same pipeline stage at the same time causes a collision – resource conflict • Forbidden latencies: cause collisions • To detect forbidden latencies, check distance b/t any two marks in the same row of the reservation table EENG-630
Latency Analysis • Latency sequence: sequence of permissible latencies b/t successive task initiations • Latency cycle: latency seq. that repeats the same cycle indefinitely • Average latency: divide sum of all latencies by # of latencies in cycle • Constant cycle: cycle which contains only one latency value EENG-630
Collision Vectors • Max forbidden latency m n-1 • Permissible latency: 1 p m-1 (p=1 ideal) • Collision vector: displays set of permissible & forbidden latencies (m bit binary vector) • Ci = 1 if latency i causes collision (0 else) • Cm = 1 always (max forbidden latency) EENG-630
State Diagrams • Specifies permissible state transitions among successive iterations • Initial collision vector: corresponds to initial state at time 1 • Next state at time t+p obtained w/m-bit right shift register • Next state after p shifts obtained by Oring initial collision vector w/shifted register EENG-630
Greedy Cycles • Simple cycles: each state appears only once • Some simple cycles are greedy cycles • One whose edges are all made w/min latencies from their respective starting states • Their average latencies must be lower than those of other simple cycles • One w/minimal avg. latency (MAL) chosen EENG-630
Bounds on MAL • Lower bounded by max # of checkmarks in any row of reservation table • Lower than or equal to avg. latency of any greedy cycle in the state diagram • Avg. latency of any greedy cycle is upper-bounded by # of 1’s in the initial collision vector + 1. (upper bound on MAL also) EENG-630
Optimizing Schedule • Greedy cycle not sufficient for optimality of MAL, lower bound is • Find lower bound by modifying the reservation table • Try to reduce max # of marks in any row • Modified table must preserve the original function being evaluated EENG-630
Delay Insertion • Use noncompute delay stages to increase pipeline performance with a shorter MAL • Purpose is to modify reservation table • Yields a new collision vector • Results in a modified state diagram EENG-630
Pipeline Throughput • Initiation rate or avg. # of task initiations per cycle • If N tasks initiated in n cycles, then initiation rate or pipeline throughput is N/n • Scheduling strategy affects performance • Shorter MAL, then higher throughput • Unless MAL reduced to 1, then throughput is a fraction EENG-630
Pipeline Efficiency • Stage utilization: % of time each stage is used over a long series of task initiations • Accumulated rate determines efficiency • Higher efficiency implies less idle time and higher throughput EENG-630
Instruction Execution Phases • Instruction execution consists of: • Fetch, decode, operand fetch, execute, and write back phases • Ideal for overlapped execution on a linear pipeline • Each phase may require one or more clock cycles to execute EENG-630
Instruction Pipeline Stages • Fetch: fetches instructions from cache • Decode: reveals the function to perform and identifies needed resources • Issue: reserves resources, maintain control interlocks, and read register operands • Execute: one or several stages • Writeback: write results into registers EENG-630
Prefetch Buffers • Three types of buffers can be used to match the instruction fetch rate to pipeline consumption rate • Sequential: for in sequence pipelining • Target: instructions from a branch target • Loop: seq. instructions within a loop • Fetch block of instructions in one memory access time to a prefetch buffer EENG-630
Multiple Functional Units • Bottleneck stage is one w/max # of marks in its row in the reservation table • Solve by using multiple copies of same stage simultaneously • Reservation stations for each unit used to resolve data or resource dependencies EENG-630
Reservation Stations • Operands wait in the RS until its data dependencies have been resolved • Each RS has an ID tag, monitored by a tag unit • Allows h/w to resolve conflicts b/t source and destination registers • Also serve as buffers EENG-630
Internal Data Forwarding • Improves throughput further • Replace some memory access ops by register transfer ops • Store-load forwarding: load replaced by move operation • Load-load forwarding: replace second load with move operation • Store-store: remove first store operation EENG-630
Hazard Avoidance • Read and write of shared variables by dif. instructions may lead to dif. results if executed out of order • Three types of hazards • RAW, WAW, and WAR • Domain = input set • Range = output set EENG-630
Hazard Conditions • RAW: R(I) D(J) (flow) • WAW: R(I) R(J) (antidependence) • WAR: D(I) R(J) (output) • Necessary, but not sufficient conditions • Occurrence depends on order two instructions are executed • Special tag bit used with each operand register to indicate safe or hazard-prone EENG-630
Static Scheduling • Data dependencies create interlocked relationship b/t sequence of instructions • Resolve by compiler-based static scheduling approach • Increases separation b/t interlocked instr. • Cheaper to implement and flexible to apply EENG-630
2 Add R0, R1 1 Move R1, R5 2 Load R2, M(a) 2 Load R3, M(b) 3 Mult R2, R3 (multiply held up by previous load) Load R2, M(a) 2-3 Load R3, M(b) 2 Add R0, R1 2 Move R1, R5 1 Mult R2, R3 3 (no delay for multiply) Static Scheduling EENG-630
Tomasulo’s Algorithm • Hardware dependence-resolution scheme • Resolves resource conflicts as well as data dependencies using register tagging • An issued instruction whose operands are not available is forwarded to an RS associated w/the unit it will use EENG-630
CDC Scoreboarding • Dynamic instruction scheduling hardware • Scoreboard unit keeps track of registers needed by instructions waiting for units • When all registers have valid data, the scoreboard enables execution • When finished, resources are released EENG-630
Branching Terms • Fetching a nonsequential instruction after a branch instr. is called branch taken • Instr. to be executed after a branch taken is called branch target • # of cycles b/t branch taken and its target is called delay slot (denoted by b) EENG-630