1 / 70

Linear Pipeline Processors

Linear Pipeline Processors. Cascade of processing stages that are linearly connected Perform a fixed function k processing stages External input fed in at stage S 1 Final result emerges from stage S k. Asynchronous Model. Data flow between adjacent stages controlled by handshaking

devonw
Download Presentation

Linear Pipeline Processors

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Linear Pipeline Processors • Cascade of processing stages that are linearly connected • Perform a fixed function • k processing stages • External input fed in at stage S1 • Final result emerges from stage Sk EENG-630

  2. Asynchronous Model • Data flow between adjacent stages controlled by handshaking • Si sends ready signal to Si+1when ready to transmit • Si+1sends ack signal to Siafter receiving data • Allows variable throughput rate at stages EENG-630

  3. EENG-630

  4. Synchronous Model • Clocked latches used to interface between stages • Upon arrival of clock pulse, all latches transfer data to next stage • Approximately equal delay in all stages EENG-630

  5. Reservation Table • Specifies utilization pattern of successive stages • Follows a diagonal streamline • Need k clock cycles to flow through • One result emerges at each cycle if tasks are independent of each other EENG-630

  6. Clock Cycle • i = time delay of circuitry in Si • d = time delay of a latch • m = max stage delay •  = m + d (clock cycle of pipeline) • Data latched to master f/f of each latch register at rising edge of clock pulse • d = width of clock pulse (m >> d) EENG-630

  7. Pipeline Throughput • f = 1/  = pipeline frequency • At best, can expect one result per cycle, therefore, f represents the maximum throughput • Actual throughput < f due to initiation and dependencies EENG-630

  8. Clock Skewing • Same clock pulse may arrive at different stages with time offset of s • tmax(tmin)= time delay of longest (shortest) logic path within a stage • Choose m > tmax + s and d  tmin - s • d + tmax+ s  m + tmin - s • Ideally s = 0, tmax = mand tmin = d EENG-630

  9. Speedup Factor • Ideally k stage pipeline can process n tasks in k+(n-1) cycles • Tk = [k + (n –1)]  • Flow thru delay = k  for nonpipelined proc. • For n tasks: T1 = nk • Sk = T1/Tk = nk / [k+(n-1)] EENG-630

  10. Number of Stages • Micropipelining: divide at logic gate level • Macropipelining: divide at processor level • Optimal # of stages should maximize performance/cost ratio • p = t/k + df = 1/p • Total cost = c + kh • PCR = f/(c+kh) = 1/(t/k + d)(c + kh) EENG-630

  11. Efficiency and Throughput • Ek = Sk /k = n/[k +(n-1)] • Hk = n/[k + (n-1)] = nf / [k + (n-1)] • Max f when Ek 1 as n   • Hk = Ek f = Ek /  = Sk / k EENG-630

  12. Dynamic Pipeline • Can be reconfigured to perform variable functions at different times • Allows feedforward/feedback connections • Making nonlinear pipelines • Linear pipelines are static for fixed functions • Following different dataflow patterns, can use same pipeline to evaluate different functions EENG-630

  13. Reservation Tables • Multiple reservation tables can be generated for evaluation of different functions • Different fxns may follow dif. paths • One to many mapping b/t pipeline configuration and reservation tables • # of columns is evaluation time of a given fxn EENG-630

  14. EENG-630

  15. Latency • # of time units b/t two initiations • Any attempt by two or more initiations to use the same pipeline stage at the same time causes a collision – resource conflict • Forbidden latencies: cause collisions • To detect forbidden latencies, check distance b/t any two marks in the same row of the reservation table EENG-630

  16. Latency Analysis • Latency sequence: sequence of permissible latencies b/t successive task initiations • Latency cycle: latency seq. that repeats the same cycle indefinitely • Average latency: divide sum of all latencies by # of latencies in cycle • Constant cycle: cycle which contains only one latency value EENG-630

  17. EENG-630

  18. EENG-630

  19. Collision Vectors • Max forbidden latency m  n-1 • Permissible latency: 1  p  m-1 (p=1 ideal) • Collision vector: displays set of permissible & forbidden latencies (m bit binary vector) • Ci = 1 if latency i causes collision (0 else) • Cm = 1 always (max forbidden latency) EENG-630

  20. State Diagrams • Specifies permissible state transitions among successive iterations • Initial collision vector: corresponds to initial state at time 1 • Next state at time t+p obtained w/m-bit right shift register • Next state after p shifts obtained by Oring initial collision vector w/shifted register EENG-630

  21. EENG-630

  22. Greedy Cycles • Simple cycles: each state appears only once • Some simple cycles are greedy cycles • One whose edges are all made w/min latencies from their respective starting states • Their average latencies must be lower than those of other simple cycles • One w/minimal avg. latency (MAL) chosen EENG-630

  23. Bounds on MAL • Lower bounded by max # of checkmarks in any row of reservation table • Lower than or equal to avg. latency of any greedy cycle in the state diagram • Avg. latency of any greedy cycle is upper-bounded by # of 1’s in the initial collision vector + 1. (upper bound on MAL also) EENG-630

  24. Optimizing Schedule • Greedy cycle not sufficient for optimality of MAL, lower bound is • Find lower bound by modifying the reservation table • Try to reduce max # of marks in any row • Modified table must preserve the original function being evaluated EENG-630

  25. Delay Insertion • Use noncompute delay stages to increase pipeline performance with a shorter MAL • Purpose is to modify reservation table • Yields a new collision vector • Results in a modified state diagram EENG-630

  26. EENG-630

  27. EENG-630

  28. Pipeline Throughput • Initiation rate or avg. # of task initiations per cycle • If N tasks initiated in n cycles, then initiation rate or pipeline throughput is N/n • Scheduling strategy affects performance • Shorter MAL, then higher throughput • Unless MAL reduced to 1, then throughput is a fraction EENG-630

  29. Pipeline Efficiency • Stage utilization: % of time each stage is used over a long series of task initiations • Accumulated rate determines efficiency • Higher efficiency implies less idle time and higher throughput EENG-630

  30. Instruction Execution Phases • Instruction execution consists of: • Fetch, decode, operand fetch, execute, and write back phases • Ideal for overlapped execution on a linear pipeline • Each phase may require one or more clock cycles to execute EENG-630

  31. Instruction Pipeline Stages • Fetch: fetches instructions from cache • Decode: reveals the function to perform and identifies needed resources • Issue: reserves resources, maintain control interlocks, and read register operands • Execute: one or several stages • Writeback: write results into registers EENG-630

  32. EENG-630

  33. EENG-630

  34. Prefetch Buffers • Three types of buffers can be used to match the instruction fetch rate to pipeline consumption rate • Sequential: for in sequence pipelining • Target: instructions from a branch target • Loop: seq. instructions within a loop • Fetch block of instructions in one memory access time to a prefetch buffer EENG-630

  35. EENG-630

  36. Multiple Functional Units • Bottleneck stage is one w/max # of marks in its row in the reservation table • Solve by using multiple copies of same stage simultaneously • Reservation stations for each unit used to resolve data or resource dependencies EENG-630

  37. Reservation Stations • Operands wait in the RS until its data dependencies have been resolved • Each RS has an ID tag, monitored by a tag unit • Allows h/w to resolve conflicts b/t source and destination registers • Also serve as buffers EENG-630

  38. EENG-630

  39. Internal Data Forwarding • Improves throughput further • Replace some memory access ops by register transfer ops • Store-load forwarding: load replaced by move operation • Load-load forwarding: replace second load with move operation • Store-store: remove first store operation EENG-630

  40. EENG-630

  41. Hazard Avoidance • Read and write of shared variables by dif. instructions may lead to dif. results if executed out of order • Three types of hazards • RAW, WAW, and WAR • Domain = input set • Range = output set EENG-630

  42. EENG-630

  43. Hazard Conditions • RAW: R(I)  D(J)   (flow) • WAW: R(I)  R(J)   (antidependence) • WAR: D(I)  R(J)   (output) • Necessary, but not sufficient conditions • Occurrence depends on order two instructions are executed • Special tag bit used with each operand register to indicate safe or hazard-prone EENG-630

  44. Static Scheduling • Data dependencies create interlocked relationship b/t sequence of instructions • Resolve by compiler-based static scheduling approach • Increases separation b/t interlocked instr. • Cheaper to implement and flexible to apply EENG-630

  45. 2 Add R0, R1 1 Move R1, R5 2 Load R2, M(a) 2 Load R3, M(b) 3 Mult R2, R3 (multiply held up by previous load) Load R2, M(a) 2-3 Load R3, M(b) 2 Add R0, R1 2 Move R1, R5 1 Mult R2, R3 3 (no delay for multiply) Static Scheduling EENG-630

  46. Tomasulo’s Algorithm • Hardware dependence-resolution scheme • Resolves resource conflicts as well as data dependencies using register tagging • An issued instruction whose operands are not available is forwarded to an RS associated w/the unit it will use EENG-630

  47. EENG-630

  48. CDC Scoreboarding • Dynamic instruction scheduling hardware • Scoreboard unit keeps track of registers needed by instructions waiting for units • When all registers have valid data, the scoreboard enables execution • When finished, resources are released EENG-630

  49. EENG-630

  50. Branching Terms • Fetching a nonsequential instruction after a branch instr. is called branch taken • Instr. to be executed after a branch taken is called branch target • # of cycles b/t branch taken and its target is called delay slot (denoted by b) EENG-630

More Related