1 / 17

CS 7810 Lecture 5

CS 7810 Lecture 5. The Optimal Logic Depth Per Pipeline Stage is 6 to 8 FO4 Inverter Delays M.S. Hrishikesh, N.P. Jouppi, K.I. Farkas, D. Burger, S.W. Keckler, P. Shivakumar UT-Austin and Compaq ISCA’02. Improvements in Clock Speed. 33MHz. 66MHz. 100MHz. 200MHz. 450MHz. 1GHz. 2GHz.

tveronica
Download Presentation

CS 7810 Lecture 5

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CS 7810 Lecture 5 The Optimal Logic Depth Per Pipeline Stage is 6 to 8 FO4 Inverter Delays M.S. Hrishikesh, N.P. Jouppi, K.I. Farkas, D. Burger, S.W. Keckler, P. Shivakumar UT-Austin and Compaq ISCA’02

  2. Improvements in Clock Speed 33MHz 66MHz 100MHz 200MHz 450MHz 1GHz 2GHz 1000nm 130nm

  3. Definitions • Clock Period f = flogic + flatch + fskew + fjitter • flogic: the actual work being done in one stage • flatch: data has to be saved in latch registers at the • end of each pipeline stage (1 FO4 = 36ps at 100nm) • fskew : Two parts of the circuit may receive their • clocks thru different paths, resulting in a slight • phase difference (0.3 FO4) • fjitter: Unpredictable variations (0.5 FO4)

  4. Processor Model • An Alpha-like processor with latencies updated • for 100nm • Simplification: the study is insensitive to the • technology generation • Note that all structures are perfectly pipelined – • this is a “Limit of Pipelining” study

  5. Effect of Deep Pipelining Add = 16 FO4 Mpred = 128 FO4 Load from mem = 400 FO4 Mult = 160 FO4 Overhead = 2 FO4 . . . . Clock Period 18 FO4 10 FO4 add = 16+2 mpred = 8x18 load = 400 mult = 180 add = 8+2+8+2 mpred = 16x10 load = 400 mult = 200 Clock Period FO4s Cycles Clock speed 18 FO4 18+144+400+180=742 42 1.54GHz 10 FO4 20+160+400+200=780 78 2.78GHz

  6. Yet, Performance Increases… • Deepening a car assembly line  more cars • being made at the same time  a new car rolls • out at twice the freq • Independent instrs benefit from deep pipelining • Dependent instrs are slowed down • The latter dominates when pipelining overhead is • a large fraction of clock period

  7. Example Latencies

  8. In-Order Processors • With no overhead, when flogicreduces from 8FO4 • to 4FO4, performance can go up by 100% (like in • the car assembly line), but only goes up by 18% • With overhead, max performance is seen for 6FO4 • for all three benchmark classes • For the Cray, optimal pipeline depth was 10.9FO4 • (Int) and 5.4FO4 (vector) • Degree of parallelism: vector > int-programs-today • > int-programs-before (no caches!)

  9. Out-of-Order Processors • Optimal logic delay for integer is 6FO4, for FP • non-vector is 5FO4, for FP vector is 4FO4 • These results are insensitive to overhead costs • and microarchitecure optimizations • P.S. The effect of o-o-o execution on performance: • Non-vector FP: 0.5  1.0 • Integer : 0.8  1.8 • Vector FP : 0.9  3.5

  10. Out-of-Order Processors

  11. Increased Pipeline Depth • Reasons for IPC decrease: • Longer ALU latencies (not quantified) • Longer load latency (~25% for 6-cyc increase) • Longer branch mpred cost (~10%) • Longer wakeup+select (~55%)

  12. Pipelining Wakeup • It takes a long time to broadcast tags across the • entire issueq • Hence, wake the first eight instructions in the • first cycle, wake the next eight in the second, and • so on • This works well if most ready instructions are in • the first stage – a 10-stage pipeline worsens • performance by only 11% -- will this change the • optimal logic depth?

  13. Instruction Select • Stage-1 only goes through one arbiter • Stages 2-4 have a pre-select and go thru 2 arbiters • Does well if most ready instrs in stage-1 (4% loss) stage 4 stage 3 stage 2 16-input arbiters stage 1 / 8

  14. IssueQ Compaction • Both techniques work well only if instructions • move up to occupy empty slots • Wastes energy, increases complexity • Correctness problems – what if you miss the tag • while in transit

  15. Conclusions • Logic per stage will only shrink by a factor of two • – limits clock speed improvements in the future • Pipelining wakeup+select has the biggest impact • on IPC

  16. Related Work • Hartstein and Puzak (IBM): Most programs have • optimal pipeline depth between 13-30, • corresponding to FO4 delays of 4-8 • Sprangle and Carmean (Intel): Optimum pipeline • depth is 50-60, corresponding to FO4 delays of 4-5

  17. Title • Bullet

More Related