170 likes | 187 Views
CS 7810 Lecture 5. The Optimal Logic Depth Per Pipeline Stage is 6 to 8 FO4 Inverter Delays M.S. Hrishikesh, N.P. Jouppi, K.I. Farkas, D. Burger, S.W. Keckler, P. Shivakumar UT-Austin and Compaq ISCA’02. Improvements in Clock Speed. 33MHz. 66MHz. 100MHz. 200MHz. 450MHz. 1GHz. 2GHz.
E N D
CS 7810 Lecture 5 The Optimal Logic Depth Per Pipeline Stage is 6 to 8 FO4 Inverter Delays M.S. Hrishikesh, N.P. Jouppi, K.I. Farkas, D. Burger, S.W. Keckler, P. Shivakumar UT-Austin and Compaq ISCA’02
Improvements in Clock Speed 33MHz 66MHz 100MHz 200MHz 450MHz 1GHz 2GHz 1000nm 130nm
Definitions • Clock Period f = flogic + flatch + fskew + fjitter • flogic: the actual work being done in one stage • flatch: data has to be saved in latch registers at the • end of each pipeline stage (1 FO4 = 36ps at 100nm) • fskew : Two parts of the circuit may receive their • clocks thru different paths, resulting in a slight • phase difference (0.3 FO4) • fjitter: Unpredictable variations (0.5 FO4)
Processor Model • An Alpha-like processor with latencies updated • for 100nm • Simplification: the study is insensitive to the • technology generation • Note that all structures are perfectly pipelined – • this is a “Limit of Pipelining” study
Effect of Deep Pipelining Add = 16 FO4 Mpred = 128 FO4 Load from mem = 400 FO4 Mult = 160 FO4 Overhead = 2 FO4 . . . . Clock Period 18 FO4 10 FO4 add = 16+2 mpred = 8x18 load = 400 mult = 180 add = 8+2+8+2 mpred = 16x10 load = 400 mult = 200 Clock Period FO4s Cycles Clock speed 18 FO4 18+144+400+180=742 42 1.54GHz 10 FO4 20+160+400+200=780 78 2.78GHz
Yet, Performance Increases… • Deepening a car assembly line more cars • being made at the same time a new car rolls • out at twice the freq • Independent instrs benefit from deep pipelining • Dependent instrs are slowed down • The latter dominates when pipelining overhead is • a large fraction of clock period
In-Order Processors • With no overhead, when flogicreduces from 8FO4 • to 4FO4, performance can go up by 100% (like in • the car assembly line), but only goes up by 18% • With overhead, max performance is seen for 6FO4 • for all three benchmark classes • For the Cray, optimal pipeline depth was 10.9FO4 • (Int) and 5.4FO4 (vector) • Degree of parallelism: vector > int-programs-today • > int-programs-before (no caches!)
Out-of-Order Processors • Optimal logic delay for integer is 6FO4, for FP • non-vector is 5FO4, for FP vector is 4FO4 • These results are insensitive to overhead costs • and microarchitecure optimizations • P.S. The effect of o-o-o execution on performance: • Non-vector FP: 0.5 1.0 • Integer : 0.8 1.8 • Vector FP : 0.9 3.5
Increased Pipeline Depth • Reasons for IPC decrease: • Longer ALU latencies (not quantified) • Longer load latency (~25% for 6-cyc increase) • Longer branch mpred cost (~10%) • Longer wakeup+select (~55%)
Pipelining Wakeup • It takes a long time to broadcast tags across the • entire issueq • Hence, wake the first eight instructions in the • first cycle, wake the next eight in the second, and • so on • This works well if most ready instructions are in • the first stage – a 10-stage pipeline worsens • performance by only 11% -- will this change the • optimal logic depth?
Instruction Select • Stage-1 only goes through one arbiter • Stages 2-4 have a pre-select and go thru 2 arbiters • Does well if most ready instrs in stage-1 (4% loss) stage 4 stage 3 stage 2 16-input arbiters stage 1 / 8
IssueQ Compaction • Both techniques work well only if instructions • move up to occupy empty slots • Wastes energy, increases complexity • Correctness problems – what if you miss the tag • while in transit
Conclusions • Logic per stage will only shrink by a factor of two • – limits clock speed improvements in the future • Pipelining wakeup+select has the biggest impact • on IPC
Related Work • Hartstein and Puzak (IBM): Most programs have • optimal pipeline depth between 13-30, • corresponding to FO4 delays of 4-8 • Sprangle and Carmean (Intel): Optimum pipeline • depth is 50-60, corresponding to FO4 delays of 4-5
Title • Bullet