CS 7810 Lecture 3

CS 7810 Lecture 3 Clock Rate vs. IPC: The End of the Road for Conventional Microarchitectures V. Agarwal, M.S. Hrishikesh, S.W. Keckler, D. Burger UT-Austin ISCA ’00

Previous Papers • Limits of ILP – it is probably worth doing o-o-o • superscalar • Complexity-Effective – wire delays make the • implementations harder and increase latencies • Today’s paper – these latencies severely impact • IPCs and slow the growth in processor performance

1995-2000

1995-2000 • Clock speed has improved by 50% every year • Reduction in logic delays • Deeper pipelines  This will soon end • IPC has gone up dramatically (the increased • complexity was worth it)  Will this end too?

Wire Scaling • Multiple wire layers – the SIA roadmap predicts • dimensions (somewhat aggressive) • As transistor widths shrink, wires become thinner, • and their resistivity goes up (quadratically – Table 1) • Parallel-plate capacitance reduces, but coupling • capacitance increases (slight overall increase) • The equations are different, but the end result is • similar to Palacharla’s (without repeaters)

Wire Scaling

Wire Scaling • With repeaters, delay of a fixed-length wire does • not go up quadratically as we shrink gate-width • In going from 250nm  35nm, • 5mm wire delay 170ps  390ps • delay to cross X gates 170ps  55ps • SIA clock speed 0.75GHz  13.5GHz • delay to cross X gates 0.13 cyc  0.75 cycles • We could increase wire width, but that compromises • bandwidth

Clock Scaling • Logic delay (the FO4 delay) scales linearly with • gate length • Likewise, work per pipeline stage has also been • shrinking • The SIA predicts that today’s 16 FO4/stage delay • will shrink to 5.6 FO4/stage • A 64-bit add takes 5.5 FO4 – hence, they examine • SIA (super-aggressive), 8-FO4 (aggressive), and • 16-FO4 (conservative) scaling strategies

Clock Scaling

Clock Scaling • While the 15-20% improvement in technology • scaling will continue, the 15-20% improvement • in pipeline depth will cease

On-Chip Wire Delays • The number of bits reachable in a cycle are • shrinking (by more than a factor of two across • three generations) •  Structures that fit in a cycle today, will have • to be shrunk (smaller regfiles, issue queues) • Chip area is steadily increasing •  Less than 1% of the chip reachable in a • cycle, 30 cycles to go across the chip! • Processors are becoming communication-bound

Processor Structure Delays • To model the microarchitecture, they estimate • the delays of all wire-limited structures • Weakness: bypass delays are not considered

Microarchitecture Scaling • Capacity Scaling: constant access latencies in • cycles (simpler designs), scale capacities down • to make it fit • Pipeline Scaling: constant capacities, latencies • go up, hence, deeper pipelines • Any other approaches?

Microarchitecture Scaling • Capacity Scaling: constant access latencies in • cycles (simpler designs), scale capacities down • to make it fit • Pipeline Scaling: constant capacities, latencies • go up, hence, deeper pipelines • Replicated Capacity Scaling: fast core with few • resources, but lots of them – high IPC if you can • localize communication

IPC Comparisons 2-cycle wakeup 2-cycle regread 2-cycle bypass Pipeline Scaling 20-IQ F 20-IQ F F F 40 Regs 40 Regs F F F F Capacity Scaling 15-IQ F F Replicated Capacity Scaling 30 Regs F 15-IQ F 15-IQ F F F 30 Regs 30 Regs F F

Methodology

Results

Results • Every instruction experiences longer latencies • IPCs are much lower for aggressive clocks • Overall performance is still comparable for all • approaches

Results • In 17 years, we are seeing only a 7-fold speedup • (historically, it should have been 1720) – annual • increase of 12.5% • Slow growth because pipeline depth and IPC • increase will stagnate

Questionable Assumptions • Additional transistors are not being used to • improve IPC • All instructions pay wire-delay penalties

Conclusions • Large monolithic cores will perform poorly – • microarchitectures will have to be partitioned • On-chip caches will be the biggest bottlenecks – • 3-cycle 0.5KB L1s, 30-50-cycle 2MB L2s • Future proposals should be wire-delay-sensitive

Next Class’ Paper • “Dynamic Code Partitioning for Clustered • Architectures”, UPC-Barcelona, 2001 • Instruction steering heuristics to balance load • and minimize communication

Title • Bullet

CS 7810 Lecture 3

CS 7810 Lecture 3

Presentation Transcript

CS 7810 Lecture 19

CS 7810 Lecture 17

CS 7810 Lecture 22

CS 7810 Lecture 25

CS 7810 Lecture 9

CS 7810 Lecture 2

CS 7810 Lecture 14

CS 7810 Lecture 8

CS 7810 Lecture 13

CS 7810 Lecture 21

CS 7810 Lecture 23

CS 7810 Lecture 9

CS 7810 Lecture 21

CS 7810 Lecture 3

CS 7810 Lecture 25

CS 7810 Lecture 8

CS 7810 Lecture 5

CS 7810 Lecture 12

CS 7810 Lecture 19

CS 7810 Lecture 22

CS 7810 Lecture 2