1 / 18

CS 7960-4 Lecture 4

CS 7960-4 Lecture 4. Clock Rate vs. IPC: The End of the Road for Conventional Microarchitectures V. Agarwal, M.S. Hrishikesh, S.W. Keckler, D. Burger UT-Austin ISCA ’00. Previous Papers. Limits of ILP – it is probably worth doing o-o-o superscalar

brigney
Download Presentation

CS 7960-4 Lecture 4

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CS 7960-4 Lecture 4 Clock Rate vs. IPC: The End of the Road for Conventional Microarchitectures V. Agarwal, M.S. Hrishikesh, S.W. Keckler, D. Burger UT-Austin ISCA ’00

  2. Previous Papers • Limits of ILP – it is probably worth doing o-o-o • superscalar • Complexity-Effective – wire delays make the • implementations harder and increase latencies • Today’s paper – these latencies severely impact • IPCs and slow the growth in processor performance

  3. 1995-2000 • Figure 1. Clock speed has improved by 50% • every year • Reduction in logic delays • Deeper pipelines  This will soon end • IPC has gone up dramatically (the increased • complexity was worth it)  Will this end too?

  4. Wire Scaling • Multiple wire layers – the SIA roadmap predicts • dimensions (somewhat aggressive) • As transistor widths shrink, wires become thinner, • and their resistivity goes up (quadratically – Table 1) • Parallel-plate capacitance reduces, but coupling • capacitance increases (slight overall increase) • The equations are different, but the end result is • similar to Palacharla’s (without repeaters)

  5. Wire Scaling • With repeaters, delay of a fixed-length wire does • not go up quadratically as we shrink gate-width • In going from 250nm  35nm, • 5mm wire delay 170ps  390ps • delay to cross X gates 170ps  55ps • SIA clock speed 0.75GHz  13.5GHz • delay to cross X gates 0.13 cyc  0.75 cycles • We could increase wire width, but that compromises • bandwidth

  6. Clock Scaling • Logic delay (the FO4 delay) scales linearly with • gate length • Likewise, work per pipeline stage has also been • shrinking (Fig. 2) • The SIA predicts that today’s 16 FO4/stage delay • will shrink to 5.6 FO4/stage • A 64-bit add takes 5.5 FO4 – hence, they examine • SIA (super-aggressive), 8-FO4 (aggressive), and • 16-FO4 (conservative) scaling strategies

  7. Clock Scaling • While the 15-20% improvement in technology • scaling will continue, the 15-20% improvement • in pipeline depth will cease

  8. On-Chip Wire Delays • The number of bits reachable in a cycle are • shrinking (by more than a factor of two across • three generations) •  Structures that fit in a cycle today, will have • to be shrunk (smaller regfiles, issue queues) • Chip area is steadily increasing •  Less than 1% of the chip reachable in a • cycle, 30 cycles to go across the chip! • Processors are becoming communication-bound

  9. Processor Structure Delays • To model the microarchitecture, they estimate • the delays of all wire-limited structures • Weakness: bypass delays are not considered

  10. Microarchitecture Scaling • Capacity Scaling: constant access latencies in • cycles (simpler designs), scale capacities down • to make it fit • Pipeline Scaling: constant capacities, latencies • go up, hence, deeper pipelines • Any other approaches?

  11. Microarchitecture Scaling • Capacity Scaling: constant access latencies in • cycles (simpler designs), scale capacities down • to make it fit • Pipeline Scaling: constant capacities, latencies • go up, hence, deeper pipelines • Replicated Capacity Scaling: fast core with few • resources, but lots of them – high IPC if you can • localize communication

  12. IPC Comparisons 2-cycle wakeup 2-cycle regread 2-cycle bypass Pipeline Scaling 20-IQ F 20-IQ F F F 40 Regs 40 Regs F F F F Capacity Scaling 15-IQ F F Replicated Capacity Scaling 30 Regs F 15-IQ F 15-IQ F F F 30 Regs 30 Regs F F

  13. Results • Tables on Pg. 10 • Every instruction experiences longer latencies • IPCs are much lower for aggressive clocks • Overall performance is still comparable for all • approaches

  14. Results • In 17 years, we are seeing only a 7-fold speedup • (historically, it should have been 1720) – annual • increase of 12.5% • Slow growth because pipeline depth and IPC • increase will stagnate

  15. Questionable Assumptions • Additional transistors are not being used to • improve IPC • All instructions pay wire-delay penalties

  16. Conclusions • Large monolithic cores will perform poorly – • microarchitectures will have to be partitioned • On-chip caches will be the biggest bottlenecks – • 3-cycle 0.5KB L1s, 30-50-cycle 2MB L2s • Future proposals should be wire-delay-sensitive

  17. Next Class’ Paper • “Dynamic Code Partitioning for Clustered • Architectures”, UPC-Barcelona, 2001 • Instruction steering heuristics to balance load • and minimize communication

  18. Title • Bullet

More Related