1 / 23

Some Challenges in On-Chip Interconnection Networks

Some Challenges in On-Chip Interconnection Networks. ECE 284 Spring 2014. 1000-Cores: When Are We Going to Get There?. Intel 48-core SCC 45nm Based on x86 cores [Mattson, SC’10]. 1000-Cores: When Are We Going to Get There?. Intel x86 SCC 48 cores (45nm). 96 cores (32nm). 192 cores

lea
Download Presentation

Some Challenges in On-Chip Interconnection Networks

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Some Challenges in On-Chip Interconnection Networks ECE 284 Spring 2014

  2. 1000-Cores:When Are We Going to Get There? • Intel 48-core SCC • 45nm • Based on x86 cores • [Mattson, SC’10]

  3. 1000-Cores:When Are We Going to Get There? Intel x86 SCC 48 cores (45nm) 96 cores (32nm) 192 cores (22nm) 384 cores (16nm) 768 cores (11nm) 1536 cores (8nm) If we assume 2x cores per generation 2010 2012 2014 2016 2018 2020 ** Note: These are extrapolations, not product announcements

  4. 1000-Cores:When Are We Going to Get There? • Tilera Gx100 • 40nm • MIPS core • [Tilera website] Intel x86 SCC 48 cores (45nm) 96 cores (32nm) 192 cores (22nm) 384 cores (16nm) 768 cores (11nm) 1536 cores (8nm) If we assume 2x cores per generation 2010 2012 2014 2016 2018 2020 ** Note: These are extrapolations, not product announcements

  5. 1000-Cores:When Are We Going to Get There? Tilera 100 cores (40nm) 200 cores (28nm) 400 cores (20nm) 800 cores (14nm) 1600 cores (10nm) If we assume 2x cores per generation 2011 2013 2015 2017 2019 Intel x86 SCC 48 cores (45nm) 96 cores (32nm) 192 cores (22nm) 384 cores (16nm) 768 cores (11nm) 1536 cores (8nm) If we assume 2x cores per generation 2010 2012 2014 2016 2018 2020 ** Note: These are extrapolations, not product announcements

  6. Throughput Wall • Effective network throughput per core of commonly used NoC architectures drops exponentially by in each process generation!

  7. Throughput Wall • Mesh topology most widely used Cross traffic increases with radix k P P P P P P P P P P P P P P P P

  8. Throughput Wall • Normalized throughput for uniform traffic 1.0 P P P P 0.9 0.8 Tuniform = 4/k P P P P 0.7 0.6 0.5 P P P P 0.4 0.3 P P P P 0.2 0.1 0.0 40 nm (k=10) 28 nm (k=14) 20 nm (k=20) 14 nm (k=28) 10 nm (k=40)

  9. Throughput Wall • Normalized throughput for worst-case traffic 1.0 P P P P 0.9 0.8 Tuniform = 4/k P P P P 0.7 0.6 0.5 P P P P 0.4 Tworst-case = 1/(k – 1) 0.3 P P P P 0.2 0.1 0.0 40 nm (k=10) 28 nm (k=14) 20 nm (k=20) 14 nm (k=28) 10 nm (k=40)

  10. Latency Wall • Worst-case hop counts increases with k 5.0 P P P P 4.0 Hworst-case = 2(k – 1) P P P P 3.0 P P P P 2.0 P P P P 1.0 0.0 40 nm (k=10) 28 nm (k=14) 20 nm (k=20) 14 nm (k=28) 10 nm (k=40)

  11. Latency Wall • When k = 32 (1000 cores) • Hworst-case = 61 • Suppose average queue length = 10 flits, node-to-node latency can be 600+ cycles! 5.0 4.0 Hworst-case = 2(k – 1) 3.0 2.0 1.0 0.0 40 nm (k=10) 28 nm (k=14) 20 nm (k=20) 14 nm (k=28) 10 nm (k=40)

  12. What About 3D Stacking? • Number of stacking layers < 4 • Only reduces radix k by 1/2, but effective throughput and worst-case latency still increasing exponentially by each generation

  13. What About Locality? • If traffic nearly all local (e.g., nearest neighbor) • Then per-processor throughput and latency should remain constant if frequency held constant • Improving with frequency

  14. What About Locality? But … • Data center applications often require all-to-all communication • e.g., MapReduce must perform significant data shuffling between its map and reduce phases Single-Chip Cloud Computer (Intel SCC)

  15. What About Locality? • 1000-cores may be virtualized for different applications or users • Applications or virtual machines/clusters may enter and exit dynamically • Physical locations may be fragmented Single-Chip Cloud Computer (Intel SCC)

  16. What About Locality? • Cache Coherence • Needed in shared-memory programming models • Local cache copies invalidated when data is updated • Broadcast operations inefficiently implemented with many 1-to-N unicast operations

  17. Some Research Directions • Use of nanophotonics • Use of transmission lines • Both can get anywhere-to-anywhere on the chip in 2-3 ns

  18. What About “Dark Silicon”? • Power growing exponentially each process generation • Can we get to 1000 usable cores? [Source: Borkar’10, The Exascale Challenge] Tech Node 45nm (2008) 32nm (2010) 22nm (2012) 16nm (2014) 11nm (2016) 8nm (2018) Frequency scaling 1.00 1.10 1.19 1.25 1.30 1.34 Vdd scaling 1.00 0.93 0.88 0.86 0.84 0.84 Capacitance scaling 1.00 0.75 0.56 0.42 0.32 0.24 Power scaling 1.00 0.71 0.52 0.39 0.29 0.22 Assuming 2x cores 1.00 1.41 2.06 3.09 4.68 7.08 Power increasing exponentially by over

  19. Can we get to 1000 usable cores? • Depends … • What about non-GPUs? • NVIDIA Kepler GPU • 1536 cores already • 28nm, 1 GHz base clock • But simple cores

  20. My Back-of-Envelope Extrapolations 1600 cores (2Ghz) 257W 250W 200W 800 cores (1.5Ghz) 170W 150W 1000 cores (2Ghz) 161W 400 cores (1.78Ghz) 114W 100W 200 cores (1.65Ghz) 78W 100 cores (1.5Ghz) 55W 50W 0W 40 nm 28 nm 20 nm 14 nm 10 nm ** Note: These are extrapolations, not product announcements

  21. My Back-of-Envelope Extrapolations 1600 cores (2Ghz) 257W 250W 200W 1600 cores (1.5Ghz) 198W 800 cores (1.5Ghz) 170W 150W 1000 cores (2Ghz) 161W 400 cores (1.78Ghz) 114W 100W 1000 cores (1.5Ghz) 124W 200 cores (1.65Ghz) 78W 800 cores (1.5Ghz) 136W 100 cores (1.5Ghz) 55W 50W 400 cores (1.5Ghz) 96W 200 cores (1.5Ghz) 71W 0W 40 nm 28 nm 20 nm 14 nm 10 nm ** Note: These are extrapolations, not product announcements

  22. Utilization Wall • What about more complex cores? • 1150 x86 cores could reach 300-600W with current extrapolations [Borkar’10] • 67-83% have to be dark if power < 100W • What about beyond 1000 cores?

  23. Some Research Directions • Minimizing energy is #1 goal, even at the expense of wasting silicon area. • Heavy use of “accelerators”. • Analogous to “operating systems,” which comprises many many “kernel functions”, but only loaded into memory when used. • What about moving more functions into specialized hardware, and only “light them up” when used? • What are implications on the network?

More Related