230 likes | 381 Views
Some Challenges in On-Chip Interconnection Networks. ECE 284 Spring 2014. 1000-Cores: When Are We Going to Get There?. Intel 48-core SCC 45nm Based on x86 cores [Mattson, SC’10]. 1000-Cores: When Are We Going to Get There?. Intel x86 SCC 48 cores (45nm). 96 cores (32nm). 192 cores
E N D
Some Challenges in On-Chip Interconnection Networks ECE 284 Spring 2014
1000-Cores:When Are We Going to Get There? • Intel 48-core SCC • 45nm • Based on x86 cores • [Mattson, SC’10]
1000-Cores:When Are We Going to Get There? Intel x86 SCC 48 cores (45nm) 96 cores (32nm) 192 cores (22nm) 384 cores (16nm) 768 cores (11nm) 1536 cores (8nm) If we assume 2x cores per generation 2010 2012 2014 2016 2018 2020 ** Note: These are extrapolations, not product announcements
1000-Cores:When Are We Going to Get There? • Tilera Gx100 • 40nm • MIPS core • [Tilera website] Intel x86 SCC 48 cores (45nm) 96 cores (32nm) 192 cores (22nm) 384 cores (16nm) 768 cores (11nm) 1536 cores (8nm) If we assume 2x cores per generation 2010 2012 2014 2016 2018 2020 ** Note: These are extrapolations, not product announcements
1000-Cores:When Are We Going to Get There? Tilera 100 cores (40nm) 200 cores (28nm) 400 cores (20nm) 800 cores (14nm) 1600 cores (10nm) If we assume 2x cores per generation 2011 2013 2015 2017 2019 Intel x86 SCC 48 cores (45nm) 96 cores (32nm) 192 cores (22nm) 384 cores (16nm) 768 cores (11nm) 1536 cores (8nm) If we assume 2x cores per generation 2010 2012 2014 2016 2018 2020 ** Note: These are extrapolations, not product announcements
Throughput Wall • Effective network throughput per core of commonly used NoC architectures drops exponentially by in each process generation!
Throughput Wall • Mesh topology most widely used Cross traffic increases with radix k P P P P P P P P P P P P P P P P
Throughput Wall • Normalized throughput for uniform traffic 1.0 P P P P 0.9 0.8 Tuniform = 4/k P P P P 0.7 0.6 0.5 P P P P 0.4 0.3 P P P P 0.2 0.1 0.0 40 nm (k=10) 28 nm (k=14) 20 nm (k=20) 14 nm (k=28) 10 nm (k=40)
Throughput Wall • Normalized throughput for worst-case traffic 1.0 P P P P 0.9 0.8 Tuniform = 4/k P P P P 0.7 0.6 0.5 P P P P 0.4 Tworst-case = 1/(k – 1) 0.3 P P P P 0.2 0.1 0.0 40 nm (k=10) 28 nm (k=14) 20 nm (k=20) 14 nm (k=28) 10 nm (k=40)
Latency Wall • Worst-case hop counts increases with k 5.0 P P P P 4.0 Hworst-case = 2(k – 1) P P P P 3.0 P P P P 2.0 P P P P 1.0 0.0 40 nm (k=10) 28 nm (k=14) 20 nm (k=20) 14 nm (k=28) 10 nm (k=40)
Latency Wall • When k = 32 (1000 cores) • Hworst-case = 61 • Suppose average queue length = 10 flits, node-to-node latency can be 600+ cycles! 5.0 4.0 Hworst-case = 2(k – 1) 3.0 2.0 1.0 0.0 40 nm (k=10) 28 nm (k=14) 20 nm (k=20) 14 nm (k=28) 10 nm (k=40)
What About 3D Stacking? • Number of stacking layers < 4 • Only reduces radix k by 1/2, but effective throughput and worst-case latency still increasing exponentially by each generation
What About Locality? • If traffic nearly all local (e.g., nearest neighbor) • Then per-processor throughput and latency should remain constant if frequency held constant • Improving with frequency
What About Locality? But … • Data center applications often require all-to-all communication • e.g., MapReduce must perform significant data shuffling between its map and reduce phases Single-Chip Cloud Computer (Intel SCC)
What About Locality? • 1000-cores may be virtualized for different applications or users • Applications or virtual machines/clusters may enter and exit dynamically • Physical locations may be fragmented Single-Chip Cloud Computer (Intel SCC)
What About Locality? • Cache Coherence • Needed in shared-memory programming models • Local cache copies invalidated when data is updated • Broadcast operations inefficiently implemented with many 1-to-N unicast operations
Some Research Directions • Use of nanophotonics • Use of transmission lines • Both can get anywhere-to-anywhere on the chip in 2-3 ns
What About “Dark Silicon”? • Power growing exponentially each process generation • Can we get to 1000 usable cores? [Source: Borkar’10, The Exascale Challenge] Tech Node 45nm (2008) 32nm (2010) 22nm (2012) 16nm (2014) 11nm (2016) 8nm (2018) Frequency scaling 1.00 1.10 1.19 1.25 1.30 1.34 Vdd scaling 1.00 0.93 0.88 0.86 0.84 0.84 Capacitance scaling 1.00 0.75 0.56 0.42 0.32 0.24 Power scaling 1.00 0.71 0.52 0.39 0.29 0.22 Assuming 2x cores 1.00 1.41 2.06 3.09 4.68 7.08 Power increasing exponentially by over
Can we get to 1000 usable cores? • Depends … • What about non-GPUs? • NVIDIA Kepler GPU • 1536 cores already • 28nm, 1 GHz base clock • But simple cores
My Back-of-Envelope Extrapolations 1600 cores (2Ghz) 257W 250W 200W 800 cores (1.5Ghz) 170W 150W 1000 cores (2Ghz) 161W 400 cores (1.78Ghz) 114W 100W 200 cores (1.65Ghz) 78W 100 cores (1.5Ghz) 55W 50W 0W 40 nm 28 nm 20 nm 14 nm 10 nm ** Note: These are extrapolations, not product announcements
My Back-of-Envelope Extrapolations 1600 cores (2Ghz) 257W 250W 200W 1600 cores (1.5Ghz) 198W 800 cores (1.5Ghz) 170W 150W 1000 cores (2Ghz) 161W 400 cores (1.78Ghz) 114W 100W 1000 cores (1.5Ghz) 124W 200 cores (1.65Ghz) 78W 800 cores (1.5Ghz) 136W 100 cores (1.5Ghz) 55W 50W 400 cores (1.5Ghz) 96W 200 cores (1.5Ghz) 71W 0W 40 nm 28 nm 20 nm 14 nm 10 nm ** Note: These are extrapolations, not product announcements
Utilization Wall • What about more complex cores? • 1150 x86 cores could reach 300-600W with current extrapolations [Borkar’10] • 67-83% have to be dark if power < 100W • What about beyond 1000 cores?
Some Research Directions • Minimizing energy is #1 goal, even at the expense of wasting silicon area. • Heavy use of “accelerators”. • Analogous to “operating systems,” which comprises many many “kernel functions”, but only loaded into memory when used. • What about moving more functions into specialized hardware, and only “light them up” when used? • What are implications on the network?