270 likes | 585 Views
Scaling Towards Kilo-Core Processors with Asymmetric High-Radix Topologies. Nilmini Abeyratne , Reetuparna Das, Qingkun Li, Korey Sewell, Bharan Giridhar , Ronald G. Dreslinski , David Blaauw , and Trevor Mudge University of Michigan, Ann Arbor HPCA 19 February 27, 2013.
E N D
Scaling TowardsKilo-Core Processors with Asymmetric High-Radix Topologies Nilmini Abeyratne, Reetuparna Das, Qingkun Li, Korey Sewell, BharanGiridhar, Ronald G. Dreslinski, David Blaauw, and Trevor Mudge University of Michigan, Ann Arbor HPCA 19 February 27, 2013
Many-Core Trend • Thousand-core chips are in our future • A scalable on-chip interconnect is required Mesh TILE Gx100 TILE64 Intel SCC Crossbar or Ring
Outline • Motivation • Symmetric Low-Radix and High-Radix Designs • Asymmetric High-Radix Designs • Super-Star • Super-StarX • Results • Conclusion
Mesh Topology • Popular in tiled-based many-core processors • Low complexity • Planar 2D layout properties Can Mesh topology scale to 100s of cores? Tilera’s TILE6464-core processor
High-Radix Topologies • Alternative to low-radix topologies • Concentration R R R R R R R 6 tile R R R R R R Tile R R R R R R R R R R R R R R R R R R R R R R 6 tile R R R R R R R R R R R R R R R R R R R R R Fewerhops improve latency, but links become bottlenecks
High-Radix Topologies • Improve throughput • Additional Connectivity • Parallel links • Express links High-RadixRouter R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R
High-Radix Switch: Swizzle-Switch • Traditional Matrix-Style Crossbar • Separate crossbar & arbiter • Not scalable as radix increases: • Routing to/from arbiter becomes more challenging • Arbitration logic grows more complex • Swizzle-Switch* • Combines routing-dominated arbiter with logic-dominated crossbar • SRAM-like technology • Scales to radix-64 in 32nm @ 1.5GHz *VLSIC 2011, ISSCC 2012, DAC 2012, JETCAS 2012, HotChips 2012
High-Radix Topologies Conventional Router Delay Swizzle-Switch Router Delay Global Communication Global Communication Delay Hop Count Local Communication Local Communication Low-Radix Router High-Radix Router Symmetric high-radix topologies trade-off efficiency of local communication to achieve faster global communication
Outline • Motivation • Symmetric Low-Radix and High-Radix Designs • Asymmetric High-Radix Designs • Super-Star • Super-StarX • Results • Conclusion
Asymmetric High-Radix Topologies Low-Radix Topologies optimize local communication High-Radix Topologies optimize global communication Fast, Low-Radix LR LR LR LR LR LR LR LR = Local Router Slow, High-Radix GR = Global Router GR LR LR LR LR LR Asymmetric High-Radix merge best features of both low-radix and high-radix topologies
Asymmetric High-Radix Topologies • Decouple localand globalcommunication • Match router speed to wirespeed • Local communication Short wires Fast Low-Radix • Global communication Long wires Slow High-Radix Routers Reduce Hop count
Super-Star • Each local router connects a cluster of tiles • Eachglobal router connects to all local routers LR LR LR LR GR LR LR LR LR LR LR LR LR
Super-StarX • Inter-cluster links further reduce local communication latency • Locality-aware routing policy Inter-Cluster Links LR LR LR LR GR Low Load: Inter-Cluster Links LR LR LR High Load: Inter-Cluster Links + Global Router LR LR LR LR LR
Super-StarX • Multiple global routers • Higher throughput, energy proportionality LR LR LR LR GR GR GR GR LR LR LR LR LR LR LR LR
Super-StarX Layout 3.6mm 18mm 3.6mm LR LR LR LR LR LR 14.4mm Inter-Cluster Links LR LR LR LR LR LR 4 tile GR 7.2mm GR 3.6mm LR LR LR LR LR LR 4 tile 21.6mm LR 10.8mm LR LR LR LR LR LR GR GR 576 tiles in total LR LR LR LR LR LR 21.6mm LR LR LR LR LR LR 25.2mm 21.6mm
Evaluation • 576 tiles • Synthetic uniform random traffic, 4-flit messages • 128-bit Swizzle-Switch in 15nm • 4 VCs/port, buffer depth 5 flits/VC • Power & delay from SPICE modeling in 32nm, scaled to 15nm
Results: Latency • Compared with Mesh topology, Super-Star topologies have • 39% more throughput, 45% reduction in latency
Results: Power • Compared with Mesh topology, Super-Star topologies have • 40% less power. At 30W, 3x more throughput Low Power 3x 2.3x HighPerf.
Results: Energy Proportionality • Available throughput can be tuned using global routers • A single global router can provide full network connectivity
Results: Localized Traffic • Nearest neighbor traffic between LRs • Maximum one hop Inter-Cluster Links + Global Routers Inter-Cluster Links
Results: Applications • Processor Configuration • 576 nodes: 552 cores + 24 memory controllers (1 GHz frequency) • Private L1 cache; shared, distributed L2 cache • Workloads • 4 workloads – 12 SPECCPU 2006 benchmarks each • 1 workload – 8 SPLASH-2 benchmarks • Metrics • Performance (execution time in cycles) • Power • Results: Super-StarX • Average over Mesh: 17% performance improvement, 39% less power • Average over Fbfly: 32% performance improvement, 5% worse power
Conclusion • Goal: a scalable on-chip network topology for kilo-core chips • Made feasible by Swizzle-Switches • Asymmetric high-radix topologies: Super-Starand Super-StarX • Fast low-radix local routers, slow high-radix global routers • Multiple global routers for higher throughput and energy proportionality • Results: Super-StarX • Average latency: 45% reduction over Mesh • Power: 40% less over Mesh • Throughput @ 30W TDP: 3x Mesh, 2.3x Fbfly
Thank You! Scaling TowardsKilo-Core Processors with Asymmetric High-Radix Topologies Nilmini Abeyratne, Reetuparna Das, Qingkun Li, Korey Sewell, BharanGiridhar, Ronald G. Dreslinski, David Blaauw, and Trevor Mudge University of Michigan, Ann Arbor HPCA 19 February 27, 2013
High-Radix Switch: Swizzle-Switch Radix-64 128-bit channels 32nm 1.5 GHz 2W of power ~2mm2 of area
Super-Star Layout 3.6mm 18mm 3.6mm LR LR LR LR LR LR 14.4mm LR LR LR LR LR LR 4 tile GR 7.2mm GR 3.6mm LR LR LR LR LR LR 4 tile 21.6mm LR 10.8mm LR LR LR LR LR LR GR GR 576 tiles In total LR LR LR LR LR LR 21.6mm LR LR LR LR LR LR 25.2mm 21.6mm
Super-Ring (Anti-design) • Medium-radix local and global routers • Limited connectivity hinders scalability LR LR LR LR LR LR LR LR LR LR LR LR GR GR LR LR LR LR LR LR LR LR LR LR LR LR GR GR LR LR LR LR LR LR LR LR LR LR LR LR