1 / 27

Scaling Towards Kilo-Core Processors with Asymmetric High-Radix Topologies

Scaling Towards Kilo-Core Processors with Asymmetric High-Radix Topologies. Nilmini Abeyratne , Reetuparna Das, Qingkun Li, Korey Sewell, Bharan Giridhar , Ronald G. Dreslinski , David Blaauw , and Trevor Mudge University of Michigan, Ann Arbor HPCA 19 February 27, 2013.

tayten
Download Presentation

Scaling Towards Kilo-Core Processors with Asymmetric High-Radix Topologies

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Scaling TowardsKilo-Core Processors with Asymmetric High-Radix Topologies Nilmini Abeyratne, Reetuparna Das, Qingkun Li, Korey Sewell, BharanGiridhar, Ronald G. Dreslinski, David Blaauw, and Trevor Mudge University of Michigan, Ann Arbor HPCA 19 February 27, 2013

  2. Many-Core Trend • Thousand-core chips are in our future • A scalable on-chip interconnect is required Mesh TILE Gx100 TILE64 Intel SCC Crossbar or Ring

  3. Outline • Motivation • Symmetric Low-Radix and High-Radix Designs • Asymmetric High-Radix Designs • Super-Star • Super-StarX • Results • Conclusion

  4. Mesh Topology • Popular in tiled-based many-core processors • Low complexity • Planar 2D layout properties Can Mesh topology scale to 100s of cores? Tilera’s TILE6464-core processor

  5. High-Radix Topologies • Alternative to low-radix topologies • Concentration R R R R R R R 6 tile R R R R R R Tile R R R R R R R R R R R R R R R R R R R R R R 6 tile R R R R R R R R R R R R R R R R R R R R R Fewerhops improve latency, but links become bottlenecks

  6. High-Radix Topologies • Improve throughput • Additional Connectivity • Parallel links • Express links High-RadixRouter R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R

  7. High-Radix Switch: Swizzle-Switch • Traditional Matrix-Style Crossbar • Separate crossbar & arbiter • Not scalable as radix increases: • Routing to/from arbiter becomes more challenging • Arbitration logic grows more complex • Swizzle-Switch* • Combines routing-dominated arbiter with logic-dominated crossbar • SRAM-like technology • Scales to radix-64 in 32nm @ 1.5GHz *VLSIC 2011, ISSCC 2012, DAC 2012, JETCAS 2012, HotChips 2012

  8. High-Radix Topologies Conventional Router Delay Swizzle-Switch Router Delay Global Communication Global Communication Delay Hop Count Local Communication Local Communication Low-Radix Router High-Radix Router Symmetric high-radix topologies trade-off efficiency of local communication to achieve faster global communication

  9. Outline • Motivation • Symmetric Low-Radix and High-Radix Designs • Asymmetric High-Radix Designs • Super-Star • Super-StarX • Results • Conclusion

  10. Asymmetric High-Radix Topologies Low-Radix Topologies optimize local communication High-Radix Topologies optimize global communication Fast, Low-Radix LR LR LR LR LR LR LR LR = Local Router Slow, High-Radix GR = Global Router GR LR LR LR LR LR Asymmetric High-Radix merge best features of both low-radix and high-radix topologies

  11. Asymmetric High-Radix Topologies • Decouple localand globalcommunication • Match router speed to wirespeed • Local communication  Short wires  Fast Low-Radix • Global communication  Long wires  Slow High-Radix Routers  Reduce Hop count

  12. Super-Star • Each local router connects a cluster of tiles • Eachglobal router connects to all local routers LR LR LR LR GR LR LR LR LR LR LR LR LR

  13. Super-StarX • Inter-cluster links further reduce local communication latency • Locality-aware routing policy Inter-Cluster Links LR LR LR LR GR Low Load: Inter-Cluster Links LR LR LR High Load: Inter-Cluster Links + Global Router LR LR LR LR LR

  14. Super-StarX • Multiple global routers • Higher throughput, energy proportionality LR LR LR LR GR GR GR GR LR LR LR LR LR LR LR LR

  15. Super-StarX Layout 3.6mm 18mm 3.6mm LR LR LR LR LR LR 14.4mm Inter-Cluster Links LR LR LR LR LR LR 4 tile GR 7.2mm GR 3.6mm LR LR LR LR LR LR 4 tile 21.6mm LR 10.8mm LR LR LR LR LR LR GR GR 576 tiles in total LR LR LR LR LR LR 21.6mm LR LR LR LR LR LR 25.2mm 21.6mm

  16. Evaluation • 576 tiles • Synthetic uniform random traffic, 4-flit messages • 128-bit Swizzle-Switch in 15nm • 4 VCs/port, buffer depth 5 flits/VC • Power & delay from SPICE modeling in 32nm, scaled to 15nm

  17. Results: Latency • Compared with Mesh topology, Super-Star topologies have • 39% more throughput, 45% reduction in latency

  18. Results: Power • Compared with Mesh topology, Super-Star topologies have • 40% less power. At 30W, 3x more throughput Low Power 3x 2.3x HighPerf.

  19. Results: Energy Proportionality • Available throughput can be tuned using global routers • A single global router can provide full network connectivity

  20. Results: Localized Traffic • Nearest neighbor traffic between LRs • Maximum one hop Inter-Cluster Links + Global Routers Inter-Cluster Links

  21. Results: Applications • Processor Configuration • 576 nodes: 552 cores + 24 memory controllers (1 GHz frequency) • Private L1 cache; shared, distributed L2 cache • Workloads • 4 workloads – 12 SPECCPU 2006 benchmarks each • 1 workload – 8 SPLASH-2 benchmarks • Metrics • Performance (execution time in cycles) • Power • Results: Super-StarX • Average over Mesh: 17% performance improvement, 39% less power • Average over Fbfly: 32% performance improvement, 5% worse power

  22. Conclusion • Goal: a scalable on-chip network topology for kilo-core chips • Made feasible by Swizzle-Switches • Asymmetric high-radix topologies: Super-Starand Super-StarX • Fast low-radix local routers, slow high-radix global routers • Multiple global routers for higher throughput and energy proportionality • Results: Super-StarX • Average latency: 45% reduction over Mesh • Power: 40% less over Mesh • Throughput @ 30W TDP: 3x Mesh, 2.3x Fbfly

  23. Thank You! Scaling TowardsKilo-Core Processors with Asymmetric High-Radix Topologies Nilmini Abeyratne, Reetuparna Das, Qingkun Li, Korey Sewell, BharanGiridhar, Ronald G. Dreslinski, David Blaauw, and Trevor Mudge University of Michigan, Ann Arbor HPCA 19 February 27, 2013

  24. BACKUP SLIDES

  25. High-Radix Switch: Swizzle-Switch Radix-64 128-bit channels 32nm 1.5 GHz 2W of power ~2mm2 of area

  26. Super-Star Layout 3.6mm 18mm 3.6mm LR LR LR LR LR LR 14.4mm LR LR LR LR LR LR 4 tile GR 7.2mm GR 3.6mm LR LR LR LR LR LR 4 tile 21.6mm LR 10.8mm LR LR LR LR LR LR GR GR 576 tiles In total LR LR LR LR LR LR 21.6mm LR LR LR LR LR LR 25.2mm 21.6mm

  27. Super-Ring (Anti-design) • Medium-radix local and global routers • Limited connectivity hinders scalability LR LR LR LR LR LR LR LR LR LR LR LR GR GR LR LR LR LR LR LR LR LR LR LR LR LR GR GR LR LR LR LR LR LR LR LR LR LR LR LR

More Related