380 likes | 558 Views
Throughput-Effective On-Chip Networks for Manycore Accelerators. Ali Bakhoda , John Kim ¹ and Tor M. Aamodt ¹ KAIST, Korea . Manycore Accelerators and NoC. Manycore accelerators P revalent example: high-end GPUs 10s of thousands of threads running at the same time
E N D
Throughput-Effective On-Chip Networks for Manycore Accelerators Ali Bakhoda, John Kim¹ and Tor M. Aamodt ¹KAIST, Korea
Manycore Accelerators and NoC • Manycoreaccelerators • Prevalent example: high-end GPUs • 10s of thousands of threads running at the same time • Bulk Synchronous Parallel programming style • 3 / 5 top supercomputers • Based on the Nov. 2010 Top500 list • Primary goal: Higher application level throughput • NoCin accelerators • Needs a different perspective from CPUs • Not very well studied in this context
The Need for Throughput-Effective NoCs Throughput-Effective design: Improves application level performance per unit chip area
Contributions • Study impact of NoC on application level performance • Traditional improvements (router latency reduction): minimal impact on application level performance • Increasing channel width: High performance gain + high area cost • Consider application level throughput per unit area of NoC • Throughput correlated with injection rate of few nodes • Many-to-few-to-many traffic pattern • Propose Throughput-Effective NoC design • Checkerboard network • Multi-port router structure
Outline • Introduction • Baseline architecture • NoC properties in accelerators • Throughput-Effective NoC design • Experimental results • Conclusion
Baseline Network • Mesh with MCs at periphery of the chip • Similar to Tilera’s TILE64 or Intel’s 80-core Teraflops chip • Simple and Scalable • Dimension Order Routing • Virtual Channel Flow Control • 4-cycle routers
Finding a Balanced Design Bisection bandwidth of baseline mesh
Outline • Introduction • Baseline architecture • NoC properties in accelerators • Throughput-Effective NoC design • Experimental results • Conclusion
NoC properties in ManyCore Accelerators • Router latency has minimal impact on application level throughput • Aggressive 1-cycle routers instead of 4-cycle router • Only2.3% application level speedup • Channel Bandwidth is very important • 27% speedup by doubling BW • But quadratic area increase
Many-to-Few-to-Many Traffic Pattern MC Injection bandwidth MC1 MC0 C0 C0 C1 C1 C2 C2 Cn Cn reply network request network MCm
Outline • Introduction • Baseline architecture • NoC properties in accelerators • Throughput-Effective NoC design • Experimental results • Conclusion
Checkerboard Routing: Half-Routers • Half-Routers • No turns allowed at half-routers • Limited connectivity • Saves ~50% of router crossbar area • Full-Routers: • Normal routers w/ complete connectivity • Use Half-Routers every other node Half-Router Connectivity
Solution: Routing Restriction (1) • Routing from a full-router to a half-router that is: • An odd number of columns away • Not in the same row • Solution: Use YX routing instead of XY routing in this case
Solution: Routing Restriction (2) • Routing from a half-router to a half-router that is: • An even number of columns away • Not in the same row • Solution: needs two turns (1) To intermediate full-router using YX (2) To the destination using XY • Requires an extra VC to avoid deadlock
Routing Restriction (3) • Full-routers that are odd number of columns away • We avoid this case by using a different MC placement (next 2 slides)
Placement of MCs • Exploit Many-to-Few • Place the MCs at Half-Router nodes • Half-Routers can communicate will all nodes with no penalty • Common case for BSP: compute cores communicate with MCs not each other • [CMP-MSI’08] “Extending the Scalability of Single Chip Stream Processors with On-chip Caches”, • Bakhoda et al. • [ISCA’09] “Achieving Predictable Performance Through Better Memory Controller Placement in Many-Core CMPs" Abts et al.
Multi-port routers at MCs • Reduce the bottleneck at the few nodes • Increase terminal BW of the fewnodes • Increase the injection ports of MC routers • Minimal area overhead (~1% in total NoC area) • Speedups of up to 25%
Outline • Introduction • Baseline architecture • NoC properties in accelerators • Throughput-Effective NoC design • Experimental results • Conclusion
Methodology • Compute simulation: GPGPU-Sim (2.2.1b) • NoCsimulation: Booksim-2 • Integrated into GPGPU-Simas network simulator • Area estimations: Orion 2.0 • Benchmarks: 24 CUDA applications including the Rodinia benchmarks
Results • Combination of • Checkerboard routing and placement • Channel Slicing • Multi-port routers at MCs • Overall HM speedup 17% across 24 benchmarks over balanced baseline • Total NoC area reduction of 43% High Speedup High Traffic Low Speedup High Traffic Low Speedup Low Traffic
Summary • Throughput-Effective design: Consider system level performance impact + area impact of NoC • Observations • NoC BW is more important than latency in accelerators • Many-to-Few-to-Many traffic pattern • Throughput-Effective NoC for accelerators • Checkerboard • Multi-port MC routers • Channel-slicing
Channel Slicing – Double networks • Divide the single network into two physical networks • Each new network: half the bisection BW of the original network • Overall bisection BW: constant • Saves area • Quadratic dependency of crossbar area on channel BW • Increases serialization latency • But compute accelerators are not sensitive to latency
Results • Memory Controller placement • HM of speedup 13% over balanced baseline design
Results • Checkerboard routing • Less than 1% performance loss compared to DOR with same resources • Reduces total router area by 14.2%
Results • Channel slicing • Average change in performance < 1% • NoC area reduction of 37%
Top 5 systems • TOP 5 Systems - 11/2010 • 1 Tianhe-1A - NUDT TH MPP, X5670 2.93Ghz 6C, Nvidia GPU, FT-1000 8C • 2 Jaguar - Cray XT5-HE Opteron 6-core 2.6 GHz • 3 Nebulae - Dawning TC3600 Blade, Intel X5650, Nvidia Tesla C2050 GPU • 4TSUBAME 2.0 - HP ProLiant SL390s G7 Xeon 6C X5670, Nvidia GPU, Linux/Windows • 5 Hopper - Cray XE6 12-core 2.1 GHz
Many-to-Few-to-Many Traffic Pattern MC output bandwidth Core input bandwidth MC input bandwidth Core output bandwidth MC1 MC0 C0 C0 C1 C1 C2 C2 Cn Cn reply network request network MCm