310 likes | 514 Views
Regional Congestion Awareness for Load Balance in Networks-on-Chip. Boris Grot. Paul Gratz Steve Keckler. The University of Texas at Austin Department of Computer Sciences. The Era of Many-core. Intel Polaris 80 tiles 8x10 2D mesh. UT TRIPS 2x16 exec tiles 16 NUCA tiles.
E N D
Regional Congestion Awareness for Load Balance in Networks-on-Chip Boris Grot Paul Gratz Steve Keckler The University of Texas at Austin Department of Computer Sciences UTCS
The Era of Many-core • Intel Polaris • 80 tiles • 8x10 2D mesh • UT TRIPS • 2x16 exec tiles • 16 NUCA tiles • Tilera Tile • 64 cores • 5 networks UTCS
The Era of Many-core • Many tiles: cores, cache, accelerators, more! • Many on-chip networks • Many traffic types: operands, memory, I/O UTCS
Networks on a Chip (NOCs) • First-order system-level impact: • Performance • Energy • Resilience • Prior work: • Topology (Dally, DAC 2001) • Flow Control (Dally, IEEE Trans on Computers 1987) • Router µArch (Peh, HPCA 2001) • Prototyping (Taylor, IEEE Micro 2002) • Routing (Seo, ISCA 2005. Kim, DAC 2005) UTCS
Routing Policy • Determines the path from Source to Dest. • Directly impacts load-balancing properties of the network • Ability to spread network load • Major performance implications • Current NOC research & practice: DOR • Deadlock freedom • Low implementation complexity • Fast route calculation • Poor load balancing properties UTCS
Routing Example: Transpose Traffic Dimension-Order Routing (DOR) Adaptive Routing 100% Wanted: Load Balance Avg latency = 230 cycles Avg latency = 18 cycles UTCS
Outline • Adaptive routing • Problems with adaptive routing • Regional Congestion Awareness • Evaluation • Conclusion UTCS
Adaptive Routing • Path is a function of network condition • Dynamically balances load among network links • Used in systems from IBM, Cray, DEC, etc. • Issues: • Deadlock (Duato, Trans. On Parallel & Dist’d Systems 1993) • Minimal vs non-minimal routing • Router complexity & latency (Kim, DAC 2005) • Performance UTCS
Adaptive Routing: Performance Issues • Performance depends on ability to estimate network congestion • Local metrics • Downstream VC & buffer availability • XB demand (ie, output port contention) • Limitations of local metrics • Myopic congestion estimation • By the time congestion is encountered, it's too late • Congestion in the center and underutilization at the edges • Poor load-balancing properties • Uniformly distributed traffic • Transient hot spots UTCS
Ideal Routing • Perfect knowledge of network state • Low router complexity • low logic & state overhead • no impact on critical path • Low bandwidth requirements UTCS
Regional Congestion Awareness (RCA) • Local data collection • Propagation to neighboring routers • Aggregation of local & non-local data • Trivial logic & state overhead • Low bandwidth requirements • Significantly improved network visibility UTCS
RCA 1D UTCS
RCA Fanin UTCS
RCA Router µArch UTCS
RCA Router µArch UTCS
RCA Router µArch UTCS
RCA Details • Aggregation • Local vs non-local weight assignment: 50-50 • Trivial logic (one 8-bit adder/port) • Propagation • Differentiates RCA variants • Trivial complexity (0-2 8-bit adders/port) • RCA bandwidth • Baseline: 8 bits/channel • Can be reduced by serializing each update • Negligible performance impact at 1 bit/channel • Subject to traffic pattern stability UTCS
Experimental Methodology Combined XB Demand + Free VCs 1 Splash traces courtesy of A. Kumar et. al. UTCS
Results: Splash UTCS
Results: Splash UTCS
RCA Conclusions • Improved congestion estimation through aggregation of local and non-local measurements • Significant performance improvement • Improved load-balancing • Better throughput • 71% max latency reduction on Splash • Low complexity, no critical path impact • Multiple configurations possible • Performance-complexity trade-offs UTCS
RCA Future Work • Performance in other topologies • Eg: 2D and 3D tori • Applicability to off-chip networks • Early results are promising • System-level energy/power impact • Earlier task completion vs RCA overhead • Network fault tolerance • Expect significant improvements in network performance under 1+ faults via improved load balancing UTCS