Scalable Reconfigurable Interconnects

Scalable Reconfigurable Interconnects Ali Pinar Lawrence Berkeley National Laboratory joint work with Shoaib Kamil, Lenny Oliker, and John Shalf CSCAPES Workshop, Santa Fe, June 11, 2008

Ultra-scale systems rely on increased concurrency. Huge increases in concurrency since 2004. How to connect huge numbers of processors?

Torus Fat tree What is a good interconnect for ultra-scale systems? • Mesh/torus networks provide limited performance. • Fat-trees are widely used due to their flexibility. • 94 of 100 of Top500 in 2004 • 72 of 100 of Top500 in 2007 • Cost of a fat-tree scales as O(PlgP). • Cost of the interconnect dominates the cost of compute power for large numbers of processors.

Step-by-step approach • Characterize the communication requirements of applications. • Replaces theoretical metrics with practical ones. • Minimize the interconnection requirements • Choice of subdomains • Task-to-processor mapping • Scheduling of messages • Design alternative interconnects • Static networks: Fit-trees • Reconfigurable networks

Static Applications

Most messages are small Employ a separate network for low bandwidth messages

Most fat-tree ports are not utilized >50% of the ports of a fat-tree are not used

Clever task-to-procesor allocation yields better results. Hops reduced by an average of 25%; improved latency!

Do we need the fat-tree bandwidth? • We need the flexibility of a fat tree, but not the full bandwidth. • Bandwidth requirement can de decreased with careful placement of tasks. • Proposed alternative: Fit trees • Idea: Analyze the communication requirements of apps and design the interconnect for what is really needed.

Even all-to-all communication does not need a fat-tree. Randomized Optimal Standard • All-to-all communication is the bottleneck for FFT. • Clever scheduling of messages reduces bandwidth requirement. • Conventional algorithms for all-to-all communication do not distribute communication evenly. • The savings are even more pronounced in FFT with 2D decomposition. level Communication Step

Fittrees: network should fit the application • Key observation: scalability of an application is related locality of computation. • Implication: required bandwidth decreases as we go higher in the tree. • Fitness ratio (f) : ratio of the bandwidth between two successive layers • 2D domains: f ~=1.4 • 3D domains: f ~=1.2 N Fattree N N Fittree fN

Fit-trees provide scalability

HFAST • Hybrid Flexibly-Assignable Switch Topology • Use Layer-1 (circuit) switches to configure Layer-2 (packet) switches at run-time (O(10-100ms) cost of reconfiguration) • Hardware to do so exists (optical networks) • Layer-1 switches cheaper per port (no dynamic decisions, like telephone switchboard) Collective communication uses a separate low-latency, low bandwidth tree network (like IBM BlueGene)

How to use HFAST • Improved task to processor assignments • Even at runtime • Migrate processes with little overhead • Adapt to changing communication requirements • Avoid defragmentation at the system level • Build an interconnect for each application • Avoid overprovisioning the communication resources

Processor allocation for adaptive applications We obtain 41% of ideal and 53% of ideal hops savings.

Conclusions • Massive concurrencies of ultrascale machines will require new interconnects. • We cannot afford to overprovision the resources. • There is no magic solution that is good for all applications. • Flexibility or reconfigurability is necessary. • The technology for reconfigurable networks is available. • We need to • reduce the resource requirements • design networks for typical workloads • design methods to build networks for a given application.

Scalable Reconfigurable Interconnects

Scalable Reconfigurable Interconnects

Presentation Transcript

Basic Interconnects

Reconfigurable Computing

Signal Propagation Along Interconnects

Reconfigurable Computing

Optical Interconnects

Scalable Many-Core Memory Systems Optional Topic 5 : Interconnects

A Linux-based Software Environment for the Reconfigurable Scalable Computing Project

Reconfigurable Computing

Buses and INterconnects

Reconfigurable computing

Module 6 Networking Interconnects

Interconnects

Reconfigurable Architectures

The Systolic Ring : A Scalable Dynamically Reconfigurable Core for Embedded Systems

Reconfigurable Computing

VLSI Interconnects

Reconfigurable Computing

Reconfigurable Processor for Energy-Scalable Computational Photography

Multicast at Interconnects

Bio-templated Interconnects

Reconfigurable Computing

Multiprocessor System Interconnects