170 likes | 285 Views
Scalable Reconfigurable Interconnects. Ali Pinar Lawrence Berkeley National Laboratory joint work with Shoaib Kamil, Lenny Oliker, and John Shalf. CSCAPES Workshop, Santa Fe, June 11, 2008. Ultra-scale systems rely on increased concurrency. Huge increases in concurrency since 2004.
E N D
Scalable Reconfigurable Interconnects Ali Pinar Lawrence Berkeley National Laboratory joint work with Shoaib Kamil, Lenny Oliker, and John Shalf CSCAPES Workshop, Santa Fe, June 11, 2008
Ultra-scale systems rely on increased concurrency. Huge increases in concurrency since 2004. How to connect huge numbers of processors?
Torus Fat tree What is a good interconnect for ultra-scale systems? • Mesh/torus networks provide limited performance. • Fat-trees are widely used due to their flexibility. • 94 of 100 of Top500 in 2004 • 72 of 100 of Top500 in 2007 • Cost of a fat-tree scales as O(PlgP). • Cost of the interconnect dominates the cost of compute power for large numbers of processors.
Step-by-step approach • Characterize the communication requirements of applications. • Replaces theoretical metrics with practical ones. • Minimize the interconnection requirements • Choice of subdomains • Task-to-processor mapping • Scheduling of messages • Design alternative interconnects • Static networks: Fit-trees • Reconfigurable networks
Most messages are small Employ a separate network for low bandwidth messages
Most fat-tree ports are not utilized >50% of the ports of a fat-tree are not used
Clever task-to-procesor allocation yields better results. Hops reduced by an average of 25%; improved latency!
Do we need the fat-tree bandwidth? • We need the flexibility of a fat tree, but not the full bandwidth. • Bandwidth requirement can de decreased with careful placement of tasks. • Proposed alternative: Fit trees • Idea: Analyze the communication requirements of apps and design the interconnect for what is really needed.
Even all-to-all communication does not need a fat-tree. Randomized Optimal Standard • All-to-all communication is the bottleneck for FFT. • Clever scheduling of messages reduces bandwidth requirement. • Conventional algorithms for all-to-all communication do not distribute communication evenly. • The savings are even more pronounced in FFT with 2D decomposition. level Communication Step
Fittrees: network should fit the application • Key observation: scalability of an application is related locality of computation. • Implication: required bandwidth decreases as we go higher in the tree. • Fitness ratio (f) : ratio of the bandwidth between two successive layers • 2D domains: f ~=1.4 • 3D domains: f ~=1.2 N Fattree N N Fittree fN
HFAST • Hybrid Flexibly-Assignable Switch Topology • Use Layer-1 (circuit) switches to configure Layer-2 (packet) switches at run-time (O(10-100ms) cost of reconfiguration) • Hardware to do so exists (optical networks) • Layer-1 switches cheaper per port (no dynamic decisions, like telephone switchboard) Collective communication uses a separate low-latency, low bandwidth tree network (like IBM BlueGene)
How to use HFAST • Improved task to processor assignments • Even at runtime • Migrate processes with little overhead • Adapt to changing communication requirements • Avoid defragmentation at the system level • Build an interconnect for each application • Avoid overprovisioning the communication resources
Processor allocation for adaptive applications We obtain 41% of ideal and 53% of ideal hops savings.
Conclusions • Massive concurrencies of ultrascale machines will require new interconnects. • We cannot afford to overprovision the resources. • There is no magic solution that is good for all applications. • Flexibility or reconfigurability is necessary. • The technology for reconfigurable networks is available. • We need to • reduce the resource requirements • design networks for typical workloads • design methods to build networks for a given application.