140 likes | 256 Views
A Look at Application Performance Sensitivity to the Bandwidth and Latency of Infiniband Networks. Darren J. Kerbyson Performance and Architecture Laboratory ( PAL ) http://www.c3.lanl.gov/pal Computer and Computational Sciences Division Los Alamos National Laboratory.
E N D
A Look at Application Performance Sensitivity to the Bandwidth and Latencyof Infiniband Networks Darren J. Kerbyson Performance and Architecture Laboratory (PAL) http://www.c3.lanl.gov/pal Computer and Computational Sciences Division Los Alamos National Laboratory
Performance and Architecture Lab • Performance analysis team at Los Alamos • Measurement • Modeling • Simulation • Large-scale: • systems (10,000s to 100,000s processors) • applications • Analyze existing systems (or near-to-market systems) • Examine possible future systems • e.g. IBM PERCS (DARPA HPCS), next generation Blue Gene, … • Recent work includes: • Modeling and optimization of ASCI Q (SC03 best paper) • Comparison of systems: e.g. Earth Simulator, & Top 5 (CCPE05) • Blue Gene/L (SC04) • Large-scale Optical Circuit Switch network (SC05)
Assessing impact of network performance • Context • What would be the performance improvement we had: • a network with higher bandwidth ? • a network with lower latency ? • Is it worth procuring an enhanced configuration ? • Approach • Use application performance models • Application abstraction encapsulating performance related features • Compute factors: single processor / node performance • Parallel factors: boundary exchanges, collectives etc. • Parameterized in terms of system characteristics • Node characteristics • Network characteristics (inc. bandwidth and latency)
Applications • Three applications of interest to Los Alamos: • Sweep3D: kernel application representing the heart of a deterministic SN transport calculation • SAGE: AMR hydrocode for shock propagation • Partisn: Deterministic SN transport code • Performance models previously developed • Validated on large-scale systems including: • Blue Gene/L (Lawrence Livermore) up to 32K nodes • Red Storm (Sandia) up to 8K processors • ASCI Q (Los Alamos) up to 8K processors • Typical ~10% error • Once validated can be used to explore performance on new systems
Network Characteristics • Latency of 4µs seems optimistic (currently) • Latency of 1.5µs is close to PathScale (1.29µs) • Achievable bandwidth assumed is ~80% of peak • Infiniband fabric assumed to be a 12-ary fat-tree with switch latency of 200ns.
Performance studies • Sensitivity to network bandwidth and latency • 4x, 8x, & 12x bandwidths • 4µs, & 1.5µs near-neighbor latency • Effect of node size • Varying the number of processors in a node • Assumes single-core but applicable to multi-core • Assumes node: 2GHz AMD Opterons • Use of measured single processor performance • Vary system size • From 1 processor up to 8,192 processors • Concentrate on 256, 512 and 1024 processor clusters
Communication cost - example Sweep3D SAGE • 4x, 8x, 12x IB with near-neighbor latency of 4µs • 4-way nodes
Performance sensitivity: Partisn • Relative to a baseline configuration: • 4-way, 4x IB with 4µs latency • X-axis indicates node-size sensitivity (1 to 8 way) • Bar height indicates bandwidth sensitivity • 4x = lowest bar value • 12x = highest bar value • 8x = white ‘mid’ line • Difference in solid and shaded bars indicates latency sensitivity (4µs & 1.5µs)
Performance sensitivity: Partisn • 512 processor cluster • Highest sensitivity to node-size • Multiple processors sharing NIC • More sensitive to bandwidth than latency
Performance sensitivity: Sweep3D • 512 processor cluster • Highest sensitivity to latency • Most messages are small (~1KB) • Similar sensitivity to bandwidth and to node-size
Performance sensitivity: SAGE • 512 processor cluster • Similar sensitivity to bandwidth and node size (1 to 4-way) • No change from 4 to 8-way due to application effect • Little sensitivity to latency
Sensitivity summary • Says nothing about cost, or relative workload usage
Conclusions • Performance improvements due to enhanced network is application dependent • Bandwidth on SAGE • Latency on Sweep • Mixture (Node-size and bandwidth) on Partisn • Compute performance dampens any performance enhancement of network • Faster processors would increase performance sensitivity to network • Performance modeling can be used to assess configurations prior to procurement