230 likes | 374 Views
Applications Performance on HPCx Technology. Issues, Challenges and the Prospects for Capability Computing Martyn F Guest Terascale Applications Team Leader. Outline. Background - Capability & Capacity Computing IBM Regatta-H systems accessed to date
E N D
Applications Performance on HPCx Technology Issues, Challenges and the Prospects for Capability Computing Martyn F Guest Terascale Applications Team Leader
Outline • Background - Capability & Capacity Computing • IBM Regatta-H systems accessed to date • HPCx Technology - Phases 1, 2 and 3 (2002-2007) • Single CPU and node performance & the Interconnect • Performance Overview of “HPC’97” Applications: • Molecular Simulation, Computational Materials, Computational Engineering, Atomic and Molecular Physics • Cray T3E/1200E and Current High-end Systems • SGI Origin 3800, HP/Compaq Alpha Server SC and IBM SP/Regatta-H • HPCx Strategy for Capability Computing • Characterisation, Performance attributes and Migration • User consultation - capability & capacity applications • Short term and longer term strategy • HPCx Terascale Applications Team Application, and not H/W driven HPC User Meeting
Systems Used In Performance Analysis • IBM Systems • IBM SP 32 CPU system at DL, 4-way WH2 SMP nodes • p-series 690 Turbo (8 way 1.3 GHz power4 CPUs, Austin) • Regatta-H (32-way) and Regatta HPC (16-way) (Montpelier) • SP/Regatta-H (8-way LPAR’d nodes, 1.3 GHz) at ORNL • HP/Compaq AlphaServer SC • 4-way ES40/667 (APAC) and 833 MHz SMP nodes ; • TCS1 system at PSC: 750 4-way ES45 nodes - 3,000 EV68 1 GHz CPUs, with 4 GB memory per node • Quadrics “fat tree” interconnect (5 usec latency, 250+ MB/sec B/W) • SGI Origin 3800 • SARA (1000 CPUs) - Numalink - with R14k/500 and R12k/400 CPUs • Cray T3E/1200E HPC User Meeting
HPC(x) Technology • Phase 1 (Dec. 2002): 3 TFlop/s Rmax Linpack • 40 Regatta-H SMP compute systems (1.28 TB memory) • 32 X 1.3GHz processors, 32 GB memory; 4 equal partitions (SP nodes) • 2 Regatta-H I/O systems • 16 X 1.3GHz processors (Regatta-HPC), 4 GPFS LPARS • 2 HSM/backup LPARS, 18TB EXP500 fibre-channel global filesystem • Switch Interconnect • Existing SP Switch2 with "Colony" PCI adapters in all LPARs (20+usec latency, 350 MBytes/sec bandwidth) • Each compute node has two connections into switch fabric (double plane) • 160 X 8-way compute nodes in total HPC User Meeting
HPC(x) Technology 2. • Phase 2 (2004): 6 TFlop/s Rmax Linpack • >40 Regatta-H+ compute systems • 32 X 1.8GHz processors, 32 GB memory, full SMP mode (no LPAR) • 3 Regatta-H I/O systems (Double the capabilities of Phase 1) • "Federation" switch fabric • bandwidth quadrupled, ~5-10 microsecond latency, Connect to GX bus directly • Phase 3 (2006): 12 TFlop/s Rmax Linpack • >40 Regatta-H+ compute systems • unchanged from Phase 2 • >40 additional Regatta-H+ compute systems • double the existing configuration • 4 Regatta I/O systems (Double the capabilities of Phase 2) • Open to Alternative Technology Solutions (IPF, BlueGene/L ..) HPC User Meeting
Four POWER4 chips (8 processors) on an MCM, with two associated memory slots Mem Ctrl M E M O R Y S L O T Mem Ctrl L3 L3 Shared L2 M E M O R Y S L O T Shared L2 Distributed switch Distributed switch Distributed switch Distributed switch Mem Ctrl Mem Ctrl Shared L2 Shared L2 L3 L3 4GX Buslinks for external connections L3 cache shared across all processors IBM p-series 690Turbo:Multi-chip Module (MCM) GX Bus GX Bus GX Bus GX Bus HPC User Meeting
Applications Performance Overview 1. Molecular Simulation (DLPOLY) 2. Computational Materials Science (CASTEP) 3. Atomic and Molecular Physics (PFARM) 4. Computational Engineering (PCHAN) Serial (SPEC, DL) and Communication Benchmarks Applications Performance HPC User Meeting
Serial Benchmark Summary Performance relative to the SGI Origin 3800/R12k-400 6.8 X Cray T3E 3 X SGI Origin 3800 † † HPC User Meeting
SPEC CPU2000: SPECfp vs SPECfp_rate (32 CPUs) Values relative to the IBM 690 Turbo 1.3 GHz SPECfp SPECfp_rate HPC User Meeting
Interconnect Benchmark - EFF_BW QSNet QSNet 16 CPUs Myrinet2k SCALI MBytes/sec Fast Ethernet HPC User Meeting
DL_POLY V2: Replicated Data Macromolecular Simulations Performance Relative to the Cray T3E/1200E Bench 7: Gramicidin in water; rigid bonds and SHAKE, 12,390 atoms, 500 time steps Performance Relative to the Cray T3E/1200E Number of CPUs Bench 4. NaCl; 27,000 ions, Ewald, 75 time steps, Cutoff=24Å Ionic Simulations Number of CPUs HPC User Meeting
B A C D Migration from Replicated to Distributed data DL_POLY-3 : Domain Decomposition • Distribute atoms, forces across the nodes • More memory efficient, can address much larger cases (105-107) • Shake and short-ranges forces require only neighbour communication • communications scale linearly with number of nodes • Coulombic energy remains global • strategy depends on problem and machine characteristics • Adopt Smooth Particle Mesh Ewald scheme • includes Fourier transform smoothed charge density (reciprocal space grid typically 64x64x64 - 128x128x128) HPC User Meeting
DL_POLY-3: Coulomb Energy Performance Distributed Data SPME, with revised FFT Scheme Performance Relative to the Cray T3E/1200E DL_POLY-3 216,000 ions, 200 time steps, Cutoff=12Å Number of CPUs HPC User Meeting
DL_POLY-3: Macromolecular Simulations Gramicidin in water; rigid bonds + SHAKE: 792,960 ions, 50 time steps Measured Time (seconds) Performance Relative to the SGI Origin 3800/R14k-500 Number of CPUs Number of CPUs HPC User Meeting
Materials Simulation. Plane Wave Methods: CASTEP Direct minimisation of the total energy (avoiding diagonalisation) • Pseudopotentials must be used to keep the number of plane waves manageable • Large number of basis functions N~106 (especially for heavy atoms). The plane wave expansion means that the bulk of the computation comprises large 3D Fast Fourier Transforms (FFTs) between real and momentum space. • These are distributed across the processors in various ways. • The actual FFT routines are optimized for the cache size of the processor. HPC User Meeting
TiN: A 32 atom slab of TiN, 8 k points, single point energy calculation 88,000 plane waves 3D FFT: 108X36X36 CASTEP 4.2 - kG Parallel Benchmark Performance Relative to the Cray T3E/1200E Bottleneck: Data Transformation associated with 3D FFT& MPI_AlltoAllV CPUs HPC User Meeting
A&M Physics: Electron-Atom Collisions • R-matrix theory - efficient methods for investigating electron-atom and electron-molecule collisions. • Calculation involves integration of up to 103 coupled channels i.e. 2nd linear differential equations. • External Region Calculation Timings: • Data from internal region calculations (from disk) • 2 stage approach - Diagonalisation [PeIGS (20%, 4K)] and functional task parallelisation (80%) - BLAS3 dominated • Systolic processor pipeline approach. • Coarse-grained parallelism ensures scalable performance. • Asynchronous communications minimises communication costs. • Benchmark Example for PFARM application HPC User Meeting
External Region Calculation Timings Performance Ratio vs. Cray T3E/1200E Elapsed Time (seconds) CPUs Bottleneck: Matrix Diagonalisation CPUs HPC User Meeting
Computational Engineering:UK Turbulence Consortium • Focus on compute-intensive methods (Direct Numerical Simulation, Large Eddy Simulation, etc) for the simulation of turbulent flows. • Shock boundary layer interaction modelling - critical for accurate aerodynamic design but still poorly understood. • Results from a compressible turbulent channel flow benchmark using Southampton’s DNS code (PCHAN). • 3D compressible Navier-Stokes equations. • General grid transformation for complex geometries. • Multi-step Runge-Kutta time advancement (RK3/4). • High-order (4th or 6th) central difference scheme. • Reduced-order and stable boundary treatment. • Entropy splitting for Euler terms. • Shock capturing TVD scheme with artificial compression method. HPC User Meeting
3603 benchmark : Cray T3E/1200E & IBM SP/Regatta-H HPC User Meeting
Capability Computing Strategy for HPCx • Identify “capability applications” and determine migration approach i.e. short-term and long-term strategy • application workshops (full community involvement) • determine discipline specific balance between capability / capacity computing • identify key scientific drivers amenable to capability computing • Short-term strategy and optimisation • optimise key CPU and communication collectives that limit scalability of existing applications. Scalability of numerical algorithms and scope for memory-driven algorithms • identify and install terascaling applications from e.g. US sites and collaborators • longer term strategy • migration from replicated data to distributed data • O(N) methods, optimum iterative and spectral algorithms Terascale Applications Team Accelerated delivery of the Federation Switch ? HPC User Meeting
Components of Strategy for Capability Computing 1. Performance Attributes of Key Applications • Trouble-shooting with Vampir 2. Scalability of Numerical Algorithms • Parallel eigensolvers 3. Optimisation of Communication Collectives (EPCC) • MPI_ALLTOALLV and CASTEP 4. Memory-driven Approaches • in-core SCF & DFT, direct minimisation & CRYSTAL 5. Terascaling Applications • NWChem, NAMD ... 6. Migration from replicated to distributed data • DL_POLY-3 7. Scientific drivers amenable to Capability Computing • Enhanced Sampling Methods, Replica Methods Terascale Applications Team within HPCx HPC User Meeting
Summary • Background - Capability & Capacity Computing • Overview of Performance of HPC’97 Applications: • Cray T3E/1200E and Current High-end Systems • Prospects for Capability Computing • Molecular Simulation (Distributed Data, DL_POLY 3) • Computational Materials (CASTEP v4.2) • Atomic and Molecular Physics (PFARM) • Computational Engineering (PCHAN) • Applications Strategy for HPCx Capability Computing • Characterisation, Performance attributes and Migration • Widespread user consultation - capability & capacity applications • Enhanced single CPU utilisation • Short term and longer term strategy • Acknowledgements (AV Team, IBM, ORNL, PSC, Sara) HPC User Meeting