190 likes | 297 Views
Scientific Computations on Modern Parallel Vector Systems. Leonid Oliker, Jonathan Carter, Andrew Canning, John Shalf Lawrence Berkeley National Laboratories Stephane Ethier Princeton Plasma Physics Laboratory http://crd.lbl.gov/~oliker. Overview.
Scientific Computations on Modern Parallel Vector Systems Leonid Oliker, Jonathan Carter, Andrew Canning, John Shalf Lawrence Berkeley National Laboratories Stephane Ethier Princeton Plasma Physics Laboratory http://crd.lbl.gov/~oliker
Overview • Superscalar cache-based architectures dominate HPC market • Leading architectures are commodity-based SMPs due to generality and perception of cost effectiveness • Growing gap between peak & sustained performance is well known in scientific computing • Modern parallel vectors may bridge gap this for many important applications • In April 2002, the Earth Simulator (ES) became operational: Peak ES performance > all DOE and DOD systems combined Demonstrated high sustained performance on demanding scientific apps • Conducting evaluation study of scientific applications on modern vector systems • 09/2003 MOU between ES and NERSC was completedFirst visit to ES center: December 8th-17th, 2003 (ES remote access not available)First international team to conduct performance evaluation study at ES • Examining best mapping between demanding applications and leading HPC systems - one size does not fit all
Vector Paradigm • High memory bandwidth • Allows systems to effectively feed ALUs (high byte to flop ratio) • Flexible memory addressing modes • Supports fine grained strided and irregular data access • Vector Registers • Hide memory latency via deep pipelining of memory load/stores • Vector ISA • Single instruction specifies large number of identical operations • Vector architectures allow for: • Reduced control complexity • Efficiently utilize large number of computational resources • Potential for automatic discovery of parallelism However: most effective if sufficient regularity discoverable in program structure • Suffers even if small % of code non-vectorizable (Amdahl’s Law)
Architectural Comparison • Custom vector architectures have • High memory bandwidth relative to peak • Superior interconnect: latency, point to point, and bisection bandwidth • Overall ES appears as the most balanced architecture, while Altix shows best architectural balance among superscalar architectures • A key ‘balance point’ for vector systems is the scalar:vector ratio
Applications studied LBMHDPlasma Physics 1,500 linesgrid based Lattice Boltzmann approach for magneto-hydrodynamics CACTUS Astrophysics 100,000 lines grid based Solves Einstein’s equations of general relativity PARATECMaterial Science50,000 linesFourier space/grid Density Functional Theory electronic structures codes GTCMagnetic Fusion 5,000 linesparticle based Particle in cell method for gyrokinetic Vlasov-Poisson equation • Applications chosen with potential to run at ultrascale • Computations contain abundant data parallelism • ES runs require minimum parallelization and vectorization hurdles • Codes originally designed for superscalar systems • Ported onto single node of SX6, first multi-node experiments performed at ESC
Plasma Physics: LBMHD • LBMHD uses a Lattice Boltzmann method to model magneto-hydrodynamics (MHD) • Performs 2D simulation of high temperature plasma • Evolves from initial conditions and decaying to form current sheets • 2D spatial grid is coupled to octagonal streaming lattice • Block distributed over 2D processor grid Current density decays of two cross-shaped structures • Main computational components: • Collision requires coefficients for local gridpoint only, no communication • Stream values at gridpoints are streamed to neighbors, at cell boundaries information is exchanged via MPI • Interpolation step required between spatial and stream lattices • Developed George Vahala’s group College of William and Mary, ported Jonathan Carter
LBMHD: Porting Details (left) octagonal streaming lattice coupled with square spatial grid (right) example of diagonal streaming vector updating three spatial cells • Collision routine rewritten: • For ES loop ordering switched so gridpoint loop (~1000 iterations) is inner rather than velocity or magnetic field loops (~10 iterations) • X1 compiler made this transformation automatically: multistreaming outer loop and vectorizing (via strip mining) inner loop • Temporary arrays padded reduce bank conflicts • Stream routine performs well: • Array shift operations, block copies, 3rd-degree polynomial eval • Boundary value exchange • MPI_Isend, MPI_Irecv pairs • Further work: plan to use ES "global memory" to remove message copies
LBMHD: Performance • ES achieves highest performance to date: over 3.3 Tflops for P=1024 • X1 comparable absolute speed up to P=64 (lower % peak) • But performs 1.5X slower at P=256 (decreased scalability) • CAF improved X1 to slightly exceed ES at P=64 (up to 4.70 Gflop/P) • ES is 44X, 16X, and 7X faster than Power3, Power4, and Altix • Low CI (1.5) and high memory requirement (30GB) hurt scalar performance • Altix best scalar due to: high memory bandwidth, fast interconnect
LBMHD on X1 MPI vs CAF • X1 well-suited for one-sided parallel languages (globally addressable mem) • MPI hinders this feature and requires scalar tag matching • CAF allows much simpler coding of boundary exchange (array subscripting): • feq(ista-1,jsta:jend,1) = feq(iend,jsta:jend,1)[iprev,myrankj] • MPI requires non-contiguous data copies into buffer, unpacked at destination • Since communication about 10% of LBMHD, only slight improvements • However, for P=64 on 40962 performance degrades. Tradeoffs: • CAF reduced total message volume 3X (eliminates user and system buffer copy) • But CAF used more numerous and smaller sized message
Astrophysics: CACTUS • Numerical solution of Einstein’s equations from theory of general relativity • Among most complex in physics: set of coupled nonlinear hyperbolic & elliptic systems with thousands of terms • CACTUS evolves these equations to simulate high gravitational fluxes, such as collision of two black holes Visualization of grazing collision of two black holes Communication at boundariesExpect high parallel efficiency • Evolves PDE’s on regular grid using finite differences • Uses ADM formulation: domain decomposed into 3D hypersurfaces for different slices of space along time dimension • Exciting new field about to be born: Gravitational Wave Astronomy - fundamentally new information about Universe • Gravitational waves: Ripples in spacetime curvature, caused by matter motion, causing distances to change. • Developed at Max Planck Institute, vectorized by John Shalf
CACTUS: Performance • ES achieves fastest performance to date: 45X faster than Power3! • Vector performance related to x-dim (vector length) • Excellent scaling on ES using fixed data size per proc (weak scaling) • Scalar performance better on smaller problem size (cache effects) • X1 surprisingly poor (4X slower than ES) - low ratio scalar:vector • Unvectorized boundary, required 15% of runtime on ES and 30+% on X1 • < 5% for the scalar version: unvectorized code can quickly dominate cost • Poor superscalar performance despite high computational intensity • Register spilling due to large number of loop variables • Prefetch engines inhibited due to multi-layer ghost zones calculations
Material Science: PARATEC • PARATEC performs first-principles quantum mechanical total energy calculation using pseudopotentials & plane wave basis set • Density Functional Theory to calc structure & electronic properties of new materials • DFT calc are one of the largest consumers of supercomputer cycles in the world Induced current and chargedensity in crystallized glycine • Uses all-band CG approach to obtain wavefunction of electrons • 33% 3D FFT, 33% BLAS3, 33% Hand coded F90 • Part of calculation in real space other in Fourier space • Uses specialized 3D FFT to transform wavefunction • Computationally intensive - generally obtains high percentage of peak • Developed Andrew Canning with Louie and Cohen’s groups (UCB, LBNL)
PARATEC:Wavefunction Transpose (a) (b) • Transpose from Fourier to real space • 3D FFT done via 3 sets of 1D FFTs and 2 transposes • Most communication in global transpose (b) to (c) little communication (d) to (e) • Many FFTs done at the same timeto avoid latency issues • Only non-zero elements communicated/calculated • Much faster than vendor 3D-FFT (c) (d) (e) (f)
PARATEC: Performance • ES achieves fastest performance to date! Over 2Tflop/s on 1024 procs • Main advantage for this type of code is fast interconnect system • X1 3.5X slower than ES (although peak is 50% higher) • Non-vectorizable code can be much more expensive on X1 (32:1 vs 8:1) • Lower bisection bandwidth to computation ratio • Limited scalability due to increasing cost of global transpose and reduced vector length • Plan to run larger problem size next ES visit • Scalar architectures generally perform well due to high computational intensity • Power3, Power4, Alitx are 8X, 4X, 1.5X slower than ES • Vector arch allow opportunity to simulate systems not possible on scalar platforms
Magnetic Fusion: GTC • Gyrokinetic Toroidal Code: transport of thermal energy (plasma microturbulence) • Goal magnetic fusion is burning plasma power plant producing cleaner energy • GTC solves 3D gyroaveraged gyrokinetic system w/ particle-in-cell approach (PIC) • PIC scales N instead of N2 – particles interact w/ electromagnetic field on grid • Allows solving equation of particle motion with ODEs (instead of nonlinear PDEs) • Main computational tasks: • Scatter deposit particle charge to nearest point • Solve Poisson eqn to get potential for each point • Gather calc force based on neighbors potential • Move particles by solving eqn of motion • Shift particles moved outside local domain 3D visualization of electrostatic potential in magnetic fusion device Developed at Princeton Plasma Physics Laboratory, vectorized by Stephane Ethier
GTC: Scatter operation • Particle charge deposited amongst nearest grid points. • Calculate force based on neighbors potential, then move particle accordingly • Several particles can contribute to same grid points, resulting in memory conflicts (dependencies) that prevent vectorization • Solution: VLEN copies of charge deposition array with reduction after main loop • However, greatly increases memory footprint (8X) • Since particles are randomly localized - scatter also hinders cache reuse
GTC: Performance • ES achieves fastest performance of any tested architecture! • First time code achieved 20% of peak - compared with less 10% on superscalar systems • Vector hybrid (OpenMP) parallelism not possible due to increased memory requirements • P=64 on ES is 1.6X faster than P=1024 on Power3! • Reduced scalability due to decreasing vector length, not MPI performance • Non-vectorizable code portions expensive on X1 • Before vectorization shift routine accounted for 11% of ES and 54% of X1 overhead • Larger tests could not be performed at ES due to parallelization/vectorization hurdles • Currently developing new version with increased particle decomposition • Advantage of ES for PIC codes may reside in higher statistical resolution simulations • Greater speed allow more particles per cell
Overview Tremendous potential of vector architectures: 4 codes running faster than ever before • Vector systems allows resolution not possible with scalar arch (regardless of # procs) • Opportunity to perform scientific runs at unprecedented scale • ES shows high raw and much higher sustained performance compared with X1 • Limited X1 specific optimization - optimal programming approach still unclear (CAF, etc) • Non-vectorizable code segments become very expensive (8:1 or even 32:1 ratio) • Evaluation codes contain sufficient regularity in computation for high vector performance • GTC example code at odds with data-parallelism • Much more difficult to evaluate codes poorly suited for vectorization • Vectors potentially at odds w/ emerging techniques (irregular, multi-physics, multi-scale) • Plan to expand scope of application domains/methods, and examine latest HPC architectures
Second ES visit • Evaluate high-concurrency PARATEC performance using large-scale Quantum Dot simulation • Evaluate CACTUS performance using updated vectorization of radiation boundary condition • Evaluate MADCAP performance using a newly optimized version, without global file systems requirements and improved I/O behavior • Examine 3D version of LBMHD, and explore optimization strategies • Evaluate GTC performance using updated vectorization of shift routine as well as new particle decomposition approach designed to increase concurrency • Evaluate performance of FVCAM3 (Finite Volume atmospheric model), at high concurrencies and resolution (1x1.25 , 0.5 x 0.625, 0.25 x 0.375) Papers available athttp://crd.lbl.gov/~oliker