180 likes | 265 Views
How We Use MPI: A Naïve Curmudgeon’s View. Bronson Messer Scientific Computing Group Leadership Computing Facility National Center for Computational Sciences Oak Ridge National Laboratory Theoretical Astrophysics Group Oak Ridge National Laboratory Department of Physics & Astronomy
E N D
How We Use MPI:A Naïve Curmudgeon’s View Bronson Messer Scientific Computing Group Leadership Computing Facility National Center for Computational Sciences Oak Ridge National Laboratory Theoretical Astrophysics Group Oak Ridge National Laboratory Department of Physics & Astronomy University of Tennessee, Knoxville
Why do we (I and other idiot astrophysicists) use MPI? • It is ubiquitous! • … and everywhere it exists it performs • OK, ‘performs’ connotes good performance, but ‘poor’ performance on a given platform is always met with alarm • AND, we have now figured out how to ameliorate some shortcomings in performance through avoidance • … and it’s pretty darn easy to use • Even ‘modern’ (i.e. grew up with their notion of ‘computer’ meaning ‘information appliance’) grad students can figure out how to program poorly using MPI in a matter of days • That’s it! • Importantly, right now our need for expressiveness is close to being met
Selected Petascale Science DriversWe Have Worked With Science Teams to Understand and Define Specific Science Objectives
Science WorkloadJob Sizes and Resource Usage of Key Applications Total aggregate allocation for CHIMERA production & GenASiS development this FY: 38 million CPU-hours (16M INCITE, 18M NSF, 4M NERSC)
Multi-physics applications are very good present-day laboratories for multi-core ideas. Current Planned Pioneering Application RunsSimulation Specs on the 250 TF Jaguar System*
Current Workhorse Ray-by-ray MGFLD transport (En) 3D (magneto)hydrodynamics 150 species nuclear network Bruenn et al. (2006) Messer et al. (2007) mCHIMERA Possible Future Workhorse Ray-by-ray Boltzmann transport (En,q) 3D (magneto)hydrodynamics 150-300 species nuclear network bCHIMERA The “Ultimate Goal” Full 3D Boltzmann transport (En,q,φ) 3D (magneto)hydrodynamics 150-300 species nuclear network
Physical Models A “chimera” of three separate yet mature codes Coupled into a single executable Three primary modules (“heads”) MVH3: Stellar gasdynamics MGFLD-TRANS: ``ray-by-ray-plus'' neutrino transport XNET: thermonuclear kinetics The heads are augmented by Sophisticated equation of state for nuclear matter Self-gravity solver capable of an approximation to general-relativistic gravity Numerical Algorithms Directionally-split hydrodynamics with a standard Riemann solver for shock capturing Solutions for ray-by-ray neutrino transport and thermonuclear kinetics are obtained during the radial hydro sweep All necessary data for those modules is local to a processor during the radial sweep Computed along each radial ray using only data that is local to that ray Physics modules are coupled with standard operator-splitting Valid because characteristic time scales for each module are widely disparate Neutrino transport solution Sparse linear solve local to a ray Nuclear burning solution Dense linear solve local to a zone Pioneering Application: CHIMERA*Physical Models and Algorithms Early-time distribution of entropy in 2D exploding core collapse simulation * Conservative Hydrodynamics Including Multi-Energy Ray-by-ray Transport
CHIMERA is: • a “chimera” of 3 separate, mature codes • VH1 (MVH3) • Multidimensional hydrodynamics • http://wonka.physics.ncsu.edu/pub/VH-1/ • non-polytropic EOS • 3D domain decomposition • uses directional sweeps to define subcommunicators for data transpose (MPI_alltoall) • results in all processes performing ‘several_to_several’
MVH3: Dicing instead of slicing Using M*N processors; X data starts local to proc jcol = mod(mype,N) krow = mype/N mpi_comm_split(mpi_comm_world, krow, mype, mpi_comm_row) mpi_comm_split(mpi_comm_world, jcol, mype, mpi_comm_col) MPI_ALLTOALL( MPI_COMM_COL ) krow = M-1 3 2 1 0 MPI_ALLTOALL( MPI_COMM_ROW ) Y Hydro is done after transposing data only with processors with the same value of krow: transposing I and J but keeping K constant. Z zro(imax,js,ks) Local data includes all of the X domain, but only portions of Y and Z. mype +1 = jcol = 0 1 2 3 N-1 Y
MGFLD-TRANS • Multi-group (energy) neutrino radiation hydro solver • GR corrections • 4 neutrino flavors with many modern interactions included • flux limiter is “tuned” from Boltzmann transport simulations
XNET • Nuclear kinetics solver • Currently have implemented only an α network • 150 species to be included in future simulations • Custom interface routine written for CHIMERA • All else is ‘stock’
How does CHIMERA work? ϑ VH1/MVH3 φ r MGFLD-TRANS ν CHIMERA XNET
Example: XNET performance and implementation • XNET runs at ~50% of peak on a single XT4 processor • Roughly 50% Jacobian build / 50% dense solve • 1 XNET solve is required per SPATIAL ZONE (i.e. hundreds per ray) • Best load balancing on a node with OpenMP or a subcommunicator is interleaved lots of burning little burning hot cool 1 2 3 4 1 2 3 4 r=0 r=rmax
A lot of “big” codes don’t really stress the XT network 100% 100% Distribution in this space depends upon the applications and the problem being simulated for a given application CHIMERA Communication Communication POP POP GTC GTC MADNESS MADNESS S3D S3D CHIMERA PFLOTRAN DCA++ DCA++ 0% 0% Computation Computation 100% 100% 0% 0% CHIMERA S3D MADNESS PFLOTRAN POP GTC DCA++ 2007 INCITE
Relative Per Core Performance * Only hydrodynamics module used in benchmark
GenASiS development • GenASiS is not completely “wed” to an programming model yet • Lots of abstraction • Function overloading used everywhere in the code • Many implementations are possible ‘under the hood’ • Full, 3D rad-hydro simulations will require an exascale computer in any event, so we have time… (Why, they couldn’t hit an elephant at this dis… [Gen. John Sedgwick, 1864])
Opinions and questions • Ubiquity and performance are go/no-go metrics for any future methods/languages/ideas. • Does this present a chicken/egg conundrum: must things be built and tested on architectures unready to exhibit the expected performance? • Are the users of an exascale machine the present users of petascle-ish platforms? Is the mapping one-to-one? • Writing code from scratch is not anathema, but you’re lucky if you can afford to do it. • Even then, design decisions are often made during this process based not on wise reflection, but attempts to snag the proverbial (but elusive) low-hanging fruit.