150 likes | 296 Views
HFODD for Leadership Class Computers. N. Schunck, J. McDonnell, Hai Ah Nam. HFODD. HFODD for Leadership Class Computers. DFT AND HPC COMPUTING. Classes of DFT solvers. Coordinate-space: Direct integration of the HFB equations Accurate: provide “exact” result
E N D
HFODD for Leadership Class Computers N. Schunck, J. McDonnell, Hai Ah Nam
HFODD for Leadership Class Computers DFT AND HPC COMPUTING
Classes of DFT solvers • Coordinate-space:Direct integration of the HFB equations • Accurate: provide “exact” result • Slow and CPU/memory intensive for 2D-3D geometries • Configuration space: Expansion of the solutions on a basis (HO) • Fast and amenable to beyond mean-field extensions • Truncation effects: source of divergences/renormalization issues • Wrong asymptotic unless different bases are used (WS, PTG, Gamow, etc.) Resources needed for a “standard HFB” calculation
Why High Performance Computing? Core of DFT: Global theory which averages out individual degrees of freedom • From light nuclei to neutron stars • Rich physics • Fast and reliable • Treatment of correlations ? • ~100 keV level precision ? • Extrapolability ? g.s. of even nucleus can be computed in a matter of minutes on a standard laptop: why bother with supercomputing? • Large-scale DFT • Static: fission, shape coexistence, etc. – compute > 100k different configurations • Dynamics: restoration of broken symmetries, correlations, time-dependent problems – combine > 100k configurations • Optimization of extended functionals on larger sets of experimental data
Computational Challenges for DFT • Self-consistency = iterative process: • Not naturally prone to parallelization (suggests: lots of thinking…) • Computational cost : (number of iterations) × (cost of one iteration) O(everything else) • Cost of symmetry breaking: triaxiality, reflection asymmetry, time-reversal invariance • Large dense matrices(LAPACK) constructed and diagonalized many times – size of the order of (2,000 x 2,000) – (10,000 x 10,000) (suggests: message passing) • Many long loops (suggests: threading) • Finite-range forces/non-local functionals: exact Coulomb, Yukawa-, Gogny-like • Many nestedloops (suggests: threading) • Precision issues
HFODD • Solve HFB equations in the deformed, Cartesian HO basis • Breaks all symmetries (if needed) • Zero-range and finite-range forces coded • Additional features: cranking, angular momentum projection, etc. • Technicalities: • Fortran 77, Fortran 90 • BLAS, LAPACK • I/O with standard input/output + a few files Redde Caesari quae sunt Caesaris
HFODD for Leadership Class Computers OPTIMIZATIONS
Loop reordering • Fortran: matrices are stored in memory column-wise elements must be accessed first by column index, then by row index (good stride) • Cost of bad stride grows quickly with number of indexes and dimensions Ex.: Accessing Mijk do i = 1, N do j = 1, N do k = 1, N do k = 1, N do j = 1, N do i = 1, N Time of 10 HF iterations as function of the model space (Skyrme SLy4, 208Pn, HF, exact Coulomb exchange)
Threading (OpenMP) • OpenMP designed to auto-matically parallelize loops • Ex: calculation of density matrix in HO basis • Solutions: • Thread it with OpenMP • When possible, replace all such manual linear algebra with BLAS/LAPACK calls (threaded version exist) do j = 1, N do i = 1, N do = 1, N Time of 10 HFB iterations as function of the number of threads (Jaguar Cray XT5 – Skyrme SLy4, 152Dy, HFB, 14 full shells)
Parallel Performance (MPI) • DFT = naturally parallel • 1 core = 1 configuration (only if ‘all’ fits into core) • HFODD characteristics • Very little communication overhead • Lots of I/O per processor (specific to that processor): 3 ASCII files/core • Scalability limited by: • File system performance • Usability of the results (handling of thousands of files) • ADIOS library being implemented Time of 10 HFB iterations as function of the cores (Jaguar Cray XT5, no threads – Skyrme SLy4, 152Dy, HFB, 14 full shells)
ScaLAPACK M M M M • Multi-threading: more memory/core available • How about scalability of diagonalization for large model spaces? • ScaLAPACK successfully implemented for simplex-breaking HFB calculations (J. McDonnell) • Current issues: • Needs detailed profiling as no speed-up is observed: bottleneck? • Problem size adequate?
Hybrid MPI/OpenMP Parallel Model • Spread the HFB calculation across a few cores (<12-24) • MPI for task management Threading (OpenMP) 1 HFB calculation MPI sub-communicator (optional) for very large bases needing ScaLapack ScaLAPACK (MPI) Task management (MPI) Cores Threads for loop optimization Time HFB - i/N HFB - (i+1)/N
Conclusions • DFT codes are naturally parallel and can easily scale to 1 M processors or more • High-precision applications of DFT are time- and memory-consuming computations need for fine-grain parallelization • HFODD benefits from HPC techniques and code examination • Loop-reordering give N ≫1 speed-up (Coulomb exchange: N ~ 3, Gogny force, N ~ 8) • Multi-threading gives extra factor > 2 (only a few routines have been upgraded) • ScaLAPACK implemented: very large bases (Nshell > 25) can now be used (Ex.: near scission) • Scaling only average on standard Jaguar file system because of un-optimized I/O
Year 4 – 5 Roadmap • Year 4 • More OpenMP, debugging of ScaLAPACK routine • First tests of ADIOS library (at scale) • First development of a prototype python visualization interface • Tests of large-scale, I/O-briddled, multi-constrained calculations • Year 5 • Full implementation of ADIOS • Set up framework for automatic restart (at scale) • SVN repository (ask Mario for account) http://www.massexplorer.org/svn/HFODDSVN/trunk