1 / 21

Running in Parallel : Theory and Practice

Running in Parallel : Theory and Practice. Julian Gale Department of Chemistry Imperial College. Why Run in Parallel?. Increase real-time performance Allow larger calculations : - usually memory is the critical factor - distributed memory essential for all significant arrays

kalea
Download Presentation

Running in Parallel : Theory and Practice

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Running in Parallel : Theory and Practice Julian Gale Department of Chemistry Imperial College

  2. Why Run in Parallel? • Increase real-time performance • Allow larger calculations : - usually memory is the critical factor - distributed memory essential for all significant arrays • Several possible mechanisms for parallelism : - MPI / PVM / OpenMP

  3. Parallel Strategies • Massive parallelism : - distribute according to spatial location - large systems (non-overlapping regions) - large numbers of processors • Modest parallelism : - distribute by orbital index / K point - spatially compact systems - spatially inhomogeneous systems - small numbers of processors • Replica parallelism (transition states / phonons) S.Itoh, P. Ordejón and R.M. Martin, CPC, 88, 173 (1995) A. Canning, G. Galli, F. Mauri, A. de Vita and R. Car, CPC, 94, 89 (1996) D.W. Bowler, T. Miyazaki and M. Gillan, CPC, 137, 255 (2001)

  4. Key Steps in Calculation • Calculating H (and S) matrices - Hartree potential - Exchange-correlation potential - Kinetic / overlap / pseudopotentials • Solving for self-consistent solution - Diagonalisation - Order N

  5. Integrals evaluated directly in real space Orbitals distributed according to 1-D block cyclic scheme Each node calculates integrals relevant to local orbitals Presently duplicated set up on each node for numerical tabulations One/Two-Centre Integrals Blocksize = 4 Kinetic energy integrals = 45s Overlap integrals = 43s Non-local pseudopotential = 136s Mesh = 2213s 16384 atoms of Si on 4 nodes :

  6. Sparse Matrices Order N memory Compressed 2-D Compressed 1-D

  7. Parallel Mesh Operations • Spatial decomposition of mesh • 2-D Blocked in y/z • Map orbital to mesh distribution • Perform parallel FFT  Hartree • XC calculation only involves local communication • Map mesh back to orbitals 0 4 8 1 5 9 2 6 10 3 7 11 2-D Blocked

  8. Distribution of Processors Better to divide work in y direction than z Command: ProcessorY z • Example: • 8 nodes • ProcessorY 4 • 4(y) x 2(z) grid of nodes y

  9. Diagonalisation • H and S stored as sparse matrices • Solve generalised eigenvalue problem • Currently convert back to dense form • Direct sparse solution is possible - sparse solvers exist for standard eigenvalue problem - main issue is sparse factorisation

  10. Dense Parallel Diagonalisation 0 1 0 Two options : - Scalapack - Block Jacobi (Ian Bush, Daresbury) - Scaling vs absolute performance 1-D Block Cyclic (size ≈ 12 - 20) Command: BlockSize

  11. Order N Kim, Mauri, Galli functional :

  12. Order N • Direct minimisation of band structure energy => co-efficients of orbitals in Wannier fns • Three basic operations : - calculation of gradient - 3 point extrapolation of energy - density matrix build • Sparse matrices : C, G, H, S, h, s, F, Fs => localisation radius • Arrays distributed by rhs index : - nbasis or nbands

  13. Putting it into practice…. • Model test system = bulk Si (a=5.43Å) • Conditions as previous scalar runs • Single-zeta basis set • Mesh cut-off = 40 Ry • Localisation radius = 5.0 Bohr • Kim / Mauri / Galli functional • Energy shift = 0.02 Ry • Order N calculations -> 1 SCF cycle / 2 iterations • Calculations performed on SGI R12000 / 300 MHz • “Green” at CSAR / Manchester Computing Centre

  14. Scaling of Time with System Size 32 processors

  15. Scaling of Memory with System Size NB : Memory is per processor

  16. Parallel Performance on Mesh • 16384 atoms of Si / Mesh = 180 x 360 x 360 • Mean time per call • Loss of performance is due to orbital - mesh mapping (XC shows perfect scaling (LDA))

  17. Parallel Performance in Order N • 16384 atoms of Si / Mesh = 180 x 360 x 360 • Mean total time per call in 3 point energy calculation • Minimum memory algorithm • Needs spatial decomposition to limit internode communication

  18. Installing Parallel SIESTA • What you need: - f90 - MPI - scalapack - blacs - blas - lapack • Usually ready installed on parallel machines • Source/prebuilt binaries from www.netlib.org • If compiling, look out for f90/c cross compatibility • arch.make - available for several parallel machines Also needed for serial runs

  19. Running Parallel SIESTA • To run a parallel job: mpirun -np 4 siesta < job.fdf > job.out • Sometimes must use “prun” on some sites • Notes: - generally must run in queues - copy files on to local disk of run machine - times reported in output are sum over nodes - times can be erratic (Green/Fermat) Number of processors

  20. Useful Parallel Options • ParallelOverK : Distribute K points over nodes - good for metals • ProcessorY : Sets dimension of processor grid in Y direction • BlockSize : Sets size of blocks into which orbitals are divided • DiagMemory : Controls memory available for diagonalisation. Memory required depends on clusters of eig values See also DiagScale/TryMemoryIncrease • DirectPhi : Phi values are calculated on the fly - saves memory

  21. Why does my job run like a dead donkey? • Poor load balance between nodes: Alter BlockSize / ProcessorY • I/O is too slow: Could set “WriteDM false” • Job is swapping like crazy: Set “DirectPhi true” • Scaling with increasing number of nodes is poor: Run a bigger job!! • General problems with parallelism: Latency / bandwidth Linux clusters with 100MB ethernet switch - forget it!

More Related