150 likes | 168 Views
Parallelization of CPAIMD using Charm++. Parallel Programming Lab. CPAIMD. Collaboration with Glenn Martyna and Mark Tuckerman MPI code – PINY Scalability problems When #procs >= #orbitals Charm++ approach Better scalability using virtualization Further divide orbitals. The Iteration.
E N D
Parallelization of CPAIMD using Charm++ Parallel Programming Lab
CPAIMD • Collaboration with Glenn Martyna and Mark Tuckerman • MPI code – PINY • Scalability problems • When #procs >= #orbitals • Charm++ approach • Better scalability using virtualization • Further divide orbitals
The Iteration (contd.) • Start with 128 “states” • State – spatial representation of electron • FFT each of 128 states • In parallel • Planar decomposition => transpose • Compute densities (DFT) • Compute energies using density • Compute Forces and move electrons • Orthonormalize states • Start over
Optimized Parallel 3D FFT • To perform 3D FFT • 1d followed by 2d instead of 2d followed by 1d • Lesser computation • Lesser communication
Orthonormalization • All-pairs operation • The data of each state has to meet with the data of all other states • Our approach (picture follows) • A virtual processor acts as meeting point for several pairs of states • Create lots of these • The number of pairs meeting at a VP: n • Communication decreases with n • Computation increases with n • Balance required
Performance • Existing MPI code – PINY • Does not scale beyond 128 processors • Best per-iteration: 1.7s • Our performance:
Load balancing • Load imbalance due to distribution of data in orbitals • Planes are sections of a sphere • Hence imbalance • Computation – more points • Communication – more data to send
Load Imbalance Iteration time: 900ms on 1024 procs
Improvement - I Improvement by pairing heavily loaded planes with lightly loaded planes. Iteration time: 590ms
Charm++ Load Balancing Load balancing provided by the system, iteration time: 600ms
Improvement - II Improvement by using a load vector based scheme to map planes to processors. The number of “light” planes per processor is corresponding lesser than that of the number of “heavy” planes. Iteration time: 480ms
Scope for Improvement • Load balancing • Charm++ load balancer shows encouraging results on 512 pes • Combination of automated and manual load-balancing • Avoiding copying when sending messages • In ffts • Sending large read-only messages • FFTs can be made more efficient • Use double packing • Make assumption about data distribution when performing FFTs • Alternative implementation of orthonormalization