1 / 31

Optimizing Quantum Chemistry using Charm++

Optimizing Quantum Chemistry using Charm++. Eric Bohm http://charm.cs.uiuc.edu Parallel Programming Laboratory Department of Computer Science University of Illinois at Urbana Champaign. Decomposition State Planes 3d FFT 3d matrix multiply Utilizing Charm++ Prioritized nonlocal Commlib

qiana
Download Presentation

Optimizing Quantum Chemistry using Charm++

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Optimizing Quantum Chemistry using Charm++ Eric Bohm http://charm.cs.uiuc.edu Parallel Programming Laboratory Department of Computer Science University of Illinois at Urbana Champaign

  2. Decomposition State Planes 3d FFT 3d matrix multiply Utilizing Charm++ Prioritized nonlocal Commlib Projections Overview • CPMD • 9 phases • Charm applicability • Overlap • Decomposition • Portability • Communication Optimization

  3. Quantum Chemistry • LeanCP Collaboration • Glenn Martyna (IBM TJ Watson) • Mark Tuckerman (NYU) • Nick Nystrom (PSU) • PPL: Kale, Shi, Bohm, Pauli, Kumar (now at IBM), Vadali • CPMD Method • Plane wave QM : 100s of atoms • Charm++ Parallelization • PINY MD Physics engine

  4. Adaptive Overlap Prioritized computation for phased application Communication optimization Load balancing Group caches Rth Threads CPMD on Charm++ • 11 Charm Arrays • 4 Charm Modules • 13 Charm Groups • 3 Commlib strategies • BLAS • FFTW • PINY MD

  5. Practical Scaling • Single Wall Carbon Nanotube Field Effect Transistor • BG/L Performance

  6. Computation Flow

  7. Charm++ • Uses the approach of virtualization • Divide the work into VPs • Typically much more than #proc • Schedule each VP for execution • Advantage: • Computation and communication can be overlapped (between VPs) • Number of VPs can be independent of #proc • Other: load balancing, checkpointing, etc.

  8. Decomposition • Higher degree of virtualization better for Charm++ • Real Space State Planes, Gspace State Planes, Rho Real and Rho G, S-Calculators for each gspace state plane. • Tens of thousands of chares for a 32 mol problem • Careful scheduling to maximize efficiency • Most of the computation is in FFTs and Matrix Multiplies

  9. 3-D FFT Implementation “Dense” 3-D FFT “Sparse” 3-D FFT

  10. Parallel FFT Library • Slab-based parallelization • We do not re-implement the sequential routine • Utilize 1-D and 2-D FFT routines provided by FFTW • Allow for • Multiple 3-D FFTs simultaneously • Multiple data sets within the same set of slab objects • Useful as 3-D FFTs are frequently used in CP computations

  11. Multiple Parallel 3-D FFTs

  12. Matrix Multiply • AKA Scalculator or Pair Calculator • Decompose state-plane values into smaller objects. • Use DGEMM on smaller sub-matrices • Sum together via reduction back to Gspace

  13. Matrix Multiply VP based approach

  14. Charm++ Tricks and Tips • Message driven execution and high degree of virtualization present tuning challenges • Flow of control using Rth-Threads • Prioritized messages • Commlib framework • Charm++ arrays vs groups • Problem identification with projections • Problem isolation techniques

  15. Flow Control in Parallel • Rth Threads • Based on Duff's device these are user level threads with negligible overhead. • Essentially Goto and Return without readability loss • Allow for an event loop style of programming • Makes flow of control explicit • Uses familiar threading semantic

  16. Rth Threads for Flow Control

  17. Prioritized Messages for Overlap

  18. Communication Library • Fine grained decomposition can result in many small messages. • Message combining via the Commlib framework in Charm++ addresses this problem. • Streaming protocol optimizes many to many personalized. • Forwarding protocols like Ring or Multiring can be beneficial. • But not on BG/L

  19. Commlib Strategy Selection

  20. Streaming CommlibSaves time • 610ms • vs • 480ms

  21. Bound Arrays • Why? • Efficiency and clarity of expression. • Two arrays of the same dimensionality where like indices are co-placed. • Gspace and the non-local computation both have plane based computations and share many data elements. • Use ck-local to access elements, like local functions and local function calls. • Remain distinct parallel objects

  22. Group Caching Techniques • Group objects have 1 element per processor • Making excellent cache points for arrays which may have many chares per processor • Place low volatility data in the group • Array elements use cklocal to access • In CPMD: the Structure Factor for all chares which have plane P use the same memory

  23. Charm++ Performance Debugging • Complex parallel applications hard to debug • Event based model with high degree of virtualization presents new challenges • Projections and Charm++ debugger Tools • Bottleneck identification: • using the Projections Usage Profile tool

  24. Old S->T Orthonormalization

  25. After Parallel S->T

  26. Problem isolation techniques • Using Rth threads its easy to isolate phases by adding a barrier. • Contribute to Reduction -> suspend • Reduction proxy is broadcast client ->resume • In the following example we break up the Gspace IFFT into computation and communication entry methods. • We then insert a barrier between them to highlight a specific performance problem

  27. Projections Timeline Analysis

  28. Optimizations Motivated by BG/L • Finer decomposition • Structure Factor and non-local computation now operate on groups of atoms within a plane • Improved scaling • Avoid creating network bottlenecks • No DMA or communication offload on BG/L's torus net • Workarounds for MPI progress engine • Set eager <1000 • Add network probes inside inner loops • Shift communication to avoid cross computation phase interference

  29. After the fixes

  30. Future Work Scaling to 20k processors on BG/L - density pencil ffts Rhospace real->complex doublepack optimization New FFT based algorithm for Structure Factor More systems Topology aware chare mapping HLL Orchestration expression

  31. What time is it in Scotland? • There is a 1024 node BG/L in Edinburg • Time is 6 hours ahead of CT there. • During this non production time we can run on the full rack at night • Thank you EPCC!

More Related