Optimizing Quantum Chemistry using Charm++

Optimizing Quantum Chemistry using Charm++ Eric Bohm http://charm.cs.uiuc.edu Parallel Programming Laboratory Department of Computer Science University of Illinois at Urbana Champaign

Decomposition State Planes 3d FFT 3d matrix multiply Utilizing Charm++ Prioritized nonlocal Commlib Projections Overview • CPMD • 9 phases • Charm applicability • Overlap • Decomposition • Portability • Communication Optimization

Quantum Chemistry • LeanCP Collaboration • Glenn Martyna (IBM TJ Watson) • Mark Tuckerman (NYU) • Nick Nystrom (PSU) • PPL: Kale, Shi, Bohm, Pauli, Kumar (now at IBM), Vadali • CPMD Method • Plane wave QM : 100s of atoms • Charm++ Parallelization • PINY MD Physics engine

Adaptive Overlap Prioritized computation for phased application Communication optimization Load balancing Group caches Rth Threads CPMD on Charm++ • 11 Charm Arrays • 4 Charm Modules • 13 Charm Groups • 3 Commlib strategies • BLAS • FFTW • PINY MD

Practical Scaling • Single Wall Carbon Nanotube Field Effect Transistor • BG/L Performance

Computation Flow

Charm++ • Uses the approach of virtualization • Divide the work into VPs • Typically much more than #proc • Schedule each VP for execution • Advantage: • Computation and communication can be overlapped (between VPs) • Number of VPs can be independent of #proc • Other: load balancing, checkpointing, etc.

Decomposition • Higher degree of virtualization better for Charm++ • Real Space State Planes, Gspace State Planes, Rho Real and Rho G, S-Calculators for each gspace state plane. • Tens of thousands of chares for a 32 mol problem • Careful scheduling to maximize efficiency • Most of the computation is in FFTs and Matrix Multiplies

3-D FFT Implementation “Dense” 3-D FFT “Sparse” 3-D FFT

Parallel FFT Library • Slab-based parallelization • We do not re-implement the sequential routine • Utilize 1-D and 2-D FFT routines provided by FFTW • Allow for • Multiple 3-D FFTs simultaneously • Multiple data sets within the same set of slab objects • Useful as 3-D FFTs are frequently used in CP computations

Multiple Parallel 3-D FFTs

Matrix Multiply • AKA Scalculator or Pair Calculator • Decompose state-plane values into smaller objects. • Use DGEMM on smaller sub-matrices • Sum together via reduction back to Gspace

Matrix Multiply VP based approach

Charm++ Tricks and Tips • Message driven execution and high degree of virtualization present tuning challenges • Flow of control using Rth-Threads • Prioritized messages • Commlib framework • Charm++ arrays vs groups • Problem identification with projections • Problem isolation techniques

Flow Control in Parallel • Rth Threads • Based on Duff's device these are user level threads with negligible overhead. • Essentially Goto and Return without readability loss • Allow for an event loop style of programming • Makes flow of control explicit • Uses familiar threading semantic

Rth Threads for Flow Control

Prioritized Messages for Overlap

Communication Library • Fine grained decomposition can result in many small messages. • Message combining via the Commlib framework in Charm++ addresses this problem. • Streaming protocol optimizes many to many personalized. • Forwarding protocols like Ring or Multiring can be beneficial. • But not on BG/L

Commlib Strategy Selection

Streaming CommlibSaves time • 610ms • vs • 480ms

Bound Arrays • Why? • Efficiency and clarity of expression. • Two arrays of the same dimensionality where like indices are co-placed. • Gspace and the non-local computation both have plane based computations and share many data elements. • Use ck-local to access elements, like local functions and local function calls. • Remain distinct parallel objects

Group Caching Techniques • Group objects have 1 element per processor • Making excellent cache points for arrays which may have many chares per processor • Place low volatility data in the group • Array elements use cklocal to access • In CPMD: the Structure Factor for all chares which have plane P use the same memory

Charm++ Performance Debugging • Complex parallel applications hard to debug • Event based model with high degree of virtualization presents new challenges • Projections and Charm++ debugger Tools • Bottleneck identification: • using the Projections Usage Profile tool

Old S->T Orthonormalization

After Parallel S->T

Problem isolation techniques • Using Rth threads its easy to isolate phases by adding a barrier. • Contribute to Reduction -> suspend • Reduction proxy is broadcast client ->resume • In the following example we break up the Gspace IFFT into computation and communication entry methods. • We then insert a barrier between them to highlight a specific performance problem

Projections Timeline Analysis

Optimizations Motivated by BG/L • Finer decomposition • Structure Factor and non-local computation now operate on groups of atoms within a plane • Improved scaling • Avoid creating network bottlenecks • No DMA or communication offload on BG/L's torus net • Workarounds for MPI progress engine • Set eager <1000 • Add network probes inside inner loops • Shift communication to avoid cross computation phase interference

After the fixes

Future Work Scaling to 20k processors on BG/L - density pencil ffts Rhospace real->complex doublepack optimization New FFT based algorithm for Structure Factor More systems Topology aware chare mapping HLL Orchestration expression

What time is it in Scotland? • There is a 1024 node BG/L in Edinburg • Time is 6 hours ahead of CT there. • During this non production time we can run on the full rack at night • Thank you EPCC!

Optimizing Quantum Chemistry using Charm++