Applying Automated Memory Analysis to improve the iterative solver in the Parallel Ocean Program

Applying Automated Memory Analysis to improve the iterative solver in the Parallel Ocean Program John M. Dennis: dennis@ucar.edu Elizabeth R. Jessup: jessup@cs.colorado.edu April 5, 2006

Motivation • Outgrowth of PhD thesis • Memory efficient iterative solvers • Data movement is expensive • Developed techniques to improve memory efficiency • Apply Automated Memory Analysis to POP • Parallel Ocean Program (POP) solver • Large % of time • Scalability issues Petascale Computation for the Geosciences Workshop

Outline: • Motivation • Background • Data movement • Serial Performance • Parallel Performance • Space-Filling Curves • Conclusions Petascale Computation for the Geosciences Workshop

Automated Memory Analysis? • Analyze algorithm written in Matlab • Predicts data movement if algorithm written in C/C++ or Fortran -> Minimum Required • Predictions allow: • Evaluate design choices • Guide performance tuning Petascale Computation for the Geosciences Workshop

POP using 20x24 blocks (gx1v3) • POP data structure • Flexible block structure • land ‘block’ elimination • Small blocks • Better {load balanced, land block elimination} • Larger halo overhead • Larger blocks • Smaller halo overhead • Load imbalanced • No land block elimination • Grid resolutions: • test: (128x192) • gx1v3 (320x384) Petascale Computation for the Geosciences Workshop

2D data structure Advantages Regular stride-1 access Compact form of stencil operator Disadvantages Includes land points Problem specific data structure 1D data structure Advantages No more land points General data structure Disadvantages Indirect addressing Larger stencil operator Alternate Data Structure Petascale Computation for the Geosciences Workshop

Data movement • Working set load size (WSL) • (MM --> L1 cache) • Measure using PAPI (WSLM) • Compute platforms: • Sun Ultra II (400Mhz) • IBM POWER4 (1.3 Ghz) • SGI R14K (500Mhz) • Compare with prediction (WSLP) Petascale Computation for the Geosciences Workshop

solver w/2D (Matlab) solver w/1D (Matlab) SLAMM Predicts WSLP > 4902 Kbytes 3218 Kbytes 1D data structure --> 34% reduction in data movement Predicting Data Movement Petascale Computation for the Geosciences Workshop

Measured versus Predicted data movement Petascale Computation for the Geosciences Workshop

Measured versus Predicted data movement Excessive data movement Petascale Computation for the Geosciences Workshop

do i=1,nblocks p(:,:,i)=z(:,:,i) + p(:,:,i)*ß q(:,:,i) = A*p(:,:,i) w0(:,:,i)=Q(:,:,i)*P(:,:,i) enddo delta = gsum(w0,lmask) ldelta=0 do i=1,nblocks p(:,:,i) = z(:,:,i) + p(:,:,i)* ß q(:,:,i) = A*p(:,:,i) w0=q(:,:,i)*P(:,:,i) ldelta = ldelta + lsum(w0,lmask) enddo delta=gsum(ldelta) w0 array accessed after loop! extra access of w0 eliminated Two blocks of source code PCG2+2D v1 PCG2+2D v2 Petascale Computation for the Geosciences Workshop

Measured versus Predicted data movement Data movement matches predicted! Petascale Computation for the Geosciences Workshop

Using 1D data structures in POP2 solver (serial) • Replace solvers.F90 • Execution time on cache microprocessors • Examine two CG algorithms w/Diagonal precond • PCG2 ( 2 inner products) • PCG1 ( 1 inner product) [D’Azevedo 93] • Grid: test • [128x192 grid points]w/(16x16) Petascale Computation for the Geosciences Workshop

56% reduction in cost/iteration Serial execution time on IBM POWER4 (test) Petascale Computation for the Geosciences Workshop

Using 1D data structure in POP2 solver (parallel) • New parallel halo update • Examine several CG algorithms w/Diagonal precond • PCG2 ( 2 inner products) • PCG1 ( 1 inner product) • Existing solver/preconditioner technology: Hypre (LLNL) http://www.llnl.gov/CASC/linear_solvers • PCG solver • Preconditioners: • Diagonal • Hypre integration -> Work in progress Petascale Computation for the Geosciences Workshop

48% cost/iteration 27% cost/iteration Solver execution time for POP2 (20x24) on BG/L (gx1v3) Petascale Computation for the Geosciences Workshop

64 processors != PetaScale

0.1 degree POP • Global eddy-resolving • Computational grid: • 3600 x 2400 x 40 • Land creates problems: • load imbalances • scalability • Alternative partitioning algorithm: • Space-filling curves • Evaluate using Benchmark: • 1 day/ Internal grid / 7 minute timestep Petascale Computation for the Geosciences Workshop

Nb Partitioning with Space-filling Curves • Map 2D -> 1D • Variety of sizes • Hilbert (Nb=2n) • Peano (Nb=3m) • Cinco (Nb=5p) [New] • Hilbert-Peano (Nb=2n3m) • Hilbert-Peano-Cinco (Nb=2n3m5p) [New] • Partitioning 1D array Petascale Computation for the Geosciences Workshop

Partitioning with SFC Partition for 3 processors Petascale Computation for the Geosciences Workshop

POP using 20x24 blocks (gx1v3) Petascale Computation for the Geosciences Workshop

POP (gx1v3) + Space-filling curve Petascale Computation for the Geosciences Workshop

Space-filling curve (Hilbert Nb=24) Petascale Computation for the Geosciences Workshop

Remove Land blocks Petascale Computation for the Geosciences Workshop

Space-filling curve partition for 8 processors Petascale Computation for the Geosciences Workshop

POP 0.1 degree benchmark on Blue Gene/L Petascale Computation for the Geosciences Workshop

POP 0.1 degree benchmark Courtesy of Y. Yoshida, M. Taylor, P. Worley Petascale Computation for the Geosciences Workshop

Conclusions • 1D data structures in Barotropic Solver • No more land points • Reduces execution time vs 2D data structure • 48% reduction in Solver time! (64 procs BG/L) • 9.5% reduction in Total time! (64 procs POWER4) • Allows use of solver/preconditioner packages • Implementation quality critical! • Automated Memory Analysis (SLAMM) • Evaluate design choices • Guide performance tuning Petascale Computation for the Geosciences Workshop

Conclusions (con’t) • Good scalability to 32K processors on BG/L • Increase simulation rate by 2x on 32K processors • SFC partitioning • 1D data structure in solver • Modify 7 source files • Future work • Improve scalability • 55% Efficiency 1K => 32K • Better preconditioners • Improve load-balance • Different block sizes • Improve partitioning algorithm Petascale Computation for the Geosciences Workshop

Thanks to: F. Bryan (NCAR) J. Edwards (IBM) P. Jones (LANL) K. Lindsay (NCAR) M. Taylor (SNL) H. Tufo (NCAR) W. Waite (CU) S. Weese (NCAR) Blue Gene/L time: NSF MRI Grant NCAR University of Colorado IBM (SUR) program BGW Consortium Days IBM research (Watson) Acknowledgements/Questions? Petascale Computation for the Geosciences Workshop

Serial Execution time on Multiple platforms (test) Petascale Computation for the Geosciences Workshop

9.5% reduction Total execution time for POP2 (40x48) on POWER4 (gx1v3) Eliminate need for ~216,000 CPU hours per year @ NCAR Petascale Computation for the Geosciences Workshop

POP 0.1 degree Increasing || --> Decreasing overhead --> Petascale Computation for the Geosciences Workshop

Applying Automated Memory Analysis to improve the iterative solver in the Parallel Ocean Program

Applying Automated Memory Analysis to improve the iterative solver in the Parallel Ocean Program

Presentation Transcript

Memory Issues in Parallel Processing

Applying to the Senior to Sophomore Program

Iterative Program Analysis Abstract Interpretation

Iterative Program Analysis Abstract Interpretation

Applying Data Copy To Improve Memory Performance of General Array Computations

A Parallel Hierarchical Solver for the Poisson Equation

The -model of sub-gridscale turbulence in the Parallel Ocean Program (POP)

Parallel Memory Allocation

Applying Iterative Project Management Techniques to Business Continuity

How to Improve memory

Parallel Shared Memory

Iterative Program Analysis Abstract Interpretation

Iterative Program Analysis Part II Mathematical Background

How To Improve Memory Power, How To Improve Concentration And Memory, Foods That Help The Brain

Applying to the Doctor of Chiropractic degree program

The Ultimate Guide On How To Improve Memory

The Best Foods That Improve Memory

A Parallel Hierarchical Solver for the Poisson Equation

Parallel Program Analysis Framework for the DOE ACTS Toolkit

Applying to the USDA GIPSA Process Verified Program

Parallel Program Analysis Framework for the DOE ACTS Toolkit

Ways To Improve Memory