370 likes | 486 Views
Applying Automated Memory Analysis to improve the iterative solver in the Parallel Ocean Program. John M. Dennis: dennis@ucar.edu Elizabeth R. Jessup: jessup@cs.colorado.edu April 5, 2006. Motivation. Outgrowth of PhD thesis Memory efficient iterative solvers Data movement is expensive
E N D
Applying Automated Memory Analysis to improve the iterative solver in the Parallel Ocean Program John M. Dennis: dennis@ucar.edu Elizabeth R. Jessup: jessup@cs.colorado.edu April 5, 2006
Motivation • Outgrowth of PhD thesis • Memory efficient iterative solvers • Data movement is expensive • Developed techniques to improve memory efficiency • Apply Automated Memory Analysis to POP • Parallel Ocean Program (POP) solver • Large % of time • Scalability issues Petascale Computation for the Geosciences Workshop
Outline: • Motivation • Background • Data movement • Serial Performance • Parallel Performance • Space-Filling Curves • Conclusions Petascale Computation for the Geosciences Workshop
Automated Memory Analysis? • Analyze algorithm written in Matlab • Predicts data movement if algorithm written in C/C++ or Fortran -> Minimum Required • Predictions allow: • Evaluate design choices • Guide performance tuning Petascale Computation for the Geosciences Workshop
POP using 20x24 blocks (gx1v3) • POP data structure • Flexible block structure • land ‘block’ elimination • Small blocks • Better {load balanced, land block elimination} • Larger halo overhead • Larger blocks • Smaller halo overhead • Load imbalanced • No land block elimination • Grid resolutions: • test: (128x192) • gx1v3 (320x384) Petascale Computation for the Geosciences Workshop
2D data structure Advantages Regular stride-1 access Compact form of stencil operator Disadvantages Includes land points Problem specific data structure 1D data structure Advantages No more land points General data structure Disadvantages Indirect addressing Larger stencil operator Alternate Data Structure Petascale Computation for the Geosciences Workshop
Outline: • Motivation • Background • Data movement • Serial Performance • Parallel Performance • Space-Filling Curves • Conclusions Petascale Computation for the Geosciences Workshop
Data movement • Working set load size (WSL) • (MM --> L1 cache) • Measure using PAPI (WSLM) • Compute platforms: • Sun Ultra II (400Mhz) • IBM POWER4 (1.3 Ghz) • SGI R14K (500Mhz) • Compare with prediction (WSLP) Petascale Computation for the Geosciences Workshop
solver w/2D (Matlab) solver w/1D (Matlab) SLAMM Predicts WSLP > 4902 Kbytes 3218 Kbytes 1D data structure --> 34% reduction in data movement Predicting Data Movement Petascale Computation for the Geosciences Workshop
Measured versus Predicted data movement Petascale Computation for the Geosciences Workshop
Measured versus Predicted data movement Excessive data movement Petascale Computation for the Geosciences Workshop
do i=1,nblocks p(:,:,i)=z(:,:,i) + p(:,:,i)*ß q(:,:,i) = A*p(:,:,i) w0(:,:,i)=Q(:,:,i)*P(:,:,i) enddo delta = gsum(w0,lmask) ldelta=0 do i=1,nblocks p(:,:,i) = z(:,:,i) + p(:,:,i)* ß q(:,:,i) = A*p(:,:,i) w0=q(:,:,i)*P(:,:,i) ldelta = ldelta + lsum(w0,lmask) enddo delta=gsum(ldelta) w0 array accessed after loop! extra access of w0 eliminated Two blocks of source code PCG2+2D v1 PCG2+2D v2 Petascale Computation for the Geosciences Workshop
Measured versus Predicted data movement Data movement matches predicted! Petascale Computation for the Geosciences Workshop
Outline: • Motivation • Background • Data movement • Serial Performance • Parallel Performance • Space-Filling Curves • Conclusions Petascale Computation for the Geosciences Workshop
Using 1D data structures in POP2 solver (serial) • Replace solvers.F90 • Execution time on cache microprocessors • Examine two CG algorithms w/Diagonal precond • PCG2 ( 2 inner products) • PCG1 ( 1 inner product) [D’Azevedo 93] • Grid: test • [128x192 grid points]w/(16x16) Petascale Computation for the Geosciences Workshop
56% reduction in cost/iteration Serial execution time on IBM POWER4 (test) Petascale Computation for the Geosciences Workshop
Outline: • Motivation • Background • Data movement • Serial Performance • Parallel Performance • Space-Filling Curves • Conclusions Petascale Computation for the Geosciences Workshop
Using 1D data structure in POP2 solver (parallel) • New parallel halo update • Examine several CG algorithms w/Diagonal precond • PCG2 ( 2 inner products) • PCG1 ( 1 inner product) • Existing solver/preconditioner technology: Hypre (LLNL) http://www.llnl.gov/CASC/linear_solvers • PCG solver • Preconditioners: • Diagonal • Hypre integration -> Work in progress Petascale Computation for the Geosciences Workshop
48% cost/iteration 27% cost/iteration Solver execution time for POP2 (20x24) on BG/L (gx1v3) Petascale Computation for the Geosciences Workshop
Outline: • Motivation • Background • Data movement • Serial Performance • Parallel Performance • Space-Filling Curves • Conclusions Petascale Computation for the Geosciences Workshop
0.1 degree POP • Global eddy-resolving • Computational grid: • 3600 x 2400 x 40 • Land creates problems: • load imbalances • scalability • Alternative partitioning algorithm: • Space-filling curves • Evaluate using Benchmark: • 1 day/ Internal grid / 7 minute timestep Petascale Computation for the Geosciences Workshop
Nb Partitioning with Space-filling Curves • Map 2D -> 1D • Variety of sizes • Hilbert (Nb=2n) • Peano (Nb=3m) • Cinco (Nb=5p) [New] • Hilbert-Peano (Nb=2n3m) • Hilbert-Peano-Cinco (Nb=2n3m5p) [New] • Partitioning 1D array Petascale Computation for the Geosciences Workshop
Partitioning with SFC Partition for 3 processors Petascale Computation for the Geosciences Workshop
POP using 20x24 blocks (gx1v3) Petascale Computation for the Geosciences Workshop
POP (gx1v3) + Space-filling curve Petascale Computation for the Geosciences Workshop
Space-filling curve (Hilbert Nb=24) Petascale Computation for the Geosciences Workshop
Remove Land blocks Petascale Computation for the Geosciences Workshop
Space-filling curve partition for 8 processors Petascale Computation for the Geosciences Workshop
POP 0.1 degree benchmark on Blue Gene/L Petascale Computation for the Geosciences Workshop
POP 0.1 degree benchmark Courtesy of Y. Yoshida, M. Taylor, P. Worley Petascale Computation for the Geosciences Workshop
Conclusions • 1D data structures in Barotropic Solver • No more land points • Reduces execution time vs 2D data structure • 48% reduction in Solver time! (64 procs BG/L) • 9.5% reduction in Total time! (64 procs POWER4) • Allows use of solver/preconditioner packages • Implementation quality critical! • Automated Memory Analysis (SLAMM) • Evaluate design choices • Guide performance tuning Petascale Computation for the Geosciences Workshop
Conclusions (con’t) • Good scalability to 32K processors on BG/L • Increase simulation rate by 2x on 32K processors • SFC partitioning • 1D data structure in solver • Modify 7 source files • Future work • Improve scalability • 55% Efficiency 1K => 32K • Better preconditioners • Improve load-balance • Different block sizes • Improve partitioning algorithm Petascale Computation for the Geosciences Workshop
Thanks to: F. Bryan (NCAR) J. Edwards (IBM) P. Jones (LANL) K. Lindsay (NCAR) M. Taylor (SNL) H. Tufo (NCAR) W. Waite (CU) S. Weese (NCAR) Blue Gene/L time: NSF MRI Grant NCAR University of Colorado IBM (SUR) program BGW Consortium Days IBM research (Watson) Acknowledgements/Questions? Petascale Computation for the Geosciences Workshop
Serial Execution time on Multiple platforms (test) Petascale Computation for the Geosciences Workshop
9.5% reduction Total execution time for POP2 (40x48) on POWER4 (gx1v3) Eliminate need for ~216,000 CPU hours per year @ NCAR Petascale Computation for the Geosciences Workshop
POP 0.1 degree Increasing || --> Decreasing overhead --> Petascale Computation for the Geosciences Workshop