1 / 40

Patrick H. Worley Computer Science and Mathematics Division Oak Ridge National Laboratory

Kernel and Application Code Performance for a Spectral Atmospheric Global Circulation Model on the Cray T3E and IBM SP. Patrick H. Worley Computer Science and Mathematics Division Oak Ridge National Laboratory. NERSC Users’ Group Meeting Oak Ridge, TN June 6, 2000. Alternative Title.

darby
Download Presentation

Patrick H. Worley Computer Science and Mathematics Division Oak Ridge National Laboratory

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Kernel and Application Code Performance for a Spectral Atmospheric Global Circulation Model on the Cray T3E and IBM SP Patrick H. Worley Computer Science and Mathematics Division Oak Ridge National Laboratory NERSC Users’ Group Meeting Oak Ridge, TN June 6, 2000

  2. Alternative Title … random collection of benchmarks, looking at communication, serial, and parallel performance on theIBM SP and other MPPs at NERSC and ORNL.

  3. Acknowledgements • Research sponsored by the Atmospheric and Climate Research Division and the Office of Mathematical, Information, and Computational Sciences, Office of Science, U.S. Department of Energy under Contract No. DE-AC05-00OR22725 with UT-Battelle, LLC. • These slides have been authored by a contractor of the U.S. Government under contract No. DE-AC05-00OR22725. Accordingly, the U.S. Government retains a nonexclusive, royalty-free license to publish or reproduce the published form of this contribution, or allow others to do so, for U.S. Government purposes • Oak Ridge National Laboratory is managed by UT-Battelle, LLC for the United States Department of Energy under Contract No. DE-AC05-00OR22725.

  4. Platforms at NERSC • IBM SP • 2-way Winterhawk I SMP “wide” nodes with 1 GB memory • 200 MHz Power 3 processors with 4 MB L2 cache • 1.6 GB/sec node memory bandwidth (single bus) • Omega multistage interconnect • SGI/Cray Research T3E-900 • Single processor nodes with 256 MB memory • 450 MHz Alpha 21164 (EV5) with 96 KB L2 cache • 1.2 GB/sec node memory bandwidth • 3D torus interconnect

  5. Platforms at ORNL • IBM SP • 4-way Winterhawk II SMP “thin” nodes with 2 GB memory • 375 MHz Power 3-II processors with 8 MB L2 cache • 1.6 GB/sec node memory bandwidth (single bus) • Omega multistage interconnect • Compaq AlphaServer SC • 4-way ES40 SMP nodes with 2 GB memory • 667 MHz Alpha 21264a (EV67) processors with 8 MB L2 cache • 5.2 GB/sec node memory bandwidth (dual bus) • Quadrics “fat tree” interconnect

  6. Other Platforms • SGI / Cray Research Origin 2000 at LANL • 128-way SMP node with 32 GB memory • 250 MHz MIPS R10000 processors with 4 MB L2 cache • NUMA memory subsystem • IBM SP • 16-way Nighthawk II SMP node • 375 MHz Power3-II processors with 8 MB L2 cache • switch-based memory subsystem • Results obtained using prerelease hardware and software

  7. Topics • Interprocessor communication performance • Serial performance • PSTSWM spectral dynamics kernel • CRM column physics kernel • Parallel performance • CCM/MP-2D atmospheric global circulation model

  8. Communication Tests • Interprocessor communication performance • within an SMP node • between SMP nodes • with and without contention • with and without cache invalidation for both bidirectional and unidirectional communication protocols • Brief description of some results. For more details, see http://www.epm.ornl.gov/~worley/studies/pt2pt.html

  9. Communication Tests MPI_SENDRECV bidirectional and MPI_SEND/MPI_RECV unidirectional bandwidth between nodes on the IBM SP at NERSC

  10. Communication Tests MPI_SENDRECV bidirectional and MPI_SEND/MPI_RECV unidirectional bandwidth between nodes on the IBM SP at NERSC

  11. Communication Tests MPI_SENDRECV bidirectional and MPI_SEND/MPI_RECV unidirectional bandwidth between nodes on the IBM SP at ORNL

  12. Communication Tests Bidirectional bandwidth comparison across platforms: swap between processors 0-1

  13. Communication Tests Bidirectional bandwidth comparison across platforms: swap between processors 0-4

  14. Communication Tests Bidirectional bandwidth comparison across platforms: simultaneous swap between processors 0-4,1-5,2-6,3-7

  15. Communication Tests Bidirectional bandwidth comparison across platforms: 8 processor send/recv ring 0-1-2-3-4-5-6-7-0

  16. Communication Tests • Summary • Decent intranode performance is possible. • Message passing functionality is good. • Switch / NIC performance is limiting in internode communication. • Contention for Switch/NIC bandwidth in SMP nodes can be significant.

  17. Serial Performance • Issues • Compiler optimization • Domain decomposition • Memory contention in SMP nodes • Kernel codes • PSTSWM - spectral dynamics • CRM - column physics

  18. Spectral Dynamics • PSTSWM • solves the nonlinear shallow water equations on a sphere using the spectral transform method • 99% of floating point operations are fmul, fadd, or fmadd • accessing memory linearly, but not much reuse • (longitude, vertical, latitude) array index ordering • computation independent between horizontal layers (fixed vertical index) • as vertical dimension size increases, demands on memory increase

  19. Spectral Dynamics Horizontal Resolutions T5: 8x16 T10: 16x32 T21: 32x64 T42: 64x128 T85: 128x256 T170: 256x512 PSTSWM on the IBM SP at NERSC

  20. Spectral Dynamics PSTSWM on the IBM SP at NERSC

  21. Spectral Dynamics Horizontal Resolutions T5: 8x16 T10: 16x32 T21: 32x64 T42: 64x128 T85: 128x256 T170: 256x512 PSTSWM Platform comparisons - 1 processor per SMP node

  22. Spectral Dynamics Horizontal Resolutions T5: 8x16 T10: 16x32 T21: 32x64 T42: 64x128 T85: 128x256 T170: 256x512 PSTSWM Platform comparisons - all processors active in SMP node (except Origin-250)

  23. Spectral Dynamics PSTSWM Platform comparisons - 1 processor per SMP node

  24. Spectral Dynamics PSTSWM Platform comparisons - all processors active in SMP node (except Origin-250)

  25. Spectral Dynamics • Summary • Math libraries and relaxed mathematical semantics improve performance significantly on the IBM SP. • Node memory bandwidth is important (for this kernel code), especially on bus-based SMP nodes. • The IBM SP serial performance is a significant improvement over the (previous generation) Origin and T3E systems.

  26. Column Physics • CRM • Column Radiation Model extracted from the Community Climate Model • 6% of floating point operations are sqrt, 3% are fdiv • exp, log, and pow are among top six most frequently called functions • (longitude, vertical, latitude) array index ordering • computations independent between vertical columns (fixed longitude, latitude) • as longitude dimension size increases, demands on memory increase

  27. Column Physics CRM on the NERSC SP longitude-vertical slice, with varying number of longitudes

  28. Column Physics CRM longitude-vertical slice, with varying number of longitudes 1 processor per SMP node

  29. Column Physics • Summary • Performance is less sensitive to node memory bandwidth for this kernel code. • Performance on the IBM SP is very sensitive to compiler optimization and domain decomposition.

  30. Parallel Performance • Issues • Scalability • Overhead growth and analysis • Codes • CCM/MP-2D

  31. CCM/MP-2D • Message-passing parallel implementation of the National Center for Atmospheric Research (NCAR) Community Climate Model • Computational Domains • Physical Domain: Longitude x Latitude x Vertical levels • Fourier Domain: Wavenumber x Latitude x Vertical levels • Spectral Domain: (Wavenumber x Polynomial degree) x Vertical levels.

  32. CCM/MP-2D • Problem Sizes • T42L18 128 x 64 x 18 physical domain grid 42 x 64 x 18 Fourier domain grid 946 x 18 spectral domain grid ~59.5 GFlops per simulated day • T170L18 512 x 256 x 18 physical domain grid 170 x 256 x 18 Fourier domain grid 14706 x 18 spectral domain grid ~3231 GFlops per simulated day

  33. CCM/MP-2D • Computations • Column Physics • independent between vertical columns • Spectral Dynamics • Fourier transform in longitude direction • Legendre transform in latitude direction • tendencies for timestepping calculated in spectral domain, independent between spectral coordinates • Semi-Lagrangian Advection • Use local approximations to interpolate wind fields and particle distributions away from grid points.

  34. CCM/MP-2D • Decomposition across latitude • parallelizes the Legendre transform: Use distributed global sum algorithm currently • requires north/south halo updates for semi-Lagrangian advection • Decomposition across longitude • parallelizes the Fourier transform: Either use distributed FFT algorithm or transpose fields and use serial FFT • requires east/west halo updates for semi-Lagrangian advection • requires night/day vertical column swaps to load balance physics

  35. CCM/MP-2D Sensitivity of message volume to domain decomposition

  36. Scalability CCM/MP-2D T42L18 Benchmark

  37. Scalability CCM/MP-2D T170L18 Benchmark

  38. Overhead CCM/MP-2D T42L18 Benchmark Overhead Time Diagnosis

  39. Overhead CCM/MP-2D T170L18 Benchmark Overhead Time Diagnosis

  40. CCM/MP-2D • Summary • Parallel algorithm optimization is (still) important for achieving peak performance • Bottlenecks • Message-passing bandwidth and latency • SMP node memory bandwidth on the SP

More Related