1 / 22

CCSM Portability and Performance, Software Engineering Challenges, and Future Targets

CCSM Portability and Performance, Software Engineering Challenges, and Future Targets. Tony Craig National Center for Atmospheric Research Boulder, Colorado, USA. CAS Meeting, September 7-11, 2003, Annecy France. Topics. CCSM SE and design overview Coupler design and performance

shelly
Download Presentation

CCSM Portability and Performance, Software Engineering Challenges, and Future Targets

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CCSM Portability and Performance, Software Engineering Challenges, and Future Targets Tony Craig National Center for Atmospheric Research Boulder, Colorado, USA CAS Meeting, September 7-11, 2003, Annecy France

  2. Topics • CCSM SE and design overview • Coupler design and performance • Production and performance • portability • scaling • SE Challenges • The Future

  3. CCSM Overview • CCSM = Community Climate System Model (NCAR) • Designed to evaluate and understand earth’s global climate, both historical and future. • Multiple executables (5) • Atmosphere (CAM), MPI/OpenMP • Ocean (POP), MPI • Land (CLM), MPI/OpenMP • Sea Ice (CSIM), MPI • Coupler (CPL6), MPI

  4. CCSM SE Overview • Good science top priority • Fortran 90 (mostly) • 500k lines of code • Community project, dozens of developers • Collaborations are critical • University Community • DOE - SciDAC • NASA - ESMF • Regular code releases • Netcdf history files • Binary restart files • Many levels of parallelism-multiple executables, MPI, OpenMP

  5. CCSM “Hub and Spoke” System • Each component is a separate executable • Each component on a unique set of hardware processors • All communications go through coupler • Coupler • communicates with all components • maps (interpolates) data • merges fields • computes some fluxes • has diagnostic, history, and restart capability ocn ice cpl lnd atm

  6. The CCSM coupler • Recent redesign (cpl6) • Create a fully parallel distributed memory coupler • Implement M to N communication between components • Improve communication performance to minimize bottlenecks at higher resolutions in the future • Improve coupling interfaces, abstract communication method away from components • Improve usability, flexibility, and extensibility of coupled system • Improve overall performance

  7. The Solution Build a new coupler framework with abstracted, parallel communication software in the foundation. Create a coupler application instantiation called cpl6 which reproduces the functionality of cpl5: cpl6 MCT* MPH** *Model Coupling Toolkit (DOE Argonne National Lab) ** Multi-Component Handshaking Library (DOE Lawrence Berkley National Lab)

  8. cpl6 Design: Another view of CCSM hardware processors • In cpl5, MPI was the coupling interface • In cpl6, the “coupler” is now attached to each component • Components unaware of coupling method • Coupling work can be carried out on component processors • Separate coupler no longer absolutely required cpl ice ocn lnd atm coupling interface layer

  9. NO copy NO copy copy copy comm (M to N) comm (M to N) comm (root to root) comm (root to root) copy copy gather scatter CCSM Communication: cpl5 vs cpl6 Coupler on 8pes Ice component on 16pes 240 transfers, 21 fields Production configuration cpl5 cpl6 cpl6 communication=18.5s cpl5 communication=61.5s

  10. CCSM Production • Forward integration of relatively coarse models • atm/land - T42 (128x64, L26) • ocn/ice - 1 degree (320x384, L40) • Finite difference and spectral, explicit and implicit methods, vertical physics, global sums, nearest neighbor communication • I/O not a bottleneck (5 Gb / simulated year) • Restart capability (750 Mb) • Separate harvesting to local mass storage system • Auto resubmit

  11. CCSM Throughput vs Resolution *IBM power4 system, bluesky, as of 9/1/2003

  12. CCSM Throughput vs Platform T42/1 degree atm/ocn resolution * ANL jazz machine, 2.4Ghz Pentium

  13. Ocean Model Performance and Scaling **Courtesy of PW Jones, PH Worley, Y Yoshida, JB White III, J Levesque

  14. CCSM2_2_beta08 T42_gx1v3 IBM Power4, bluesky CCSM Component Scaling

  15. CCSM2_2_beta08 IBM Power4, bluesky 21.3 4.0 14.3 1.2 11.8 3.3 1.2 2.1 1.2 .8 3.0 1.1 21.7 CCSM Load Balance Example 64 ocn 48 atm 16 ice 8 lnd 16 cpl 152 total processors Seconds per simulated day

  16. Challenges in the Environment (1) • Machines often not well balanced • chip speed • interconnect • memory access • cache • I/O • vector capabilities • Each machine is “balanced” differently • Optimum coding strategy often depends largely on platform • Need to develop “flexible” software strategies

  17. RISC vs Vector • Data layout; index order, data structure layout • Floating operation count (if) versus pipelining (masking) • Loop ordering and loop structure • Vectorization impacts parallelization • Memory access, cache blocking, array layouts, array usage • Bottom Line (In My Opinion): • Truly effective cache reuse is very hard to achieve on real codes • Sustained performance on some RISC machines is disappointing • Poor vectorization costs an order of magnitude in performance on vector machines • We are now (re-)vectorizing and expect to pay little or no performance penalty on RISC machines

  18. Challenges in the Environment (2) • Startup and control of multiple executables • Compilers and libraries • Tools • Debuggers inadequate; multiple executables and MPI/OpenMP parallel models • Timing and profiling tools generally inadequate • IBM HPM getting better • Jumpshot works well • Cray performance tools look promising • Have avoided instrumenting code (risk, robust, #if) • Use print statements and calls to system clock

  19. Summary • Science top priority, large community project, regular model releases • SE improvements continuous, cpl6 is a success • Machines change rapidly and are highly variable in architecture • Component scaling and CCSM load balance are acceptable • (Re)-Vectorization • Tools and machine software can present significant challenges

  20. Future • Increased coupling flexibility • Single executable • Mixed concurrent/serial design • Continue to work on scalar and parallel performance in all models • Take advantage of libraries/collaborations for performance portability software, more layering, leverage external efforts • NASA/ESMF • DOE/SCIDAC • University Community • Others • IBM is still an important production platform for CCSM • CCSM is moving onto vector platforms and linux clusters, production capability on these platforms still to be determined

  21. THE END

More Related