200 likes | 326 Views
PetaApps : Update on software engineering and performance. J. Dennis M. Vertenstein N. Hearn. Code Base Update. Trunk+ means ccsm4 release code + IE mods scripts – trunk+ (just in) f ixes build problem inherent in alpha38+ cice – trunk+
E N D
PetaApps: Update on software engineering and performance J. Dennis M. Vertenstein N. Hearn
Code Base Update • Trunk+ means ccsm4 release code + IE mods • scripts – trunk+ (just in) • fixes build problem inherent in alpha38+ • cice – trunk+ • has weighted space filling curves and restarts on tripole grid working • has OpenMP threading capability • has PIO for history and restarts (netcdf) • has multi-frequency history capability (1 file per day)
Code Base Update (con’t) • pop - alpha38+ • has fix to tripole grid problem and restarts are working • has multi-frequency history capability (1 file per day) • TO DO: migrate time series capability from trunk onto alpha38+ • TO DO: migrate PIO capability from trunk onto alpha38+ • TO DO: OpenMP threading capability is not functional (ORNL working on this)
Code Base Update (con’t) • cam - alpha38+ • TO DO: migrate cam to cam trunk- will then get pio – (almost done by Nathan) • clm - alpha38+ • drv - alpha38+ • interactive ensembles for atm functional • TO DO: Interactive ensembles for ice in progress • TO DO: Migrate driver to the head of the trunk - where interactive ocean ensembles have been implemented
Interactive Ensemble Runs Update • TO DO: Finish validation of 2 degree atm/ 1 degree ocean interactive ensembles • POP convergence problem at year 150 for low-res IE • - Reduce pop time step • Problem with branch/hybrid start for IE from HRC03 • Demonstrated functionality with a 10 member atm ensemble for high-res • Execute high-res interactive ensemble run
Experiences on Kraken • Somewhat behind on cycle usage • Highly variable Disk I/O performance ~18x • Use little-endian binary writes avoids performing 4K to file system • Job performance dependent on node mapping • Some jobs are ~20% slower [excludes I/O]
Job Placement of CCSM within the Torus White = Ice only Blue = Ocean Green = Land Red = Atmosphere & Ice Courtesy of Nick Jones
Experiences on Kraken • Somewhat behind on cycle usage • Highly variable Disk I/O performance ~18x • Use little-endian binary writes avoids performing 4K to file system • Job performance dependent on node mapping • Some jobs are ~20% slower [excludes I/O] • Friendly User access • Invaluable for development effort • Now can run < 1GB per core • Multi-frequency support in CICE, POP • Hex-core improves CCSM performance
Kraken Upgrade • Started August 1th October 5th • OS upgrade • Significant increase in job failures [1/3 of all jobs failed] • Subset of nodes upgraded to Hex-core • Queue wait became excessive • Friendly user access
Kraken Upgrade • Started August 1th October 5th • OS upgrade • Significant increase in job failures [1/3 of all jobs failed] • Subset of nodes upgraded to Hex-core • Queue wait became excessive • Friendly user access • Entire system down for upgrade • Access to Athena • Friendly user access • What changed? • CPU: • quad-core to hex-core [12 core per node] • Improved memory controller • Memory: • All nodes to 16 GB per node (1.3GB per core)
Simulation cost [HRC03] • CCSM(1,1,1,1) @ f0.5_tx0.1v2 on 5848 cores • Monthly output [Historical perspective] • First time [ATLAS] 140K per year [0.8 SYPD] early 2008 • NERSC [XT4] 100K per year [1.3 SYPD] fall 2008 • Budgeted [XT4] 89K per year [1.6 SYPD] early 2009 • Actual [XT5] 81K per year [1.8 SYPD] summer 2009 • Measured [XT5] 65K per year [2.1 SYPD] fall 2009 • upgraded Hex-core system • Small user group • Monthly + Daily output • Measured: 91K per year [1.6 SYPD] • Observations • Time to complete additional 100 years [61 days wall-clock]
Simulation cost (con’t) • CCSM(10,1,1,1) @ f0.5_tx0.1v2 on 7434 cores • Monthly + Daily output • Budgeted: 234K per year • Measured: 120K per year [1.5 SYPD] • On Cray XT4 • Observations • Significantly cheaper than budgeted!! • Implied start times: mid January 2010 [41 days wall-clock]
ATM-IE performance on 7434 cores on Cray XT4 ATM on 480 cores per ensemble (10 members) 1.5 SYPD 120K per year Problem in CPL7 currently limits parallelism to 2000
Simulation cost (con’t) • CCSM(10,1,10,1) @ f0.5_tx0.1v2 on 6000 cores • ICE-IE is still being tested/developed • Monthly + Daily output • Budgeted: 234K per year [0.8 SYPD] • Observations • Implied start times: • December 1st, 2009 [79 days wall-clock]
Resource requirements: TRAC2 Ice IE experiment moved to second year