180 likes | 404 Views
From Athena to Minerva: A Brief Overview. Ben Cash Minerva Project Team , Minerva Workshop, GMU/COLA, September 16, 2013. Athena Background. World Modeling Summit (WMS; May 2008)
E N D
From Athena to Minerva: A Brief Overview Ben Cash Minerva Project Team, Minerva Workshop, GMU/COLA, September 16, 2013
Athena Background • World Modeling Summit (WMS; May 2008) • Summit calls for revolution in climate modeling to more rapidly advance improvement in climate model resolution, accuracy and reliability • Recommends petascale supercomputers dedicated to climate modeling • Athena supercomputer • The U.S. National Science Foundation responds, offering to dedicate the retiring Athena supercomputer over a six-month period in 2009-2010 • An international collaboration was formed among groups in the U.S., Japan and the U.K. to use Athena to take up the challenge
Project Athena • Dedicated supercomputer • Athena was a Cray XT-4 with 18,048 computational cores • Replaced by new Cray XT-5, Kraken, with 99,072 cores (since increased) • # 21 on June 2009 Top 500 list • 6 months, 24/7, 99.3% utilization • Over 1 PB data generated • Large international collaboration • Over 30 people • 6 groups • 3 continents • State-of-the-art globalAGCMs • NICAM (JAMSTEC/ U. Tokyo): NonhydrostaticIcosahedral Atmospheric Model • IFS (ECMWF): Integrated Forecast System • Highest possible spatial resolution
Athena Science Goals • Hypothesis: Increasing climate model resolution to accurately resolve mesoscale phenomena in the atmosphere (and ocean and land surface) can dramatically improve the fidelity of the models in simulating climate– mean, variances, covariances, and extreme events. • Hypothesis: Simulating the effect of increasing greenhouse gases on regional aspects of climate, especially extremes, may, for some regions, depend critically on the spatial resolution of the climate model. • Hypothesis: Explicitly resolving important processes, such as clouds in the atmosphere (and eddies in the ocean and landscape features on the continental surface), without parameterization, canimprove the fidelity of the models, especially in describing the regional structure of weather and climate.
Qualitative Analysis:2009 NICAM Precipitation and CloudinessMay 21-August 31
Athena Lessons Learned • Dedicated usage of a relatively big supercomputer greatly enhances productivity • Dealing with only a few users and their requirements allows for more efficient utilization of resources • Challenge: Dedicated simulation projects like Project Athena can generate enormous amounts of data to be archived, analyzed and managed. NICS (and TeraGrid) do not currently have enough storage capacity. Data management is a big challenge. • Preparation time: 2 to 3 weeks at least were needed before the beginning of dedicated runs to test and optimize the codes and to plan strategies for optimal use of the system. Communication throughout the project was essential: (weekly telecons, email lists, personal calls, …)
Athena Limitations • Athena was a tremendous success, generating tremendous amount of data and large number of papers for a six month project. • BUT… • Limited number of realizations • Athena runs generally consisted of a single realization • No way to assess robustness of results • Uncoupled models • Multiple, dissimilar models • Resources were split between IFS and NICAM • Differences in performance meant very different experiments performed – difficult to directly compare results • Storage limitations and post-processing demands limited what could be saved for each model
Minerva Background • NCAR Yellowstone • In 2012, NCAR-Wyoming Supercomputing Center (NWSC) debuted Yellowstone, the successor to Bluefire, their previous production platform • IBM iDataplex, 72,280 cores, 1.5 petaflops peak performance • #17 on June 2013 Top 500 list • 10.7 PB disk capability – vast increase over capacity available during Athena • High capacity HPSS data archive • Dedicated high memory analysis clusters (Geyser and Caldera) • Accelerated Scientific Discovery (ASD) program • Recognizing that many groups will not be ready to take advantage of new architecture, NCAR accepted a small number proposals for early access to Yellowstone • 3 months of near-dedicated access before being opened to general user community • Opportunity to continue successful Athena collaboration between COLA and ECMWF, and to address limitations in the Athena experiments
Minerva Timeline • March 2012 – Proposal finalized and submitted • 31 million core hours requested • April 2012 – Proposal accepted • 21 million core hours approved • Anticipated date of production start: July 21 • Code testing and benchmarking on Janus begins • October 5, 2012 • First login to Yellowstone – bcash reportedly user 1 • October – November 23, 2012 • Jobs are plagued by massive system instabilities, conflict between code and Intel compiler
Minerva Timeline continued • November 24 – Dec 1, 2012 • Code conflict resolved, low core count jobs avoid worst of system instability • Minerva jobs occupy 61000 cores (!) • Peter Towers estimates Minerva easily sets record for “Most IFS FLOPs in a 24 hour period” • Jobs rapidly overrun initial 250 TB disk allocation, triggering request for additional resources • This becomes a Minerva project theme • Due to system instability, user accounts are not charged for jobs at this time • Roughly 7 million free core hours as a result: 28 million total • 800+ TB generated
Minerva Catalog: Base Experiments Minerva Catalog: Extended Experiments ** to be completed
Qualitative Analysis:2010 T1279 Precipitation May – November
Minerva Lessons Learned • Dedicated usage of a relatively big supercomputer greatly enhances productivity • Experience with early usage period demonstrates tremendous progress can be made with dedicated access • Dealing with only a few users allows for more efficient utilization • Noticeable decrease in efficiency once scheduling multiple jobs of multiple sizes was turned over to a scheduler • NCAR resources initially overwhelmed by challenges of new machine and individual problems that arose. • Focus on a single model allows for in-depth exploration • Data saved at much higher frequency • Multiple ensemble members, increased vertical levels, etc.
Dedicated simulation projects like Athena and Minerva generate enormous amounts of data to be archived, analyzed and managed. Data management is a big challenge. • Other than machine instability, data management and post-processing were solely responsible for halts in production. • Even on a system designed with lessons from Athena in mind, production capabilities overwhelm storage and processing • Post-processing and storage must be incorporated into production stream • ‘Rapid burn’ projects such as Athena and Minerva are particularly prone to overwhelming storage resources
Beyond Minerva: A New Pantheon • Despite advances beyond Athena, more work to be done • Focus of Tuesday discussion • Fill in matrix of experiments • Further increases in ocean, at mospheric resolution • Sensitivity tests (aerosols, greenhouse gases) • ??