SC13 19-20 November 2013

From Athena to Minerva: COLA’s Experience in the NCAR Advanced Scientific Discovery Program SC13 19-20 November 2013 Ben Cash, COLA Animation courtesy of CIMSS

Why does climate research need HPC and Big Data? • Societal demand for information about weather-in-climate and climate impacts on weather on regional scales • Seamless days-to-decades prediction & unified weather/climate modeling • Multi-model ensembles and Earth system prediction • Requirements for data assimilation

Balancing Demands on Resources 1/120 Data Assimilation Complexity Data and HPC Resources Resolution Duration and/or Ensemble size

COLA HPC & Big Data Projects Project Athena: An International, Dedicated High-End Computing Project to Revolutionize Climate Modeling (Dedicated XT4 at NICS) Collaborating Groups: COLA, ECMWF, JAMSTEC, NICS, Cray Project Minerva: Toward Seamless, High-Resolution Prediction at Intra-seasonal and Longer Time Scales (Dedicated Advanced Scientific Discovery resources on NCAR Yellowstone) Collaborating Groups: COLA, ECMWF, U. Oxford, NCAR

NICS Resources for Project Athena • The Cray XT4 – Athena– the first NICS machine in 2008 • 4512 nodes: AMD 2.3 GHz quad-core CPUs + 4 GB RAM • 18,048 cores + 17.6 TB aggregate memory • 165 TFLOPS peak performance • Dedicated to this project during October 2009 – March 2010  72 million core-hours! • Other resources made available to project: • 85 TB Lustre file system • 258 TB auxilliaryLustre file system (called Nakji) • Verne: 16-core 128-GB system (data analysis) during production phase (2009-2010) • Nautilus: SGI UV with 1024 Nehelem EX cores, 8 GPUs, 4 TB memory, 960 TB GPFS disk (data analysis) in 2010-11 Many thanks to NICS for resources and sustained support!

Regional Climate Change – Beyond CMIP3 Models’ Ability?

Europe Growing Season (Apr-Oct) Precipitation Change: 20thC to 21st C T159 (125-km) T1279 (16-km) “Time-slice” runs of the ECMWF IFS global atmospheric model with observed SST for the 20th century and CMIP3 projections of SST for the 21st century at two different model resolutions The continental-scale pattern of precipitation change in April – October (growing season) associated with global warming is similar, but the regional details are quite different, particularly in southern Europe.

Future Change in Extreme Summer Drought Late 20th C to Late 21st C 4X probability of extreme summer drought in Great Plains, Florida, Yucutan, and parts of Eurasia 10th Percentile Drought: Number of years out of 47 in a simulation of future climate (2071-2117) for which the June-August mean rainfall was less than the 5th driest year of 47 in a simulation of current climate (1961-2007).

Clouds and Precipitation: Summer 2009 (NICAM 7km)

Athena Limitations • Athena was a tremendous success, generating tremendous amount of data and large number of papers for a six month project. • BUT… • Limited number of realizations • Athena runs generally consisted of a single realization • No way to assess robustness of results • Uncoupled models • Multiple, dissimilar models • Resources were split between IFS and NICAM • Differences in performance meant very different experiments performed – difficult to directly compare results • Storage limitations and post-processing demands limited what could be saved for each model

Project Minerva • Explore the impact of increased atmospheric resolution on model fidelity and prediction skill in a coupled, seamless framework byusing a state-of-the-art coupled operational long-range prediction system to systematically evaluate the prediction skill and reliability of a robust set of hindcast ensembles at low, medium and high atmospheric resolutions • NCAR Advanced Scientific Discovery Program to inaugurate Yellowstone • Allocated 21 M core-hours on Yellowstone • Used ~28 M core-hours Many thanks to NCAR for resources & sustained support!

Project Minerva: Background • NCAR Yellowstone • In 2012, NCAR-Wyoming Supercomputing Center (NWSC) debuted Yellowstone, the successor to Bluefire • IBM iDataplex, 72,280 cores, 1.5 petaflops peak performance • #17 on June 2013 Top500 list • 10.7 PB disk capability • High capacity HPSS data archive • Dedicated large memory and floating point accelerator clusters (Geyser and Caldera) • Accelerated Scientific Discovery (ASD) program • NCAR accepted a small number proposals for early access to Yellowstone, as it has done in the past with new hardware installs • 3 months of near-dedicated access before being opened to general user community • Opportunity • Continue successful Athena collaboration between COLA and ECMWF, and to address limitations in the Athena experiments

Project Minerva: Timeline • March 2012 – ASD proposal submitted • 31 million core hours requested • April 2012 – Proposal approved • 21 million core hours approved • October 5, 2012 • First login to Yellowstone – bcash = user #1 (Ben Cash) • November 21 – Dec 1, 2012 • Minerva production code finalized • Yellowstone system instability due to “cable cancer” • Minerva’s low core count jobs avoid problem – user accounts not charged for jobs at this time  Minerva benefits by using ~7 million free core hours • Minerva jobs occupy as many as 61000 cores (!) • Minerva sets record: “Most IFS FLOPs in 24 hours” • December 1 – project end • Failure rate falls to 1%, then to 0.5%; production computing tailed off in March 2013 • Data management becomes by far the greatest challenge • Project Minerva consumption: ~28 million total • 800+ TB generated Many thanks to NCAR for resources & sustained support!

Minerva Catalog Minerva Catalog: Extended Experiments

Project Minerva: Selected Results • Simulated precipitation • Tropical cyclones • SST – ENSO

Precipitation: Summer 2010 (IFS 16km)

Minerva: Coupled Prediction of Tropical Cyclones 11-12 June 2005 hurricane off west coast of Mexico: precipitation in mm/day every 3 hours (T1279 coupled forecast initialized on 1 May 2005) The predicted maximum rainfall rate reaches 725 mm/day (30 mm/hr) Based TRMM global TC rainfall observations (1998-2000), the frequency of rainfall rates exceeding 30 mm/hr is roughly 1% Courtesy Julia Manganello, COLA

Minerva vs. Athena – TC Frequency (NH; JJASON; T1279) 9-Year Mean (2000-2008) OBS 49.9 Athena 59.1 Minerva 48.9 (all members) Athena IBTrACS Minerva Courtesy Julia Manganello, COLA

Jul Sep Nov

Project Minerva: Lessons Learned • More evidence that dedicated usage of a relatively big supercomputer greatly enhances productivity • Experience with ASD period demonstrates tremendous progress can be made with dedicated access • Dedicated computing campaigns provide demonstrably more efficient utilization • Noticeable decrease in efficiency once scheduling multiple jobs of multiple sizes was turned over to a scheduler • In-depth exploration • Data saved at much higher frequency • Multiple ensemble members, increased vertical levels, etc.

Project Minerva: Lessons Learned • Dedicated simulation projects like Athena and Minervagenerate enormous amounts of data to be archived, analyzed and managed. Data management is a big challenge. • Other than machine instability, data management and post-processing were solely responsible for halts in production.

Data Volumes • Project Athena: Total data volume 1.2 PB (~500 TB unique)* Spinning disk 40 TB at COLA 0 TB at NICS (was 340 TB) • * no home after April 2014 • Project Minerva: Total data volume 1.0 PB (~800 TB unique) Spinning disk 100 TB at COLA 500TB at NCAR(for now) • That much data breaks everything: H/W, systems management policies, networks, apps S/W, tools, and shared archive space • NB: Generating 800 TB using 28 M core-hours took ~3 months; this would take about a week using a comparable fraction of a system with 1M cores!

Challenges and Tensions • Making effective use of large allocations – takes a village • Exaflood of data • Resolution vs. parameterization • Sampling (e.g. extreme events) TENSIONS “Having more data won’t substitute for thinking hard, recognizing anomalies, and exploring deep truths.” Samuel Arbeson, Wash. Post (18 Aug. 2013)

Athena and Minerva: Harbingers of the Exaflood • Even on a system designed for big projects like these, HPC production capabilities overwhelm storage and processing, a particularly acute problem for ‘rapid burn’ projects such as Athena and Minerva • Familiar diagnostics are hard to do at very high resolution • Can’t “just recompute” – years of data analysis and mining after production phase • Have we wrung all the “science” out of the data sets, given that we can only keep a small percentage of the total data volume on spinning disk? How can we tell? • Must move from ad hoc problem solving  systematic, repeatable workflows (e.g. incorporate post-processing and data management into production stream) (transform Noah’s Ark  a Shipping Industry) “We need exaflood insurance.” - Jennifer Adams

Any questions?

SC13 19-20 November 2013

SC13 19-20 November 2013

Presentation Transcript

20 November 2013

November 20, 2013

November 20, 2013

19 November 2013

November 19, 2013

November 20, 2013

November 19, 2013

November 20, 2013

TUESDAY November 19, 2013

19 November 2013

Tuesday November 19, 2013

November 20, 2013

November 19, 2013

19-20 November 2013 List Presenting Organization Here

November 19, 2013

November 19, 2013

DATE : 19 NOVEMBER 2013

November 20, 2013

Wednesday , November 20 , 2013

Tuesday, November 19, 2013