150 likes | 266 Views
RAMADDA for Big C limate D ata. Don Murray NOAA/ESRL/PSD and CU-CIRES. Outline . The Problem Space The Data Space The RAMADDA Solution How should we deal with complex calculations?. The Problem Space. Climat e Attribution What caused the 2013 Colorado flood?
E N D
RAMADDA for Big Climate Data Don Murray NOAA/ESRL/PSD and CU-CIRES Boulder/Denver Big Data Meetup - June 18, 2014
Outline • The Problem Space • The Data Space • The RAMADDA Solution • How should we deal with complex calculations? Boulder/Denver Big Data Meetup - June 18, 2014
The Problem Space • Climate Attribution • What caused the 2013 Colorado flood? • What is causing the California drought? • Has global warming stopped? • What do the observations say? • Can climate models give us insight into the statistical nature of these events? Boulder/Denver Big Data Meetup - June 18, 2014
The Data Space • Observations • National Climatic Data Center (NCDC) collects data from worldwide observing sites • Temperature (30-40K stations), Precipitation (75K stations), 1901-present, 90K files • Problem: Different stations have different recording periods and gaps in the record • Reanalyses • Model reconstructions from observations. • Help fill in the gaps – but are not observations Boulder/Denver Big Data Meetup - June 18, 2014
The Data Space • Climate model simulations • Climate models are used to test the impact of external forcing on the atmosphere (experiments) • Greenhouse gases, sea surface temperature, arctic sea ice • Multiple runs using the same inputs with slight perturbations of the initial conditions • Ensembles provide useful statistics (mean, variance) • Multiple models using the same experiment • Ensemble of ensembles Boulder/Denver Big Data Meetup - June 18, 2014
The Data Space • PSD Climate Model Output • Experiments are run over a period of time (e.g. 1979-present, 1880-present) • Global models at .75 to 1.25 degree resolution • 27 levels • 55-115K points/parameter/level/time step/ensemble • Problem: Different domains (-180 to 180, 0 to 360) • Model’s internal calculations vary (5 mins to hours) • Output data for each 6 hour time step (0, 06, 12, 18) • Post processing produces daily and monthly averages • Output format is netCDF (in an ideal world) Boulder/Denver Big Data Meetup - June 18, 2014
The Data Space • Ensemble size from 10 to 50 members • Even larger in other cases • Multiple parameters calculated • Temperature, precipitation, wind, humidity, etc. • Problem: Each model has different variable names and units • Each experiment can take weeks to months to complete on a supercomputer. Boulder/Denver Big Data Meetup - June 18, 2014
The Data Space • At NOAA/ESRL/PSD we run multiple models with multiple ensembles for multiple experiments • Need to provide web-based access and analysis capabilities Boulder/Denver Big Data Meetup - June 18, 2014
The Data Problem • 1 model, 20 ensembles, 34 years: ~10 TB data, 14K files, multiple parameters/file • Post processing • Separate by parameter • Daily/monthly averages, merge files • Convert to common names/units • End result for 1 model/experiment • Monthly data: ~.5 TB, 700 files • Daily data: ~7.5 TB, 13.5K files • Times 2 models x 6 experiments Boulder/Denver Big Data Meetup - June 18, 2014
The RAMADDA Solution • NOAA’s Facility for Climate Assessments (FACTS) • Web based access to climate model runs and reanalyses • Provides on-line analysis • Download raw data • PSD Climate Data Repository • Access other data holdings • Publishing platform for visualization bundles, images and climate assessments Boulder/Denver Big Data Meetup - June 18, 2014
The RAMADDA Solution • Ingest the metadata • Use harvester for automatic metadata ingestion • For some datasets, use Entry XML specification • Organize the data • Use collections to partition the data (monthly vs. daily) • Database searches make finding the data easy • Data Processing Framework • Loosely based on Open Geospatial Consortium (OGC) Web Processing Service (WPS) • Fairly simple calculations – areal/temporal subsetting/averaging • Use community accepted tools for analysis and plotting (Climate Data Operators, NCAR Command Language) • Other tools could be plugged in (e.g., R) • Currently synchronous, looking at batch processing Boulder/Denver Big Data Meetup - June 18, 2014
The RAMADDA Solution • Demo/Examples Boulder/Denver Big Data Meetup - June 18, 2014
Complex calculations • Question: How are extremes behaving during the hiatus? • Look at 27 standard extreme indices (e.g., frost free days, number of days that max temp exceeds the 90th percentile, etc.) • Finding 99th percentile precipitation in the ensemble space requires reading all members for all times for all points. • 5 models/> 100 ensembles/multiple experiments = Big Data Boulder/Denver Big Data Meetup - June 18, 2014
Complex calculations • Tools used now • FORTRAN, R, Python • Data has to be looked at as a cohesive unit for statistical calculations, but may be in many files. • Problems • getting all the data into memory • System reliability • Could standard Big Data processes be applied? Boulder/Denver Big Data Meetup - June 18, 2014
Links • NOAA/ESRL/PSD Climate Data Repository • http://www.esrl.noaa.gov/psd/repository • Facility for Climate Assessments (FACTS) • http://www.esrl.noaa.gov/psd/repository/alias/facts • RAMADDA • http://ramadda.org Boulder/Denver Big Data Meetup - June 18, 2014