410 likes | 529 Views
Earth System Modelling & the NDG. Bryan Lawrence (Kerstin Kleese, Roy Lowry, Kevin O’Neill, Andrew Woolf & others) NCAS/British Atmospheric Data Centre Rutherford Appleton Laboratory, CCLRC. NDG Partners. As funded a partnership between
E N D
Earth System Modelling & the NDG Bryan Lawrence (Kerstin Kleese, Roy Lowry, Kevin O’Neill, Andrew Woolf & others) NCAS/British Atmospheric Data Centre Rutherford Appleton Laboratory, CCLRC
NDG Partners • As funded a partnership between • British Atmospheric Data Centre (BADC, PI: Bryan Lawrence) • British Oceanographic Data Centre (BODC, Co-I: Roy Lowry) • CLRC E-science Centre (Co-I: Kerstin Kleese) • PCMDI at LNL in the US (Dean Williams, Bob Drach, Mike Fiorino) • Project has caught the imagination, extra funding now supports: • A number of groups at the NERC Centre for Ecology and Hydrology (CEH: Ecology DataGrid) • NERC Earth Observation Data Centre & Plymouth Marine Lab Remote Sensing • Not directly funded major collaborators will include: • ClimatePrediction.net, GODIVA (NERC e-science projects) • NCAS/CGAM: The Centre for Global Atmospheric Modelling at the University of Reading (via Lois Stenman-Clark and Katherine Bouton) • Project will support HIGEM
Outline • Motivation: • The NDG Goals • NDG Metadata • Networks • Summary
British Atmospheric Data Centre The Role: Key words: Curation and Facilitation!
Easily catalogued, but successful preservation? Phaistos Disk, 1700BC One could argue that the writers of these documents did a brilliant job of preserving the bits-and-bytes of their time … And yes they’ve both been translated … many times, it’s a shame the meanings are different …
NERC Metadata Gateway - SST • No clean handover from discovery to browse and use! • Geospatial coordinates forgotten. Time reference forgotten. Need to get entire field(s), and find correct time! • And if I want to compare data from different locations? • - multiple logins • - multiple formats • - discovery?
How good is our metadata? • A priori would any user know to look in the COAPEC data set? • Earth system-science means we have to remove these boundaries! • detailed file level metadata isn’t visible, and so data mining applications impossible. NB: Dynamic catalogues!
Finding Data The Goal: Very simple interface, hide the complex software!
A newer “dataset” The extreme relevance of this example from Amazon was pointed out by Jon Callahan (LAS project, PMEL)!
PCMDI – Best practice! Final references are papers! Is the information coupled to the datasets? What if I take a dataset home, and another, and another … and then forget which is which? (if you know where to look) Can I ask the question: what datasets used the Semtner sea ice parameterisation?
Different types of data returned: Wallingford Supporting very diverse user community: NetCDF is not enough …
Don Middleton NCAR Modelling advances: Baseline Numbers • T42 CCSM (current, 280km) • 7.5GB/yr, 100 years -> .75TB • T85 CCSM (140km) • 29GB/yr, 100 years -> 2.9TB • T170 CCSM (70km) • 110GB/yr, 100 years -> 11TB
Capacity-related Improvements Increased turnaround, model development, ensemble of runs Increase by a factor of 10, linear data • Current T42 CCSM • 7.5GB/yr, 100 years -> .75TB * 10 = 7.5TB
Capability-related Improvements Spatial Resolution: T42 -> T85 -> T170 Increase by factor of ~ 10-20, linear data Temporal Resolution: Study diurnal cycle, 3 hour data Increase by factor of ~ 4, linear data CCM3 at T170 (70km)
Capability-related Improvements Quality: Improved boundary layer, clouds, convection, ocean physics, land model, river runoff, sea ice Increase by another factor of 2-3, data flat Scope: Atmospheric chemistry (sulfates, ozone…), biogeochemistry (carbon cycle, ecosystem dynamics), middle Atmosphere Model… Increase by another factor of 10+, linear data
Don Middleton NCAR Model Improvement Wishlist Grand Total: Increase compute by a Factor O(1000-10000)
Climate in 20010 – A graphic Illustration Figures from Gary Strand, NCAR, ESG website
Summary thus far Contentions: • The average atmospheric scientific project involves about 1/3 of the time data handling! (Getting, reformatting etc). • The problem for earth system model projects is about to get worse – for everyone, from the initiator, to the archiver, to the analyst, to the contributor, to the improver. • (Remember the documentation problem is growing exponentially too: new sub-components etc)
Requirements: Information (1) “Scientist are are real people too” Jon Callahan (from the LAS project at PMEL) • Amazon Discovery gives good examples: • Browse • Similar datasets • Details • Content examples • Our domain Issues • require: • Dealing with Volume • Formats • Providing Tools Learn from the library and book handling community! All require documentation (aka metadata); We need to improve our information handling
Firstly: information to help one use one’s own data: e.g. calibration data (E) , netcdf metadata (A) It is information passed with the data to enable someone else to use it. It describes the data. (B & E) Metadata can be used to enable automatic software to find (D) & manipulate data (A). Metadata can help one find other people’s data … and then help one obtain and use it. (D) Metadata can be used to enable the preservation of data for posterity (all of ABCD) What is metadata? The answer depends on who you are!
NDG A and B metadata in practice • Clear separation of function between use and discovery. • Standards Compliant • Avoid tie-in to details of particular fields or data formats or even components • Metadata model (B) • “Intermediate” schema, supports multiple discovery formats • NDG Data Model (A). • provides an abstract semantic model for the structure of data within NDG, • enables the specification of concrete instances for use by NDG Data Services
(B) Metadata Model: an NDG Intermediate Schema, Conceptual Overview
NDG Discovery Service Element Traditional and Grid Service (GT3) Interfaces
NDG Prototype Layout not important (yet!) It’s what’s under the hood that counts … ( … the data is NOT in NetCDF. The original data is available … … the search covered data that could have been harvested … … the architecture works!)
NDG Metadata Status • We have built a SIMPLE prototype based primarily on our data model and used our structures to find, locate, reformat and deliver data typical of BODC and BADC observational data. (This is a first) • We are about to re-engineer. • Key issues to address will be • Vocabularies, and • Ontologies • Developing a Model Attribute Language (with CGAM, PRISM, PCMDI and others). • Populating our metadata; a boring and laborious job!
Research Group Research Group Satellite Shared Resources SuperComputer Research Group DB Wider Internet Metadata Origins Consider a hierarchy of data users beginning with an individual scientist, who may herself be part of a research group, itself part of a community sharing resources, lying in the wider internet … To be well integrated the metadata should have a role at each level! (The data portal client and server interface may be different at each level). At each level “extra” metadata will be required, probably produced by dedicated staff at the research group, or data centre.
Requirements (2) We need to think about our networks and our tools for moving and keeping track of data! • We can’t rely on the “leave it at the supercomputer site” • How do we do joint analysis? • How do we process the data at all? • Malcolm Atkinson quoting Jim Gray pointed out that it takes: ~ o(minute) to grep or ftp a GB ~ o(2 days) to grep or ftp a TB ~ o(3 years) to grep or ftp a PB • Requires • sophisticated “fire and forget” file transfer (that has to out perform “sneaker net”). • Disk and compute resources for processing.
SuperJanet4 • We need to address • local firewall issues (not just at the Met Office) • spur bandwidths. The limits are not in the backbones! • 2 Mbit/s link • 80 minutes to transfer 500 MB cf 40 minutes with GridFTP, or less than 1 minute between DL and RAL (1 Gbit/s)
ESG1 Results (Supercomputing, 2001) Dallas to Chicago: Allcock et al. 2001
Starting with the LAS Deployment for UK users within a few weeks (constraint is primarily access control)
ERA40 LAS – Simple Box fill Output Work for us to do: Labelling is inadequate as yet ..
BADC/CDAT localCache.py ERA-40 < 1TB Grid Cache YES Locks access to cache. Checks if regular gridded file is in cache list. BADC/CDAT intercepts command and checks cache NO Cache unlocked. New cdms.open command sent to CDAT and cache file opened. LAS Spectral file is converted on-the-fly and placed in cache. ERA-40 4 TB Spectral Archive Calls cdms.open to open data file. CDAT Data object delivered to LAS. Internet User 18 TB virtual dataset NetCDF file, plot or animations delivered to user. Cache management in LAS/CDAT Cache also checks if enough room, deletes oldest files if necessary and checks against disk space limit.
Summary • Earth System Modelling extends the data handling challenge. • We need better information management • We need better tools for moving things around • We need better tools for using remote data • … and we need data manipulation hardware! The NDG is attempting (with help) to address: • Information management • Data movement • Tools to manipulate large volumes of data.