200 likes | 297 Views
The Earth System Grid (ESG). METADATA SCHEMAS IN ESG DOE SciDAC ESG Project Review Argonne National Laboratory, Illinois May 8-9, 2003. Introduction. ESG initial focus is on climate model data, particularly PCM/CCSM data (netCDF format).
E N D
The Earth System Grid (ESG) METADATA SCHEMAS IN ESG DOE SciDAC ESG Project Review Argonne National Laboratory, Illinois May 8-9, 2003
Introduction • ESG initial focus is on climate model data, particularly PCM/CCSM data (netCDF format). • Consequently, our work so far has concentrated upon developing or evaluating metadata schemas suited for this kind of data, specifically: • “ESG schema” for expressing collection-level metadata • NcML schema for file-level metadata • THREDDS schema for data cataloguing and browsing Earth System Grid
Part I ESG Schema
ESG schema: history • Purposedly developed by ESG to fulfill the specific needs of the PCM/CCSM modeling community (through ESG liason Gary Strand) • Several other standards were evaluated before developing our own, none of them was found to be completely satisfactory: • Dublin Core (not rich enough for scientific data) • ISO (too complex to be imposed on data providers), • CLRC and DIF (almost ok, but not flexible enough to allow capturing some details that are important to PCM/CCSM). • Initial draft developed in conjunction with UK eScience office, still collaborating towards common schema or interoperability Earth System Grid
ESG schema: requirements Information that needed to be captured in the metadata: • Model run description (including run scenario and time period) • Model configuration notes • Active/inactive components (atmosphere, ocean, ice) • Pointers to documentation of model components (usually on the web). • Input forcing datasets (which ozone dataset, sulfate dataset, etc.) • At what site the model binary was built, perhaps even the compiler options that were used. • Site where the model was run. • Persons that carried out the model integration and submission • Related model experiments - VERY IMPORTANT! • "Sibling" runs (for ensembles of runs) • "Parent" run (the run from which this particular experiment started) • "Child" runs (runs descended from this run) Earth System Grid
ESG schema: requirements • References to visualizations (MPGs and so on) using this model data. • References to to published journal articles/papers/presentations that have used this experiment's data. • Miscallenous notes • Aknowledgment of funding agencies Earth System Grid
ESG schema: description • Expresses collection-level metadata, i.e. logical metadata that describes a set of logically related data files (for example, a model run). • Developed following an object model: we defined objects with properties, inheritance between objects, and relations between objects (see following slide) • Although developed specifically for modeled data, it could be easily extended to express observational, experimental and analysis data. • Metadata encoded in XML, conforming to an XML schema definition document (metadata syntax) • XML metadata may be stored directly in an XML native database (Apache Xindice), or may be shredded and stored in a relational database (MySQL) within a set of purposedly defined tables. • Currently developing API for I/O of ESG metadata as XML to/from a transparent database backend Earth System Grid
isA Person [0,1] firstName [0,1] lastName [0,1] contact LEGEND Object [1] id Institution [0,1] name [0,1] type [0,1] contact AbstractClass worksFor Class participant role= isA inheritance Project [0,n] topic type= [0,1] funding association Activity [0,1] name [0,1] description [0,1] rights [0,n] date type= encoding= [0,n] note [0,n] participant role= [0,n] reference uri= Parameter [1] name [0,1] mapping authority= activityRef hasParameter isA isPartOf isA Campaign ParameterList Investigation isDerivedFrom Service [0,1] name [0,1] description serviceRef Ensemble isA hasParameters Experiment Analysis Observation Simulation [0,n] simulationInput type= [0,n] simulationHardware Dataset [0,1] type [0,1] conventions [0,n] date type= encoding= [0,n] format type= uri= [0,1] timeCoverage [0,1] spaceCoverage isPartOf generatedBy
Part II NcML NetCDF Markup Language
NcML: description • Developed as ESG/Unidata collaboration • XML language for expressing metadata associated with netCDF data (i.e. data following the netCDF model) • Modular, extensible architecture: built as a set of schema modules each fulfilling a specific funtionality: • Core NcML schema: XML encoding of file-level metadata associated with any netcdf file (i.e. same information as contained in netCDF header). Useful for expressing metadata into an encoding standard (XML), so that it can be processed by a large number of clients; also, metadata may be made immediately available even if data is not (for example, it’s on remote storage). • Coordinate system extension: allows capturing of information related to coordinate and coordiante systems (normally encoded as netCDF conventions like COADS or CF). This info can be used for example by high level visualization and analysis clients. • Dataset extension (under development): allows data aggregation and subsetting, definition of derived or virtual data. Aggregation metadata information is used to expose a dataset independently on how (which files) the data is actually stored • Planned extension for openGIS-ISO interoperability • NcML is automatically generated by parsing the input netCDF file(s) Earth System Grid
NcML: schemas architecture NcML core (generic netcdf data) NcML Coordinate Systems (netcdf conventions for coord, coord systems) NcML dataset (aggregation, operations on data) openGIS-ISO Earth System Grid
Part III THREDDS
THREDDS • Project lead by Unidata in collaboration with many universities and research groups • Aimed ad developing a standard for hierarchical cataloguing of data and associated metadata • Allows cross browsing of catalogs and associated metadata, federation of data holdings among multiple repositories • ESG is currently evaluating THREDDS technology: we produced and published on the web THREDDS catalogs for 16 PCM runs • Ultimately, ESG might decide to produce THREDDS catalogs for all of its data holdings, either as a separate process or by generating them from other metadata sources Earth System Grid
Part IV Conclusions
Future Development • Schema conversion: automatic generation of metadata conforming to other standards from ESG collection level metadata • DIF, for publishing to GCMD discovery system (also, DIF can be converted to ISO) • Dublin Core, for publishing to digital libraries • Aggregation metadata: • Finalize NcML dataset extension • Conversion of NcML aggregation metadata into: • - CDML (for CDAT visualization) • - LAS (for analysis of data through LAS) • Ontologies for scientific schemas interoperability Earth System Grid
Collaborations and Impact • COLLABORATIONS • PCM/CCSM modeling community (“ESG schema”) • UK eScience office (“ESG schema”) • Unidata (NcML) • FEDERATIONS • THREDDS servers • GCMD search and discovery engine • Digital Libraries • IMPACT • ESG schema could be adopted by a wide scientific community • NcML may become standard for XML encoding of netCDF data • NcML will be used as standard for Unidata DODS aggregation server Earth System Grid