650 likes | 755 Views
Distributed Data, Distributed Governance, Distributed Vocabularies, with a dash of CLADDIER: The NERC DataGrid. Bryan Lawrence (on behalf of a big team, and note also a substantial piece of work with specific authorship included herein). +. +. ]=. +[. +. +. BADC, BODC, CCLRC, PML and SOC.
E N D
Distributed Data, Distributed Governance, Distributed Vocabularies, with a dash of CLADDIER: The NERC DataGrid Bryan Lawrence (on behalf of a big team, and note also a substantial piece of work with specific authorship included herein) + + ]= +[ + + BADC, BODC, CCLRC, PML and SOC
Introduction to the BADC Motivation Standards Feature Types Taxonomy Overall Architecture NDG Products Discovery Portal Data Extractor MOLES (NumSim relationship with NMM) CSML CSML Description Prototyping in MarineXML Round-Tripping Vocabulary Issues IN NDG (Hughes, Kondapalli, Lowry) NDG Timeline Outline
The BADC role is to assist UK researchers to locate, access and interpret atmospheric data and to ensure the long-term integrity of atmospheric data produced by NERC projects. Facilitation and Curation/Preservation! BADC Role
A BADC dataset is an aggregation of data files, documents and metadata sharing common administrative policies. These policies could be file validation, access control or retention schemes. Datasets vary from TBs in millions of files to a few MBs in a single file. There are presently over 100 datasets. BADC Data Holdings
BADC User examples • Atmospheric chemistry models. • Pollution chemistry measurement campaigns.
User examples • Bird feeding habits.
User examples • Wind power research. • Radio communication modelling. • A & E influenza cases.
User examples • Castle mortar decay. • Discomfort indices.
User Diversity These stats quite old now, but the distributions haven’t changed.
Typically, two-thirds of this data will never see the light of the day: why? March 2006, 2.5 PB Climate in 20010 – A graphic Illustration Figures from Gary Strand, NCAR, ESG website No one can remember what it was, or, if they can remember that, where it is!
http://www.realclimate.org/index.php?p=121 What McIntyre got right: Data as Evidence http://www.uoguelph.ca/~rmckitri/research/trcback.html
Data Retention Policies University of Cambridge Research Division: Data generated in the course of research should be kept securely in paper or electronic format, as appropriate. Back-up records should always be kept for data stored on a computer. The [AMRC] considers a minimum of ten years to be an appropriate period. However, research based on clinical samples or relating to public health may require longer storage to allow for long-term follow-up to occur. [AMRC: Association of Medical Research Charities] University of Oxford Research Services Office: A successful laboratory notebook allows for ready verification of quality and integrity of research data and enables another investigator to reproduce the procedure which has been documented and get the same result. …. A successful laboratory notebook allows for ready verification of quality and integrity of research data and enables another investigator to reproduce the procedure which has been documented and get the same result. Natural Environment Research Council: … Scientists will frequently process the data they have collected selectively, or with specific application packages, in order to prepare material for publication in the scientific literature. But the full value of the data collected may only be realised if the entire dataset is subjected to generic processing (eg to ensure calibration and adequate quality control) and is sufficiently documented to allow others to re-use it at a later date. The original collector may be the only person in a position to undertake such work, and so to unlock the full potential of the data. Those holding data collected under NERC funding will be expected to cooperate in validating and publishing them in their entirety - when this can be justified in terms of their scientific value - rather than merely creaming off a subset for immediate publication in the literature. …
NCAR Complexity + Volume + Remote Access = Grid Challenge British Atmospheric Data Centre http://ndg.nerc.ac.uk British Oceanographic Data Centre
No one would change their data storage systems! Need to support a wide range of “metadata-maturity”! No NDG-wide user management system possible. It is illegal to share user information without each and every user agreeing … implies no way of having one virtual organisation with common user management! With a large enough group it is impossible to agree on common roles that could be associated with access control. … but we want single-sign on … and trust relationships between data providers … NDG Assumptions
Want interdisciplinary semantic access to information, not abstract data getData(potential temperature from ERA-40 dataset in North Atlantic from 1990 to 2000) not: getData(“era40.nc”, ‘PTMP’, 20:50, 300:340, 190:200) or even worse: for j=1990:2000 getData(“era40_”+j+“.nc”, ‘PTMP’, 20:50, 300:340) Lossy is OK! Care less about completeness of representation than semantic unification Integration – semantics
ISO 19101: Geographic information – Reference model …in a defined logical structure… …delivered through services… …and described by metadata. A geospatial dataset… …consists of features and related objects… Standards
Standards • Geographic ‘features’ • “abstraction of real world phenomena” [ISO 19101] • Type or instance • Encapsulate important semantics in universe of discourse • “Something you can name” • Application schema • Defines semantic content and logical structure • ISO standards provide toolkit: • spatial/temporal referencing • geometry (1-, 2-, 3-D) • topology • dictionaries (phenomena, units, etc.) • GML – canonical encoding [from ISO 19109 “Geographic information – Rules for Application Schema”]
CSML NCML+CF MOLES THREDDS CLADDIER DIF -> ISO19115 Architecture: NDG Metadata Taxonomy … not one schema, not one solution!
Vocab Services Users NDG GUI Interface(s) Data Providers NDG Core Services Architecture: Deployment
Vocab Services Users NDG GUI Interface(s) NDG Core Services Architecture: Deployment
Vocab Services Users NDG GUI Interface(s) Architecture: Deployment
Vocab Services Users Architecture: Deployment
Core linking concept is the deployment MOLES: implementation of a Data Production Tool at an Observation Station on behalf of an Activity that produces a Data Entity Activity DataProductionTool ObservationStation Links the metadata records into a structure that can be turned into a navigable structure Deployment Each of the main metadata objects has security data attached to it. This means that this can be applied to queries on the metadata Data Entity
Simulators as data production tools: NumSim NDG Products: NumSim
NumSim Example NumSim Example
NDG “Pseudo-Demo” EXPLOITING DISCOVERY WEB SERVICE (running interface on my laptop last night)
Scrolling Down More Browse
MOLES Navigation Actually, this is where we plan to use NumSim and Numerical Model Metadata
Offering up trusted host list … NDG Authentication
Aims: provide semantic integration mechanism for NDG data explore new standards-based interoperability framework emphasise content, not container Design principles: offload semantics onto parameter type (‘phenomenon’, observable, measurand) e.g. wind-profiler, balloon temperature sounding offload semantics onto CRS e.g. scanning radar, sounding radar ‘sensible plotting’ as discriminant ‘in-principle’ unsupervised portrayal explicitly aim for small number of weakly-typed features (in accordance with governance principle and NDG remit) NDG-A: Climate Science Modelling Language
CSML feature types defined on basis of geometric and topologic structure Climate Science Modelling Language
CSML feature types examples... ProfileSeriesFeature ProfileFeature GridFeature Climate Science Modelling Language
Numerical array descriptors provides ‘wrapper’ architecture for legacy data files ‘Connected’ to data model numerical content through ‘xlink:href’ Three subtypes: InlineArray ArrayGenerator FileExtract (NASAAmes, NetCDF, GRIB) Composite design pattern for aggregation Climate Science Modelling Language
Inline array Array generator Climate Science Modelling Language <NDGInlineArray> <arraySize>5 2</arraySize> <uom>udunits.xml#degreeC</uom> <numericType>float</numericType> <regExpTransform>s/10/9/ge</regExpTransform> <numericTransform>+5</numericTransform> <values>1 2 3 4 5 6 7 8 9 10</values> </NDGInlineArray> <NDGArrayGenerator> <arraySize>10001</arraySize> <uom>udunits.xml#minute</uom> <numericType>float</numericType> <expression>0:5:50000</expression> </NDGArrayGenerator>
File extract Climate Science Modelling Language <NDGNASAAmesExtract> <arraySize>526</arraySize> <numericType>double</numericType> <fileName>/data/BADC/macehead/mh960606.cf1</fileName> <variableName>CFC-12</variableName> </NDGNASAAmesExtract> <NDGNetCDFExtract gml:id="feat04azimuth"> <arraySize>10000</arraySize> <fileName>radar_data.nc</fileName> <variableName>az</variableName> </NDGNetCDFExtract> <NDGGRIBExtract> <arraySize>320 160</arraySize> <numericType>double</numericType> <fileName>/e40/ggas1992010100rsn.grb</fileName> <parameterCode>203</parameterCode> <recordNumber>5</ recordNumber> <fileOffset>289412</fileOffset> </NDGGRIBExtract>
MarineXML Testbed For each XSD (for the source data) there is an XSLT to translate the data to the Feature Types (FT) defined by CSML. The FT’s and XSLT are maintained in a ‘MarineXML registry’ Phenomena in the XSD must have an associated portrayal Data from different parts of the marine community conforming to a variety of schema (XSD) The FTs can then be translated to equivalent FTs for display in the ECDIS system XSD XML Biological Species S52 Portrayal Library XSD XML Chl-a from Satellite XML Parser MarineGML(NDG) Feature Types XSLT XML XSLT XSLT SENC SeeMyDENC XSD MeasuredHydrodynamics XML XSLT XML XSLT XSLT ECDIS acts as an example client for the data. XSD Data Dictionary XML ModelledHydrodynamics The result of the translation is an encoding that contains the marine data in weakly typed (i.e. generic) Features Features in the source XSD must be present in the data dictionary. XSD Feature described using S-57v3.1Application Schema can be imported and are equivalent to the same features in CSML’ XML S-57v3 GML Slide adapted from Kieran Millard (AUKEGGS, 2005)
MarineXML Testbed Biological sampling station with attributes for the species sampled at each Grid of Chl-a from the MERIS instrument on ENVISAT Predicted and measured wave climate timeseries (height, direction and period) Vectors of currents from instruments Slide adapted from Kieran Millard (AUKEGGS, 2005)
All this requires agreement on standards The Concept of re-using Features Here structured XML is converted to plain ascii text in the form required for a numerical model HTML warning service pages are generated ‘on the fly’ Here the same XML is converted to the SENC format used in a proprietary tool for viewing electronic navigation charts. XML can also be converted to SVG to display data graphically Slide adapted from Kieran Millard (AUKEGGS, 2005)
Other User Example: Norwegian Met Office CSML Present
conceptual model New Dataset Conforms to 101010 UGAS produces <gml:featureMember> <NDGPointFeature gml:id="ICES_100"> <NDGPointDomain> <domainReference> <NDGPosition srsName="urn:EPSG:geographicCRS:4979" axisLabels="Lat Long" uomLabels="degree degree"> <location>55.25 6.5</location> </NDGPosition> </domainReference> </NDGPointDomain> <gml:rangeSet> <gml:DataBlock> <gml:rangeParameters> <gml:CompositeValue> <gml:valueComponents> <gml:measure uom="#tn"/> <gml:measure uom="#amount"/> <gml:measure uom="#gsm"/> </gml:valueComponents> </gml:CompositeValue> </gml:rangeParameters> <gml:tupleList> XML V1.0 will be in NDG Alpha GML app schema GML dataset Application instance parser CSML Round Tripping - 1 Managing semantics
V1.0 in NDG Alpha CF Dataset scanner 101010 CF produces <gml:featureMember> <NDGPointFeature gml:id="ICES_100"> <NDGPointDomain> <domainReference> <NDGPosition srsName="urn:EPSG:geographicCRS:4979" axisLabels="Lat Long" uomLabels="degree degree"> <location>55.25 6.5</location> </NDGPosition> </domainReference> </NDGPointDomain> <gml:rangeSet> <gml:DataBlock> <gml:rangeParameters> <gml:CompositeValue> <gml:valueComponents> <gml:measure uom="#tn"/> <gml:measure uom="#amount"/> <gml:measure uom="#gsm"/> </gml:valueComponents> </gml:CompositeValue> </gml:rangeParameters> <gml:tupleList> XML V1.0 in NDG Alpha GML app schema GML dataset Application instance parser CSML Round Tripping - 2 Managing data - 1
CF Dataset CF Dataset 101010 101010 Define Dataset DECISION PROCESSES <gml:featureMember> <NDGPointFeature gml:id="ICES_100"> <NDGPointDomain> <domainReference> <NDGPosition srsName="urn:EPSG:geographicCRS:4979" axisLabels="Lat Long" uomLabels="degree degree"> <location>55.25 6.5</location> </NDGPosition> </domainReference> </NDGPointDomain> <gml:rangeSet> <gml:DataBlock> <gml:rangeParameters> <gml:CompositeValue> <gml:valueComponents> <gml:measure uom="#tn"/> <gml:measure uom="#amount"/> <gml:measure uom="#gsm"/> </gml:valueComponents> </gml:CompositeValue> </gml:rangeParameters> <gml:tupleList> XML Add Information GML dataset Managing Data 2 scanner XSLT PUBLISH ISO19115
Future of CSML Some new features: (Swath, Ragged ProfileSeries) But more importantly, factor out the storage and introduce the concept of “processing affordance”. (see http://home.badc.rl.ac.uk/lawrence/papers/ExeterCommunique06.pdf
Architecture: Deployment
Vocabulary Management for NERC DataGrid Michael Hughes, V.Siva Kondapalli and Roy Lowry