540 likes | 642 Views
Distributed Data, Distributed Governance, Distributed Vocabularies: The NERC DataGrid. Bryan Lawrence (on behalf of a big team, and note also a substantial piece of work with specific authorship included herein). +. +. ]=. +[. +. +. BADC, BODC, CCLRC, PML and SOC. Motivation Standards
E N D
Distributed Data, Distributed Governance, Distributed Vocabularies: The NERC DataGrid Bryan Lawrence (on behalf of a big team, and note also a substantial piece of work with specific authorship included herein) + + ]= +[ + + BADC, BODC, CCLRC, PML and SOC
Motivation Standards Feature Types Taxonomy Overall Architecture NDG Products Discovery Portal Data Extractor MOLES (NumSim relationship with NMM) CSML CSML Description Prototyping in MarineXML Round-Tripping Vocabulary Issues IN NDG (Hughes, Kondapalli, Lowry) NDG Timeline Outline
NCAR Complexity + Volume + Remote Access = Grid Challenge British Atmospheric Data Centre http://ndg.nerc.ac.uk British Oceanographic Data Centre
Want interdisciplinary semantic access to information, not abstract data getData(potential temperature from ERA-40 dataset in North Atlantic from 1990 to 2000) not: getData(“era40.nc”, ‘PTMP’, 20:50, 300:340, 190:200) or even worse: for j=1990:2000 getData(“era40_”+j+“.nc”, ‘PTMP’, 20:50, 300:340) Lossy is OK! Care less about completeness of representation than semantic unification Integration – semantics
ISO 19101: Geographic information – Reference model …in a defined logical structure… …delivered through services… …and described by metadata. A geospatial dataset… …consists of features and related objects… Standards
Standards • Geographic ‘features’ • “abstraction of real world phenomena” [ISO 19101] • Type or instance • Encapsulate important semantics in universe of discourse • “Something you can name” • Application schema • Defines semantic content and logical structure • ISO standards provide toolkit: • spatial/temporal referencing • geometry (1-, 2-, 3-D) • topology • dictionaries (phenomena, units, etc.) • GML – canonical encoding [from ISO 19109 “Geographic information – Rules for Application Schema”]
CSML NCML+CF MOLES THREDDS CLADDIER DIF -> ISO19115 Architecture: NDG Metadata Taxonomy … not one schema, not one solution!
Vocab Services Users NDG GUI Interface(s) Data Providers NDG Core Services Architecture: Deployment
Vocab Services Users NDG GUI Interface(s) NDG Core Services Architecture: Deployment
Vocab Services Users NDG GUI Interface(s) Architecture: Deployment
Vocab Services Users Architecture: Deployment
Discovery Service NDG Products: Discovery Portal http://ndg.nerc.ac.uk/discovery NB: Web Service Interface (you can do the search from your own site and format and present the results there!
NDG Products: MOLES Ugly as sin! A hint of things to come:
Core linking concept is the deployment MOLES: implementation of a Data Production Tool at an Observation Station on behalf of an Activity that produces a Data Entity Activity DataProductionTool ObservationStation Links the metadata records into a structure that can be turned into a navigable structure Deployment Each of the main metadata objects has security data attached to it. This means that this can be applied to queries on the metadata Data Entity
Simulators as data production tools: NumSim NDG Products: NumSim
NumSim Example NumSim Example
Background activity being parallelised with GODIVA/CCLRC e-science collaboration (spectral -> gridpoint + CDMS + visualisation tools) Download either plot or the data that went into the plot. NDG Products: GEOSPLAT
ERA40: • All driven from one CDML file, 9 TB online spherical harmonics, looking like 40 TB “virtual” gridded!
Aims: provide semantic integration mechanism for NDG data explore new standards-based interoperability framework emphasise content, not container Design principles: offload semantics onto parameter type (‘phenomenon’, observable, measurand) e.g. wind-profiler, balloon temperature sounding offload semantics onto CRS e.g. scanning radar, sounding radar ‘sensible plotting’ as discriminant ‘in-principle’ unsupervised portrayal explicitly aim for small number of weakly-typed features (in accordance with governance principle and NDG remit) NDG-A: Climate Science Modelling Language
CSML feature types defined on basis of geometric and topologic structure Climate Science Modelling Language
CSML feature types examples... ProfileSeriesFeature ProfileFeature GridFeature Climate Science Modelling Language
Numerical array descriptors provides ‘wrapper’ architecture for legacy data files ‘Connected’ to data model numerical content through ‘xlink:href’ Three subtypes: InlineArray ArrayGenerator FileExtract (NASAAmes, NetCDF, GRIB) Composite design pattern for aggregation Climate Science Modelling Language
Inline array Array generator Climate Science Modelling Language <NDGInlineArray> <arraySize>5 2</arraySize> <uom>udunits.xml#degreeC</uom> <numericType>float</numericType> <regExpTransform>s/10/9/ge</regExpTransform> <numericTransform>+5</numericTransform> <values>1 2 3 4 5 6 7 8 9 10</values> </NDGInlineArray> <NDGArrayGenerator> <arraySize>10001</arraySize> <uom>udunits.xml#minute</uom> <numericType>float</numericType> <expression>0:5:50000</expression> </NDGArrayGenerator>
File extract Climate Science Modelling Language <NDGNASAAmesExtract> <arraySize>526</arraySize> <numericType>double</numericType> <fileName>/data/BADC/macehead/mh960606.cf1</fileName> <variableName>CFC-12</variableName> </NDGNASAAmesExtract> <NDGNetCDFExtract gml:id="feat04azimuth"> <arraySize>10000</arraySize> <fileName>radar_data.nc</fileName> <variableName>az</variableName> </NDGNetCDFExtract> <NDGGRIBExtract> <arraySize>320 160</arraySize> <numericType>double</numericType> <fileName>/e40/ggas1992010100rsn.grb</fileName> <parameterCode>203</parameterCode> <recordNumber>5</ recordNumber> <fileOffset>289412</fileOffset> </NDGGRIBExtract>
MarineXML Testbed For each XSD (for the source data) there is an XSLT to translate the data to the Feature Types (FT) defined by CSML. The FT’s and XSLT are maintained in a ‘MarineXML registry’ Phenomena in the XSD must have an associated portrayal Data from different parts of the marine community conforming to a variety of schema (XSD) The FTs can then be translated to equivalent FTs for display in the ECDIS system XSD XML Biological Species S52 Portrayal Library XSD XML Chl-a from Satellite XML Parser MarineGML(NDG) Feature Types XSLT XML XSLT XSLT SENC SeeMyDENC XSD MeasuredHydrodynamics XML XSLT XML XSLT XSLT ECDIS acts as an example client for the data. XSD Data Dictionary XML ModelledHydrodynamics The result of the translation is an encoding that contains the marine data in weakly typed (i.e. generic) Features Features in the source XSD must be present in the data dictionary. XSD Feature described using S-57v3.1Application Schema can be imported and are equivalent to the same features in CSML’ XML S-57v3 GML Slide adapted from Kieran Millard (AUKEGGS, 2005)
MarineXML Testbed Biological sampling station with attributes for the species sampled at each Grid of Chl-a from the MERIS instrument on ENVISAT Predicted and measured wave climate timeseries (height, direction and period) Vectors of currents from instruments Slide adapted from Kieran Millard (AUKEGGS, 2005)
All this requires agreement on standards The Concept of re-using Features Here structured XML is converted to plain ascii text in the form required for a numerical model HTML warning service pages are generated ‘on the fly’ Here the same XML is converted to the SENC format used in a proprietary tool for viewing electronic navigation charts. XML can also be converted to SVG to display data graphically Slide adapted from Kieran Millard (AUKEGGS, 2005)
conceptual model New Dataset Conforms to 101010 UGAS produces <gml:featureMember> <NDGPointFeature gml:id="ICES_100"> <NDGPointDomain> <domainReference> <NDGPosition srsName="urn:EPSG:geographicCRS:4979" axisLabels="Lat Long" uomLabels="degree degree"> <location>55.25 6.5</location> </NDGPosition> </domainReference> </NDGPointDomain> <gml:rangeSet> <gml:DataBlock> <gml:rangeParameters> <gml:CompositeValue> <gml:valueComponents> <gml:measure uom="#tn"/> <gml:measure uom="#amount"/> <gml:measure uom="#gsm"/> </gml:valueComponents> </gml:CompositeValue> </gml:rangeParameters> <gml:tupleList> XML V1.0 will be in NDG Alpha GML app schema GML dataset Application instance parser CSML Round Tripping - 1 Managing semantics
V1.0 in NDG Alpha CF Dataset scanner 101010 CF produces <gml:featureMember> <NDGPointFeature gml:id="ICES_100"> <NDGPointDomain> <domainReference> <NDGPosition srsName="urn:EPSG:geographicCRS:4979" axisLabels="Lat Long" uomLabels="degree degree"> <location>55.25 6.5</location> </NDGPosition> </domainReference> </NDGPointDomain> <gml:rangeSet> <gml:DataBlock> <gml:rangeParameters> <gml:CompositeValue> <gml:valueComponents> <gml:measure uom="#tn"/> <gml:measure uom="#amount"/> <gml:measure uom="#gsm"/> </gml:valueComponents> </gml:CompositeValue> </gml:rangeParameters> <gml:tupleList> XML V1.0 in NDG Alpha GML app schema GML dataset Application instance parser CSML Round Tripping - 2 Managing data - 1
CF Dataset CF Dataset 101010 101010 Define Dataset DECISION PROCESSES <gml:featureMember> <NDGPointFeature gml:id="ICES_100"> <NDGPointDomain> <domainReference> <NDGPosition srsName="urn:EPSG:geographicCRS:4979" axisLabels="Lat Long" uomLabels="degree degree"> <location>55.25 6.5</location> </NDGPosition> </domainReference> </NDGPointDomain> <gml:rangeSet> <gml:DataBlock> <gml:rangeParameters> <gml:CompositeValue> <gml:valueComponents> <gml:measure uom="#tn"/> <gml:measure uom="#amount"/> <gml:measure uom="#gsm"/> </gml:valueComponents> </gml:CompositeValue> </gml:rangeParameters> <gml:tupleList> XML Add Information GML dataset Managing Data 2 scanner XSLT PUBLISH ISO19115
Architecture: Deployment
Vocabulary Management for NERC DataGrid Michael Hughes, V.Siva Kondapalli and Roy Lowry
Problem and Solution NERC DataGrid Vocabulary Model Vocabulary Technical Governance Vocabulary Content Governance Mappings and Thesaurus Server Potential Role of Local Mappings Vocabulary Presentation Outline
NERC DataGrid cannot function operationally without metadata and data semantic interoperability This will never be achieved without: Readily accessible standard terms whose meaning is clearly understood Readily accessible semantic maps both within and between lists of standard terms Semantic maps between local terms and standard terms The Problem
Implementation of a Vocabulary Server Building OWL ontologies mapping between domain-relevant de-facto standard vocabularies Deploying the ontologies through a Web Service thesaurus server Making tools available for users to build and deploy local ontologies The Solution?
Constraint (Aggregation of lists across which entry keys are unique) List (versioned, timestamped and labelled with key, name and definition) List (versioned, timestamped and labelled with key, name and definition) Entry (timestamped) Entry (timestamped) Entry (timestamped) Key Key Key Term Term Term Definition Definition Definition NDG Vocabulary Model
The vocabulary resource is built from Entries The representation of a single object in the real world comprising: Key - A bit pattern that represents an entity. It must be unique, permanent and free from semantics. Term – Text used to label the entity to facilitate human recognition. Abbreviation – An shortened version of the term for use where space is tight. Target size is 20-30 bytes. Definition – Text that unambiguously specifies the entity. Entries are aggregated into Lists (entity class or subclass e.g. UK post towns) Lists are aggregated into Constraints (entity class e.g. post towns of the world) NDG Vocabulary Model
The story so far Lists are available as flat ASCII files or XML documents as URLs e.g. http://www.cgd.ucar.edu/cms/eaton/cf-metadata/standard_name.xml ftp://ftp.pol.ac.uk/pub/bodc/jgofs/datadict/new/parameter_group.csv http://www.sea-search.net/cdi_documentation/cdi_sampling_codes.csv http://gcmd.nasa.gov/Resources/valids//gcmd_parameters.html Some (BODC, SEA-SEARCH) include keys Some (CF, BODC) include definitions None are properly versioned Vocabulary Technical Governance
Versioning should: Provide a unique label for each instantiation of the list Enable any previous instantiation of the list to be recreated Provide timestamp information for creation and modification of every object in the vocabulary system Delivery should Be from the master, not a copy Be accessible to software agents to allow automated synchronisation of local copies Have a ‘hotline’ to content governance Vocabulary Technical Governance
NERC DataGrid Vocabulary Server Back End Fully automated record archive, timestamps and version numbering. Live April 2006. 47 (of 115) lists publicly accessible. Front End Web Service API. Live June 2006. XML list downloads from website (July 2006?). Web-form tools (August 2006?). Vocabulary Technical Governance
Standard lists need to respond to ever expanding user requirements Change needs to be rapid or users lose interest Standard lists need to maintain information quality and internal consistency Content governance has to resolve these conflicting requirements Vocabulary Content Governance
Content governance in oceanographic and atmospheric domains is based on: Moderated e-mail discussion lists ‘Benign Dictator’ and well-meaning volunteers Variable success depending on right people having ‘spare’ time at the right moments More formalism underpinned by more resources required But need to be careful about going too far or levels of service become unacceptable Vocabulary Content Governance
There will never be a single list for a given topic Term mapping therefore an essential part of semantic interoperability Marine Metadata Interoperability (http://marinemetdata.org) have developed tooling and trialled mappings in the measurement phenomena arena Mappings and Thesaurus Server
MMI approach Harmonise lists to be mapped in OWL (Voc2OWL tool) Map on basis of ‘same as’, ‘broader than’ and ‘narrower than’ relationships (VINE tool) Place a Web Service API over the map to implement a term or thesaurus server Mappings and Thesaurus Server