270 likes | 354 Views
NERC DataGrid data model and its application. Andrew Woolf 1 ( A.Woolf@rl.ac.uk ), Ray Cramer 2 , Marta Gutierrez 3 , Kerstin Kleese van Dam 1 , Siva Kondapalli 2 , Susan Latham 3 , Bryan Lawrence 3 , Roy Lowry 2 , Kevin O’Neill 1 , Ag Stephens 3 1 CCLRC e-Science Centre
E N D
Spatiotemporal Databases e-Science Institute, Edinburgh 01-Nov-2005 NERC DataGrid data model and its application Andrew Woolf1 (A.Woolf@rl.ac.uk), Ray Cramer2, Marta Gutierrez3, Kerstin Kleese van Dam1, Siva Kondapalli2, Susan Latham3, Bryan Lawrence3, Roy Lowry2, Kevin O’Neill1, Ag Stephens3 1 CCLRC e-Science Centre 2 British Oceanographic Data Centre 3 British Atmospheric Data Centre
Spatiotemporal Databases e-Science Institute, Edinburgh 01-Nov-2005 Outline • NERC DataGrid – data integration problem • Semantics as integration key • CSML • Wrapper/mediator architecture • Use and future
British Atmospheric Data Centre Simulations Assimilation British Oceanographic Data Centre Spatiotemporal Databases e-Science Institute, Edinburgh 01-Nov-2005 NERC DataGrid
Spatiotemporal Databases e-Science Institute, Edinburgh 01-Nov-2005 NDG data integration • Most (but not all) NDG data is file-based… • On the Grid, no-one should know if you’re a file or relational table… (one service to bind them all) • The file problem • multiple formats • focus usually on container, not content • Scientific file format examples (earth sciences): • netCDF • HDF4 • HDF5 • GRIB • NASA Ames • ...
Spatiotemporal Databases e-Science Institute, Edinburgh 01-Nov-2005 NDG data integration
Spatiotemporal Databases e-Science Institute, Edinburgh 01-Nov-2005 NDG data integration • Typically, API is fundamental point of reference • binary format details not always exposed (or guaranteed) • public API often the only supported access mechanism • API typically implemented as optimised native library • why reinvent a well-known working interface? • Data Format Description Language (DFDL) • XML ‘facade’ to file formats • earth science files often giga-scale XML query interface not likely to be efficient • encapsulating format not the issue for NDG... • ...integrating domain-specific semantics efficiently across files and formats is!
Spatiotemporal Databases e-Science Institute, Edinburgh 01-Nov-2005 NDG data integration • Information and file contents • same information in different file formats – want to expose information, not format (seen earlier) • in addition, semantic information structures may be composed across files
Spatiotemporal Databases e-Science Institute, Edinburgh 01-Nov-2005 Integration – semantics • Want semantic access to information, not abstract data • getData(potential temperature from ERA-40 dataset in North Atlantic from 1990 to 2000) • not: getData(“era40.nc”, ‘PTMP’, 20:50, 300:340, 190:200) • or even worse: for j=1990:2000 getData(“era40_”+j+“.nc”, ‘PTMP’, 20:50, 300:340) • Lossy is OK! • Care less about completeness of representation than semantic unification
Spatiotemporal Databases e-Science Institute, Edinburgh 01-Nov-2005 NDG data integration • Integration approaches: warehousing • Integration approaches: wrapper/mediator
Spatiotemporal Databases e-Science Institute, Edinburgh 01-Nov-2005 Integration – semantics • Summary: • What we require is • semantic access to information (within and across files); • and to use native (well-known) efficient APIs under the covers • also: • scalability across providers • warehousing not an option (tera-scale!) • enhance access and use, ‘outwards-facing’ (e.g. impacts community, policymakers) • storage heterogeneity
Spatiotemporal Databases e-Science Institute, Edinburgh 01-Nov-2005 Integration – semantics • Database data modelling • Relational model (Codd, 1970) • Entity-relationship model (Chen, 1976) • Semantic data models • Object-oriented data models (inheritance, aggregation, behaviour) • File-based data modelling • Far less advanced • Abstract models (‘variables’, ‘arrays’, etc.; no ‘object’ file formats in widespread use for earth science data) • API-driven
Spatiotemporal Databases e-Science Institute, Edinburgh 01-Nov-2005 Integration – semantics • Fundamentally, an information community is defined by shared semantics • semantics often (but not always) implicit • use information semantics for data integration • Semantics as integration ‘key’ • common language across providers (and users) • supports wrapper/mediator architecture • NDG Solution components: • semantic data model (Climate Science Modelling Language) • storage descriptor (wrapper) • data services (mediator)
Spatiotemporal Databases e-Science Institute, Edinburgh 01-Nov-2005 CSML • Geographic ‘features’ • “abstraction of real world phenomena” [ISO 19101] • Object models for data types – type or instance • Encapsulate important semantics in universe of discourse • Application schema • Defines semantic content and logical structure of datasets • ISO standards provide conceptual toolkit: • spatial/temporal referencing • geometry (1-, 2-, 3-D) • topology • dictionaries (phenomena, units, etc.) • GML – canonical encoding [from ISO 19109 “Geographic information – Rules for Application Schema”]
Spatiotemporal Databases e-Science Institute, Edinburgh 01-Nov-2005 CSML • CSML aims: • provide semantic integration mechanism for NDG data • explore new standards-based interoperability framework • emphasise content, not container • Design principles: • offload semantics onto parameter type (‘phenomenon’, observable, measurand) • e.g. wind-profiler, balloon temperature sounding • offload semantics onto CRS • e.g. scanning radar, sounding radar • ‘sensible plotting’ as discriminant • ‘in-principle’ unsupervised portrayal • explicitly aim for small number of weakly-typed features (in accordance with governance principle and NDG remit)
Spatiotemporal Databases e-Science Institute, Edinburgh 01-Nov-2005 CSML • Semantic data model • Climate Science Modelling Language (CSML), http://ndg.nerc.ac.uk/csml • Weakly-typed conceptual models for range of information types • Independent of storage concerns • Based on ISO ‘geographic feature types’ framework • Defined on basis of geometric and topologic structure
ProfileSeriesFeature ProfileFeature GridFeature Spatiotemporal Databases e-Science Institute, Edinburgh 01-Nov-2005 CSML • CSML feature type examples:
Numerical array descriptors provides ‘wrapper’ architecture for legacy data files proxy for numerical content within feature instances ‘Connected’ to data model numerical content through ‘xlink:href’ Three subtypes: InlineArray ArrayGenerator FileExtract (NASAAmes, NetCDF, GRIB) Composite design pattern for aggregation Spatiotemporal Databases e-Science Institute, Edinburgh 01-Nov-2005 Wrapper
Spatiotemporal Databases e-Science Institute, Edinburgh 01-Nov-2005 Wrapper • File extract examples: <NDGNASAAmesExtract> <arraySize>526</arraySize> <numericType>double</numericType> <fileName>/data/BADC/macehead/mh960606.cf1</fileName> <variableName>CFC-12</variableName> </NDGNASAAmesExtract> <NDGNetCDFExtract gml:id="feat04azimuth"> <arraySize>10000</arraySize> <fileName>radar_data.nc</fileName> <variableName>az</variableName> </NDGNetCDFExtract> <NDGGRIBExtract> <arraySize>320 160</arraySize> <numericType>double</numericType> <fileName>/e40/ggas1992010100rsn.grb</fileName> <parameterCode>203</parameterCode> <recordNumber>5</ recordNumber> <fileOffset>289412</fileOffset> </NDGGRIBExtract>
Spatiotemporal Databases e-Science Institute, Edinburgh 01-Nov-2005 Wrapper • Aggregated array • arrays may be aggregated along an ‘existing’ or ‘new’ dimension <AggregatedArray gml:id="globaltemperature"> <arraySize>180 360</arraySize> <aggType>existing</aggType> <aggIndex>1</aggIndex> <component> <NetCDFExtract> <arraySize>90 360</arraySize> <fileName>northern_hemisphere.nc</fileName> <variableName>TMP</variableName> </NetCDFExtract> </component> <component> <NetCDFExtract> <arraySize>90 360</arraySize> <fileName>southern_hemisphere.nc</fileName> <variableName>TMP</variableName> </NetCDFExtract> </component> </AggregatedArray>
Spatiotemporal Databases e-Science Institute, Edinburgh 01-Nov-2005 Mediator • Data services (mediator) • Data services expose semantic model: • Mappings to third-party data models (e.g. file formats, OPeNDAP) • Canonical serialisation (e.g. ISO 19118 UML XML mapping) – Geography Markup Language • Example services: • netCDF file instantiation • OPeNDAP delivery • Open Geospatial Consortium (OGC) web services, e.g. Web Feature Service, Web Coverage Service • Pushed down to the file level, data access request should use optimised native file format-specific I/O
Provides semantic abstraction layer (SAX) demarshalling <CSML> Spatiotemporal Databases e-Science Institute, Edinburgh 01-Nov-2005 Mediator instantiateNetCDF(DatasetID, FeatureID)
Spatiotemporal Databases e-Science Institute, Edinburgh 01-Nov-2005 Using CSML For each XSD (for the source data) there is an XSLT to translate the data to the Feature Types (FT) defined by CSML. The FT’s and XSLT are maintained in a ‘MarineXML registry’ Phenomena in the XSD must have an associated portrayal Data from different parts of the marine community conforming to a variety of schema (XSD) • Example of CSML use – MarineXML The FTs can then be translated to equivalent FTs for display in the ECDIS system XSD XML Biological Species S52 Portrayal Library XSD XML XML Parser MarineGML(NDG) Feature Types Chl-a from Satellite XSLT XML XSLT XSLT SENC SeeMyDENC XML XSLT XSLT XSD XML XSLT with thanks to Keiran Millard, HR Wallingford MeasuredHydrodynamics ECDIS acts as an example client for the data. Data Dictionary The result of the translation is an encoding that contains the marine data in weakly typed (i.e. generic) Features XSD XML Features in the source XSD must be present in the data dictionary. ModelledHydrodynamics
Spatiotemporal Databases e-Science Institute, Edinburgh 01-Nov-2005 Using CSML <gml:definitionMember> <om:Phenomenon gml:id="taxon"> <gml:description>The taxon name</gml:description> <gml:name codeSpace="http://www.vliz.be">taxon</gml:name> </om:Phenomenon> </gml:definitionMember> </NDGPhenomenonDefinitions> <!--===================================================================--> <gml:FeatureCollection> <!-- ============================================================== --> <gml:featureMember> <NDGPointFeature gml:id="ICES_100"> <NDGPointDomain> <domainReference> <NDGPosition srsName="urn:EPSG:geographicCRS:4979" axisLabels="Lat Long" uomLabels="degree degree"> <location>55.25 6.5</location> </NDGPosition> </domainReference> </NDGPointDomain> <gml:rangeSet> <gml:DataBlock> <gml:rangeParameters> <gml:CompositeValue> <gml:valueComponents> <gml:measure uom="#tn"/> <gml:measure uom="#amount"/> <gml:measure uom="#gsm"/> </gml:valueComponents> </gml:CompositeValue> </gml:rangeParameters> <gml:tupleList> 'ANTHOZOA',63.1,missing 'Scoloplos armiger',66.1,missing 'Spio filicornis',10,missing 'Spiophanes bombyx',60.3,missing 'Capitellidae',131.8,missing 'Pholoe',10,missing 'Owenia fusiformis',23.4,missing 'Hypereteone lactea',6.8,missing 'Anaitides groenlandica',13.2,missing 'Anaitides mucosa',6.8,missing • EU project – MarineXML “MarineXML is an initiative of the IOC/IODE of UNESCO to improve marine data exchange within the marine community. The European Commission has provided a funding contribution to this initiative as part of its 5th Framework Programme to undertake a ‘pre-standardisation’ task of identifying the approaches the marine community should adopt regarding XML technology to achieve improved data exchange.” “... there is a momentum from organisations such as IHO and WMO to adopt consistent approaches for the vocabulary of their data along the reference implementation of ISO Standards prescribed by the [Open Geospatial Consortium]...” “The NDG format proved a robust recipient for the data from each community. It produced economical files with few redundant elements, striking about the right balance between weak and strong typing.”
Spatiotemporal Databases e-Science Institute, Edinburgh 01-Nov-2005 Conclusions/future • Conclusions • Mechanism is lossy, in general • semantic integration is far more important than completeness of representation • Emphasis on content, not container • Mediator services can expose data model • Well-known community formats – use efficient legacy APIs • Initial semantic decoration can add context to entire workflow chain • Loose relationship between legacy file data model and semantic (feature) instance to which it is mapped
Spatiotemporal Databases e-Science Institute, Edinburgh 01-Nov-2005 Conclusions/future • Current and future work (NDG) • Implement tooling: • CSML parsing/processing • Automated ‘scanner’: {files} CSML • Implement NDG data delivery (mediator) services layered over data model • Further perspectives • Integrate with broader interoperability frameworks (e.g. ‘semantics repositories’ Feature Type Catalogues – WMO, IOC, INSPIRE) • Generalise approach: • meta-model for data modelling • ‘data storage description language’ for file mappings (DFDL role?) • canonicalised serialisation for workflows
Spatiotemporal Databases e-Science Institute, Edinburgh 01-Nov-2005 Conclusions/future • Managing semantics conceptual model define data models auto-generate XSD GML dataset GML app schema <gml:featureMember> <NDGPointFeature gml:id="ICES_100"> <NDGPointDomain> <domainReference> <NDGPosition srsName="urn:EPSG:geographicCRS:4979" axisLabels="Lat Long" uomLabels="degree degree"> <location>55.25 6.5</location> </NDGPosition> </domainReference> </NDGPointDomain> <gml:rangeSet> <gml:DataBlock> <gml:rangeParameters> <gml:CompositeValue> <gml:valueComponents> <gml:measure uom="#tn"/> <gml:measure uom="#amount"/> <gml:measure uom="#gsm"/> </gml:valueComponents> </gml:CompositeValue> </gml:rangeParameters> <gml:tupleList> 'ANTHOZOA',63.1,missing 'Scoloplos armiger',66.1,missing 'Spio filicornis',10,missing 'Spiophanes bombyx',60.3,missing 'Capitellidae',131.8,missing auto-generated parser populate dataset instances
Spatiotemporal Databases e-Science Institute, Edinburgh 01-Nov-2005 Conclusions/future • Stack of Builders (for UML meta-model) • current class, object, attribute • specialised for particular UMLXML mapping • Builder receives: • filtered SAX events • built object • Builder returns: • built object • new object class • new Builder (for inheritance through substitutionGroups) Parser: