310 likes | 472 Views
Experience with the WMO core metadata in the SIMDAT/VGISC project. Baudouin Raoult ECMWF. The SIMDAT/VGISC project. SIMDAT EU funded GRID project 7 Technologies: Grid infrastructure, Virtual Organisation, Ontologies, Analysis Services, Workflows, Distributed data access, Knowledge Services
E N D
Experience with the WMO core metadata in the SIMDAT/VGISC project Baudouin Raoult ECMWF
The SIMDAT/VGISC project • SIMDAT • EU funded GRID project • 7 Technologies: Grid infrastructure, Virtual Organisation, Ontologies, Analysis Services, Workflows, Distributed data access, Knowledge Services • 4 Activities: Automotive, Areospace, Pharmacy and Meteorology • Meteorology activity: build a Virtual GISC (V-GISC) • DWD • UKMO • MétéoFrance • EUMETSAT • ECMWF
V-GISC Conceptual view • Through the Distributed Portal users searches for and retrieves data, subscribe to services subject to authentication and authorization • The Virtual Database Service provides a single view of partners databases
Why do we need metadata (in this project)? • Create a catalogue (discovery metadata) • Searchable (Keyword, Geographical location, Time range) • Browsable (Directory hierarchy) • Implement the V-GISC (service metadata) • Describe where the data resides (physical location) • Describe how to request the data • Describe the data format (useful for offering list of transformations, e.g. sub-sampling of gridded data, plots or format conversions) • Describe associated data policies
Study of the WMO core • Starting point • XML files available on the WMO web site • XML files from DWD earlier prototype • Trying to describe ECMWF archive (1.3 1010 GRIB fields)
XML Root element <p:piTimeseriesxmlns:p="http://www.wmo.ch/web/www/metadata/piTimeseries"xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"xmlns="http://www.wmo.ch/web/www/metadata"xsi:schemaLocation="http://www.wmo.ch/web/www/metadata http://www.dwd.de/UNIDART/metadata/WMO19115_metadata_v0_2.xsd http://www.wmo.ch/web/www/metadata/piTimeseries http://www.dwd.de/UNIDART/metadata/WMO19115_piTimeseries_schema.xsd"> or <metaDataxmlns="http://www.wmo.ch/web/www/metadata"xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance“xmlns:fc="http://www.wmo.ch/web/www/featurecatalogue“xsi:schemaLocation="http://www.wmo.ch/web/www/metadata/../WMO19115_metadata_v0_2.xsd http://www.wmo.ch/web/www/featurecatalogue/./featurecat/iso19110.xsd"> • Namespaces are a nightmare to use (especially using XPath when there is a default namespace)
XML Keywords <descriptiveKeywords>Russian Federation</descriptiveKeywords> <descriptiveKeywords>Moscow region</descriptiveKeywords> <descriptiveKeywords>Temperature</descriptiveKeywords> <descriptiveKeywords>Clouds</descriptiveKeywords> <descriptiveKeywords>Meteorology</descriptiveKeywords> <descriptiveKeywords>Observation</descriptiveKeywords> <descriptiveKeywords>Pressure</descriptiveKeywords> <descriptiveKeywords>Rainfall</descriptiveKeywords> <descriptiveKeywords>Snow</descriptiveKeywords> <descriptiveKeywords>Snowfall</descriptiveKeywords> <descriptiveKeywords>Weather</descriptiveKeywords> <descriptiveKeywords>Wind</descriptiveKeywords> <descriptiveKeywords>Phenomenon</descriptiveKeywords> Or… <descriptiveKeywords>EARTH SCIENCE > Cryosphere > Sea Ice</descriptiveKeywords> <descriptiveKeywords>EARTH SCIENCE > Atmosphere</descriptiveKeywords> <descriptiveKeywords>EARTH SCIENCE > Oceans</descriptiveKeywords> <descriptiveKeywords>EARTH SCIENCE > Solid Earth</descriptiveKeywords> <descriptiveKeywords>ocean, atmosphere, ice, land</descriptiveKeywords> Or… <descriptiveKeywords>METAR aviation hourly weather observation temperature dew point precipitation amount visibility cloud amount type height weather runway colour state</descriptiveKeywords>
XML Geographical extent <geographicElement> <polygon> <point> <latitude>50.78</latitude> <longitude>6.1</longitude> </point> </polygon> </geographicElement> Or… <geographicElement> <geographicIdentifiergazetteer="http://www.wmo.ch/web/www/ois/volume-a/vola-home.htm"> CCCC2 </geographicIdentifier> </geographicElement> Or… <geographicElement> <boundingBox> <westBoundLongitude>-126.3</westBoundLongitude> <eastBoundLongitude>-126.3</eastBoundLongitude> <southBoundLatitude>39.9</southBoundLatitude> <northBoundLatitude>39.9</northBoundLatitude> </boundingBox> </geographicElement>
XML Temporal extent <temporalElement> <beginDateTime>0100-01-01</beginDateTime> <endDateTime>0299-12-31</endDateTime> <dataFrequency>monthly</dataFrequency> <dataFrequency>daily</dataFrequency> </temporalElement> Or… <temporalElement> <referenceDateTime>2004-02-05T00:00:00</referenceDateTime> <beginDateTime>2004-02-05T06:00:00</beginDateTime> <endDateTime>2004-02-05T06:00:00</endDateTime> </temporalElement> Or… <referenceDate> <date>2004-01-28</date> <dateType>creationDate</dateType> </referenceDate>
Repetition of XML elements (means extension) <dataExtent> <verticalElement> <minimumValue>3.5</minimumValue> <maximumValue>992.5</maximumValue> <unitOfMeasure>mb</unitOfMeasure> </verticalElement> </dataExtent> <dataExtent> <geographicElement> <boundingBox> <westBoundLongitude>-180</westBoundLongitude> <eastBoundLongitude>+180</eastBoundLongitude> <southBoundLatitude>-90</southBoundLatitude> <northBoundLatitude>+90</northBoundLatitude> </boundingBox> <geographicIdentifiergazetteer="http://gcmd.gsfc.nasa.gov/Resources/valids/location.html">Global </geographicIdentifier> </geographicElement> </dataExtent> <dataExtent> <temporalElement> <beginDateTime>1900-01-01</beginDateTime> <endDateTime>1999-12-31</endDateTime> <dataFrequency>monthly</dataFrequency> <dataFrequency>daily</dataFrequency> </temporalElement> </dataExtent>
Repetition of XML elements (means redefinition) <dataExtent> <description>Global Grid 2.5 degree latitude and 2.5 degree longitude steps, 6 sectors, one sector per GRIB bulletin Sector S</description> <geographicElement> <boundingBox> <westBoundLongitude>-180</westBoundLongitude> <eastBoundLongitude>-60</eastBoundLongitude> <southBoundLatitude>0</southBoundLatitude> <northBoundLatitude>90</northBoundLatitude> </boundingBox> </geographicElement> </dataExtent> <dataExtent> <description>Global Grid 2.5 degree latitude and 2.5 degree longitude steps, 6 sectors, one sector per GRIB bulletin Sector T</description> <geographicElement> <boundingBox> <westBoundLongitude>-60</westBoundLongitude> <eastBoundLongitude>60</eastBoundLongitude> <southBoundLatitude>0</southBoundLatitude> <northBoundLatitude>90</northBoundLatitude> </boundingBox> </geographicElement> </dataExtent>
Findings • A flexible format, that leads to a lack of consistency • Different way to encode geographical extent, keywords and temporal extents • Missing information (for the V-GISC) • To create a directory • To locate the data • To create retrieval requests • To describe available transformations • To implement data policies
Findings (cont.) • Seems to be designed for human consumption • Free text in XML elements • <distributionInfo> • <dataQualityInfo> • Not scalable • Some document may change frequently (hourly?) • Some documents are orders of magnitude larger than data itself • Cannot represent very large archives with small granularity
SIMDAT/VGISC problem • Each site has its own practices • We have to be ready for variability in the XML • We will have to handle XML from other WMO programmes • We need to handle tens of thousands of documents • Lot of repeated information • We need fast search • We need to automatically • Index the keywords, the geographical extent and the temporal extent • Create a browsable directory (similar the NCAR’s Community data portal) • Locate and retrieve the data • Implement the data policy
Core WMO Owner UKMO Data type Synop Station (geographical extent) Heathrow Date (temporal extent) 2005-10-12 Solution: split XML documents into fragments • WMO core metadata is structured • Some part are shared amongst many documents • All metadata share the Core part • All UKMO metadata share the Owner part • All synops (should) share the same description • All observations at Heathrow have the same location • The date part is variable but is very small
XML fragments are hierarchically linked WMO UKMO Synop Heathrow Heathrow Synop Heathrow Synop 2005-10-12
Fragments: advantages • Factorizing commonalities into static fragments • Reduces size of XML documents • Indexation done once • Avoid redundancy of information • Faster searches • Frequently updated documents are small • Manageable • Scalable • Complete XML document can be rebuilt • For exchange outside the V-GISC
Keywords Geographical Extent Temporal Extent Indexing of XML fragments WMO UKMO Synop Heathrow Heathrow Synop Heathrow Synop 2005-10-12
Prototype implementation • XML Fragment are stored as “text” • Fragment table • Hierarchy table • Indexed at insertion time • Keywords table • Locations table • Periods table • Directory table • Implemented with MySQL • With OpenGIS extension • With text search extension • Indexes are “inherited” • OO approach
WMO UKMO Synop Heathrow Heathrow Synop Heathrow Synop 2005-10-12 Object Oriented Approach - Behaviours Index <descriptiveKeywords> as keyword Index <geographicElement><boundingBox> as geography Index <featureAttribute> <membrName> as keyword Index <referenceDate> <date> as period
Fragment properties - Behaviours • Only the owner of the data knows how to : • Describe the data (Indexation information) • Request the data (Create internal request) • Extract a subset of the data (Define a interface to extract a subset) • Associated to each fragments ancillary metadata can be defined to describe how to index, request and sub-select the data • Behaviours are inherited • Object oriented approach
Behaviours example: indexing <indexingclass="XPathKeywordIndexer“ separator=“ “> <xpath>//identificationInfo/descriptiveKeywords</xpath> </indexing> <indexingclass="XPathBoundingBoxIndexer"> <xpath>//identificationInfo/dataExtent/geographicElement/boundingBox</xpath> </indexing> <indexingclass="XPathPolygonIndexer"> <xpath>//identificationInfo/dataExtent/geographicElement/polygon</xpath> </indexing> <indexingclass="XPathDateIndexer"> <xpath>//identificationInfo/referenceDate/date</xpath> </indexing> <indexingclass="XPathPeriodIndexer"> <xpath>//identificationInfo/dataExtent/temporalElement</xpath> <xpath>//identificationInfo/referenceDate/period</xpath> </indexing> <indexingclass="XPathDirectoryIndexer"> <xpath>//identificationInfo/topicCategory</xpath> </indexing>
<vgisc> extension • A <vgisc> element from the “http://www.vgisc.org/” namespace is embedded in all the fragments • It contains all information needed to implement the V-GISC that is not defined by the WMO core because they are not relevant outside the scope of the V-GISC • Internal unique ID • Hierarchy relationship • Physical location (which V-GISC node holds the data) • Information used to create data request • Information used to create web pages • It is removed when full XML document is recomposed for use outside the V-GISC
Fragment example <metaData xmlns:v='http://www.vgisc.org/'> <v:vgisc> <id>urn:akrotiri.synop.land.second.record.20050629</id> <inherit>urn:akrotiri</inherit> <inherit>urn:int.wmo.synop.land.second.record</inherit> <location>ecmwf.obs</location> </v:vgisc> <identificationInfo> <referenceDate> <date>2005-06-29</date> </referenceDate> </identificationInfo> </metaData>
Variables and Requests • Some datasets have two many items • Impossible to describe every one of them • But describing the whole dataset is simple • Some datasets are very homogenous • E.g. same parameters for a long period of time • This can be described in a compact form (<beginDateTime> and <endDateTime>) • But we still need to specify that individual dates can be requested by the user
Variables and requests (cont.) • Associate two elements with an XML fragment: • <request> • Hold information specific on how to generate a valid request to the data repository • <variable> • Holds information on how to create a web interface to let the user select items from the dataset • Web portal • We use WMO core for discovery • We use the <variable> element to present selection dialogues to the user
Fragment example: ECMWF Reanalysis <metadata xmlns:v='http://www.vgisc.org/'> <v:vgisc> <id>urn:int.ecmwf.era40.sfc</id> <inherit>urn:int.wmo.core</inherit> <location>ecmwf.mars</location> <request> <class>e4</class> <levtype>sfc</levtype> <database>marser</database> </request> <variables> <date type='date'> <startDate>1980-01-01</startDate> <endDate>1990-12-31</endDate> </date> <param title='Parameter' multiple='1' type='enum'> <value>2t</value> <value>msl</value> </param> <time title='Base time' multiple='1' type='enum'> <value>0000</value> <value>0600</value> <value>1200</value> <value>1800</value> </time> </variables> </v:vgisc> <identificationInfo> <descriptiveKeywords>ECMWF 40 Years reanalysis ERA40 ERA-40 in GRIB</descriptiveKeywords> <topicCategory>NWP Outputs > ECMWF > 40 years reanalysis</topicCategory> <dataExtent> <temporalElement> <beginDateTime>1980-01-01</beginDateTime> <endDateTime>1990-12-31</endDateTime> </temporalElement> …
Directory structure • Problem: create a browsable hierarchy of topics, as the “Google directory” (see NCAR’s community data portal) • Not to be confuse with the internal “fragment hierarchy” which is not exposed to the end user • Currently using the element <topicCategory> <topicCategory>NWP Outputs > ECMWF > 40 years reanalysis</topicCategory> • The same product can appear in several locations of the directory <topicCategory>Observations > By Type > Profile > Temp Land</topicCategory> <topicCategory>Observations > By Region > Asia > China</topicCategory> • Usage should be recommended by WMO
Conclusion • The approach taken in the V-GISC should help us support the large variety of XML documents • Nevertheless, the standard is too flexible • Lot of programming is required to support all possible variations • The WMO must provide “best practices” guidelines • How to encode point in time, how to encode ranges, … • A topic hierarchy must be defined, to create the directory • WMO core metadata needs only contain sufficient information for discovery • The rest can be implemented as a series of local extensions, as long as they are not exported or exchanged