430 likes | 527 Views
The NERC DataGrid – Building Bridges for the Environmental Sciences. Bryan Lawrence Kerstin Kleese, Roy Lowry, Kevin O’Neill, Andrew Woolf & others Head, NCAS/British Atmospheric Data Centre Rutherford Appleton Laboratory, CCLRC. NDG Partners. As funded a partnership between
E N D
The NERC DataGrid – Building Bridges for the Environmental Sciences Bryan Lawrence Kerstin Kleese, Roy Lowry, Kevin O’Neill, Andrew Woolf & others Head, NCAS/British Atmospheric Data Centre Rutherford Appleton Laboratory, CCLRC
NDG Partners • As funded a partnership between • British Atmospheric Data Centre (BADC, PI: Bryan Lawrence) • British Oceanographic Data Centre (BODC, Co-I: Roy Lowry) • CLRC E-science Centre (Co-I: Kerstin Kleese) • PCMDI at LNL in the US (Dean Williams, Bob Drach, Mike Fiorino) • Project has caught the imagination, extra funding now supports: • A number of groups at the NERC Centre for Ecology and Hydrology (CEH: Ecology DataGrid) • NERC Earth Observation Data Centre & Plymouth Marine Lab Remote Sensing • Not directly funded major collaborators will include: • ClimatePrediction.net, GODIVA (NERC e-science projects) • NCAS/CGAM: The Centre for Global Atmospheric Modelling at the University of Reading (via Lois Stenman-Clark and Katherine Bouton) • Already required to provide technology to support the major UK project: HIGEM (a collaboration between the Hadley Centre and the NERC academic community to develop the next generation of high resolution GCM models based on HadGEM).
Outline • Motivation: • The BADC, BODC, and the Metadata Gateway • The NDG Goal • NDG Metadata Structures and Architecture • Metadata Model • Data Model • ISO Context • NDG Prototype Status • Summary & Challenges
The British Oceanographic Data Centre (not for much longer, moving to a site on Liverpool University campus imminently)
BODC Mission Statement • To operate a world class data centre in support of UK marine science by: • providing data management support for UK marine science projects • maintaining and developing the UK’s national oceanographic database • developing innovative marine data products and digital atlases • collaborating, on behalf of the UK, in the international exchange and management of oceanographic data • making high quality data readily available to UK research scientists in academia, government and industry
British Atmospheric Data Centre The Role: Key words: Curation and Facilitation!
BADC Users Users by Discipline November 02, 2150 Users 3800 registered in March03 ~ 300 individual users per month
BADC Storage Capacity • Approx 50 TB (Nov02) • Projected to quadruple well within next couple of years given existing commitments • Planning exercise under way now. • Committed to keeping as much as possible on spinning disk • Further backup and extra storage at national archival centre (ATLAS, PB soon) 2.5Gb
No possibility of automatic data usage … Querying datasets Complex Metadata, held in Ingres database: export DIF and Z39.50
Different types of data returned: Wallingford Supporting very diverse user community: NetCDF is not enough …
NERC Metadata Gateway - SST • No clean handover from discovery to browse and use! • Geospatial coordinates forgotten. Time reference forgotten. Need to get entire field(s), and find correct time! • And if I want to compare data from different locations? • - multiple logins • - multiple formats • - discovery?
Outline • Motivation: • The BADC, BODC, and the Metadata Gateway • The NDG Goal • NDG Metadata Structures and Architecture • Metadata Model • Data Model • ISO Context • NDG Prototype Status • Summary & Challenges
Research Group Research Group Satellite Shared Resources SuperComputer Research Group DB Wider Internet Metadata Origins Consider a hierarchy of data users beginning with an individual scientist, who may herself be part of a research group, itself part of a community sharing resources, lying in the wider internet … To be well integrated the metadata should have a role at each level! (The data portal client and server interface may be different at each level). At each level “extra” metadata will be required, probably produced by dedicated staff at the research group, or data centre.
Research Group Research Group Satellite Shared Resources SuperComputer Research Group DB Wider Internet A google for data; the metadata carrot!
Outline • Motivation: • The BADC, BODC, and the Metadata Gateway • The NDG Goal • NDG Metadata Structures and Architecture • Metadata Model • Data Model • ISO Context • NDG Prototype Status • Summary & Challenges
Separate data (A) and metadata (B) models • Clear separation of function • Difference between data use and discovery etc. • “Tuning” of metadata to include relevant detail • Allows increased reuse of metadata model • Avoids tie-in to details of a particular fields data formats • Can plug-in another data model Metadata Model Data granule ID Data summary Data Model
(A) NDG Data Model: Overview Dataset: named container for a number of variables Variable: physical parameters within the dataset; controlled vocabularies eg BODC datadictionary, CF standard names Array: multidimensional container for other arrays or numeric data Coordinate: may be shared between multiple Arrays; ‘anonymous’ if not georeferenced; MappedCoordinate vs ProductCoordinate; with respect to a Coordinate reference System (ref ISO 19111, ISO 19115) GranuleDescriptor: describes data granule in terms of file storage; enables file aggregation; SQL/OGSA-DAI for RDBMS; physical or logical (eg SRB) files “Profiles” of model defined for important data types
Array NDG Data Model
(B) Metadata Model: an NDG Intermediate Schema, Conceptual Overview
Outline • Motivation: • The BADC, BODC, and the Metadata Gateway • The NDG Goal • NDG Metadata Structures and Architecture • Metadata Model • Data Model • ISO Context • NDG Prototype Status • Summary & Challenges
ISO TC211 • ISO 19101: Geographic information – Reference model • ISO 19103: Geographic information – Conceptual schema language • ISO 19107: Geographic information – Spatial schema • ISO 19108: Geographic information – Temporal schema • ISO 19109: Geographic information – Rules for application schema • ISO 19111: Geographic information – Spatial referencing by coordinates • ISO 19115: Geographic information – Metadata • ISO 19118: Geographic information – Encoding • ISO 19121: Geographic information – Imagery and gridded data
Dataset responsible party Metadata point of contact Metadata character set On-line resource Metadata date stamp Metadata language Metadata file identifier Metadata standard name Distribution format Metadata standard version Spatial resolution of dataset Reference system Dataset character set Spatial representation type Dataset title Dataset language Abstract describing dataset Dataset topic category Dataset reference date Geographic location of dataset Lineage Vertical/temporal extent for dataset ISO19115
ISO • Metadata extensions and profiles Direct relationship between ISO19115 and our (B) Intermediate schema.
ISO19101 • Profiling of ISO 191xx“The comprehensiveness and large number of options available in various base standards make it difficult to combine them for practical applications. … A profile integrates a set of base standards and/or modules (predefined subsets) of base standards to meet a specific implementation requirement.” • Registration of profiles“A profile that is registered through an ISO registration procedure becomes an International Standardized Profile (ISP). National standards that are expressed as profiles of ISO base standards may be registered at a national level.”
ISO 19111 ISO 19108 Further Application in NERC DataGrid • eg Data model “Coordinates”
Outline • Motivation: • The BADC, BODC, and the Metadata Gateway • The NDG Goal • NDG Metadata Structures and Architecture • Metadata Model • Data Model • ISO Context • NDG Prototype Status • Summary & Challenges
Key Components – need APIs and standards Harvest Globus
NDG Discovery Service Element Traditional and Grid Service (GT3) Interfaces
Starting with the LAS Deployment for UK users within a few weeks (constraint is primarily access control)
ERA40 LAS – Simple Box fill Output Work for us to do: Labelling is inadequate as yet ..
BADC/CDAT localCache.py ERA-40 < 1TB Grid Cache YES Locks access to cache. Checks if regular gridded file is in cache list. BADC/CDAT intercepts command and checks cache NO Cache unlocked. New cdms.open command sent to CDAT and cache file opened. LAS Spectral file is converted on-the-fly and placed in cache. ERA-40 4 TB Spectral Archive Calls cdms.open to open data file. CDAT Data object delivered to LAS. Internet User 18 TB virtual dataset NetCDF file, plot or animations delivered to user. Cache management in LAS/CDAT Cache also checks if enough room, deletes oldest files if necessary and checks against disk space limit.
NERC DataGrid Prototype • (by hand) Ingestion of ACSOE data from BADC and BODC. • NASA GCMD DIF based discovery • Exported from Intermediate Schema • Harvested by hand • Working on hand-over-mechanism to pass dataset info to DataModel based LAS service • Generate and populate LAS database in response • Use standard LAS delivery Next Steps: • GT3 based services, improve LAS, improve delivery, implement multiple datamodel profiles, implement multiple discovery services.
Summary NDG project running for a year now, aiming to provide grid-enabled tools to support: • a diverse community • with diverse datasets NDG part of the UK National E-science programme, and will leverage off other projects to implement grid solutions. • initial prototype web-service based • GT3 prototype due early in the new year Software development based on plagiarising the maximum amount from other groups, and a standards based approach within the NDG. • All code will be in the public domain Major challenge will not be technical; policy, attitudes, legal issues.