380 likes | 526 Views
The NERC DataGrid. Bryan Lawrence, BADC David Boyd Kerstin Kleese Roy Lowry Dean Williams Bob Drach Mike Fiorino. Deputy Director CLRC e-Science centre. DL: Climate Database Expert. BODC: Marine Database Expert. PCMDI: ESG Principle Investigator. PCMDI: ESG Metadata Architecture.
E N D
TheNERC DataGrid Bryan Lawrence, BADC David Boyd Kerstin Kleese Roy Lowry Dean Williams Bob Drach Mike Fiorino Deputy Director CLRC e-Science centre DL: Climate Database Expert BODC: Marine Database Expert PCMDI: ESG Principle Investigator PCMDI: ESG Metadata Architecture PCMDI: Meteorologist Acronym Summary: PCMDI: Program for Climate Model Data Intercomparison (US Department of Energy, Lawrence-Livermore National Lab) ESG: Earth System Grid (US Grid Project: NCAR, Argonne, PCMDI, USC …)
Motivation The Earth System Grid definitions of “portals” and applications ontologies Relations with other NERC e-science programmes. Architecture querying software Stack Initial steps and Project Management Connectivity with other grid projects Success and Failure Summary of what we are doing and the road to the future Outline
The BADC – part of NCAS! The Role: Key words: Curation and Facilitation! http://www.badc.rl.ac.uk
Just under half of BADC users are NOT atmospheric scientists:
E-science should be involved with: delivering an enhanced meta-data record of archived data. 'dictionary' building. building systems to translate data and link databases. integrating computer and natural science communities. the ability to generate a single query across multiple datasets (in different catalogues) returning both metadata and data. the ability to acquire large datasets in near real time (NRT). the automatic production of metadata, both by models, and where possible, by observing systems. Motivation – Town meeting 2001 Summary from two of the four working groups!
Energy Water Management Health Weather Risk Food Chain Relevant to many stakeholders (Slide from Julia Slingo’s introduction to CGAM as part of NCAS)
Motivation Page 22: NERC will …... ensure that Earth system science is underpinned by e-science investments to enable access, manipulation … of data from diverse sources.
NERC Metadata Gateway - SST • Geospatial coordinates forgotten. Time reference forgotten. Need to get entire field(s), and find correct time! • And if I want to compare data from different locations? • - multiple logins • - multiple formats • - discovery?
Searching: need comprehensive metadata! • A priori would any user know to look in the COAPEC data set? • Earth system-science means we have to remove these boundaries! • detailed file level metadata isn’t visible, and so data mining applications impossible. • - need ontologies to help queries match actual data descriptions. NB: Dynamic catalogues!
What is an Ontology? • An ontology defines the terms used to describe and represent an area of knowledge by specifying the following kinds of concepts: • Classes (general things) in the many domains of interest • The relationships that can exist among things • The properties (or attributes) those things may have • Ontologies are usually expressed in a logic-based language, so that detailed, accurate, consistent, sound, and meaningful distinctions can be made among the classes, properties, and relations..
Ontology Example: An example of part of ontology defined using OIL (e.g. see Oil in a Nutshell, D. Fensel et.al.) ontology-definitions slot-def eats inverse is-eaten-by slot-def has-part inverse is-part-of properties transitive class-def animal class-def plant subclass-of NOT animal class-def tree subclass-of plant class-def branch slot-constraint is-part-of has-value tree class-def leaf slot-constraint is-part-of has-value branch class-def class-def defined carnivore subclass-of animal slot-constraint eats value-type animal class-def defined herbivore subclass-of animal slot-constraint eats value-type plant OR (slot-constraint is-part-of has-value plant) class-def giraffe subclass-of animal slot-constraint eats value-type leaf class-def lion subclass-of animal slot-constraint eats value-type herbivore Relationships Classes Properties With current funding, the NDG does not aim to build a formal ontology, but we do aim to being to build a thesaurus that can form the basis of one, and we do hope to spin off a project to build one and integrate it in the NDG (OIL: Ontology Inference Layer)
ESG: Example of a Web-based Data Portal • ESG will provide support for: • large but simple data sets, • limited metadata, but not searchable. • NDG will provide support for • Small-but-complex datasets. • Data-mining (searchable metadata). • NDG is complementary to ESG!
Live Access Server (1) … we will keep the basic structure, but gradually replace components.
Live Access Server (2) Data Request Structure:
ESG: Example of a Client Application • We will: • Provide python based classes for our observational data to complement the access to 3D gridded data. • Provide a web services wrapper so that other grid applications can access NDG data.
GODIVA team have already discovered issues with the XML database interface they are going to use. Relationship to GODIVA (Haines et.al.)(Grid for Ocean Diagnostics, Interactive Visualisation and Analysis) Architecture of the GODIVA Grid: • NDG will: • improve data discovery tools for GODIVA (even for their own datasets). • provide metadata creation tools for GODIVA participants. • provide access to data held outside GODIVA participants.
HTTP • HTTP • Scientific • investigators • Participants & • policy-makers • Summary • statistics • HTTP (DODS URL) • Live Access Server • Obs • ESG-II/NERC • DataGrid • Peer-to-peer • visualisation • Datamining • GridFTP • 100Tb of key output at 10-20 sites • ConventionalFTP/HTTP • 1Pb total output on 1M participants’ PCs ClimatePrediction.com CP.COM will need the NDG to make best use of observational data in evaluating their parameter space.
Satellite Data Grid Mining Agent Archive X Grid Processor Grid Mining Agent Grid Mining Agent Satellite Data Grid Processor Grid Processor Archive Y Mining on the Grid From Hinke’s NASA IPG presentation at CEOS, Rome, May 2002
Data IPG Processor Archive X IPG Processor Mining Operations Repository IPG Processor Mining Confiig Info Mining Daemon Satellite Data Control Database IPG Mining Agent IPG Mining Agent Archive Y IPG Processor IPG Processor Data mining: Grid Miner Architecture The devil is in the detail: how does the data mining agent get at the data? Need data mining clients – objects which can read specific datatypes and present themselves to agents! From Hinke’s NASA IPG presentation at CEOS, Rome, May 2002
Requires databases of metadata & querying those databases. Each part of the NDG will have an internal metadata catalogue (&/or database), and data (either in flat files or the database). so the querying strategy must support centralised querying on partially indexed data, followed (if necessary) by distributed querying, which may or may not need mapping into a local database schema. In the grid environment the indexes themselves will be replicated, and some data may also be replicated. Major NDG design issue: developing appropriate data models, database schema and indexing strategies! This is not a generic problem, it will be specific to our datatypes. Technology needs to be public domain (i.e. free) for uptake! NDG approach to database technology will be developed in conjunction with DBTF. Finding data: Querying!
Information Structure Joint Interfaces PCMDI Components NDG Components Existing Components
Simplified Software Stack Key point: make use of existing technology, allow component replacement with time! Achievable by: interface definition and integration. Note: Any application will be able to access our data services via the OGSA wrapper in the middleware.
Draft Project Schedule Phase One Delivery
Replace with Globus Giggle? • Next steps include: • Replacing the transport layers in the metadata gateway with SOAP • Replacing the SGML in the metadata gateway with XML • …etc
Plagiarism: Copying from one person Research : Copying from many people … we can’t afford to be too innovative! Connectivity? Innovation? Evolution!
Indicators of Success • Finding and making use of data: • Possible to find, reformat, and visualise disparate datasets from disparate organisations within one application. • No longer necessary to rely on personal contacts to locate and acquire data of interest if it’s held in the BADC/BODC. • Key requirement for interdisciplinarity; the ability to test data comparison ideas without learning foreign formats and establishing personal relationships every time. • Other NERC data designated data centres implementing NDG. • Take up by community: • NDG software (but not necessarily graphics tools) in use in GODIVA project and in wider UK university community (including data repositories in research groups). • Earth System Grid uses NDG components.
Someone else does it first – unlikely! Performance too slow for users! More cache and replication Improve database performance (UK DBTF!) Data-compression layer for XML Reduce scope and search depth (don’t want to do this!) Globus 3 (OGSA) delivery heavily delayed Web services implementation + Globus2 + datagrid service registry Availability of people with appropriate skills re-deploy existing staff where possible Schedule begins with three months training. ESG-II architecture delayed or incompatible with UK architecture Close relationship with PCMDI means we will be able to proceed effectively anyway. Risks Of Failure
1 Catalogue Ingestor 4 Computation Other: e.g. PML/ESSC Local Catalogue Catalogue Client 3 XML Catalogue Server Python API 6 Catalogue Client Computation Graphics Based on LAS 2 Evolving to OGSA 5 NDG expected evolution Data Repositories At USER Institution Computation NERC DDC
Extension to the other NERC data centres, requires: online (or near-line) data. appropriate ingestion tools, appropriate mappings between specific discipline specific metadata and generic metadata. GRID enabling data centres. Decisions about policy and access. Beyond the next three years: The NDG and earth systems science
TheNERC DataGrid Bryan Lawrence, BADC David Boyd, CLRC E-science Kerstin Kleese, CLRC E-science Roy Lowry, BODC Dean Williams, PCMDI Bob Drach, PCMDI Mike Fiorino, PCMDI
Weekly workgroup meetings (teleconference and physical). Milestoning code and documentation reviews at quarterly intervals. Quarterly liaison with both US colleagues and other NERC projects (GODIVA, ClimatePrediction.com etc). Bi-Annual target-reprofiling. Professional project management at the code level: Both RAL SSTD and RAL e-Science have considerable experience managing and delivering large software projects. Two key tenets of management philosophy: Build early, build often. Evolve from a working system. Project Management
Key components: BADC/BODC Project Management. Ingestion tools for station data, oracle database data, and other (eg PP - includes tools based on ESML and Marine XML). Format conversion tools within CDAT. Ingestion! Migrate NERC Metadata gateway to WDSL/SOAP (Zoom?). Key components: CLRC e-science Globus Installation at all sites. Functional decomposition and interface definitions. Search database schema; search software python API, wrappers. Database Population. Logical to Physical File Manager. Amalgamating search API into LAS (or successor) , VCDAT, metadata gateway. Add data retrieval interfaces into metadata gateway. The NDG: What will we do?