240 likes | 373 Views
V-GISC/SIMDAT project – a Virtual GISC. Alfred Hofstadler, Matteo Dell’Acqua ECMWF. Project History:. May 2002: Thirteenth session WMO Regional Association VI: “…agreed that the concept of a Virtual GISC had merit…” June 2002: V-GISC in RA-VI Kick-off Meeting
E N D
V-GISC/SIMDAT project – a Virtual GISC Alfred Hofstadler, Matteo Dell’Acqua ECMWF
Project History: • May 2002: Thirteenth session WMO Regional Association VI: “…agreed that the concept of a Virtual GISC had merit…” • June 2002: V-GISC in RA-VI Kick-off Meeting • Partners: DWD, Meteo France, UK Met-Office, EUMETSAT, ECMWF • Steering Group + 4 working groups: Policy, Data, Communications, Dissemination/Acquisition • 2003: SIMDAT project proposal submitted to EU • 1 September 2004: contract with EU is signed • October 2004: V-GISC steering group decides to move V-GISC development into the SIMDAT project • November 2004: SIMDAT Kick-off meeting • 4 V-GISC working groups are mapped onto SIMDAT working groups: Virtual Organisation, Ontologies, GRID Infrastructure, Access to Distributed Data • February 2005: First (V-)GISC-demonstrator at CBS
SIMDAT - Introduction • Data Grids for Process and Product Development using Numerical Simulation and Knowledge Discovery • 4 years project funded by the EU • Contract with EU was signed on 1 September 2004 • SIMDAT focuses on 4 applications • Product design in automotive and aerospace • Process design in pharmacology • Service provision in meteorology • Objective of SIMDAT is to use data grid technology to resolve a complex problem for each of the 4 applications • Budget of 11 M € of which 10.5% for meteorological activity • 320 men/month taking into account EU funding and the contribution from the partners
SIMDAT - Strategy • 7 Grid-technology areas have been identified to achieving SIMDAT objectives • Integrated Grid infrastructure offering basic services to applications • Access to data distributed on Grid sites • Management of Virtual Organisation • Ontologies • Integration of analysis services • Workflows • Knowledge Services Phase 1: Connectivity Phase 2: Interoperability Phase 3: Knowledge .Virtual Data Repository . Introduction of grid technologies research .Deployment of Grid infrastructure with particular attention to data transport and management . Distributed DB access Workflows for next-generation aggregated knowledge capture, discovery and mining
Meteorology application : Project Aims • 5 partners: DWD, Meteo-France, UK Met Office, EUMETSAT and ECMWF • 3 “potential” GISCs : DWD, Meteo-France, UK Met Office • 2 DCPCs : ECMWF, EUMETSAT • Instead of each National Met Service having a GISC (Global Information System Centre) • The V-GISC will be seen as a normal GISC and will fulfil the WMO Information System technical requirements • The project will build the foundations of the V-GISC by developing an infrastructure that brings together the data of the partners and provides access to the distributed meteorological databases
Meteorology application : Project Aims • A complex problem: To build a Virtual GISC, an integrated and scalable framework for the collection and sharing of distributed data that will offer: • A single view of meteorological information which is distributed amongst the 5 partners • Improve visibility and access to meteorological data through a comprehensive discovery service based on metadata development • Offer a variety of reliable reliable delivery services (routine dissemination of and collection of data) • Provide a global access control policy managed by the partners and integrated into their existing security infrastructure • Quality of services, reliability and security • Processing services and shared data manipulation facilities • The software developed within the project will be made available to WMO
GRID Technology • Grid technology will be used • To connect the diverse data sources and create a Virtual Database • To enable flexible, secure collaboration through virtual organisation • Data Grid technology presents an architectural framework that aims to provide access to distributed data in a simple,secure, reliable and scalable manner from a widely distributed set of computers and across various administrative boundaries • The essential characteristics of a Data Grid are: • Reference a dataset by a unique identifier • Discover dataset by attributes • Track multiple copies of a single file, and ultimately locate the "nearest" copy • Move files from one point on the grid to another point (push, pull and third party copy) • The domain of the V-GISC is an ideal candidate to exploit such a framework
Grid infrastructure for sharing data Monitoring Logging Control Error tracking Security AuthenticationAuthorization Audit Interoperability interfaces for data/metadata exchange Management User registration DB admin Catalogue admin Interface to offer a single view of the data - Discovery facilities - Request/Subscription Dissemination/acquisition mechanisms mechanisms to synchronise metadata V-GISC infrastructure
V-GISC Conceptual view • Through the Distributed Portal users searches for and retrieves data, subscribe to services subject to authentication and authorization • The Virtual Database Service provides a single view of partners databases
V-GISC Conceptual view • Virtual Database • Provide the unified view of all the shared datasets through a distributed catalogue • Maintain the distributed catalogue amongst the partners using synchronization mechanisms • Provide interfaces with the legacy databases • Implement data replication mechanisms • Preserve the integrity of the data • Access Facilities • Collection & Dissemination services that support secure, efficient and reliable transport mechanisms • Quality of Service (QoS): Traffic Prioritization, Queuing mechanisms, Scheduling • Discovery service by browsing the catalogue or using a keyword search engine • Interactive and batch interfaces • VO • Security Services (CA, AuthN, AuthZ, Audit,…) • Users management • Data policy management • Monitoring and control
V-GISC Distributed Architecture • V-GISC node is installed on each partner site • All the nodes are interconnected through a dedicated secure communication channel; The Database Communication Layer (DCL) • All the nodes exchange messages through the DCL • The architecture is decentralized • No central point where all the nodes are declared • No single point of failure • The network of nodes is self-organized • The network dynamically accepts new nodes and is aware of node disconnections • The network organizes its topology and indicates to the entering new nodes their position within the network • No manual intervention on the nodes to accepts new peers
V-GISC Node • Each node maintains a copy of the global catalogue describing data available through the V-GISC • The catalogue synchronization is done using the DCL • Each node maintains a cache used to replicate data and to efficiently serve the users • A node is interfaced with the local legacy databases • A node has a Web Portal for interactive access • A node has a Grid/Web Service Portal for batch access and integration of the V-GISC in a bigger Grid • A node implement all services offered by the V-GISC
Demonstrator – Functional View • To deploy a flexible infrastructure on top of which the Virtual Information Centre can be built • To use Grid technologies to federate databases located on partners site • To show to the user a unique view of data sets stored by at least 3 partners • To get a first implementation of the catalogue based on WMO core metadata • To offer first VO security services
Demonstrator - Design • 3 main components to build the virtual database: Data Repository, Catalogue Node and Portal • installed on each partner site and interconnected through a dedicated secure connection channel • Data Repository • Interface to the partners databases • Offers metadata information to describe, search, locate data • Offers interface to retrieve data from the associated local databases • Catalogue Node • Maintains the catalogue and ensures synchronisation • Harvests metadata and requests data from the data Repository • Ingests data and maintains the cache of the V-GISC • Serves clients: Portal or other Nodes • Monitors the execution of the requests • Distributed Portal • Offers interface to search/browse the V-GISC catalogue
Demonstrator - Architectural Choices • Grid Architecture that can accept any kind of Grid Technology • Free to choose any grid middleware (OGSA-DAI, GRIA, Glite, GT4) and pick the best component of each middleware that meets the V-GISC requirement • Catalogue Node built on a J2EE component framework • Solid framework used in production environment • Includes different services such as persistency, monitoring, configuration, etc • The framework can be seen as a kernel of components where it is easy to add services such as Grid services or Web services • Catalogue duplicated and synchronized on each site • To have a fast discovery (browse & search phase) phase • To have a reliable system (client redirection to another node in case of problems)
Problems and lessons learned - 1 • Grid Middleware • Technology not mature for production environment • Middleware still evolving toward standards (WSRF, WSI, …) • Access to distributed data • No efficient and robust transport mechanism • No mechanism to duplicate and synchronize data • Difficult to ensure data integrity on huge data volumes • OGSA-DAI is promising, easy to understand and use
Problems and lessons learned - 2 • Ontology / Metadata • Meteorological metadata are described using XML WMO-CORE metadata Profile • Metadata description larger than the data • Same information repeated in all metadata records Unnecessary information is circulating over the network • Large metadata records slowing down the Database hosting the catalogue • Universal request language was not a solution to the virtual database problem • VO • No standard tools to manage users and data policies • No standard security policies
What’s next • Finalise the Connectivity phase (by M18/Mar 2006) • Connect EUMETSAT to the Grid (M12-M15/Sep-Dec 2005) • Enhance the architecture (M13-M18/Oct 2005-Mar 2006) • Implement Registration Authority (M16-M17/Jan-Feb 2006) • Improve metadata model (M13-M16/Oct 2005-Jan 2006) • Enhance distributed portal (M14-M16/Nov 2005-Jan 2006) • Introduce acquisition of data (M18-M24/Mar-Sep 2006) • Develop subscription service (M20-M28/May 2006-Jan 2007) • Start developing the Virtual Organisation • Monitoring and management of the system (M18-M24/Mar-Sep 2006) • User management and data access control (M24-M30/Sep 2006-Mar 2007) • Develop the discovery mechanism (M20-M25/May-Oct 2006) • Start testing with other potential GISC • Japan and Australia have expressed interest in joining the SIMDAT project
Global View : Coordination Effort • Metadata • Request-reply mechanism • Exchange of catalogues • Definition on what data should be available and to whom • Virtual Organisation • Standardisation of services • Quality of Service • Security