640 likes | 772 Views
WWW.GBIF.ORG. GLOBAL BIODIVERSITY. INFORMATION FACILITY. The GBIF Information System – an Update. Hannu Saarenmaa Ecoinformatics Workshop Brussels 22 September 2004. Outline. Objective Status of GBIF network Data exchange standards Protocol standards Schema repository requirements.
E N D
WWW.GBIF.ORG GLOBALBIODIVERSITY INFORMATIONFACILITY The GBIF Information System – an Update Hannu Saarenmaa Ecoinformatics Workshop Brussels 22 September 2004
Outline • Objective • Status of GBIF network • Data exchange standards • Protocol standards • Schema repository requirements
GBIF’s objective is • to establish an distributed information infrastructure that serves primary biodiversity data • with initial focus on species- and specimen-level data, • with links to molecular, genetic and ecosystems levels • to function as a global integrator
Participants have expressed their willingness to... • Share biodiversity data through nodes • Formulate and implement GBIF work programme • Voting Participants (to date 24 countries) make yearly contribution based on GDP • GBIF central budget is $3M • Associate Participants (to date 15 countries/economies, 22 international organisations) cannot vote, but otherwise participate fully • Make additional investments in biodiversity information and the necessary infrastructure • 90% of investment in GBIF happens within Participants • 10% centrally and only for providing the linking mechanism
2.Status of GBIF network GBIF is building a distributed network of databases for sharing biodiversity data using a web services approach
GBIF network status Q3/2004 • What has been achieved • UDDI registry up and running since July 2003 • Data Portal with global index opened 6 February 2004 • Several DiGIR provider implementations available • Integration of the BioCASe network • Regional training workshops • What to expect in the next 12 months • Integration of name providers (on-line) • Integration of image data • Release of Data Portal software and distributed architecture • Scaling up from 41 to 100 million records • Uses of data, like through SpeciesBank
Resource Resource Resource Resource Resource Resource Metadata Metadata Metadata Metadata Metadata Metadata GBIF Architecture User Metadata and name query ( UDDI ) ( UDDI ) Provider query Index Index Data Portal Portal Registry Registry Request Marshaller Request Marshaller Metadata Metadata Cache Cache Institutions Providers Services Institutions Providers Services Available providers Metadata response Query Engine Query Engine Accounting Accounting Publish availability Metadata and statistics Full data response DiGIR DiGIR Full data query Synonyms SOAP SOAP Name provider Name provider Data provider Data provider HTTP HTTP Provider Services Provider Services Provider Services Provider Services other other Resource Resource Metadata Metadata
The Registry You don’t get very far with web services unless you have a registry...”-Tom Gaskins, uddi.org • Global yellow pages ”marketplace” of shared biodiversity data • Populated automatically by provider installations • Based on UDDI (Systinet WASP) and web services • Directory of Participants and data providers • Services of the providers, i.e., datasources and datasets offered • tModels of the standards that must be adhered to • Open interfaces for portals and specialised search engines • Registry is available to any portal or search engine
What role for WSDL? • Links the standards and UDDI registry The various data stan- dards are represented by tModels
GBIF UDDI Registry ProviderRegistrations Services Registrations How does the GBIF registry work? 6) Scientists, decision- makers, and others can use portals to acquire data sets for analysis and synthesis 1) GBIF Secretariat and other developers create and populate the registry with descriptions of standards (tModels) 5) Specialised portals and search engines can be built by anybody to query the registry and the index 2) Museums and other data providers install data provider packages which are automatically registered 4) A global index queries the registry, caches metadata, and creates a unique identifier for each record (and name) 3) GBIF Participant is notified of new provider in their domain, for endorse-ment as a GBIF data provider
The Interim GBIF Data Sharing Agreement • Biodiversity data accessible via the GBIF network are openly and universally available to all users within the framework of the GBIF Data Use Agreement and with the terms and conditions that the data provider has identified in its metadata. • GBIF does not assert any intellectual property rights in the data that is made available through its network. • The data provider warrants that they have made the necessary agreements with the original owners of the data that it can make the data available through GBIF network. • The data provider makes reasonable efforts to ensure that the data they serve are accurate. • Responsibility regarding the restriction of access to sensitive data resides with the data provider. • The data provider includes stable and unique identifier in their data so that the owner of the data is known and for other necessary purposes. • GBIF Secretariat may cache a copy and serve full or partial data further to other users together with the terms and conditions for use set by the data provider. Queries of such data through the GBIF Secretariat are reported to the data provider. • Data providers are endorsed by a GBIF Participant, if applicable, before their metadata is made available by the GBIF Secretariat. • GBIF Secretariat is not responsible for data content or the use of the data. • GBIF Secretariat is not liable or responsible, nor are its employees or contractors, for the data contents; or for any loss, damage, claim, cost or expense however it may arise, from an inability to use the GBIF network.
The Interim GBIF Data Use Agreement • The quality and completeness of data cannot be guaranteed. Users employ these data at their own risk. • Users shall respect restrictions of access to sensitive data. • In order to make attribution of use for owners of the data possible, the identifier of ownership of data must be retained with every data record. • Users must publicly acknowledge, in conjunction with the use of the data, the data providers whose biodiversity data they have used. Data providers may require additional attribution of specific collections within their institution. • Users must comply with additional terms and conditions of use set by the data provider. Where these exist they will be available through the metadata associated with the data.
Resource Resource Resource Resource Resource Resource Metadata Metadata Metadata Metadata Metadata Metadata GBIF Architecture User Metadata and name query ( UDDI ) ( UDDI ) Provider query Index Index Data Portal Portal Registry Registry Request Marshaller Request Marshaller Metadata Metadata Cache Cache Institutions Providers Services Institutions Providers Services Available providers Metadata response Query Engine Query Engine Accounting Accounting Publish availability Metadata and statistics Full data response DiGIR DiGIR Full data query Synonyms SOAP SOAP Name provider Name provider Data provider Data provider HTTP HTTP Provider Services Provider Services Provider Services Provider Services other other Resource Resource Metadata Metadata
Data provider software • Each system entails • Provider software • Communication with the DiGIR (or BioCASe) protocol • Data standards Darwin Core, (ABCD,) Dublin Core • Configuration for each resource (local existing database) • Registration with GBIF UDDI registry • Turn-key package for easy installation • Based on PHP and digir.sourceforge.net code • Packaged and supported by GBIF • Available now for Linux and Windows • Installs automatically
Supported by helpdesk@gbif.org Turn-key package Based on PHP and DiGIR project code Available now for Linux and Windows Registration with GBIF UDDI registry
GBIF Data Repository Tool • Enable data custodians to manage and publish their own data • Make available a simple data warehouse tool for those who want to host datasets for the community • Upload and manage datasets in document format such as spreadsheet and XML • Parses the data into embedded MySQL database that becomes available to the public as a DiGIR resource • Owner can revoke release (data is deleted from database)
Data quality is a problem • Central data validation service being planned • The data provider can ask the DVS to run through its data and spot inconsistencies • Requirement: A data dictionary
Resource Resource Resource Resource Resource Resource Metadata Metadata Metadata Metadata Metadata Metadata GBIF Architecture User Metadata and name query ( UDDI ) ( UDDI ) Provider query Index Index Data Portal Portal Registry Registry Request Marshaller Request Marshaller Metadata Metadata Cache Cache Institutions Providers Services Institutions Providers Services Available providers Metadata response Query Engine Query Engine Accounting Accounting Publish availability Metadata and statistics Full data response DiGIR DiGIR Full data query Synonyms SOAP SOAP Name provider Name provider Data provider Data provider HTTP HTTP Provider Services Provider Services Provider Services Provider Services other other Resource Resource Metadata Metadata
GBIF Data Portal • Gateway to data of the providers • Search and browse data by name, country, etc. • Download data and display simple maps • Multilingual • Maintains a cache of key data in case provider goes off-line • Opened 6 February 2004 • Based on Java and MySQL, source code available later
Name Service: Major component of the global index Specimen Data Specimen Data Specimen Data Specimen Data Specimen Data Specimen Data Specimen Data Specimen Data Specimen Data Links to other data Observation Data Name Lists User requests GBIF Data Nodes GBIF Data Portal Biodiversity Data Index Taxo-nomic Name Service (ECAT) Catalogue of Life and other name providers
Data description in XML Institutions, providers, collections, and persons in various roles Specimen, observation Name, taxonomic concept Images Characters for identification Species information Standards process GBIF works with TDWG Discussion, documentation Schema repository Open source sourceforge.net Data exchange standards are the key Standards for protocols and data exchange • SOAP / UDDI • Darwin Core /DiGIR • ABCD/BioCASe • SDD/BioCASe • UBIF
Darwin Core (and Mantle) • TDWG standard in review • 48 elements • Metadata almost nonexistent (is in the protocol, not data itself) • New version 2 being reviewed • Extensibility wanted • Curatorial • Bacteriological • Paleontological • Trappers...
ABCD • TDWG standard in review • 300+ elements in a hierarchical structure • Can model almost anything • Metadata handling totally different from DiGIR/DwC
Image data standards • JPEG2000 • Metadata from Dublin Core
SDD (Structured Descriptive Data) • TDWG standard in review • Description of characters of organisms • XML standard for identification key interchange • Distributed descriplets as semantic web of diagnostic/identification knowledge (cf. CYC)
SDD elements • <Document> the root of an SDD document, and encloses all other elements • <GenerationMetadata> used to specify metadata about the process (application or script) • <ProjectDefinition> used to capture metadata about the project from which the document data are sourced. • <Terminology> defines a list of characters and their states used to describe the entities described in the document. • <Entities> defines a list of entities (such as taxa and specimens) for which descriptions are provided in the document. • <Resources> provides for definitions of resources (images, notes, contributors etc) referred to elsewhere in the document. • <Descriptions> contains descriptions (either coded or marked-up natural language) of the document's entities.
SpeciesBank • This is where it all comes together... • Species home pages mushrooming, but no standard exists for species information pages • Needed for identification, invasives, pest control, taxonomic review, ...
protocol The • DiGIR is lightweight. It is not SOAP, but could be payload on SOAP. • XML messaging on top of http • Used for communication between data providers and data users • More light-weight and specialised than SOAP • Enables single point of access (portal/search) to distributed information resources • Resource: a collection of data objects that conform to a common schema (DB records, XML documents) • Distributed resources comply with a federation schema • Enables search & retrieval of structured data • Search for data values in context (semantics) • Results are presented as a structured data set • Makes location and technical characteristics of the native resource transparent to the user • The Distributed Generic Information Retrieval protocol was created by the TDWG/CODATA subgroup on biological collection data
A simple DiGIR architecture Portals, search engines, and applications developed for various purposes Data providers (have one or more databases to share and have installed DiGIR or BioCASe) Databases