250 likes | 268 Views
BIS TDWG Conference, New Orleans, 2011. GBIF: Issues in providing federated access to digital information related to biological specimens. David Remsen Senior Programme Officer Global Biodiversity Information Facility (GBIF). 3 issues. Issue #1: The consequences of scale.
E N D
BIS TDWG Conference, New Orleans, 2011 GBIF: Issues in providing federated access to digital information related to biological specimens David Remsen Senior Programme Officer Global Biodiversity Information Facility (GBIF)
3 issues Issue #1: The consequences of scale Issue #2: Geospatial integration Issue #3: Taxonomic integration
Issue #1: The consequences of scale Goal – Provide timely access to a large federated network of biodiversity databases
About GBIF The mission of the Global Biodiversity Information Facility (GBIF) is to facilitate free and open access to biodiversity data worldwide via the Internet to underpin sustainable development. • 341 publishers • 9290 datasets • 310M records • 57 countries • 45 organisations
“Wrapper” Software Install one of these ‘wrappers’ Data Your database ABCD PyWrapper (Python) Herbarium DarwinCore TAPIR Link (PHP) Bird Observations DarwinCore DiGIR (PHP) Insect Collection
The promise of federation Any specimens from Thailand? I will ask! GBIF Data Portal I do! I do! I do! Nope! Insect Collection Herbarium Bird Observations Herbarium GBIF Data Portal as a Gateway
The challenge of federation Hello? GBIF Data Portal Server Not Available Hi! Insect Collection Herbarium Bird Observations Herbarium
The rise of Indexing Any data records from Thailand? Send me a copy of your data GBIF Data Portal (now with Data!) Insect Collection Herbarium Bird Observations Herbarium GBIF Data Portal as a Data Index
The wrong tools for the job Any data records from Thailand? Send me a copy of your data once per month GBIF Data Portal (now with Data!) If I go offline, start again You ask the same questions every time Here is page one. Not too fast! Insect Collection Herbarium Bird Observations Herbarium
TAPIR request example • dataset of 260,000 specimens • 200 records retrieved per request • requires 1300 request/response pairs • over 9 hours to complete • 500 MB of XML data is transferred • becomes 32 MB text file in the GBIF server • 32 MB is compressible to 3 MB zip file
Darwin Core Archives A text-based solution to publishing biodiversity data
A Refined Approach Any data records from Thailand? This is fast! GBIF Data Portal (now with Data!) - reduce latency This is easy - index very large data sets URL URL URL URL Insect Collection Herbarium Bird Observations Herbarium
Growth 302 million Newstandardadopted Need for a new standard identified 201 million 180 million 147 million 70 million 2007 2008 2009 2010 Today
Issue #2: Geospatial Integration Goal – Provide accurate reporting of nationally-bound data Challenge – Inaccurate recording of geospatial coordinates
Geo-referenced USA data Verbatim data as shared on the network
Issue #2: Geospatial Integration Remediation includes: • Use of country boundary shapefiles to verify that coordinates fall within them • Including EEZ boundaries • Including islands • Outliers identified • Nature of the error qualified (e.g., “coordinates inverted”) • Offending records marked and omitted from display
Geo-referenced USA data Data following interpretation • Coastal regions recognised • Offshore islands recognised
Issue #3: Taxonomic Integration • Goal – Provide access to biodiversity data according to taxonomic groups and concepts • Challenge – • Heterogeneous and sometimes inaccurate classification • Same taxon appearing in different classifications • Presence of homonyms that complicate reconciling above • Misspellings • Wide range of orthographies for the same name
Enabling authoratative taxonomic data to be published through GBIF
Trochilidae (Hummingbirds)(today) Misinterpretations (Hummingbirds are restricted to the Americas)
Trochilidae (Hummingbirds)(next month) Improved interpretation
Search for Oenanthe(water dropwort plantorwheatear bird) resolution of homonyms Today Difficult for user to interpret Next month Accurate search results
In summary • GBIF has had to deploy different data access strategies in order to effectively scale • Darwin Core Archive offers a scalable solution that has led to rapid growth in data published through GBIF • Geospatial filtering via shapefiles provides basis for more accurate national reporting • Basis for additional services later (e.g., ecosystem shapefiles, protected areas, etc.) • Heterogenous taxonomy inherent to collections data is nearly impossible to consolidate into a taxonomically accurate structure. • Comprehensive authoritative taxonomic data is a key organisational component of collections data