480 likes | 616 Views
Integrating Diverse Sources of Scientific Data: Is it safe to match on names?. Prof. Jessie Kennedy. Exploiting Diverse Sources of Scientific Data. Wealth and diversity of scientific data collected and stored is growing rapidly Increase in automation
E N D
Integrating Diverse Sources of Scientific Data: Is it safe to match on names? Prof. Jessie Kennedy
Exploiting Diverse Sources of Scientific Data • Wealth and diversity of scientific data collected and stored is growing rapidly • Increase in automation • Genetic sequencing, remote sensing, astronomy satellites • Decrease in technological costs • Computers more powerful, disk space greater for the same £ • Huge potential for scientific discovery by exploiting this data • especially multi-disciplinary research • Number, complexity and diversity of resources makes this a difficult task • Case Study • Data Integration • Matching data sets on biological names Exploiting Diverse Sources of Scientific Data
SEEK • Science Environment for Ecological Knowledge • USA National Science Foundation funding • Multidisciplinary project • Biology: Ecology, Taxonomy • Environmental science: Geography, Remote sensing, Meteorology, Climatology • Computer Science: Database, GRID/Web, Ontologies, Workflows, Algorithms, Human Computer Interaction Exploiting Diverse Sources of Scientific Data
Model of niche in ecological dimensions occurrence points on native species distribution precipitation temperature Project back onto geography Native range prediction Invaded range prediction The SEEK Prototype: Ecological Niche Modeling Geographic Space Ecological Space Geospatial and remotely sensed data Biodiversity information e.g. data from museum specimens, ecological surveys ecological niche modeling Results taken to integrate with other data realms (e.g., human populations, public health, etc.) Exploiting Diverse Sources of Scientific Data
Species prediction map Predicted Distribution: Amur snakehead (Channa argus) Image from http://www.lifemapper.org Exploiting Diverse Sources of Scientific Data
SEEK - Informatics Challenges • Data is Distributed • Data is Heterogeneous • Syntax • e.g. Text, Excel, Relational Database….. • Schema • e.g. Names of the tables, columns in tables • Semantics principal focus for SEEK • From many disciplines • Biodiversity surveys, hydrology, atmospheric chemistry, spatial data, behavioural experiments,… • Data on economics, demographics, legal issues,… Exploiting Diverse Sources of Scientific Data
SEEK Overview BEAM WG: Biodiversity and Ecological Analysis and Modelling EcoGrid: Making diverse environmental data systems interoperate Analysis and Modelling System (Kepler) Modelling scientific workflows Knowledge Representation WG: Ontologies, Metadata Taxon WG: Taxonomic name/concept resolution server Semantic Mediation System: “Smart” data discovery and integration Exploiting Diverse Sources of Scientific Data
SEEK Overview EcoGrid Exploiting Diverse Sources of Scientific Data
LUQ AND HBR VCR NTL Metacat node SRB node VegBank node DiGIR node Xanthoria node Legacy system EcoGrid Resources Partnership for Interdisciplinary Studies of Coastal Oceans (4) Natural History Collections (>> 100) UC Natural Reserve System (36) Multi-agency Rocky Intertidal Network (60) LTER Network (24) Organization of Biological Field Stations (180) Exploiting Diverse Sources of Scientific Data
EcoGrid Data Access • EcoGrid registry to discover data sources • EML (Ecological Metadata Language) • Experimental data, survey data, spatial raster and vector data, etc. • XML based • Discovery information • Creator, Title, Abstract, Keyword, etc. • Coverage • Geographic, temporal, and taxonomic extent • Logical and physical data structure • Data semantics via unit definitions and typing • Protocols and methods • DarwinCore • Museum collections Exploiting Diverse Sources of Scientific Data
EcoGrid Services • Service to Analysis and Modelling Layer • Interaction with Kepler – Workflows • Interaction with Grid Computing Facilities • Distributed computation • Service to Semantic Mediation Layer • Access to Ontologies; Taxon Services • Access to Legacy Apps • LifeMapper • Spatial Data Workbench Exploiting Diverse Sources of Scientific Data
SEEK Overview AMS Exploiting Diverse Sources of Scientific Data
Query EcoGrid to find data Archive output to EcoGrid with workflow metadata Scientific Workflows • Model the way scientists currently work with data • coordinate export and import of data among software systems • Workflows emphasize data flow • Output generation includes creating appropriate metadata • The analysis workflow itself becomes metadata • The workflow describes the data lineage as it has been transformed • Derived data sets can be stored in EcoGrid with provenance Exploiting Diverse Sources of Scientific Data
Scientific workflows • EML provides semi-automated data binding Exploiting Diverse Sources of Scientific Data
Kepler: Ecological Niche Model (200 to 500 runs per species x 2000 mammal species x 3 minutes/run) = 833 to 2083 days Exploiting Diverse Sources of Scientific Data
(200 to 500 runs per species x 2000 mammal species x 3 minutes/run) / 100 nodes = 8 to 20 days Grid-enable Kepler • Utilize distributed computing resources • Execute single steps or sub-workflows on distributed machines KeplerGrid for Niche Modeling Exploiting Diverse Sources of Scientific Data
SEEK Overview SMS Exploiting Diverse Sources of Scientific Data
Metadata • Key information needed to read and machine process a data file is in the metadata • Physical descriptors (CSV, Excel, RDBMS, etc.) • Logical Entity (table, image..),Attribute (column) descriptions • Name • Type (integer, float, string…) • Codes (missing values, nulls...) • Integrity constraints • Semantic descriptions (ontology-based type systems) • Metadata driven data ingestion Exploiting Diverse Sources of Scientific Data
Ecological ontologies • What was measured (biomass or photosynthetic solar radiation) • Type of quantity measured (mass, length) • Context of measurement (Psychotria limonensis, wavelength band) • How it was measured (dry weight, total solar radiation) Exploiting Diverse Sources of Scientific Data
Data Ontology Workflow Components Semantic Mediation • Label data with semantic types • Label inputs and outputs of analytical components with semantic types • Use reasoning engine to generate transformation step • Use reasoning engine to discover relevant component Exploiting Diverse Sources of Scientific Data
Data integration • Homogeneous data integration • Integration via EML metadata is relatively straightforward • Heterogeneous Data integration • Requires advanced metadata and processing • Attributes must be semantically typed • Collection protocols must be known • Units and measurement scale must be known • Measurement relationships must be known • e.g., that ArealDensity=Count/Area Exploiting Diverse Sources of Scientific Data
Simple Example Exploiting Diverse Sources of Scientific Data
Life Sciences Data • Much of the data gathered in ecological studies and used in ecological data analysis is bio-referenced data • typically organisms are referenced by a Latin name • e.g. Picea rubens • Many analyses require integrating data • originating in many locations and • at various points in time • For most bio-referenced data, integration involves matching on organism name • SEEK Taxon investigating associated issues Exploiting Diverse Sources of Scientific Data
Biological (Scientific) Names • Used for communicating information about known organisms and groups of organisms – taxa • Framework for all biologists to communicate… • Arise from taxonomists applying them to species and higher taxa following classification • Formalized according to strict codes of nomenclature • differ depending on kingdom • Use a Latin naming scheme • polynomial for species + below; monomial for genus + above • Quoted as: LatinName NameAuthors Year • Example: Carya floridana Sarg. 1913 • Can cause problems in data analysis….. Exploiting Diverse Sources of Scientific Data
_a Taxon_concept Genus Type specimens classify _b _c _d Taxon_concept Taxon_concept Taxon_concept Species Pile of specimens Taxonomic Hierarchy Classification, Concepts & Names Exploiting Diverse Sources of Scientific Data
classify Pile of specimens Taxon_concept_d Taxon_concept_d Classification, Concepts & Names Exploiting Diverse Sources of Scientific Data
(ii) Aus L.1758 (i) Aus L.1758 Publications of Taxonomic Revisions Fry splits Aus bea Archer. 1965 into two species, retains the name for one and creates a new one Tucker finds new specimens and combines Aus aus L. 1758 and Aus bea Archer. 1965 into one species, retains the name. Archer splits Aus aus L. 1758 into two species, retains the name for one and creates a new one Pargiter decides to re-split Aus aus but believes bea(beus) is in a new genus Xus. Aus bea Archer 1965 Aus aus L.1758 type specimen Genus concept genus name (iv) Aus L.1758 (v) Aus L.1758 (iii) Aus L.1758 Archer 1965 Linnaeus 1758 Aus aus L.1758 Aus aus L.1758 Aus aus L. 1758 Aus aus L.1758 Species concept Aus bea Archer 1965 Aus ceus BFry 1989 Aus cea BFry 1989 species name Aus cea BFry 1989 Xus Pargiter 2003 Tucker 1991 A diligent nomenclaturist, Pyle (1990), notes that the species epithet of Aus bea and Aus cea are of the wrong gender and publishes the corrected names Aus beus corrig. Archer 1965 and Aus ceus corrig. BFry 1989 Xus beus (Archer) Pargiter 2003. Fry 1989 Pargiter publishes his revision using Pyle’s corrigendum of the epithet bea to beus and Aus cea to Aus ceus. Tucker publishes his revision without noting Pyle’s corrigendum of the name of Aus cea Publications of Purely Nomenclatural Observation publication Pargiter 2003 In Linnaeus 1758 In Archer 1965 In Pyle 1990 In Tucker 1991 In Fry 1989 In Pargiter 2003 specimen Taxonomic history of Aus L. 1758 bea and cea noted as invalid names and replaced with beus and ceus. Pyle1990 Exploiting Diverse Sources of Scientific Data
Problems with Taxonomic Names • Are not unique • “Re-use” of names with changed definition • Name is ambiguous without definition/context • Subject to alterations and 'corrections' in time • Often recorded inappropriately in datasets • No author and/or year (e.g. Carya floridana) • Abbreviated (e.g. C. floridana) • Internal code (e.g. PicRub for Picea rubens) • Vernacular used (e.g. Scrub Hickory) • Misspelled Exploiting Diverse Sources of Scientific Data
Taxon Concepts …… • The published expert opinion defining and describing a group of organisms which are given a (scientific) name • Scientific names qualified with a reference to the definition of a concept • Should be used for communicating about groups of organisms • Comparing or integrating data based on taxon concepts will be more accurate Exploiting Diverse Sources of Scientific Data
Taxon Concepts… • Created by someone - an Author • Described in a Publication • Given a Name • Related to the type specimen • Definition • Referenced by • Full Scientific name + “according to” (Author + Publication + Date) Definition • Carya floridana Sarg. (1913) “according to” Charles Sprague Sargent, Trees & Shrubs 2:193 plate 177 (1913) Exploiting Diverse Sources of Scientific Data
Taxon Concepts …… • Defined by • set of Specimens examined during classification • set of common Characters • context dependent; differentiate taxa rather than fully describe them; • use natural language with all its ambiguities • relationships to other Taxon Concepts • Taxon circumscription • the lower level taxa • Congruence, overlap, includes etc. to taxa in other classifications Exploiting Diverse Sources of Scientific Data
Taxon Concepts …… • Original concept • 1st use of name as described by the taxonomist • same author + date in scientific name and “according to” • Carya floridana Sarg. (1913) Charles Sprague Sargent, Trees & Shrubs 2:193 plate 177 (1913) • TC_a • Revised concept • Re-classification of a group • Carya floridana Sarg. (1913) “according to” Stone, Flora of North America 3:424 (1997) • TC_b • Relationship between the taxon concepts • TC_b includes TC_a Exploiting Diverse Sources of Scientific Data
Legacy Data … • In legacy data names often appear in place of concepts • Names are imprecise • inappropriate for referring to information regarding taxa • e.g. observational/collection data • BUT…sometimes that’s all we have • How do we interpret names?….. • potentially multiple definitions • the sum of all definitions that exist for the name • one of the existing definitions • the “attributes” in common to all the definitions • represented by the type specimen Exploiting Diverse Sources of Scientific Data
Names as Taxon Concepts • Nominal concepts • Sub-set of TaxonConcepts • Name but no AccordingTo • non-unique (concept) identifier attributes • can be given a unique concept identifier • No definition • Explicitly saying it’s something with this name • but not really sure what is/was meant by the name • Encourage people to understand and address the issue of names • Allowing mark-up of data with names allows them to believe names are really good enough • Will improve long term usefulness of scientific data • Ease integration Exploiting Diverse Sources of Scientific Data
SEEK Taxon’s Message….. • Scientific names are not unique identifiers for biological entities • Integrating data from different sources based on names alone could cause serious errors in analysis of the integrated data • Biologists must reference organisms precisely • if datasets to be of use long term or to other users • Reference by taxon concept rather than name • integrate data for analysis on taxon concepts Exploiting Diverse Sources of Scientific Data
Taxonomic Databases • Main taxonomic list servers are still name based • single perspective on taxonomy • don’t represent multiple classifications • unclear what the definition is (don’t even try!) • provide non-standardised interface (web page, xml download) • SEEK Taxon aims to prototype a concept/name resolution service for ecologists working with SEEK • Find concepts given a name • Compare concepts • Relate concepts • Mark up ecological data sets with concepts • First • Need data on names and concepts • Need an exchange standard…. Exploiting Diverse Sources of Scientific Data
Taxon Concept Schema • TCS standard for exchange of taxonomic names/concept data • Taxonomic Databases Working Group (TDWG) • Global Biodiversity Information Facility (GBIF) • XML based exchange schema • Makes heavy use of Globally Unique Identifiers (GUIDs) • Not designed as the “correct way” to model a Taxon Concept • No “rules” as to what a taxon must have • Design to accommodate different models • Includes Taxon Names • more constrained - the codes of nomenclature • TCS/EML • TCS modifications to EML taxon coverage Exploiting Diverse Sources of Scientific Data
Taxon Names and Taxon Concepts • Important to be able to pass names alone • For nomenclatural and some taxonomic purposes • But not for identifications/observations • Taxon Concepts refer to Names • By GUID • Names must not change • Can’t record original taxon concept Exploiting Diverse Sources of Scientific Data
Taxon Concept/Name Resolution Server • Taxon Object Server • Schema based on the TCS model • Implements the GUIDs using LSID technology • Tool to import/export data from TCS documents • TOS Allows • registration, retrieval of taxonomic datasets • Match concepts given names, concepts, etc. • Allow users to • See different taxonomic opinions • Uses GUIDs to reference concepts (LSIDs) • Find concepts… • Author new concepts • Make new relationships between existing concepts • Integrated with Kepler workflow system Exploiting Diverse Sources of Scientific Data
SEEK User Interface Tools • Concept mapper • A desktop tool to assist taxonomists to relate concepts from one source to another • For use in creating data sets for TOS or TCS • For creating new relationships between concepts in TOS • Taxonomy comparison visualisation • Visualisation tool to explore different classifications • Compare concepts Exploiting Diverse Sources of Scientific Data
Query concepts Concepts Relationships Concept Mapper Main GUI Exploiting Diverse Sources of Scientific Data
Concept Comparison Visualisation Exploiting Diverse Sources of Scientific Data
SEEK Summary • Environment to support large scale ecological data analysis • Scientific Workflows: Kepler • Semantic Mediation • Ecological ontology creation/use for data integration • Grid/Wed based data discovery • Resolution of Taxonomic Names/Concepts • Standards development • Concept matching server • Visualisation tools • http://seek.ecoinformatics.org Exploiting Diverse Sources of Scientific Data
Is it safe to match on names? • I hope I have convinced you that the answer is NO • as a general rule… BUT • Depends on the purpose of the data • therefore the accuracy required • The degree of automation used in matching • greater automation – greater potential problem • Expertise of person involved in the matching Exploiting Diverse Sources of Scientific Data
Many Outstanding Issues…. • Educating biologists of the inherent problem in names • Not limited to the Linnaean system of nomenclature • Lack of good taxon concept data • Widening usage and application of taxon concepts • Adopting GUIDs • Provision of reliable ‘look up’ facilities • Cross referencing of GUIDs • Reuse is vital • Must not create duplicate GUIDs if possible • Conversion of legacy data • Develop good matching algorithms • Potential move from XML schema -> semantic web technologies • …….. Exploiting Diverse Sources of Scientific Data
Acknowledgements • This material is based upon work supported by: • The National Science Foundation • SEEK Collaborators: NCEAS (UC Santa Barbara), University of New Mexico (Long Term Ecological Research Network Office), San Diego Supercomputer Center, University of Kansas (Center for Biodiversity Research), University of Vermont, University of North Carolina, Arizona State University, UC Davis • Matt Jones – for many of the slides…. • Global biodiversity Information Facility • eScience Institute • Research Theme Programme • Malcolm Atkinson Exploiting Diverse Sources of Scientific Data
Exploiting Diverse sources of Scientific Data • Upcoming Workshop • discussing possible technology solutions RDF, Ontologies and Meta-Data Workshop 7th – 9th June, 2006 e-Science Institute 15 South College Street Edinburgh http://www.nesc.ac.uk/esi/events/683/ Exploiting Diverse Sources of Scientific Data