400 likes | 489 Views
Data Integration Issues in Biodiversity Research. Jessie Kennedy Shawn Bowers, Matthew Jones, Josh Madin, Robert Peet, Deana Pennington, Mark Schildhauer, Aimee Stewart. SEEK. Science Environment for Ecological Knowledge
E N D
Data Integration Issues in Biodiversity Research Jessie Kennedy Shawn Bowers, Matthew Jones, Josh Madin, Robert Peet, Deana Pennington, Mark Schildhauer, Aimee Stewart
SEEK • Science Environment for Ecological Knowledge • Research and develop information technology to radically improve the type and scale of ecological science that can be addressed Visual Tools for Managing Taxonomic Concepts
Climatology Hydrology Meteorology Geography Oceanography Geology Ecology Paleontology Genomics Taxonomy Proteomics Morphology Nomenclature Biochemistry Science and Scientific Data are Complex Visual Tools for Managing Taxonomic Concepts
Climatology Hydrology Meteorology Geography Oceanography Temperature Geology Depth Location Organism Ecology Paleontology Taxon concept Gene sequence Genomics Taxonomy Proteomics Name Protein Morphology Nomenclature Pathway Biochemistry Visual Tools for Managing Taxonomic Concepts
Scientific Community: complex Small Scientific Community Individual Scientist Large Scientific Community Scientific Laboraotory Visual Tools for Managing Taxonomic Concepts
Climatology Climatology Climatology Climatology Hydrology Hydrology Hydrology Hydrology Meteorology Meteorology Meteorology Meteorology Geography Geography Geography Geography Oceanography Oceanography Oceanography Oceanography Temperature Temperature Temperature Temperature Geology Geology Geology Geology Depth Depth Depth Depth Location Location Location Location Organism Organism Organism Organism Ecology Ecology Ecology Ecology Paleontology Paleontology Paleontology Paleontology Taxon concept Taxon concept Taxon concept Taxon concept Gene sequence Gene sequence Gene sequence Gene sequence Genomics Genomics Genomics Genomics Taxonomy Taxonomy Taxonomy Taxonomy Proteomics Proteomics Proteomics Proteomics Name Name Name Name Protein Protein Protein Protein Morphology Morphology Morphology Morphology Nomenclature Nomenclature Nomenclature Nomenclature Pathway Pathway Pathway Pathway Biochemistry Biochemistry Biochemistry Biochemistry Visual Tools for Managing Taxonomic Concepts
conclusion observation experiment hypothesis Science & Scientific Data are Continually Changing • Conclusions become foundations for new hypotheses • New experiments invalidate existing knowledge • Knowledge is open to interpretation • Different opinions • Need to build this into our technological solutions Visual Tools for Managing Taxonomic Concepts
Exploiting Scientific Data • To support scientists in • Discovery • Access • Sharing • Integration/Linking • Analysis • Scientists can then improve their potential for new scientific discovery Visual Tools for Managing Taxonomic Concepts
Data Integration/Linking: approaches • Metadata • to describe the data sets and know how to interpret the data sets • Ontologies • to define the terminology used and know how data might be related and to aid automatic transformation of the data • Standardisation of formats • for exchange of data + to ease integration • LSIDs • to uniquely identify things; know when 2 things are the same • Workflows • to enable specification, refinement and repetition of integration/analysis • Provenance of data • to record where the data has come from and what has happened to it en route. Visual Tools for Managing Taxonomic Concepts
ESG Projects in most sciences: Visual Tools for Managing Taxonomic Concepts
Where do species occur now? Image from http://www.lifemapper.org Where will they occur in the future? Ecological Science - Analysis • Ecological niche modeling of species distributions Visual Tools for Managing Taxonomic Concepts
Environmental Characteristics Of Surrounding Geographic Area Native Distribution Prediction Known Species Locations Environmental Characteristics Of Different Geographic Area Environmental Characteristics from gridded GIS layers Develop Model Temperature layer Invasion Area Prediction Many other layers Multidimensional Ecological Space Future Scenarios Of Environmental Characteristics D2 Dn Environmental Change Prediction D1 = Temperature Ecological Niche Modeling Visual Tools for Managing Taxonomic Concepts
Sources of Scientific Data • Data are massively dispersed • Ecological field stations and research centers (100’s) • Natural history museums and biocollection facilities (100’s) • Agency data collections (10’s to 100’s) • Individual scientists (1000’s) • Data are heterogeneous • Syntax (format) • Schema (model) • Semantics (meaning) Visual Tools for Managing Taxonomic Concepts
Challenge: Data Integration Visual Tools for Managing Taxonomic Concepts
SEEK Components Visual Tools for Managing Taxonomic Concepts
Semantic Annotation – SEEK ontologies • Integration/merge • Concept mapping • Units conversion • Spatial & temporal scaling • Data discovery • Finding relevant data sets • Understanding data set content Visual Tools for Managing Taxonomic Concepts
Smart (Data) Integration: Merge • Discover data of interest • … connect to merge actor • … “compute merge” Visual Tools for Managing Taxonomic Concepts
Biomass Site Site Biomass a1 a3 a1 a2 a3 a4 a 5 10 b 6 11 a1a8 a1 a3 a4 a 5.0 10 b 6.0 11 a 0.1 c 0.2 d 0.3 a4 a3a6 Merge Merge Result a6 a4 a5 a6 a7 a8 0.1 a 0.2 c 0.3 d a8 Smart Merge … • Semantic type annotationsandontology definitionsused to find mappings between sources • Executing the merge actor results in an integrated data product (via “outer union”) Visual Tools for Managing Taxonomic Concepts
Challenges of Taxonomic Data Scientific names change in meaning over time + geographical region conclusions being drawn from analysis of data integrated on names. Visual Tools for Managing Taxonomic Concepts
Abies lasiocarpa var. lasiocarpa Abies lasiocarpa Abies bifolia var. arizonica What is Abies lasiocarpa? USDA Plants & ITIS Flora North America SubAlpine Fir Visual Tools for Managing Taxonomic Concepts
Linneaus 1758 Archer 1965 Fry 1989 Tucker 1991 Pargiter 2003 Aus L.1758 Aus L.1758 Aus L.1758 Aus L.1758 Aus L.1758 Aus aus L. 1758 Aus aus L.1758 Aus aus L.1758 Aus aus L.1758 Aus ceus BFry 1989 Aus bea Archer 1965 Aus bea Archer 1965 Aus aus L.1758 (vi) Xus Pargiter 2003 Aus cea BFry 1989 Aus cea BFry 1989 Xus beus (Archer) Pargiter 2003. Pyle1990 Aus bea and Aus cea noted as invalid names and replaced with Aus beus and Aus ceus. Changes in meaning of names Taxonomic history of imaginary genus Aus L. 1758 5 Revisions of Aus 1 name spelling change Visual Tools for Managing Taxonomic Concepts
Archer 1965 Fry 1989 Aus L.1758 Aus aus L.1758 Aus bea Archer 1965 Aus cea BFry 1989 Changes in meaning of names Linneaus 1758 Tucker 1991 Pargiter 2003 Aus L.1758 Aus L.1758 Aus L.1758 Aus L.1758 Aus aus L. 1758 Aus aus L.1758 Aus aus L.1758 Aus ceus BFry 1989 Aus bea Archer 1965 Aus aus L.1758 (vi) Xus Pargiter 2003 Aus cea BFry 1989 Xus beus (Archer) Pargiter 2003. Pyle1990 • 8 Names • 2 genus • 6 species Aus bea and Aus cea noted as invalid names and replaced with Aus beusand Aus ceus. Visual Tools for Managing Taxonomic Concepts
C0.1 C0.1 - Aus L.1758 sec. Linneaeus 1758 C0.2 C0.2 - Aus L.1758 sec. Archer 1965 Each name has many concepts ormeanings N0 C0.3 C0.3 - Aus L.1758 sec. Fry 1989 N0 - Aus L.1758 C0.4 C0.4 - Aus L.1758 sec. Tucker 1991 C0.5 C0.5 - Aus L.1758 sec. Pargiter 2003 C1.1 C1.1 - Aus aus L.1758 sec. Linneaeus 1758 C1.2 C1.2 - Aus aus L.1758 sec. Archer 1965 C1.3 N1 C1.3 - Aus aus L.1758 sec. Fry 1989 N1 - Aus aus L.1758 C1.4 C1.4 - Aus aus L.1758 sec. Tucker 1991 C1.5 C1.5 - Aus aus L.1758 sec. Pargiter 2003 C2.2 C2.2 - Aus bea Archer 1965 sec. Archer 1965 N2 C2.3 N2 - Aus bea Archer 1965 C2.3 - Aus bea Archer 1965 sec. Fry 1989 N3 C3.3 C3.3 - Aus cea Fry 1989 sec. Fry 1989 N3 - Aus cea Fry 1989 N4 C3.4 - Aus cea Fry 1989 sec. Tucker 1991 C3.4 N4 - Aus beus Archer 1965 C5.5 N5 C5.5 - Aus ceus Fry 1989 sec. Fry 1989 N5 - Aus ceus Fry 1989 C6.5 N6 C6.6 - Xus beus Pargiter 2003 sec. Pargiter 2003 N6 - Xus beus Pargiter 2003 C7.5 N7 C7.6 - Xus Pargiter 2003 sec. Pargiter 2003 N7 - Xus Pargiter 2003 8 Names 17 Concepts
Many possible interpretations of Aus aus (N1) Original concept: C1.1 Most recent concept: C1.5 Preferred Authority (e.g. Fry 1989): C1.3 Everything ever named N1: Union(C1.1,C1.2,C1.3,C1.4,C1.5) Best fit according to some matching algorithm Best(C1.1,C1.2,C1.3,C1.4,C1.5) New concept containing only those features common to all concepts with the name N1: Intersection(C1.1,C1.2,C1.3,C1.4,C1.5) Is it appropriate to link or merge data sets returned on the scientific names? Depends on the user’s purpose Level of precision required Find data sets containing Aus aus C1.1 N1 - Aus aus L.1758 C1.2 N1 C1.3 C1.4 C1.5 Visual Tools for Managing Taxonomic Concepts
N7 N0 C0.5 C7.5 C0.2 C0.4 C0.1 C0.3 C1.5 C5.5 C6.5 C1.2 C1.3 C2.3 C3.3 C1.4 C3.4 C1.1 C2.2 N5 N6 N4 N3 N1 N2 Information from literature on synonymy Taxonomists record which names their concepts are synonymous with and any name changes Parent child relationships in 5 revisions Names for each of the concepts Visual Tools for Managing Taxonomic Concepts
N7 N0 C0.5 C7.5 C0.2 C0.4 C0.1 C0.3 C1.5 C5.5 C6.5 C1.2 C1.3 C2.3 C3.3 C1.4 C3.4 C1.1 C2.2 N5 N6 N4 N3 N1 N2 Find data sets with Aus aus (N1) C1.5 C1.2 C1.3 C1.4 C1.1 N1 N1 Visual Tools for Managing Taxonomic Concepts
N7 N0 C0.5 C7.5 C0.2 C0.4 C0.1 C0.3 C1.5 C5.5 C6.5 C1.2 C1.3 C2.3 C3.3 C1.4 C3.4 C1.1 C2.2 N5 N6 N4 N3 N1 N2 Find data sets with Aus aus (N1) C1.5 C1.2 C1.3 C2.3 C1.4 C1.1 C2.2 N1 N1 N2 Visual Tools for Managing Taxonomic Concepts
N7 N0 C0.5 C7.5 C0.2 C0.4 C0.1 C0.3 C1.5 C5.5 C6.5 C1.2 C1.3 C2.3 C3.3 C1.4 C3.4 C1.1 C2.2 N5 N6 N4 N3 N1 N2 Find data sets with Aus aus (N1) C1.5 C1.2 C1.3 C2.3 C3.3 C1.4 C3.4 C1.1 C2.2 N3 N2 N1 N2 N1 N2 Visual Tools for Managing Taxonomic Concepts
N7 N0 C0.5 C7.5 C0.2 C0.4 C0.1 C0.3 C1.5 C5.5 C6.5 C1.2 C1.3 C2.3 C3.3 C1.4 C3.4 C1.1 C2.2 N5 N6 N4 N3 N1 N2 Find data sets with Aus aus (N1) C1.5 C6.5 C1.2 C1.3 C2.3 C3.3 C1.4 C3.4 C1.1 C2.2 N6 N4 N3 N2 N1 N2 N1 N2 Visual Tools for Managing Taxonomic Concepts
N7 N0 C0.5 C7.5 C0.2 C0.4 C0.1 C0.3 C1.5 C5.5 C6.5 C1.2 C1.3 C2.3 C3.3 C1.4 C3.4 C1.1 C2.2 N5 N6 N4 N3 N1 N2 Find data sets with Aus aus (N1) Results in everything returned for Aus aus by traversing the synonymy and name links C1.5 C5.5 C6.5 C1.2 C1.3 C2.3 C3.3 C1.4 C3.4 C1.1 C2.2 N5 N6 N3 N4 N3 N2 N1 N2 N1 N2 Visual Tools for Managing Taxonomic Concepts
Information to improve data sets returned We can build systems to return data suit for purpose N7 Minimally what we need are set relationships from concepts in any taxonomy to earlier concepts N0 C0.5 C7.5 C0.2 C0.4 C0.1 C0.3 C1.5 C5.5 C6.5 C1.2 C1.3 C2.3 C3.3 C1.4 C3.4 C1.1 C2.2 and name changes related to earlier names N5 N6 N4 N3 N1 N2 = = Visual Tools for Managing Taxonomic Concepts
Real Biological Taxonomies • Larger and change more frequently than the Aus example • German mosses • 14 classifications in 73 years • covering 1548 taxa • only 35% thought to be stable concepts • 65% of names used in legacy data sets are ambiguous • Taxonomic Revisions of genus Alteromonas 34 years: from 1972 to 2006 • At the species level • 18 “emendations” • 19 species reassigned to 4 genera • 3 new combinations • 6 synonyms • 2 species to subspecies • 2 subspecies to species • 21 new species Visual Tools for Managing Taxonomic Concepts
SEEK Taxon Approach • Use Taxon Concepts for referring to organisms • Aus aus L. 1758 sec. Tucker 1991 • Abies lasiocarpa (Hook) Nutt. sec FNA 1997 • Taxon Concept/Name Resolution • International data exchange schema • TCS (Taxonomic Concept Schema) • Concept Repository and Resolution web service • Linked to Kepler workflow system • Globally unique identifiers (LSIDs) • Visualization software for comparing Taxonomies and Asserting Concept Relationships Visual Tools for Managing Taxonomic Concepts
Database to TCS Mapping Tool TCS TOS SEEK Cache TCS Concept Extraction Tool Mammal Species of the World Taxonomic Data Providers Concept Mapper Taxonomic Literature Taxon Object Server Visual Tools for Managing Taxonomic Concepts
Identify species Data Analysis Mark up datasets TCS TOS Concept Mapper SEEK Cache LSID Authority Morpho EML Datasets • Get Best Concept EML(TCS) • Get Synonymous Concepts • Find All Concepts TCS http://seek.nhm.ku.edu/TaxObjServ/services Taxonomic Object Service: SEEK Visual Tools for Managing Taxonomic Concepts
Recap… • Re-emphasised the problems with Taxonomic Names • not good identifiers for organisms • problem extends to most areas • characters, countries, habitats, vegetation types, genes….. • Shown that Taxonomic concepts are better for referring to organisms, specimens, observations… • but • Need better systems for resolving taxonomic names/concepts • Which require better information Visual Tools for Managing Taxonomic Concepts
Provide better tools for users • To help taxonomists create better quality data • Better access to reference/legacy data • Explore differences/similarities in existing taxonomies • To create relationships between concepts • Improved data can be made available to the general biology community for incorporating into bio-referenced databases. • To help end users understand and use the data • and its limitations • Biologists can use tools to understand the impact of using particular data on their analysis Visual Tools for Managing Taxonomic Concepts
Conclusion • Science is complex (and therefore split into specialisms) • Identify the overlaps/linkages in the different domains • Need useful approximations of things to simplify linked domain • Need to understand the approximations or linking points well • Support re-composition, linking or building on the components • Science is inherently changing • Science is full of legacy data • Today’s scientific research is tomorrow’s legacy data • Track the changes in the data • know when components or links have changed • Provide long-term persistent storage • Any published scientific discovery should store the data as evidence • Data needs to be accurately annotated • Sufficient to repeat analyses to test hypotheses Visual Tools for Managing Taxonomic Concepts
Acknowledgements • Colleagues on the SEEK project • NSF and EPSRC funding • e-Science Centre funding • Colleagues in TDWG Visual Tools for Managing Taxonomic Concepts
Thank You Questions…