Biodiversity Informatics and the Biodiversity Literature

Biodiversity Informatics and the Biodiversity Literature

Overview • Progress over the last decade • Organism occurrence data • Taxonomic databases • The next challenge • Describing diversity

Organism Occurrence Data Tools and standards created in biodiversity informatics enable data to be aggregated from around the world. data.GBIF.org End User Anywhere The Global Biodiversity Information Facility (GBIF) is the largest aggregator of organism occurrence data. Institutions CAS USNM FMNH NHM MNHN Collection Databases California Academyof Sciences National Museum of Natural History Field Museumof Natural History The Natural History Museum Museum Nacional de Histoire Natural San Francisco Washington Chicago London Paris

Organism occurrence data

Distribution models

Remaining challenges with occurrence data • Lots of digitization still to do • Taxonomic identifications need to be updated • Georeferencing still needs to be done Relationship to literature: • Specimens and observations are primary data • Literature contains both reports of primary data, as well as summarized data • Large scale digitization efforts in museums might (will) swamp the content in literature

Taxonomic Databases >20M increasing density of names in relevant corpus Nomenclator Checklist valid / accepted taxa(plus synonyms) Catalog of uses in taxonomic works Index – all uniquename-stringsmappedto valid names/concepts

Emergent consensus • Philosophical/methodological debates • Species concepts • Biological • Evolutionary • Phylogenetic • Taxonomic definitions • Circumscription • Synonymized types • Set of specimens identified by taxon author • Tree or linneage-based definition

Anchor name-usage to publication metadata; actual publication; enable validation Citation(publication metadata) Name Usage Name begin end

Remaining challenges with taxonomic data • Taxa are concepts created in literature • Physical instances of the same published work are “equivalent” • Develop shared logical identifiers • Reconciliation across “authoritative” databases; fewer number of same as records

Recap • Taxonomic names are key to • Information retrieval • Information summary and grouping • Publication metadata are critical to anchoring taxonomic concepts, and • Providing the semantic touchstones for collaboration (critical) • Occurrence data gives us species distributions • Direct relationship to literature is small • But taxonomy is critical to integrating occurrence data, so the literature is still fundamental

What’s next What’s next?

What other classes of information remain in the literature? …that could be extracted and structured to be really useful?

Genetic/Genomic data Genetic and genomic data? …are not communicated or stored in the literature

A Model Organism Daniorerio the zebrafish

Understanding the origins of speciesthrough structured descriptions of diversity Morphological Diversity Phenotype A Phenotype B Development Genomic Diversity Genotype A Genotype B mutation evolution

Morphological variation across species difficult to find and synthesize

Information retrieval from free-text is difficult

Not computable across studies (Lundberg and Akama 2005)

What is an ontology? • A set of well-defined terms and the logical relationships that hold between them • Represents knowledge of a discipline

Teleost Anatomy Ontology terms and relationships ventral hyoid arch pharyngeal arch cartilage part_of is_a replacement bone basihyal cartilage part_of basihyal element is_a is_a develops_from basihyal bone

Ontologies quickly become large and complex; guiding philosophy required The Teleost Anatomy Ontology contains 3,039 terms, with >600 skeletal terms Dahdul et al., 2010, Systematic Biology

Translational medicine Fig. 1, Washington et al., 2010 Translation from model organisms to humans

Phenoscape II & Research Coordination Network (RCN) • Extended to include other model organisms and taxonomic groups, e.g.: • Amphibian Anatomy Ontology (AAO) – Blackburn, CAS • Hymenoptera Anatomy Ontology (HAO) – Deans, NCSU • Plant Ontology – Huala, Stanford • NLP and term extraction (Hong Cui, Univ of Arizona)

What’s next? • Description of biological phenomena • Determining how best to do this will take time • Top-down design, guided by functional demonstration • Bottom-up curation of existing descriptions, • into structured knowledge through iteration

Biodiversity Informatics and the Biodiversity Literature

Biodiversity Informatics and the Biodiversity Literature

Presentation Transcript

Biodiversity

Biodiversity Informatics at the Natural History Museum

Biodiversity:

Biodiversity informatics: Uniformity in Diversity

Ontologies in Ecology and Biodiversity Informatics

Workshop on Biodiversity and Ecosystem Informatics

Biodiversity

Biodiversity Informatics

WP7 Overview (Biodiversity literature)

Ocean Biodiversity Informatics

Biodiversity Informatics at COMSC

Growing challenges for biodiversity informatics

Biodiversity

Biodiversity

Biodiversity research and informatics in Bioversity International

Biodiversity

Biodiversity informatics: Uniformity in Diversity

Biodiversity