270 likes | 422 Views
Biodiversity Informatics and the Biodiversity Literature. Overview. Progress over the last decade Organism occurrence data Taxonomic databases The next challenge Describing diversity. Organism Occurrence Data.
E N D
Overview • Progress over the last decade • Organism occurrence data • Taxonomic databases • The next challenge • Describing diversity
Organism Occurrence Data Tools and standards created in biodiversity informatics enable data to be aggregated from around the world. data.GBIF.org End User Anywhere The Global Biodiversity Information Facility (GBIF) is the largest aggregator of organism occurrence data. Institutions CAS USNM FMNH NHM MNHN Collection Databases California Academyof Sciences National Museum of Natural History Field Museumof Natural History The Natural History Museum Museum Nacional de Histoire Natural San Francisco Washington Chicago London Paris
Remaining challenges with occurrence data • Lots of digitization still to do • Taxonomic identifications need to be updated • Georeferencing still needs to be done Relationship to literature: • Specimens and observations are primary data • Literature contains both reports of primary data, as well as summarized data • Large scale digitization efforts in museums might (will) swamp the content in literature
Taxonomic Databases >20M increasing density of names in relevant corpus Nomenclator Checklist valid / accepted taxa(plus synonyms) Catalog of uses in taxonomic works Index – all uniquename-stringsmappedto valid names/concepts
Emergent consensus • Philosophical/methodological debates • Species concepts • Biological • Evolutionary • Phylogenetic • Taxonomic definitions • Circumscription • Synonymized types • Set of specimens identified by taxon author • Tree or linneage-based definition
Anchor name-usage to publication metadata; actual publication; enable validation Citation(publication metadata) Name Usage Name begin end
Remaining challenges with taxonomic data • Taxa are concepts created in literature • Physical instances of the same published work are “equivalent” • Develop shared logical identifiers • Reconciliation across “authoritative” databases; fewer number of same as records
Recap • Taxonomic names are key to • Information retrieval • Information summary and grouping • Publication metadata are critical to anchoring taxonomic concepts, and • Providing the semantic touchstones for collaboration (critical) • Occurrence data gives us species distributions • Direct relationship to literature is small • But taxonomy is critical to integrating occurrence data, so the literature is still fundamental
What’s next What’s next?
What other classes of information remain in the literature? …that could be extracted and structured to be really useful?
Genetic/Genomic data Genetic and genomic data? …are not communicated or stored in the literature
A Model Organism Daniorerio the zebrafish
Understanding the origins of speciesthrough structured descriptions of diversity Morphological Diversity Phenotype A Phenotype B Development Genomic Diversity Genotype A Genotype B mutation evolution
Morphological variation across species difficult to find and synthesize
Not computable across studies (Lundberg and Akama 2005)
What is an ontology? • A set of well-defined terms and the logical relationships that hold between them • Represents knowledge of a discipline
Teleost Anatomy Ontology terms and relationships ventral hyoid arch pharyngeal arch cartilage part_of is_a replacement bone basihyal cartilage part_of basihyal element is_a is_a develops_from basihyal bone
Ontologies quickly become large and complex; guiding philosophy required The Teleost Anatomy Ontology contains 3,039 terms, with >600 skeletal terms Dahdul et al., 2010, Systematic Biology
Translational medicine Fig. 1, Washington et al., 2010 Translation from model organisms to humans
Phenoscape II & Research Coordination Network (RCN) • Extended to include other model organisms and taxonomic groups, e.g.: • Amphibian Anatomy Ontology (AAO) – Blackburn, CAS • Hymenoptera Anatomy Ontology (HAO) – Deans, NCSU • Plant Ontology – Huala, Stanford • NLP and term extraction (Hong Cui, Univ of Arizona)
What’s next? • Description of biological phenomena • Determining how best to do this will take time • Top-down design, guided by functional demonstration • Bottom-up curation of existing descriptions, • into structured knowledge through iteration