760 likes | 962 Views
Ontology (Science) vs. Ontology (Engineering). Barry Smith University at Buffalo http://ontology.buffalo.edu/smith. Working in ontology since 1975 Working with biomedical ontologists since 2002 Gene Ontology Protein Ontology Infectious Disease Ontology
E N D
Ontology (Science)vs.Ontology (Engineering) Barry Smith University at Buffalo http://ontology.buffalo.edu/smith
Working in ontology since 1975 • Working with biomedical ontologists since 2002 • Gene Ontology • Protein Ontology • Infectious Disease Ontology • OBO (Open Biomedical Ontologies) Foundry
NCBO • National Center for Biomedical Ontology • Dissemination and Ontology Best Practices • http://bioontology.org
ICBO • International Conference on Biomedical Ontology • Buffalo, NY. July 24-26, 2009 • http://icbo.buffalo.edu
Example ontologies Basic Formal Ontology (BFO) Common Anatomy Reference Ontology (CARO) Environment Ontology (EnvO) Foundational Model of Anatomy (FMA) Infectious Disease Ontology (IDO) Ontology for Biomedical Investigations (OBI) Ontology for Clinical Investigations (OCI) Phenotypic Quality Ontology (PATO) Relation Ontology (RO)
Multiple kinds of data in multiple kinds of silos Lab / pathology data Electronic Health Record data Clinical trial data Patient histories Medical imaging Microarray data Protein chip data Flow cytometry
How to find your data? How to find other people’s data? How to reason with data when you find it? How to understand the significance of the data you collected 3 years earlier? To solve the silo problem medical researchers need the help of ontology engineers
Ontologies facilitate retrieval of data by allowing grouping of annotations brain 20 hindbrain 15 rhombomere 10 Query brain without ontology 20 Query brain with ontology 45
How to do biology across the genome? MKVSDRRKFEKANFDEFESALNNKNDLVHCPSITLFESIPTEVRSFYEDEKSGLIKVVKFRTGAMDRKRSFEKVVISVMVGKNVKKFLTFVEDEPDFQGGPISKYLIPKKINLMVYTLFQVHTLKFNRKDYDTLSLFYLNRGYYNELSFRVLERCHEIASARPNDSSTMRTFTDFVSGAPIVRSLQKSTIRKYGYNLAPYMFLLLHVDELSIFSAYQASLPGEKKVDTERLKRDLCPRKPIEIKYFSQICNDMMNKKDRLGDILHIILRACALNFGAGPRGGAGDEEDRSITNEEPIIPSVDEHGLKVCKLRSPNTPRRLRKTLDAVKALLVSSCACTARDLDIFDDNNGVAMWKWIKILYHEVAQETTLKDSYRITLVPSSDGISLLAFAGPQRNVYVDDTTRRIQLYTDYNKNGSSEPRLKTLDGLTSDYVFYFVTVLRQMQICALGNSYDAFNHDPWMDVVGFEDPNQVTNRDISRIVLYSYMFLNTAKGCLVEYATFRQYMRELPKNAPQKLNFREMRQGLIALGRHCVGSRFETDLYESATSELMANHSVQTGRNIYGVDFSLTSVSGTTATLLQERASERWIQWLGLESDYHCSFSSTRNAEDV
MKVSDRRKFEKANFDEFESALNNKNDLVHCPSITLFESIPTEVRSFYEDEKSGLIKVVKFRTGAMDRKRSFEKVVISVMVGKNVKKFLTFVEDEPDFQGGPIPSKYLIPKKINLMVYTLFQVHTLKFNRKDYDTLSLFYLNRGYYNELSFRVLERCHEIASARPNDSSTMRTFTDFVSGAPIVRSLQKSTIRKYGYNLAPYMFLLLHVDELSIFSAYQASLPGEKKVDTERLKRDLCPRKPIEIKYFSQICNDMMNKKDRLGDILHIILRACALNFGAGPRGGAGDEEDRSITNEEPIIPSVDEHGLKVCKLRSPNTPRRLRKTLDAVKALLVSSCACTARDLDIFDDNNGVAMWKWIKILYHEVAQETTLKDSYRITLVPSSDGISLLAFAGPQRNVYVDDTTRRIQLYTDYNKNGSSEPRLKTLDGLTSDYVFYFVTVLRQMQICALGNSYDAFNHDPWMDVVGFEDPNQVTNRDISRIVLYSYMFLNTAKGCLVEYATFRQYMRELPKNAPQKLNFREMRQGLIALGRHCVGSRFETDLYESATSELMANHSVQTGRNIYGVDSFSLTSVSGTTATLLQERASERWIQWLGLESDYHCSFSSTRNAEDVVAGEAASSNHHQKISRVTRKRPREPKSTNDILVAGQKLFGSSFEFRDLHQLRLCYEIYMADTPSVAVQAPPGYGKTELFHLPLIALASKGDVEYVSFLFVPYTVLLANCMIRLGRRGCLNVAPVRNFIEEGYDGVTDLYVGIYDDLASTNFTDRIAAWENIVECTFRTNNVKLGYLIVDEFHNFETEVYRQSQFGGITNLDFDAFEKAIFLSGTAPEAVADAALQRIGLTGLAKKSMDINELKRSEDLSRGLSSYPTRMFNLIKEKSEVPLGHVHKIRKKVESQPEEALKLLLALFESEPESKAIVVASTTNEVEELACSWRKYFRVVWIHGKLGAAEKVSRTKEFVTDGSMQVLIGTKLVTEGIDIKQLMMVIMLDNRLNIIELIQGVGRLRDGGLCYLLSRKNSWAARNRKGELPPKEGCITEQVREFYGLESKKGKKGQHVGCCGSRTDLSADTVELIERMDRLAEKQATASMSIVALPSSFQESNSSDRYRKYCSSDEDSNTCIHGSANASTNASTNAITTASTNVRTNATTNASTNATTNASTNASTNATTNASTNATTNSSTNATTTASTNVRTSATTTASINVRTSATTTESTNSSTNATTTESTNSSTNATTTESTNSNTSATTTASINVRTSATTTESTNSSTSATTTASINVRTSATTTKSINSSTNATTTESTNSNTNATTTESTNSSTNATTTESTNSSTNATTTESTNSNTSAATTESTNSNTSATTTESTNASAKEDANKDGNAEDNRFHPVTDINKESYKRKGSQMVLLERKKLKAQFPNTSENMNVLQFLGFRSDEIKHLFLYGIDIYFCPEGVFTQYGLCKGCQKMFELCVCWAGQKVSYRRIAWEALAVERMLRNDEEYKEYLEDIEPYHGDPVGYLKYFSVKRREIYSQIQRNYAWYLAITRRRETISVLDSTRGKQGSQVFRMSGRQIKELYFKVWSNLRESKTEVLQYFLNWDEKKCQEEWEAKDDTVVVEALEKGGVFQRLRSMTSAGLQGPQYVKLQFSRHHRQLRSRYELSLGMHLRDQIALGVTPSKVPHWTAFLSMLIGLFYNKTFRQKLEYLLEQISEVWLLPHWLDLANVEVLAADDTRVPLYMLMVAVHKELDSDDVPDGRFDILLCRDSSREVGEMKVSDRRKFEKANFDEFESALNNKNDLVHCPSITLFESIPTEVRSFYEDEKSGLIKVVKFRTGAMDRKRSFEKVVISVMVGKNVKKFLTFVEDEPDFQGGPIPSKYLIPKKINLMVYTLFQVHTLKFNRKDYDTLSLFYLNRGYYNELSFRVLERCHEIASARPNDSSTMRTFTDFVSGAPIVRSLQKSTIRKYGYNLAPYMFLLLHVDELSIFSAYQASLPGEKKVDTERLKRDLCPRKPIEIKYFSQICNDMMNKKDRLGDILHIILRACALNFGAGPRGGAGDEEDRSITNEEPIIPSVDEHGLKVCKLRSPNTPRRLRKTLDAVKALLVSSCACTARDLDIFDDNNGVAMWKWIKILYHEVAQETTLKDSYRITLVPSSDGISLLAFAGPQRNVYVDDTTRRIQLYTDYNKNGSSEPRLKTLDGLTSDYVFYFVTVLRQMQICALGNSYDAFNHDPWMDVVGFEDPNQVTNRDISRIVLYSYMFLNTAKGCLVEYATFRQYMRELPKNAPQKLNFREMRQGLIALGRHCVGSRFETDLYESATSELMANHSVQTGRNIYGVDSFSLTSVSGTTATLLQERASERWIQWLGLESDYHCSFSSTRNAEDVVAGEAASSNHHQKISRVTRKRPREPKSTNDILVAGQKLFGSSFEFRDLHQLRLCYEIYMADTPSVAVQAPPGYGKTELFHLPLIALASKGDVEYVSFLFVPYTVLLANCMIRLGRRGCLNVAPVRNFIEEGYDGVTDLYVGIYDDLASTNFTDRIAAWENIVECTFRTNNVKLGYLIVDEFHNFETEVYRQSQFGGITNLDFDAFEKAIFLSGTAPEAVADAALQRIGLTGLAKKSMDINELKRSEDLSRGLSSYPTRMFNLIKEKSEVPLGHVHKIRKKVESQPEEALKLLLALFESEPESKAIVVASTTNEVEELACSWRKYFRVVWIHGKLGAAEKVSRTKEFVTDGSMQVLIGTKLVTEGIDIKQLMMVIMLDNRLNIIELIQGVGRLRDGGLCYLLSRKNSWAARNRKGELPPKEGCITEQVREFYGLESKKGKKGQHVGCCGSRTDLSADTVELIERMDRLAEKQATASMSIVALPSSFQESNSSDRYRKYCSSDEDSNTCIHGSANASTNASTNAITTASTNVRTNATTNASTNATTNASTNASTNATTNASTNATTNSSTNATTTASTNVRTSATTTASINVRTSATTTESTNSSTNATTTESTNSSTNATTTESTNSNTSATTTASINVRTSATTTESTNSSTSATTTASINVRTSATTTKSINSSTNATTTESTNSNTNATTTESTNSSTNATTTESTNSSTNATTTESTNSNTSAATTESTNSNTSATTTESTNASAKEDANKDGNAEDNRFHPVTDINKESYKRKGSQMVLLERKKLKAQFPNTSENMNVLQFLGFRSDEIKHLFLYGIDIYFCPEGVFTQYGLCKGCQKMFELCVCWAGQKVSYRRIAWEALAVERMLRNDEEYKEYLEDIEPYHGDPVGYLKYFSVKRREIYSQIQRNYAWYLAITRRRETISVLDSTRGKQGSQVFRMSGRQIKELYFKVWSNLRESKTEVLQYFLNWDEKKCQEEWEAKDDTVVVEALEKGGVFQRLRSMTSAGLQGPQYVKLQFSRHHRQLRSRYELSLGMHLRDQIALGVTPSKVPHWTAFLSMLIGLFYNKTFRQKLEYLLEQISEVWLLPHWLDLANVEVLAADDTRVPLYMLMVAVHKELDSDDVPDGRFDILLCRDSSREVGE
Gene Ontology: three types of questions what cellular component? what molecular function? what biological process?
Clark et al., 2005 is_a part_of
and through curation of literature what cellular component? what molecular function? what biological process?
The Idea of Common Controlled Vocabularies GlyProt MouseEcotope sphingolipid transporter activity DiabetInGene GluChem
The Idea of Common Controlled Vocabularies GlyProt MouseEcotope Holliday junction helicase complex DiabetInGene GluChem
Gene Ontologyca. 25,000 nodes male courtship behavior, orientation prior to leg tapping and wing vibration
Benefits of GO • rooted in basic experimental biology • links people to data and to literature • links data to data • across species (human, mouse, yeast, fly ...) • across granularities (molecule, cell, organ, organism, population)
Benefits of GO • links medicine to biological science • allows cumulation of scientific knowledge in algorithmically tractable form LET’S GENERALIZE THESE BENEFITS TO OTHER AREAS OF BIOLOGY AND MEDICINE …
The standard engineering methodology • Pragmatics (‘usefulness’) is everything • Usefulness = we get to write software which runs on our machines
The standard engineering methodology • It is easier to write useful software if one works with a simplified model • (“…we can’t know what reality is like in any case; we only have our concepts…”) • This looks like a useful model to me • (One week goes by:) This other thing looks like a useful model to him • Data in Pittsburgh does not interoperate with data in Vancouver
The standard engineering methodology Pragmatics (‘usefulness’) is everything Science is siloed
Why build scientific ontologies There are many ways to create ontologies Multiple ontologies only make our data silo problems worse We need to constrain ontologies so that they converge
Science-based ontology development Q: What is to serve as constraint in order to avoid silo creation ? A: Reality, as revealed, incrementally, by experimentally-based science
Ontological realism • Find out what the world is like by doing science • Ontology is ineluctably a multi-disciplinary enterprise – it cannot be left to the engineers • Build representations adequate to this world, not to some simplified model in your laptop
In the olden days • people measured lengths using inches, ulnas, perches, king’s feet, Swiss feet, leagues of Portugal, varas of Texas, etc., etc.
The SI is a Controlled Vocabulary • Each SI unit is represented by a symbol, not an abbreviation. The use of unit symbols is regulated by precise rules. • The symbols are designed to be the same in every language. • Use of the SI system makes scientific results comparable
The SI is an Ontology • Quantities are universals • one each for each measurable dimension of reality • Can we provide an analogue of the SI system for (basic dimensions of) biology?
First step • OBO (Open Biomedical Ontologies) library • comprehends some 70 ontologies • now made available also on the NCBO Bioportal • the majority of these ontologies are built to work well with the Gene Ontology
Goal of the OBO Foundry all biomedical research data should cumulate to form a single, algorithmically processable, whole Smith, et al. Nature Biotechnology, Nov 2007
Goal of the OBO Foundry • to provide a suite of controlled structured vocabularies for the callibrated annotation of data to support integration and algorithmic reasoning across the entire domain of biomedicine • as biomedical knowledge grows, these ontologies must be evolved in tandem
The ontology isopenand available to be used by all. The ontology is instantiated in, a common formal language and shares a common formal architecture The developers of the ontology agree in advance to collaboratewith developers of other OBO Foundry ontology where domains overlap. CRITERIA OBO FOUNDRY CRITERIA
The developers of each ontology commit to its maintenance in light of scientific advance, and to soliciting community feedback for its improvement. • They commit to working with other Foundry members to ensure that, for any particular domain, there is community convergence on a single controlled vocabulary. • http://obofoundry.org OBO FOUNDRY CRITERIA
Orthogonality = modularity • one ontology for each domain • no need for semantic matching • no need for ontology integration • no need for mappings (which are in any case too expensive, too fragile, very difficult to keep up-to-date as mapped ontologies change)
Orthogonality • is our best (perhaps our only) hope of solving the data silo problem • Why do computer engineers hate orthogonality (and like ‘relativism’ – every project its own, new ontology) – so much?
All OBO Foundry ontologies work in the same way • we have data (biosample, haplotype, clinical data, survey data, ...) • we need to make this data available for not just string-based search and algorithmic processing • we create a consensus-based ontology for annotating the data
We have data BioHealthBase:Tuberculosis Database, VFDB: Virulence Factor DB TropNetEurop:Dengue Case Data BioHealthBase: Influenza Database PathPort: Pathogen Portal Project IMBB: Malaria Data
We need to annotate this data to allow retrieval and integration of • sequence and protein data for pathogens • case report data for patients • clinical trial data for drugs, vaccines • epidemiological data for surveillance, prevention • ... Goal: to make data deriving from different sources comparable and computable
We need common controlled vocabularies to describe these data in ways that will assure comparability and cumulation What content is needed to adequately cover the infectious domain? • Host-related terms (e.g. carrier, susceptibility) • Pathogen-related terms (e.g. virulence) • Vector-related terms (e.g. reservoir) • Terms for the biology of disease pathogenesis (e.g. evasion of host defense) • Population-level terms (e.g. epidemic, endemic, pandemic)
IDO provides a common template It contains terms (like ‘pathogen’, ‘vector’, ‘host’) which apply to organisms of all species involved in infectious disease and its transmission Disease- and organism-specific ontologies are then built as refinements of the IDO core – the common core guarantees some level of comparability of data
IDO Qualities