580 likes | 664 Views
Pax Terminologica. Barry Smith Institute for Formal Ontology and Medical Information Science (IFOMIS), Saarland University / University at Buffalo. Overview. systems for semantic annotation linguistics vs. science semantic annotation in biomedical informatics
E N D
Pax Terminologica Barry Smith Institute for Formal Ontology and Medical Information Science (IFOMIS), Saarland University / University at Buffalo
Overview • systems for semantic annotation • linguistics vs. science • semantic annotation in biomedical informatics • improving systems for semantic annotation • conclusions
The Penn Treebank Project • annotates naturally occurring text for linguistic structure, producing skeletal parses showing syntactic and semantic information in tree form
Automatic Content Extraction Program (ACE) • develops text corpora in English, Chinese and Arabic annotated for entities, the relations among them and the events in which they participate.
High Accuracy Retrieval from Documents (HARD) • creates corpora and annotations including topics, metadata and relevance judgements
Annotation Graph Toolkit (AGTK) • formal framework for representing linguistic annotations of time series data.
TimeML robust specification language for markup of natural language to support: • time stamping of events (identifying and anchoring in time); • ordering events with respect to one another • reasoning about persistence
SpaceML provides facilities for annotating • category attributions to spatial regions (self-connected, bounded, regular, etc.) • ascription to regions of topological, distance, morphological and orientation relations; • the definition of a region in terms of its boundary.
WordNet • annotates English nouns, verbs, adjectives and adverbs to synonym sets, each representing one underlying lexical concept.
FrameNet • documents the range of semantic and syntactic combinatory possibilities (valences) of each word in each of its senses
ISO/TC 37 / SC 4N 076 • Ide, N., Romary, L., de la Clergerie, E. (2003). International Standard for a Linguistic Annotation Framework. HLT-NAACL 2003 (Edmonton)
OntoGloss (influenced by ISO Linguistic Annotation Framework) • an ontology based annotation tool that uses pre–defined terms in an ontology to mark-up a document No standard portal for semantic annotation tools/projects (?)
Purposes of semantic annotation • information retrieval (incl. semantic indexing = answering queries that use words not used in the text, including words from other languages) • automatic translation • disambiguation • topic extraction and text summarization • information integration • reasoning
for linguistics • fiction no less important than fact • English has no privileged status • regimentation not allowed • annotation frameworks may be competitive • cross-framework consistency is not important
for science • factual discourse alone important • English is language par excellence • regimentation is allowed • goal of truth: to create a single computer-processable map of reality • truth is one must strive for consistency of annotations and additivity of annotation frameworks
for science • must end the terminology wars • Plant Ontology (PO) cell =def. structural and physiological unit of a plant what should PO do when it needs to study bacteria in plants? answer: all shall use the word ‘cell’ to mean the same thing! • (all = in biology)
the ideal (of additivity) • WordNet for single word forms • FrameNet for valencies/combination forms • SpaceNet for spatial structures • TimeNet for temporal structures • ChemNet for chemical structures • CellNet for cellular structures etc.
a scientific problem: huge swarms of biomedical data at different granularities, from molecule to clinic • methods for data integration needed to enable reasoning across data at multiple granularities • (genomic medicine ...)
orthodox solutions to this problem • dumb statistical number-crunching or: • Semantic Web, Unified Medical Language System (UMLS), Moby, etc. • let a million flowers bloom • and rely on mappings between already existing controlled vocabularies/annotation systems
an alternative solution • use the peer-reviewed biomedical literature • contains both textual descriptions of biological functions (incl. diseases) and references to entities represented in the biochemical databases • use high-quality semantic annotations of the former to integrate across the latter the Gene Ontology
The methodology of annotations • Model organism databases employ scientific curators, who use the experimental observations reported in the biomedical literature to link gene products (such as proteins) with GO terms in annotations.
The process of annotations • leads to improvements and extensions of the ontology, which in turn leads to better annotations • a virtuous cycle of improvement in the quality and reach of both future annotations and the ontology itself, • yielding a slowly growing computer-interpretable map of biological reality within which major databases are automatically integrated in semantically searchable form
id: CL:0000062 name: osteoblast def: "A bone-forming cell which secretes an extracellular matrix. Hydroxyapatite crystals are then deposited into the matrix to form bone." [MESH:A.11.329.629] is_a: CL:0000055 relationship: develops_from CL:0000008 relationship: develops_from CL:0000375 need to extend GO by means of other ontologies, e.g. Cell Ontology, via integrated definitions GO + Cell type = Osteoblast differentiation: Processes whereby an osteoprogenitor cell or a cranial neural crest cell acquires the specialized features of an osteoblast, a bone-forming cell which secretes extracellular matrix. New Definition
need to extend GO also to semantic annotation of clinical literature unfortunately, available (UMLS) clinical vocabularies are of variable quality and low mutual consistency
need for prospective standards to assure consistency and high quality • create rules for high-quality controlled vocabularies for the annotation of scientific literature • make everyone follow these rules • regimentation !
first step a shared portal for (so far) 58 ontologies (low regimentation) http://obo.sourceforge.net
The OBO Foundry scientific standards and principles-based coordination of systems for semantic annotation of biomedical literature to create a single interoperable family of gold standard reference ontologies
The OBO Foundry A subset of OBO ontologies, whose developers have agreed in advance to accept a common set of principles designed to ensure • formal robustness • stability • compatibility • interoperability • support for logic-based reasoning
The OBO Foundry • Custodians • Michael Ashburner (Cambridge) • Suzanna Lewis (Berkeley) • Barry Smith (Buffalo/Saarbrücken)
The OBO Foundry A prospective standard designed to guarantee interoperability of ontologies from the very start established March 2006; already 13 OBO ontologies have joined the Foundry and are being corresponding reformed; three new ontologies are being constructed ab initio in its terms
The OBO Foundry Initial Candidate Members • GO Gene Ontology • CL Cell Ontology • SO Sequence Ontology • ChEBI Chemical Ontology • PATO Phenotype (Quality) Ontology • FuGO Functional Genomics Investigation Ontology • FMA Foundational Model of Anatomy • RO Relation Ontology • ChEBI Chemical Entities of Biological Interest • CARO Common Anatomy Reference Ontology • FuGO Functional Genomics Investigation Ontology • PrO Protein Ontology • RnaO RNA Ontology
The OBO Foundry Under development • Disease Ontology • Mammalian Phenotype Ontology • OBO-UBO / Ontology of Biomedical Reality • Organism (Species) Ontology • Plant Trait Ontology • Environment Ontology • Behavior Ontology • Biomedical Image Ontology • Clinical Trial Ontology
The OBO Foundry The OBO Foundry CRITERIA • The ontology is open and available to be used by all. • The ontology is in, or can be instantiated in, a common formal language. • The developers of the ontology agree in advance to collaborate with developers of other OBO Foundry ontology where domains overlap.
The OBO Foundry CRITERIA • The developers of each ontology commit to its maintenance in light of scientific advance, and to soliciting community feedback for its improvement. • They commit to working with other Foundry members to ensure that, for any particular domain, there is community convergence on a single controlled vocabulary
The OBO Foundry CRITERIA • The ontology possesses a unique identifier space within OBO. • The ontology provider has procedures for identifying distinct successive versions. • The ontology includes textual definitions for all terms.
The OBO Foundry CRITERIA • The ontology has a clearly specified and clearly delineated content. • The ontology is well-documented. • The ontology has a plurality of independent users.
The OBO Foundry CRITERIA • The ontology uses relations which are unambiguously defined following the pattern of definitions laid down in the OBO Relation Ontology.* *Genome Biology 2005, 6:R46
analogy with FrameNet • the constituent ontologies in the OBO Foundry are focused overwhelmingly on single nouns • the OBO Relation Ontology is designed to ensure a common structure of relations shared by all Foundry ontologies – comparable to SpaceML, TimeML ... • need something like (Bio)FrameNet to pull the different levels of granularity together
The OBO Foundry The OBO Foundry CRITERIA • Further criteria will be added over time in order to bring about a gradual improvement in the quality of the ontologies in the Foundry
The OBO Foundry GOALS • semantic alignment of OBO Foundry ontologies through a common system of formally defined relations • to enable reasoning both within and across ontologies, and thus also within and between the literature annotated in its terms • and thus also to support reasoning across associated data
The OBO Foundry GOALS • to promote re-usability of data • if data-schemas are formulated using a single well-integrated framework for semantic annotation in widespread use, then this data will be to this degree itself become more widely accessible and usable
The OBO Foundry GOALS • to help in creating better mappings e.g. between human and model organism phenotypes: S Zhang, O Bodenreider, “Alignment of Multiple Ontologies of Anatomy: Deriving Indirect Mappings from Direct Mappings to a Reference Ontology”, AMIA 2005
The OBO Foundry GOALS • to introduce the scientific method into the development of semantic annotation frameworks • to introduce some of the features of scientific peer review into biomedical ontology development