How to Build an Ontology

10:30-12:00 How to Build an Ontology 1-2pm Best Practices and Lessons Learned 2-3pm BIRN Ontologies: An Overview

How to Build an Ontology

High quality shared ontologies build communities • General trend on the part of NIH, FDA and other bodies to consolidate ontology-based standards for the communication and processing of biomedical data. • NCIT / caBIG / NECTAR / BIRN / OBO ...

TWO STRATEGIES:Ad hoc creation of new database schemas for each research group / research hypothesisvs. • Pre-established interoperable stable reference ontologies in terms of which all database schemas need to be defined

How to create the conditions for a step-by-step evolution towards gold standard reference ontologies in the biomedical domain • ... and why we need to create these conditions • OBO Core project

Ontology =def • A representation of the types of entities existing in a given domain of reality, and of the relations between these types

Types have instances • Ontologies are like science texts: they are about types • (Diaries, databases, clinical records are about instances)

The need • strong general-purpose classification hierarchies created by domain specialists • clear, rigorous definitions • thoroughly tested in real cases • ontologies teach us about the instances in reality by supporting cross-disciplinary (cross-ontology) reasoning about types

The actuality (too often) • myriad special purpose ‘light’ ontologies, prepared by ontology engineers and deposited in internet ‘repositories’ or ‘registries’

these light ontologies often do not generalize … • repeat work already done by others • are not interoperable • reproduce the very problems of communication which ontology was designed to solve • contain incoherent definitions • and incoherent documentation

BIRN Ontology Experiences • In the short-term, users will probably download the data or analyses and extract the results using their preferred methods. • In the long term, however, that will become infeasible • the databases will have to be made interoperable with standard datamining software. • This is where the neuroanatomy ontologies come in. • We will need to know what the ROI is and which naming scheme it came from (e.g., a Brodmann’s area, or a sulcal/gyral area, etc.). We’ll need to know how it was defined (Talairach atlas? MNI atlas? LONI atlas? Or subject-specific regions?) and what the statistic is.

BIRN Ontology Experiences • In the short-term, users will probably download the data or analyses and extract the results using their preferred methods. • In the long term that will become infeasible

The long term begins here

A methodology for quality-assurance of ontologies • tested thus far in the biomedical domain on: • FMA • GO + other OBO Ontologies • FuGO • SNOMED • UMLS Semantic Network • NCI Thesaurus • ICF (International Classification of Functioning, Disability and Health) • ISO Terminology Standards • HL7-RIM

A methodology for quality-assurance of ontologies • accepted need for application of this methodology: • FMA • GO + other OBO Ontologies • FuGO • SNOMED • UMLS Semantic Network • NCI Thesaurus • ICF (International Classification of Functioning, Disability and Health) • ISO Terminology Standards • HL7-RIM

A methodology for quality-assurance of ontologies • signs of hope: • FMA • GO + other OBO Ontologies • FuGO • SNOMED • UMLS Semantic Network • NCI Thesaurus • ICF (International Classification of Functioning, Disability and Health) • ISO Terminology Standards • HL7-RIM

We know that high-quality ontologies built according to this methodology can help in creating high-quality mappings between human and model organism phenotypes

“Alignment of Multiple Ontologies of Anatomy: Deriving Indirect Mappings from Direct Mappings to a Reference Ontology”Songmao ZhangOlivier BodenreiderAMIA 2005

We also know that OWL is not enough to ensure high-quality ontologies • and that the use of a common syntax and logical machinery and the careful separating out of ontologies into namespaces does not solvethe problem of ontology integration

A basic distinction • type vs. instance • science text vs. clinical document • man vs. Musen

Instances are not represented in an ontology • It is the generalizations that are important • (but instances must still be taken into account)

Ontology Types Instances

Ontology = A Representation of Types

Ontology = A Representation of Types • Each node of an ontology consists of: • preferred term (aka term) • term identifier (TUI, aka CUI) • synonyms • definition, glosses, comments

Ontology = A Representation of Types Nodes in an ontology are connected by relations: primarily: is_a (= is subtype of) and part_of designed to support search, reasoning and annotation

substance organism animal cat instances siamese types mammal leaf class frog

Rules for formating terms • Terms should be in the singular • Terms should be lower case • Avoid abbreviations even when it is clear in context what they mean (‘breast’ for ‘breast tumor’) • Avoid acronyms • Avoid mass terms (‘tissue’, ‘brain mapping’, ‘clinical research’ ...) • Each term ‘A’ in an ontology is shorthand for a term of the form ‘the type A’

Motivation: to capture reality • Inferences and decisions we make are based upon what we know of reality. • An ontology is a computable representation of the underlying biological reality. • Designed to enable a computer to reason over the data we derive from this reality in (some of) the ways that we do.

Concepts • Biomedical ontology integration will never be achieved through integration of meanings or concepts • The problem is precisely that different user communities use different concepts • Concepts are in your head and will change as your understanding changes

Concepts • Ontologies represent types: not concepts, meanings, ideas ... • Types exist, with their instances, in objective reality • – including types of image, of imaging process, of brain region, of clinical procedure, etc.

Rules on types • Don’t confuse types with words • Don’t confuse types with concepts • Don’t confuse types with ways of getting to know types • Don’t confuse types with ways of talking about types • Don’t confuses types with data about types

Some other simple rules for high quality ontologies

Univocity • Terms should have the same meanings on every occasion of use. • They should refer to the same kinds of entities in reality • Basic ontological relations such as is_a and part_of should be used in the same way by all ontologies

Positivity • Complements of types are not themselves types. • Hence terms such as • non-mammal • non-membrane • other metalworker in New Zealand • do not designate types in reality • There are also no conjunctive and disjunctive types: • protoplasmic astrocyte and Schwann cell • Purkinje neuron or dendritic shaft

Objectivity • Which types exist is not a function of our knowledge. • Terms such as ‘unknown’ or ‘unclassified’ or ‘unlocalized’ do not designate types in reality.

Single Inheritance No kind in a classificatory hierarchy should have more than one is_a parent on the immediate higher level

Multiple Inheritance • thing • blue thing • car is_a1 is_a2 • blue car

is_a Overloading • serves as obstacle to integration with neighboring ontologies • The success of ontology alignment demands that ontological relations (is_a, part_of, ...) have the same meanings in the different ontologies to be aligned. • See “Relations in Biomedical Ontologies”, Genome Biology May 2005. •  DISEASE MAPS

General Rule • Formulate universal statements first • Move to A may be B in such and such a context later

Intelligibility of Definitions • The terms used in a definition should be simpler (more intelligible) than the term to be defined; otherwise the definition provides no assistance • to human understanding • to machine processing

Definitions should be intelligible to both machines and humans • Machines can cope with the full formal representation • Humans need clarity and modularity

But • Some terms are primitive (cannot be defined) • AVOID CIRCULAR DEFINITIONS • Avoid definitions of the forms: • An A is an A which is B (person = person with identity documents) • An A is the B of an A (heptolysis = the causes of heptolysis)

Case Study: The National Cancer Institute Thesaurus (NCIT) • does not (yet) satisfy these and other simple principles

The NCIT reflects a recognition of the need • for high quality shared ontologies and terminologies the use of which by clinical researchers in large communities can ensure re-usability of data collected by different research groups

NCIT • “a biomedical vocabulary that provides consistent, unambiguous codes and definitions for concepts used in cancer research” • “exhibits ontology-like properties in its construction and use”.

Goals • to make use of current terminology “best practices” to relate relevant concepts to one another in a formal structure, so that computers as well as humans can use the Thesaurus for a variety of purposes, including the support of automatic reasoning; • to speed the introduction of new concepts and new relationships in response to the emerging needs of basic researchers, clinical trials, information services and other users.

Formal Definitions • of 37,261 nodes, 33,720 were stipulated to be primitive in the DL sense • Thus only a small portion of the NCIT ontology can be used for purposes of automatic classification and error-checking by using OWL.

Verbal Definitions • About half the NCIT terms are assigned verbal definitions • Unfortunately some are assigned more than one

Disease Progression • Definition1 • Cancer that continues to grow or spread. • Definition2 • Increase in the size of a tumor or spread of cancer in the body. • Definition3 • The worsening of a disease over time. This concept is most often used for chronic and incurable diseases where the stage of the disease is an important determinant of therapy and prognosis.

How to Build an Ontology