480 likes | 496 Views
Explore the definitions and applications of ontologies, such as the Gene Ontology and Semantic Web, in bioinformatics research. Learn how ontologies structure data, enable knowledge sharing, and support reasoning in information science.
E N D
Ontologies: “What are they?” and “How do they work?” Michael Grobe (work supported in part by Research Technologies UITS Indiana University)
Table of Contents Panorama of definitions Explication of the Big Definition The Gene Ontology as an example Processing queries on data annotated with ontology classifications Merging and building ontologies Table of Non-contents Automated annotation The role of ontologies in the Semantic Web Using ontologies in bioinformatics research
Panorama of definitions of “Ontology” In standard use: “Ontology is “is a study of conceptions of reality and the nature of being. … It is the science of what is, of the kinds and structures of the objects, properties and relations in every area of reality.” (Wikipedia, 2008) The term was hijacked for use within information science (sic) where it has many applications, but… “People use the word ontology to mean different things, e.g. - glossaries and data dictionaries, - thesauri and taxonomies, - schema and data models, and - formal ontologies and inference.” (Pidcock, 2003)
More definitions Here’s a definition from Uschold, et al. quoted by Stevens, et al.: “An ontology may take a variety of forms, but necessarily it will include a vocabulary of terms, and some specification of their meaning. This includes definitions and an indication of how concepts are inter-related which collectively impose a structure on the domain and constrain the possible interpretations of terms.”
More definitions And a definition from Pidcock, 2003: “A formal ontology is a controlled vocabulary expressed in an ontology representation language. This language has a grammar for using vocabulary terms to express something meaningful within a specified domain of interest. The grammar contains formal constraints … on how terms in the ontology’s vocabulary can be used together.”
A definition and a “clarification” And another definition by Gruber, 1993: “I use the term ontology to mean the specification of a conceptualization. …A conceptualization is an abstract, simplified view of the world that we wish to represent for some purpose.” Stevens, et al., 2000 clarify (?): “The conceptualisationis the couching of knowledge about the world in terms of entities (things, the relationships they hold and the constraints between them). The specificationis the representation of this conceptualisation in a concrete form.”
Yet more definitions And Gruber defines one purpose for ontologies: “Ontologies provide controlled, consistent vocabularies to describe concepts and relationships, thereby enabling knowledge sharing” (Gruber, 1993) But then, one also finds descriptions like: “Shallow ontologies comprise relatively few unchanging terms that organize very large amounts of data—for example, terms such as customer, account number, and overdraft…” (Shadbolt, 2006)
And Wikipedia continues in this vein by describing 2 types of (IT-related) ontology: “A domain ontology (or domain-specific ontology) models a specific domain, or part of the world. It represents the particular meanings of terms as they apply to that domain.” (Wikipedia, 2008) “An upper ontology (or foundation ontology) is a model of the common objects that are generally applicable across a wide range of domain ontologies. It contains a core glossary in whose terms objects in a set of domains can be described.” There are several standardized upper ontologies available for use, including Dublin Core . . . (Wikipedia, 2008)
As an aside (because we are actually interested in domain ontologies), let’s take a quick look at the Dublin Core thanks again to Wikipedia: “The Simple Dublin Core Metadata Element Set (DCMES) consists of 15 metadata elements: Title Creator Subject Description Publisher Contributor Date Type Format Identifier Source Language Relation Coverage Rights“ Surprisingly, these are described as “metadata elements”.
But wait, there’s still more… In particular, an ontology, or some ontologies, provide some ability to “reason”: “…an ontology is a representation of a set of concepts within a domain and the relationships between those concepts. It is used to reason about the properties of that domain, and may be used to define the domain.” (Wikipedia, 2008) “. . . a document or file that formally defines the relations among terms. The most typical kind of ontology for the Web has a taxonomy and a set of inference rules.” TimBL, 2000.
Note that this “reasoning” is performed using terms representing concepts rather than the concepts themselves. (Which is to say, text strings are being shuffled around; there is no “thought” involved.) “The computer doesn’t truly “understand” any of this information, but it can now manipulate the terms much more effectively in ways that are meaningful to the human users.” (TimBL, 2001)
Note also that this is the first definition that refers to a “taxonomy”, and the term comes up a lot in the field, so let’s look at an exampletaxonomy. Consider (a simplified version of) the Linnaean classification of living organisms based on these categories: Dominion Subphylum Family Domain Class Genus Kingdom Cohort Species Phylum Order Each individual organism is assigned a set of values, one for each category. The result is a large table with 11 (or more) columns. This taxonomy defines a hierarchy of sets and subsets, and . . . . . . the series of values in each column of an individual species record represents the path from the root of the tree to that species (leaf node), and much of this path information is redundant.
There may be better ways to store this redundant data, and . . . . . . there may be other “ways” to think about what the data mean. In particular, the set-subset relationships may be thought of as “inference rules” that can be applied to answer queries. For example: If an organism is a member of a genus and that genus is a member of a family then that organism is also a member of that family
Finally, here is a “big,” formal definition: "An ontology O is a six-tuple C, HC, HR, L, FC, FR, where C is the set of concepts, HC a taxonomy induced on the concepts, HR the set of non-taxonomic relations, L the set of terms (lexicals) which refer to concepts and relations, and FC, FR are relations that map the terms in L to the corresponding concepts and relations. If the ontology is dynamic all these structures are likely to change over time." (Niepert, et al., 2008) Note that this definition includes no mention of “inference,” but inference may be hidden within.
This is clearly a complicated description, but we can break it into parts, at least some of which are understandable: First, a couple of definitions: A “tuple” is a set of objects in a specified order. An “N-tuple” is a tuple that contains exactly N items. A “relation” is a set of “tuples” of the same “arity”, but may be thought of as a “table”, which is how Relational Databases came to be named (even tho there are differences). Now note that the primary set, C, is a “set of concepts,” not a “set of terms”. “Terms,” are used to “refer to” the “concepts”, and both terms and relations are likely to change over time. FC maps L to C, but “terms” also refer to “relations”?
As a very simple example, here’s a set of concepts (C) represented as strings of English text: { “Vehicle”, “Car”, “Truck”, “2-wheel drive car”, “4-wheel drive car”, “front-wheel drive car”, “rear-wheel drive car” } Here’s a “taxonomy” (known as HC, perhaps called “is_a”?) “induced” on the set of concepts: { ( “Car”, “Vehicle” ), ( “Truck”, “Vehicle” ), ( “2-wheel drive car”, “Car” ), ( “4-wheel drive car”, “Car” ), ( “front-wheel drive car”, “2-wheel drive car” ), ( “rear-wheel drive car”, “2-wheel drive car” ) }
Here’s a set of terms, L = ( 0, 1, 2, 3, 4, 5, and 6 ) and a relation (FC) mapping terms from the term set to concepts: { ( 0, “Vehicle” ), ( 1, “Car” ), ( 2, “Truck” ), ( 3, “2-wheel drive car” ), ( 4, “4-wheel drive car” ), ( 5, “4-wheel drive car” ), ( 6, “4-wheel drive car” ) } Note that it is a good idea to use “meaningless” terms: Identifiers like “GO:000056”. Here’s a representation of the taxonomy (HC) using terms: { ( 1, 0 ), ( 2, 0 ), ( 3, 1 ), ( 4, 1 ), ( 5, 3 ), ( 6, 3 ) }
Here’s a relation (call it “is_transitively_a” or “is_a_descendent_of” or a “transitive closure”) derived from the taxonomy assuming “transitivity”: { ( 1, 0 ), ( 2, 0 ), ( 3, 1 ), ( 3, 0 ), ( 4, 1 ), ( 4, 0 ), ( 5, 3 ), ( 5, 1 ), ( 5, 0 ), ( 5, 3 ), ( 5, 1 ), ( 5, 0 ), } The items in bold were added “by transitivity”. This seems to be one way of “sneaking” inference into the definition.
This complete table, ….ur….relation, contains the same information as could be inferred from the transitivity inference rule: If Item_A is_a Item_B and Item_B is_a Item_C then Item_A is_a Item_C ….and in some cases the relation derived by transitivity would be prohibitively large, so inference rules are frequently used to determine the relationship between 2 items ad hoc. Another way to “sneak” inference into this definition would be to consider the taxonomy as a set of inference rules, as will be considered later.
The Gene “Ontology” One of the best known “ontologies” is the Gene Ontology which is actually 3 separate “ontologies” (with different “namespaces”) - molecular function (cell biochemistry?) What biochemical reactions do gene products perform? - biological process (cell physiology?) What cellular processes do the gene products participate in? - cellular component (cell anatomy?) In which cellular compartments or locations are those gene products expressed?
Here is a portion of the GO is_a DAG (Blake, 2004) for molecularfunction (example: “chromatin binding” is_a “DNA binding”): (It is easy to confuse a gene product name with its molecular function, and for that reason many GO molecular functions are appended with the word "activity". www.geneontology.org, 2008)
Here is a subset (C) of the Gene Ontology molecular function concepts binding enzyme activity helicase activity DNA binding nucleic acid binding chromatin binding lamin/chromatin binding DNA helicase activity ATP-dependent helicase activity adenosine triphosphatase activity ATP-dependent DNA helicase activity DNA-dependent adenosine triphospatase activity (It is easy to confuse a gene product name with its molecular function, and for that reason many GO molecular functions are appended with the word "activity". www.geneontology.org, 2008)
The set (L) of Gene Ontology molecular function terms GO:00005488 GO:00008047 GO:00004386 GO:00003677 GO:00003676 GO:00003682 GO:00003683 GO:00003679 GO:00008026 GO:00016887 GO:00004003 GO:00008094
The relation (FC) mapping GO terms to concepts GO:00005488 binding GO:00008047 enzyme activity GO:00004386 helicase activity GO:00003677 DNA binding GO:00003676 nucleic acid binding GO:00003682 chromatin binding GO:00003683 lamin/chromatin binding GO:00003679 DNA helicase activity GO:00008026 ATP-dependent helicase activity GO:00016887 adenosine triphosphatase activity GO:00004003 ATP-dependent DNA helicase activity GO:00008094 DNA-dependent adenosine triphosphatase activity
Here is the is_a relation (HC) defining relationships among concepts (nucleic_acid binding activity “is a kind of” binding activity): Sub-function Function molecular_function root binding molecular function nucleic acid binding binding enzyme activity molecular function helicase activity enzyme activity DNA binding nucleic acid binding chromatin binding DNA binding lamin/chromatin binding chromatin binding DNA helicase activity DNA binding DNA helicase activity helicase activity ATP-dependent helicase activity helicase activity adenosine triphosphatase activity enzyme activity ATP-dependent helicase activity adenosine triphosphatase activity DNA-dependent adenosine triphosphatase activity adenosine triphosphatase activity ATP-dependent DNA helicase activity DNA helicase activity ATP-dependent DNA helicase activity ATP-dependent helicase activity ATP-dependent DNA helicase activity DNA-dependent adenosine triphosphatase activity (Note that some Sub-functions have multiple parent Functions.)
Here is a portion of the GO is_a DAG (Blake, 2004) for molecular function (example: “chromatin binding” is_a “DNA binding”): (It is easy to confuse a gene product name with its molecular function, and for that reason many GO molecular functions are appended with the word "activity". www.geneontology.org, 2008)
Here’s the first entry (of the ~26K) in the GO text version (with all three parts intermixed): [Term] id: GO:0000001 name: mitochondrion inheritance namespace: biological_process def: "The distribution of mitochondria, including the mitochondrial genome, into daughter cells after mitosis or meiosis, mediated by interactions between mitochondria and the cytoskeleton." [GOC:mcc, PMID:10873824, PMID:11389764] synonym: "mitochondrial inheritance" EXACT [] is_a: GO:0048308 ! organelle inheritance is_a: GO:0048311 ! mitochondrion distribution You can also get the GO as RDF XML, or as a MySQL database.
In the example, a GO concept (“name”) is being mapped to: - a GO ID, - a root “namespace”, - a “def,” and also to a - set of “synonyms”. And, in addition, the concept may be mapped to “parent” or “child” concepts through the - “is_a” (subsumption) and/or - “part_of” (meronomy/partonomy), - “regulates” (gene transcription). - “positively_regulates”, - “negatively_regulates” links as exemplified in the next slide. Remember that “is_a” is really more like ”is a kind of”, and the last 3 in the list above are “non-taxonomic relations” (HR). These links define edges of the GO DAGs.
[Term] id: GO:0003677 name: DNA binding namespace: molecular_function def: "Interacting selectively with DNA (deoxyribonucleic acid)." [GOC:jl] subset: goslim_candida subset: goslim_generic subset: goslim_plant subset: goslim_yeast subset: gosubset_prok related_synonym: "microtubule/chromatin interaction" [] narrow_synonym: "plasmid binding" [] is_a: GO:0003676 ! nucleic acid binding [Term] id: GO:0003682 name: chromatin binding namespace: molecular_function def: "Interacting selectively with chromatin, the network of fibers of DNA and protein that make up the chromosomes of the eukaryotic nucleus during interphase." [GOC:jl, ISBN:0198506732 "Oxford Dictionary of Biochemistry and Molecular Biology"] subset: goslim_generic subset: goslim_pir subset: goslim_plant related_ synonym: "microtubule/chromatin interaction" [] narrow_synonym: "nuclear membrane vesicle binding to chromatin" [] broad_synonym: "lamin/chromatin binding" [] is_a: GO:0003677 ! DNA binding (This was changed since 2004.)
Note that the Genes listed in the previous DAG graphic are NOT part of the ontology. In fact, there is NO “DATA” in the ontology. Blake (2004) emphasizes some important features of GO as: “Not a way to unify biological database[s] Not a dictated standard Not a database of gene products, protein domains, or motifs Does not define evolutionary relationships”
In fact, GO may not even BE an ontology : “GO ontology, which is more a nomenclature and a taxonomy, than a formal ontology, is highly successful and widely used.” (Sheth, 2003) In fact, that wide usage may be due directly to the fact that it is NOT a formal ontology: “Semi-formal ontologies that may be based on limited expressive power are most practical and useful. Formal or semi-formal ontologies represented in very expressive languages…have, in practice, yielded little value in real world applications.” (Sheth, 2003) “Our object in touting the value of semi-formal ontologies is to prevent research in the Semantic Web field from leading straight into the very problems that AI found itself in.” (Sheth, 2003)
So where is the data? Here is a 2-column table that uses GO to “annotate” the products of the genes shown in the graphic above from the Mouse Genome Initiative database: Gene Name Molecular Function Mcmd2 GO:0003682 Mcmd4 GO:0003682,GO:0004003 Mcmd6 GO:0003682,GO:0004003 Mcmd7 GO:0003682,GO:0004003 Note that only the lowest level GO ID terms are used here to identify functions. Note also that a gene product may perform multiple functions and that multiple function entries in this table are separated by commas.
Scale of the genome annotation As of August of 2004, the Mouse Genome had been annotated using the Gene Ontology as: - Function: 12K genes with 30K annotations - Process: 11K genes annotated with 21K annotations - Location: 11K genes annotated with 20K annotations
Data may be presented via a tree representation: binding (Click an entry to see data nucleic acid binding annotated with that entry.) DNA binding chromatin binding lamin/chromatin binding DNA helicase activity ATP-dependent DNA-helicase activity enzyme activity helicase activity DNA helicase activity ATP-dependent DNA helicase activity ATP-dependent helicase activity adenosine triphosphatase activity ATP-dependent helicase activity ATP-dependent DNA helicase activity DNA-dependent adenosine triphosphatase activity ATP-dependent DNA helicase activity
When data is annotated using the most specific GO category, membership in parent categories (supersets) must to be determined by “inference”, that is, by moving up the is_a DAG or up the part_of path (if available), applying the transitivity rule. We “know” that transitivity holds, and we use it “intuitively” as we inspect the DAG: If X is_a molecular_function_1 and molecular_function_1 is_a molecular_function_2 then X is_a molecular_function_2
Or we might think of each entry in the DAG relation as being an inference rule itself, and apply these rules whenever possible. So an entry in the is_a DAG like: ( Function_1, Function_2 ) might be interpreted as the inference rule: If gene_product_X is annotated with Function_1 and Function_1 is_a Function_2 then gene_product_X could be annotated with Function_2 (Aside: In some cases the is_a relation could be interpreted in reverse order as an “includes” relation?)
One way or another: If you have the function, process, and location GO IDs for a collection of genes (which will never be in the GO itself) and you have the GO, and you have an appropriate inference capability then you should be able answer questions that relate to the membership of any annotated item in any GO class.
How might this process actually work with questions like: “Tell me whether mouse Mcmd4 is a helicase.” which should be roughly equivalent to: “Is Mcmd4 annotated with “helicase activity” (GO:0004386) or some child thereof?” Answer: Yes “Which mouse genes are involved in DNA binding, but are not DNA helicases.” which should be roughly equivalent to: “Which mouse genes are annotated with “DNA binding” (GO:0003677) or some child thereof, but are not annotated with “helicase activity” (GO:0004386) or some child thereof.” Answer: Mcmd, Mcmd2?
We could answer these questions by “inspection” because we know what the is_a relation “means”, and how to manipulate the relation “meaningfully”. However, how can we answer these questions “mechanically” using a program? In particular, if we interpret the is_a relation entries as inference rules, how can we process these queries? First we will think of the queries as “assertions”, like: “Mcmd4 is a helicase.” and “Gene_product_X displays “helicase activity” and try to prove (or “satisfy”) them by using the inference rules to derive a list of facts provable from the given data.
Suppose you want to answer the question: “Does mouse Mcmd4 display helicase activity? Start with a “fact base” composed of the set of known “facts” from your annotation database: { ( Mcmd4, chromatin binding ), ( Mcmd4, ATP-dependent DNA helicase activity ) } Then repeatedly apply the inference rules to add facts to the collection of facts in the “fact base”, and . . . Stop when the target assertion appears in the fact base, or when no new facts have been added during a step. At that point, if the assertion is in the fact base, it is has been “proved” to be true, else it is false. (This is a “forward-chaining” inference process.)
After step one, the “fact base” will contain (assuming the entries in the DAG relation are processed in the order presented earlier): { ( Mcmd4, chromatin binding ), ( Mcmd4, ATP-dependent DNA helicase activity ) ( Mcmd4, DNA helicase activity ), ( Mcmd4, DNA binding activity ) ( Mcmd4, ATP-dependent helicase activity ), ( Mcmd4, DNA-dependent adenosine triphosphate activity ) }
After step two, the fact base will contain: { ( Mcmd4, chromatin binding ), ( Mcmd4, ATP-dependent DNA helicase activity ) ( Mcmd4, DNA helicase activity ), ( Mcmd4, DNA binding activity ), ( Mcmd4, ATP-dependent helicase activity ), ( Mcmd4, DNA-dependent adenosine triphosphate activity ), ( Mcmd4, helicase activity ), ( Mcmd4, adenosine triphosphatase activity ), ( Mcdm4, nucleic acid binding ) } at which point we can stop, because the assertion has been “proved”. To resolve the second query we initialize the fact base with the entire annotation database, infer new facts until no facts can be added and then list the facts that include gene-products with “helicase activity”. (Aside: How would this be done using SQL?)
How would this be done using SQL? It might be possible to use a series of self-joins to get records that include the full path from each concept to root. On the other hand, it would probably be better to compute the paths using some external tool and store the result as a table of concept-ancestor pairs (for each DAG) like: record_count, namespace, concept ID, ancestor ID where the GO IDs are foreign keys. The record_count might be useful to identify the order of discovery during traversal. It might also be useful to include a "generation offset" from the concept to each ancestor. Query resolution would then require simple SQL requests for concept-ancestor pairs.
Merging and building ontologies If you confront 2 databases each with its own ontology, you MIGHT be able to map one to the other if you want to combine them or query both using the ontology of just 1. There has been a lot off research in this area, and apparently a handful (or 2) of tools have been developed to help, but . . . “There are multiple tools to merge or map ontologies, but they are quite difficult to use and require some user editing in order to obtain reliable results.” (Pasquier, 2008) In fact, there are also “no standardized methods for building ontologies” (Sevens, et al., 2003), and even though there exist multiple toolsets to help, building ontologies remains difficult.
Shirky (2008) argues that ontologies are not necessarily the best way to annotate all kinds of data. He provides a list ofdomain and user characteristics that bode well for success: “Domain has a small corpus, formal categories, stable entities, restricted entities, and clear edges.” The “participants” are expert catalogers and include authoritative sources of judgment, and the users are organized, and expert in their use of the ontology. He sites the psychiatric Diagnostic and Statistical Manual (DSM-IV) and the Periodic Table as examples. One might add that the domain categories are stable, but not too stable; the categorization structure is irregular; and there are storage space constraints.
Summary There are many (not entirely consistent) definitions of ontology. The “Big” definition provides a concrete toehold that helps clarify the other definitions, and can be used to structure further work. The Gene Ontology is, and can be, used to annotate life-sciences data. Programs can be written to use the Gene Ontology and annotated data to answer queries (prove assertions). Building ontologies may be difficult, but should be worth the effort in many circumstances.
References Aktas, Mehmet, and Malon Pierce, Semantic Web and RDF Ontologies. http://grids.ucs.indiana.edu/ptliupages/presentations/SemanticWeb&RDFOntology.ppt Berners-Lee, Tim, James Hendler and Ora Lassila, The Semantic Web, Scientific American, May 2001. http://www.sciam.com/article.cfm?id=the-semantic-web Blake, Judith, “Using the Gene Ontology for Data Analysis”. http://www.geneontology.org/teaching_resources/presentations/2004-11_dataanalysis_jblake.ppt Feigenbaum,Lee, Ivan Herman, Tonya Hongsermeier, Eric Neumann and Susie Stephens, The Semantic Web in Action, Scientific American, 2007. Ignore this article. http://thefigtrees.net/lee/sw/sciam/semantic-web-in-action#single-page Gruber, Tom, “What is an Ontology?”, Personal web site. http://www.ksl.stanford.edu/kst/what-is-an-ontology.html Jonquet, Clement, Mark A. Musen, Nigam H. Shah, Help will be provided for this task: Ontology-Based Annotator Web Service,International Semantic Web Conference (ISWC08), Karlsruhe, Germany. May 2008. http://bmir.stanford.edu/file_asset/index.php/1321/ISWC08_Jonquet_Musen_Shah_final.pdf Niepert, Mathias, Cameron Buckner, and Colin Allen, Answer Set Programming on Expert Feedback to Populate and Extend Dynamic Ontologies, Association for the Advancement of Artificial Intelligence, 2008. http://inpho.cogs.indiana.edu/Papers/2008-InPhO-flairs.pdf
Pidcock, Woody, What are the differences between a vocabulary, a taxonomy, a thesaurus, an ontology, and a meta-model?, Web article, 2003. http://www.metamodel.com/article.php?story=20030115211223271 Rubin, Daniel L.,1 Dilvan A. Moreira, Pradip P. Kanjamala, and Mark A. Musen, BioPortal: A Web Portal to Biomedical Ontologies, AAAI Spring Symposium Series, Symbiotic Relationships between Semantic Web and Knowledge Engineering, Stanford University, (in press). Published 2008. http://bmir.stanford.edu/file_asset/index.php/1298/AAAI-BioPortal-2008.pdf Shadbolt, Nigel, Wendy Hall and Tim Berners-Lee, The Semantic Web Revisited, IEEE INTELLIGENT SYSTEMS, 2006. http://eprints.ecs.soton.ac.uk/12614/1/Semantic_Web_Revisted.pdf Sheth, Amit, Cartic Ramakrishnan, Semantic (Web) Technology In Action: Ontology Driven Information Systems for Search, Integration and Analysis, IEEE Data Engineering Bulletin, Special issue on Making the Semantic Web, Real, U. Dayal, H. Kuno, and K. Wilkinson, Eds., December 2003. http://lsdis.cs.uga.edu/library/download/SR03-BW.pdf Shirky, Clay, “Ontology is overrated”, Personal website http://www.shirky.com/writings/ontology_overrated.html Stevens, Robert, Carole A. Goble, and Sean Bechhofer, Ontology-based knowledge representation for bioinformatics, Briefings in Bioinformatics, November 2000. http://bib.oxfordjournals.org/cgi/reprint/1/4/398?ck=nck Wikipedia, Ontology (information science), June 2008. http://en.wikipedia.org/wiki/Ontology_(computer_science)