430 likes | 529 Views
Obol: Open Bio-Ontology Language. Using grammars to extract and use implicit knowledge in the GO and OBO Chris Mungall Berkeley Drosophila Genome Project / GO Consortium. Obol. Obol is a system for discovering and reasoning over hidden knowledge in ontologies
E N D
Obol:Open Bio-Ontology Language Using grammars to extract and use implicit knowledge in the GO and OBO Chris Mungall Berkeley Drosophila Genome Project / GO Consortium
Obol • Obol is a system for discovering and reasoning over hidden knowledge in ontologies • Obol is useful for helping maintain cross-products in the Gene Ontology • Obol works by parsing syntax and semantics from GO and OBO terms
Motivation: Ontology Maintenance • GO: 3 ontologies, 16k terms, 23k relationships • OBO: cell, biochemical, sequence and multiple anatomical ontologies • Many GO terms are combinatorial (cross-products) • “regulation of neutrophil differentiation” • No explicit links between ontologies • Difficult to maintain manually
Some Sample GO terms ‘regulation of neutrophil differentiation.’ ‘neutrophil differentiation.’ ‘granuloctye differentiation.’ ‘smooth muscle contraction.’ ‘nucleolar chromatin.’ ‘nucleolus.’ ‘oxygen transport.’ ‘negative regulation of interleukin-2 biosynthesis.’ ‘oxidoreductase activity, acting on paired donors, with incorporation or reduction of molecular oxygen, reduced iron-sulfur protein as one donor, and incorporation of one atom of oxygen.’
Graph complexity biosynthesis regulation of biosynthesis negative regulation of biosynthesis regulation of cytokine biosynthesis cytokine biosynthesis negative regulation of cytokine biosynthesis regulation of interleukin-2 biosynthesis interleukin-2 biosynthesis negative regulation of interleukin-2 biosynthesis part-of is-a
Automatic inference of relationships • Some relationships can be derived computationally… • …provided we have complete logical definitions regulation ^ (regtype:negative) ^ (regprocess:biosynthesis ^ (makes:interleukin-2)) Tools exist for reasoning over these logical definitions, but…
Generating logical definitions • Generating and maintaining logical definitions for GO/OBO is non-trivial • Obol exploits the highly regular grammatical structure of GO term names • “regulation of X”, never “X regulation” • “Y biosynthesis”, never “biosynthesis of Y” • no stemming required • Obol derives candidate class definitions from term names, and performs basic reasoning over them
Obol: parsing and reasoning GO/OBO Term Lexical string “interleukin-2 biosynthesis” Class Definition(s) may involve relationships to other OBO terms biosynthesis^(makes:interleukin-2) “interleukin-2 biosynthesis” is_a “cytokine biosynthesis” inferred from: interleukin-2 is_a cytokine Inferences using definitions and existing ontologies
How Obol Works • term names are broken into lexical tokens (words) using a tokeniser • tokens are parsed using a grammar, generating parse trees • parse trees are turned into class definitions using transformation rules and property definitions • transformation is reversible • class definitions are reasoned over • implemented in XSB Prolog
Word tokens • Obol uses an atomic vocabulary of word tokens • tokens are partitioned by ontology domain • cell, anatomy, biological process, etc • tokens have a grammatical type • adj, noun, prep, relational adj, special • vocabularies need not be correct or complete
Computational Grammars • formal grammars can elucidate sentence structure • grammars transform token lists into parse trees • multiple parses may be possible • parses are reversible • a grammar is a collection of transformation rules
A simple OBO term grammar (subset of the whole OBO grammar) Term --> NPe.g.negative regulation of interleukin-2 biosynthesis NP --> NP PP e.g.negative regulation of interleukin-2 biosynthesis NP --> NOUNe.g. interleukin;; regulation;; biosynthesis NP --> NP-TOK e.g. interleukin-2 NP --> ADJ NPe.g.negative regulation NP --> NP NP e.g. interleukin-2 biosynthesis PP --> PREP NPe.g.of interleukin-2 biosynthesis
Applying grammar rules pp -> p np term -> np np np -> np np pp np -> np pp np np np -> np-tok np np -> adj np np np -> n np np noun prep noun tok noun adj negative regulation of interleukin-2 biosynthesis
Generating Class Definitions • A parse tree shows the syntax structure of a term • A class definition is a description of the meaning of a term • An Obol classdef is a cross product (intersection) of necessary and sufficient conditions • Classdefs are generated from parse trees using tree transform rules and property descriptions • Classdefs can be exported using obo or OWL format
Property definitions guide class construction np Property name: makes domain: biosynthesis range: substance grammar: np_modifier np np interleukin-2 biosynthesis biosynthesis^(makes:interleukin-2)
Property definitions guide class construction np Property name: regtype domain: regulation range: neg/pos grammar: np_modifier adj np negative regulation regulation^(regtype:negative)
Property definitions guide class construction np Property name: regprocess domain: regulation range: biological_process grammar: prep(of) pp np np of biosynthesis^ (makes:IL-2) regulation^ (regtype:negative) regulation^ (regtype:negative)^ (regprocess:biosynthesis^(makes:IL-2))
Unparseable terms and multi-parse terms biological process molecular function cellular component single-token terms excluded from this analysis
Reasoning over class definitions • Using class definitions, we can: • autocreate parentage for new terms • check for missing relationships • find inconsistencies between ontologies • generate implicit orthogonal ontologies • Method: • Use native OBOL rules (via prolog or DAG-Edit) • OR use external reasoner; eg RACER, FaCT
Finding missing relationships • Obol is run periodically on GO to check for missing IS A and PART OF relationships • Multiple parses produce false-positives • 223 missing relationships added to GO • ToDo: increase specificity by improving vocabularies and property definitions
Obol sample report nucleolar chromatin PART OF nucleus clathrin-coated vesicle HAS PART clathrin coat chromoplast membrane IS A plastid membrane nuclear microtubule PART OF nucleus vitamin E biosynthesis IS A vitamin E metabolism uracil permease activity IS A permease activity chloroplast envelope IS A plastid envelope negative regulation of lipid biosynthesis IS A negative regulation of lipid metabolism ketone body metabolism IS A ketone metabolism dense nuclear body IS A nuclear body inverse present false positive!
Aligning to the OBO cell ontology most differentiation terms align precisely; some don’t… muscle cell ??? cardiac cell differentiation cardiac muscle cell mesodermal cell animal cell DEVELOPS FROM cardioblast differentiation cardioblast
Obol as an ontology curation tool • Obol can be used by GO curators in a variety of ways • “Behind the scenes” • Iterative • GO curator receives periodic suggestion reports • Continuous • GO curator uses OBOL interactively via DAG-Edit plugin • To help the transition to a fully specified ontology • GO curators then maintain class definitions • Obol as a search tool?
Problems to address • Integration with curation process • Memory usage • Syntax parsing • chemical terms, long terms • Dealing with and, or and not • Generating text definitions • Word list maintenance • solution: integrate with ontology maintenance • Ontology dependencies • protein and generic anatomy ontologies needed • Obol can be used to help generate these
Conclusions • Obol is useful for maintainng large GO-style ontologies • combination of semantic parsing with reasoning is powerful • benefits of both GO-style ontology development and formal reasoning
Acknowledgements • Berkeley/GO • John Richter • Brad Marshall • Karen Eilbeck • Suzanna Lewis • Gerry Rubin • Jackson Labs/GO • David Hill • Joel Richardson • Judith Blake • GO Curators • Midori Harris • Jennifer Clark • Amelia Ireland • Jane Lomax • Manchester • Chris Wroe • Robert Stevens • Phillip Lord • J Michael Cherry • Michael Ashburner • all the GO Consortium
The OBO Universe (partial) Chemical Phenotype Function Process Protein Component Cell Sequence Anatomy Fly-Anat Fish-Anat
Detecting inconsistencies Z binding Z Y binding Y X X binding
Prolog as an ontology language • % DATABASE OF FACTS • isa(carb_binding, binding). • isa(polysac_binding, carb_binding). • isa(chitin_binding, polysac_binding) • isa(cellulose_binding, polysac_binding). • % INFERENCE RULES • isaT(X,Y):- isa(X,Y). • isaT(X,Y):-isa(X,Z), isaT(Z,Y). ● ?- isaT(chitin_binding, binding). ● YES ● ?-isaT(X, polysac_binding). ● X=carb_binding. ● X=chitin_binding. ● X=cellulose_binding. ● ?-isaT(chitin_binding, cellulose_binding). ● NO ● ?-isaT(X,Y). % returns all paths
regulates Prolog internal representation class(regulation <process> qualifier=class(negative <general>) regulates=class(contraction <process> affects_cell_type=class(muscle <anatomical> has_type=class(smooth <general>)))) regulation contraction muscle smooth negative
Prolog Grammar Implementation • Prolog: the classic logic programming language • High-level declarative language, natural choice for ontologies; built in “database” • Definite Clause Grammars (DCGs)part of the language; DCGs allow passing data up the parse tree • XSB Prolog • Uses tabling (more efficient, less re-calculation) • Tabling + DCGs = chart parsing (Earley's algorithm) • This means we can have left-recursive grammars
A Formal Grammar for OBO terms • All(?) GO/OBO terms are NOUN-PHRASES (exception: phenotypes?) • A NOUN-PHRASE is (recursively) made from • a NOUN (includes inflected verbs; eg binding) • an ADJECTIVE followed by a NOUN-PHRASE eg inner membrane • a NOUN-PHRASE preceeded by a NOUN-PHRASE acting as ADJECTIVE; eg clathrin coat • a NOUN-PHRASE then PREPOSITION then NOUN-PHRASE; eg regulation of transcription • an (optional) NOUN-PHRASE then a RELATIONAL ADJECTIVE then a NOUN-PHRASE; eg clathrin-coated vesicle • Precedence rules are also required to prune parse forest • Simple but effective
Sentence -> Subject Verb Object Subject -> Article Noun Object -> Article Noun Article -> a | the Verb -> ate | chased Noun -> cat | banana | mouse Parse tree for a simple sentence, “the cat ate the banana” • A formal grammar is a set of production rules operating over terminal symbols (eg words) and non-terminal symbols (eg word/phrase categories) • The rules determine how sequences of symbols can be transformed, making a parse tree GENERATING: Start at top ('sentence') and apply rules until all symbols are terminal PARSING: Start at bottom – a sequence of terminals; apply rules, combining symbols if necessary
regulates NP NP -> NP* PP Parse tree PP PP -> P NP* thick lines inidcate stem terms the parse tree shows the synctactic structure NP NP -> NP NP* NP NP NP NP -> ADJ NOUN* ADJ NOUN P ADJ NOUN NOUN negative regulation of smooth muscle contraction recurse down tree applying grammatical context rules to get property fillers Class Definition (shown as DAG) we need new OBO format (or OWL) to represent cross-products (aka intersections / complete defs) X X the classdef shows the logical structure regulation contraction muscle X X X smooth negative negative regulation of smooth muscle contraction smooth muscle contraction smooth muscle DAG definition is minimal – non-definitional relationships to other terms not shown
COPII-coated vesicle membrane class(membrane <component> “COPII-coated vesicle membrane” part_of=class(vesicle <component> “COPII-coated vesicle” has_part=class(coat <component> “COPII coat” made_from=class(COPII <complex> “COPII”)))) class/term name shown in quotes; these can be derived by reversion the transformation the above classdef is consistent with what is in the GO cellular_component ontology requires use of inverse properties (has_part vs part_of) - supported in new OBO format.
Outline • Motivation – combinatorial issues with composite terms • Approaches: annotation-time term composition vs tools for maintenance of large DAGs • The OBOL System • Term decomposition using grammars • Generating computable logical class definitions • Rules and reasoning over class definitions • Initial Results • Strategies for using OBOL within GO/OBO
Manual maintenance of GO • GO is 3 DAGs of over 16k terms • Large DAGs of terms are hard to maintain • “cross-products” produce combinatorial explosions and highly connected sub-graphs • GO terms include OBO terms • eg oxygen binding; wing development • Zipf's Law • Many terms not yet used in annotation (Ogren, pers. Comm.)
part_of part_of affects part_of affects Example combinatorial explosion cell motility regulation negative regulation Negative regulation contraction regulation of contraction regulation of muscle contraction negative regulation of contraction muscle contraction regulation of smooth muscle contraction smooth muscle contraction Negative regulation of muscle contraction negative regulation of muscle contraction implicit anatomical terms not shown negative regulation of smooth muscle contraction “implicit” terms actual GO terms
One Approach: Properties • One extreme solution is to remove composite terms from ontology altogether • Generate “anonymous” composite terms at annotation-time via property/slot restrictions to atomic terms • binding ^ affects(interleukin-18) • contraction ^ affects(muscle+type(smooth)) • These bindings constitute a class definition • But it is still necessary to make statements about composite terms in the ontology • macrophage activationis_aimmune cell activity • fibrinolysis is_a negative regulation of blood coagulation
Another approach: Computationally aided ontology maintenance • GO terms exhibit regularity in their syntactic structure • Substring relationships highly correlated with actual relationships • regulation of smooth muscle contraction • smooth muscle contraction • muscle contraction • contraction Ogren PV, Cohen KB, Acquaah-Mensah GK, Eberlein J, Hunter L. 2004. The compositional structure of Gene Ontology terms. Pac Symp Biocomput 9: 214-215.
OBOL: Syntax and Semantics • What about using the term syntax to get at the meaning of the term? GO/OBO Term “interleukin binding” Class Definition may involve relationships to other OBO terms binding^affects(interleukin) “interleukin binding” is_a “cytokine binding” inferred from: interleukin is_a cytokine Inferences
Inference of intermediate terms and IS_As – example rule FORALL classdef pairs IFF the stem-class is the same AND all the property-values in the restriction-list are identical EXCEPT for one property, in which the property-values are linked by an isa, THEN the classdefs are linked by an isa class(C P1=V1 P2=V2..Px=Vx Pn=Vn) is_a class(C P1=V1 P2=V2..Px=Vx' Pn=Vn) <=> Vx is_a Vx' class(regulation process_regulated=R qual=Q) is_a class(regulation process_regulated=R' qual=Q) <=> R is_a R'