250 likes | 342 Views
The Gene Ontology Categorizer. C.A. Joslyn 1 , S.M. Mniszewski 1 , A. Fulmer 2 and G. Heaton 3
E N D
The Gene Ontology Categorizer C.A. Joslyn1, S.M. Mniszewski1, A. Fulmer2 and G. Heaton3 1Computer and Computational Sciences, Los Alamos National Laboratory, 2Corporate Biotechnology, Miami Valley Labs and 3Corporate Functions-IT, Procter & Gamble, USA (Bioinformatics, Vol. 20, Suppl. 1, 2004, p. i169-i177)
Abstract (1/2) • Given a list of genes of interest, what are the best nodes of the GO to summarize or categorize that list? • From a drug discovery process, we wish to understand the overall effect of some cell treatment or condition by identifying ‘where’ in the GO the differentially expressed genes fall.
Abstract (2/2) • View bio-ontologies more as combinatorially structured databases than facilities for logical inference, and draw on the discrete mathematics of finite partially ordered sets (posets) to develop data representation and algorithms appropriate for the GO. • Issues: categorization task, distances in ontologies and ontology merger and exchange.
1. Introduction (1/3) • A gene expression experiment involves high-throughput microarrays, a biomedical researcher will need to extract useful information on the types of biological processes affected in the experiment. • The categorization task arises from the researcher wanting to take the names of some genes and gain an understanding of their overall function by examining their distribution through the GO: are they localized, grouped in distinct areas or spread uniformly?
1. Introduction (2/3) • The Gene Ontology Categorizer (GOC) applies novel research in the discrete mathematics of posets for semantic hierarchies to GO analysis. • Represent the GO as a poset ontology, then use pseudo-distances between comparable nodes to develop scoring functions. Finally, cluster the resulting rank-ordered list to produce a ranked list of appropriate summarizing nodes within the GO, which act as functional hypotheses about the characteristics of the genes expressed.
1. Introduction (3/3) • GO analysis weaknesses • Many researchers consider the GO simply as a list of categories, ignoring any structural relationships among the categories. • Even those researchers with a treatment closest in spirit to authors consider the GO primarily as a tree, or even cast it as a graph for determining distances between nodes.
2. Methodology (1/2) • A finite partially ordered set (poset) is a mathematical structure P =<P,≤>, where P is a finite set and ≤⊆P2 is a reflexive, anti-symmetric, transitive binary relation on P. • Every poset is a digraph with no cycles and they are general than trees or lattices in that collections of nodes can have multiple parents. • The GO is a pair of directed acyclic graphs (DAGs), one for the is-a and has-part links.
2. Methodology (2/2) • PGO is the set of nodes such as ‘DNA unwinding’ and ‘DNA replication’. • The ordering ≤ in ‘DNA repair ≤ DNA metabolism’ represents that DNA repair is a kind of DNA metabolism. • GO, cast as a pair of posets Pis=<PGO,≤is> and Phas=<PGO,≤has> for the two kinds of relations, is a large, taxonomically organized semantic hierarchy. • This paper treats two kinds of links to be equivalent: PGO=<PGO,≤GO>, where ≤GO=≤is≤has.
2.1 Poset theory (1/3) • Two nodes p1, p2∈P are comparable, denoted p1~p2, if either p1≤p2 or p2≤p1. • A chain C⊆P is a collection of comparable nodes. • Height H(P) is the size of the largest chain. • Two nodes p1, p2∈P are non-comparable if p1 p2. • An antichain is a collection of non-comparable nodes. • Width W(P) is the size of the largest anti-chain.
2.1 Poset theory (2/3) • Given two comparable nodes p1≤p2, the set of all nodes ‘between’ them is the interval [p1, p2] ={p: p1≤p≤p2}, which is equivalent to the set of all chains between p1 and p2, denoted C(p1, p2). • The vector of chain lengths h(p1, p2)=|C(p1, p2)| is the collection of the lengths of all these chains. • Minimal and maximum chain lengths between p1 and p2 are h∗(p1, p2)= minC∈C(p1,p2)|C| and h∗(p1, p2)=maxC∈C(p1,p2)|C|, respectively.
2.1 Poset theory (3/3) • P={1,A,B, . . . ,K} • B and J are noncomparable, while A≤B are comparable. • [A,B]={A,F,G,H,I,B} consists of the three chains C(A, B)={A≤F≤B, A≤G≤B, A≤H≤I≤B}. • h(A, B)=<2,2,3> with h∗(A,B)=2, h∗(A,B)=3. • H(P)=5 (a maximal chain is D≤E≤I≤C≤1) and W(P)=5 (the largest anti-chain is {F,G,H,E,J}). 18
2.2 Methods (1/4) • Define a POSet Ontology (POSO) as O=<P,X,F>, where X is a finite, non-empty set of labels, and F: X→2P is an annotation function mapping each label x∈X to a collection of nodes F(x)⊆P. • E.g. X={a,b,…,j }, F(b)={A,E,F}. • In GOC, OGO=<PGO, XGO, FGO>, where the gene products XGO and annotations FGO are provided by the GO file.
2.2 Methods (2/4) • A pseudo-distance function δ: P2 → R • The minimum path length δm =h∗ • The maximum path length δx =h∗ • The average of extreme path lengths • The average of all path lengths • h∗(p1, p2)≤δ(p1, p2)≤h∗(p1, p2). • A normalized distance as δ=δ/H(P).
2.2 Methods (3/4) • A scoring function Sy(p) that returns the weighted rank of a node pP based on requested nodes Y. • Two kinds of scores • An unnormalized score SY : P → R+ which returns an ‘absolute’ number • A normalized score which returns a ‘relative’ number.
2.2 Methods (4/4) • s{…,-1,0,1,2,3,…}, where low s emphasizes coverages, and high s emphasizes specificity. Let r=2s, then we have four scoring functions: • Unnormalized distance and unnormalized score: • Unnormalized distance and normalized score: • Normalized distance and unnormalized score: • Normalized distance and normalized score:
3. Expert validation (1/2) • An experienced molecular immunologist constructed two nonoverlapping lists of genes: KT1 a list of 242 genes involved in immune processes; and KT4 a list of 147 genes involved in cell–cell/cell–matrix interactions. • KT1, KT4 and KT1∪KT4 provided three queries for GOC into the BP branch of the GO using δm, s=7 and scoring function .
3. Expert validation (2/2) • Two assessed values • Utility (1=low to 5=high): Did the cluster terms provide a useful description of a specific biological process? • Expectation (1=high to 5=low): Was the identified biological process expected for the genes in the query?
4. Formal validation (1/3) • An independent source of annotations of collections of GO nodes: the InterPro project, which catalogs assignments of protein families, domains and functional sites to GO IDs. • E.g. ‘phosphofructokinase’ is InterPro ID IPR000023, and is annotated to GO:0006096=‘glycolysis’, GO:0003872=‘6-phosphofructokinase activity’, and GO:0005945=‘6-phosphofructokinase complex’. It also maps to 175 proteins. Thus the validation task is to make these 175 proteins a GOC query, and see how well cluster heads match against the set of GO IDs {GO:0006096, GO:0003872, GO:0005945}.
4. Formal validation (2/3) • In the run, there were 4,866 InterPro IDs with GO annotations, with 11,370 mappings to GO nodes and 787,760 mappings to proteins in total. Of these proteins, they were able to locate 778 494, or >99% with GO annotations.
4. Formal validation (3/3) • Immediate family: child/parent/sibling. • Extended family: grandparent/grandchild/cousin/aunt/uncle/niece/nephew
5. Conclusions • The GOC methodology provides a valid and useful approach to categorization in the GO. • Future work • Methodological development in combinatorial approaches to data analysis, including distances between noncomparable nodes, interval-valued measures of ‘level’ in posets, algorithms for poset width calculation and poset matching. • Expansion to other ontologies. • Continuation of work in textual approaches, mapping back and forth from semantic relations among GO nodes to those among its lexical components.