Research in the Verspoor Lab

Research in the Verspoor Lab

Generally speaking… • Focus on analysis of the biomedical literature • For the purpose of: • Turning unstructured data (natural language text) into structured statements • Taking advantage of the wealth of information in the literature for biological data analysis • Using (analyzing, building) semantic resources for the biomedical domain

Today: Focus on Ontologies • Use of the structure of ontologies to understand relations among protein annotations • Analysis of the term structure of ontologies • Particular ontology of interest: Gene Ontology

Gene Ontology (GO) • Taxonomic controlled vocabulary • ~ 16K nodes PGOpopulated by genes, proteins • Two orders on PGO: ≤isa,≤has Gene Ontology Consortium (2000): “Gene Ontology: Tool For the Unification of Biology”, Nature Genetics, 25:25-29

The Gene Ontology: Usage • 33703terms • 20403 biological_process • 2810 cellular_component • 8996 molecular_function • Gene Annotations for 40+ organisms • 3504 publications in PubMed matching “gene ontology” (3/8/11) • ISI Web of Knowledge: 5371 refs to GO paper Graph statistics as of June 9, 2009

Protein Function Prediction • Verspoor, K., Cohn, J., Mniszewski, S., and Joslyn, C. (2006). A Categorization Approach to Automated Ontological Function Annotation.Protein Science, v.15, pp.1544-1549.

Automated Protein Function Annotation • Mappings • From regions of sequence, structure, keyword spaces • Into regions of biological function space: • taxonomic bio-ontologies of molecular function • Characterize formal structure of bio-ontologies: • Order theoretical approaches • Combinatorial algorithms

POSOLE: POSet Ontology Laboratory Environment • POSOLE: a general environment for ontology experimentation • Graph representation of an ontology as a POSet • POSet statistics analysis (e.g. depth, width, average rank) • Algorithms for node categorization utilizing the structure of the ontology • First Deployment: Ontology categorization for automated protein function annotation • Function: Gene Ontology node • Protein: target sequence or Swiss-Prot identifier • Map proteins to sets of potential Gene Ontology nodes • Ontology categorization: “clustering” nodes in ontology space to identify the most likely node assignment • Dual Queries: Text and sequence neighborhoods

POSOLE strategy • Function Prediction as Categorization of Nearest Neighbors • Application of POSOC categorization methodology utilizing the Gene Ontology structure to find the best covering nodes given a set of node “hits” • “Hits” are based on (application-dependent) mappings from neighbors of an input protein to Gene Ontology nodes • Covering nodes are function annotation predictions

POSOLE architecture • PosoleRun, core of each application • Load the graph (GO) • Build a query, a set of query items • Categorize the query items • Each application defines its own QueryBuilder

Categorization Task: POSOC“Cluster” Genes in Ontology Space • Given the Gene Ontology (GO) . . . And mappings to GO nodes . . . • “Splatter” them over the GO . . . Where do they end up? • Concentrated? -- Dispersed? • Clustered? -- High or low? • Overlapping or distinct? • Pseudo-distances between comparable nodes to measure vertical separation • POSOC traverses the structure of the GO, percolating hits upwards, and calculating scores for GO nodes. • Scores to rank-order nodes with respect to gene locations, balancing: • Coverage: Covering as many genes as possible • Specificity: But at the “lowest level” possible • “Cluster” based on non-comparable high score nodes http://www.c3.lanl.gov/posoc/ Joslyn, Cliff; Mniszewski, Susan; Fulmer, Andy; and Heaton, Gary: (2004) “The Gene Ontology Categorizer”, Bioinformatics, v. 20:s1, pp. 169-177

Order Theoretical Categorization Method • Represent GO as labeled, finite ordered set • Given labels (genes) c, e, i . . . • What node(s) A,B, C, . . . ,K are best to attend to? • C • {H, J} • {A, H, J}

POSOLE applications

Application: BioCreAtIvE I, Task 2 Critical Assessment of Information Extraction in Biology • Automatic assignment of Gene Ontology annotations to human proteins based on a journal publication • Given a Swiss-Prot/TrEMBL protein ID and a document, predict a GO node to which the protein should be annotated • Also return the evidence text from the document supporting the annotation • Strategy: Annotation as Categorization of Document Neighborhood • Application of POSOC categorization utilizing the Gene Ontology structure to find the best covering nodes given a set of node “hits” • “Hits” in this case are based on overlaps between input terms and GO node terms (in labels, definitions)

POSOC as applied to context terms • Collect all terms in a context window of n sentences around any reference to the protein of interest • Transform an input query into a set of node hits: • Morphologically normalize GO node labels • Look for any overlaps between input terms and terms in the normalized node labels • An overlap = a node hit, with strength based on the input weight of the term (from TFIDF) • Multiple overlaps on a given node count as multiple hits • POSOC returns a set of GO nodes representing cluster heads for weighted term input set, and data on which input terms contributed to the selection of each cluster head: Annotation predictions

BioLASER:Los Alamos Semantic Event Recognizer for Biology • Text analysis environment: • Relation extraction • Term vector analysis • Domain-specific and application-specific components • Markup workflow implementation

Application: CASP-6 Function Prediction Critical Assessment of Structure Prediction evaluation Function Prediction subtask • Automatic assignment of Gene Ontology annotations to target protein sequences • Strategy: Annotation as Categorization of Sequence Neighborhood • Application of POSOC categorization utilizing the Gene Ontology structure to find the best covering nodes given a set of node “hits” • “Hits” in this case are based on known mappings from proteins in the sequence neighborhood of the target to Gene Ontology nodes

CASP architecture

CASP Evaluation • Test set • proteins with known Gene Ontology mappings • 4530 SwissProt protein sequences associated from PDB • Protein to GO Mappings derived from UniProt • Eliminate PSI-BLAST identity matches from mappings used in prediction • Matches to protein with the same SwissProt Accession ID • Matches to protein with an accession ID that maps to the same SwissProt Entry ID • Matches to protein with an e-value < 10-130 or e-value < max e-value for known identity match • Goal: compare function predictions made by the system with known functions assigned to each input protein

CASP Evaluation runs • Baseline Best Blast: Predictions are the GO nodes associated with non-identical protein scoring highest in the PSI-BLAST analysis. All predicted GO nodes are considered to be at rank 1. • Baseline Full Neighborhood: Predictions are the GO nodes associated with all proteins matched in the PSI-BLAST analysis (with evalue < 10). The predictions are ranked according to the evalue of the corresponding PSI-BLAST match. • POSOC Best Blast: Inputs to POSOC are the GO nodes associated with non-identical protein scoring highest in the PSI-BLAST analysis, weighted by evalue of the match. POSOC categorizes and ranks these inputs to produce the predictions. • POSOC Full Neighborhood: Inputs to are the GO nodes associated with all proteins matched in the PSI-BLAST analysis, weighted by evalue of the match. POSOC categorizes and ranks these inputs to produce the predictions.

Evaluation analysis • Precision/Recall • Precision = % of predictions that are correct • Recall = % of known predictions that are recovered • Extension to ranked list of predictions • Consider precision/recall at different ranks

Ontological Distance Metrics • How “far apart” are p and q? • Genealogical approach: • Radius 0: Equals: Direct match • Radius 1: Nuclear family: Parents, children, siblings • Radius 2: Extended family: grandparents, grandchildren, cousins, aunts/uncles, nieces/nephews

Evaluation results: Precision

Evaluation results: Recall

Evaluation of Ontological predictions • Extension to ontological predictions: when does a GO node p in F(x) count as a “match” against a q in G(x)? • What about siblings? Ancestors? • Partial credit? • Based on proximity • Based on specificity • Adapt hierarchical precision/recall measure from Kiritchenko et al 2005

Hierarchical Precision vs. Rank(Cellular Component branch)

Hierarchical Precision vs. Rank(Molecular Function branch)

Hierarchical Precision vs. Rank(Biological Process branch)

Summary: Protein Function Prediction • We have constructed the POSOLE architecture, supporting integration of mappings from different spaces into function space • We utilize the mathematical structure of function space as defined by the Gene Ontology to help identify commonalities and “clusters”, as well as in evaluation • We have proposed an extension to Kiritchenko et al’s hierarchical precision/recall measure to support comparison of sets of predictions and answers • The results on CASP function prediction show the promise of the POSOLE and POSOC technologies for automated annotation of protein sequences.

Ontology Quality Assurance • Verspoor, K., Dvorkin, D., Cohen, K.B., Hunter, L. (2009) Ontology quality assurance through analysis of term transformations. Bioinformatics 25(12):i77-i84.

Key quality concern: Univocality • Univocality = one voice (Spinoza, 1677) “a shared interpretation of the nature of reality” (with thanks to David Hill @ Jackson Lab) • Consistency of expression of concepts • Regular, compositional, linguistic structure • Facilitates human usability • Computational tools can utilize this regularity

Quality Assurance in the GO • Goal: identify violations of univocality • Problem: the GO is generally very high quality; how to identify the few inconsistencies? • Hypothesis: violations of univocality will correspond to transformational variants • Strategy: term transformation & clustering

GO Term Transformation:Abstraction • Substitution of embedded GO & ChEBI terms toluene oxidation via 3-hydroxytoluene CTERM oxidation via CTERM regulation of coagulation regulation of GTERM leukotriene production during acute inflammatory response CTERM production during GTERM

GO Term Transformations • Stopword removal toluene oxidation via 3-hydroxytoluene toluene oxidation 3-hydroxytoluene regulation of coagulation regulation coagulation • Alphabetic reording 3-hydroxytoluene oxidation toluene via coagulation of regulation

Transformation combinations • Abstraction=1, StopRemoval=1, Reordering=1 toluene oxidation via 3-hydroxytoluene regulation of coagulation leukotriene production during acute inflammatory response

Transformation combinations • Abstraction=1, StopRemoval=1, Reordering=1 toluene oxidation via3-hydroxytoluene CTERMCTERM oxidation regulation ofcoagulation GTERM regulation leukotriene production duringacute inflammatory response CTERMGTERM production

Clustering • Group together all terms with a common form after transformation • Perform clustering for different combinations of transformations asr {GTERM constit structu} GO:0005201 -- extracellular matrix structural constituent GO:0005199 -- structural constituent of cell wall GO:0005213 -- structural constituent of chorion GO:0005200 -- structural constituent of cytoskeleton GO:0003735 -- structural constituent of ribosome GO:0017056 -- structural constituent of nuclear pore GO:0019911 -- structural constituent of myelin sheath

Analysis of clusters • Heuristic search: • Consider only clusters with abstraction (a±±) • Identify terms in distinct a-- clusters, but merge together in a-r, as-, or asr. • Manual assessment of 190 clusters

Transformation Impact • 25,539 source GO terms (12/2007 version) • Pre-processing reduces to 23,478 (8%) • a=Abstraction, s=StopRemoval, r=Reordering • Abstraction has most impact: 46% reduction

Abstraction breakdown,a-- clusters

Distribution of cluster size --- transformation asr transformation

True Positive clusters • 67 clusters • 317 GO terms • Obsolete term filter: 7 clusters, 32 terms • Approximately 77 term rephrasings anticipated

True Positive inconsistencies • {X Y} ≈ {Y of X} | {Y in X} [45%] {GTERM GTERM organis symbion} GO:0052387 -- induction by organism of symbiont apoptosis GO:0052351 -- induction by organism of systemic acquired resistance in symbiont GO:0052350 -- induction by organism of induced systemic resistance in symbiont GO:0052560 -- induction by organism of symbiont immune response GO:0052399 -- induction by organism of symbiont programmed cell death GO:0052396 -- induction by organism of symbiont non-apoptotic programmed cell death {GTERM multice organis} GO:0010259 -- multicellular organismal aging GO:0022412 -- reproductive cellular process in multicellular organism GO:0032504 -- multicellular organism reproduction GO:0033057 -- reproductive behavior in a multicellular organism GO:0033555 -- multicellular organismal response to stress GO:0035264 -- multicellular organism growth

True Positives (2) • Determiners [16%] {GTERM forebra} GO:0021861 -- radial glial cell differentiation in the forebrain GO:0021846 -- cell proliferation in forebrain GO:0021872 -- generation of neurons in the forebrain {GTERM organ} GO:0031100 -- organ regeneration GO:0035265 -- organ growth GO:0010260 -- organ senescence GO:0001759 -- induction of an organ

True Positives (3) • Other alternations [16%] {GTERM selecti site} GO:0000282 -- cellular bud site selection GO:0000918 -- selection of site for barrier septum formation • Conflicting conventions [6%] {GTERM endothe} (partial listing) GO:0003100 -- regulation of systemic arterial blood pressure by endothelin GO:0004962 -- endothelin receptor activity • Punctuation [3%] GO:0016653 -- oxidoreductase activity, acting on NADH, heme protein as acceptor GO:0016658 -- oxidoreductase activity, acting on NADH, flavin as acceptor GO:0050664 -- oxidoreductase activity, acting on NADH, with oxygen as acceptor GO:0043247 -- telomere maintenance in response to DNA damage GO:0042770 -- DNA damage response, signal transduction

True Positives (4) • “Grab bag” • Lexical choice • “within” vs. “in” • “substrate-specific” vs. “substrate-dependent” • Superfluous words like “other”

False positive breakdown

False positive cluster examples • Semantic import of stopword [50%] {CTERM GTERM levels modulat symbion} (partial listing) GO:0052430 – modulation by host of symbiont RNA levels GO:0052018 – modulation by symbiont of host RNA levels {CTERM CTERM galacto GTERM} GO:0033580 -- protein amino acid galactosylation at cell surface GO:0033582 -- protein amino acid galactosylation in cytosol GO:0033579 -- protein amino acid galactosylation in endoplasmic reticulum {callose deposit GTERM} GO:0052542 -- callose deposition during defense response GO:0052543 -- callose deposition in cell wall

False positives (2) • Non-parallel structure [27%] {CTERM CTERM} GO:0005204 -- chondroitin sulfate proteoglycan GO:0006088 -- acetate to acetyl-CoA GO:0015641 -- lipoprotein toxin {GTERM GTERM GTERM} (partial listing) GO:0019896 -- axon transport of mitochondrion GO:0047496 -- vesicle transport along microtubule GO:0047497 -- mitochondrion transport along microtubule GO:0032066 -- nucleolus to nucleoplasm transport GO:0052067 -- negative regulation by symbiont of entry into host cell via phagocytosis {GTERM storage} GO:0001506 -- neurotransmitter biosynthetic process and storage GO:0000322 -- storage vacuole

False positives (3) • Stemming [17%] {regulat GTERM} (partial listing) GO:0045066 -- regulatory T cell differentiation GO:0045069 -- regulation of viral genome replication GO:0045055 -- regulated secretory pathway GO:0031347 -- regulation of defense response • Syntactic variation [5%] {GTERM mainten} GO:0045216 -- intercellular junction assembly and maintenance GO:0045217 -- intercellular junction maintenance GO:0045218 -- zonula adherens maintenance • Semantic import of word order[5%]

Research in the Verspoor Lab