1 / 56

Research in the Verspoor Lab

Research in the Verspoor Lab. Generally speaking…. Focus on analysis of the biomedical literature For the purpose of: Turning unstructured data (natural language text) into structured statements Taking advantage of the wealth of information in the literature for biological data analysis

Download Presentation

Research in the Verspoor Lab

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Research in the Verspoor Lab

  2. Generally speaking… • Focus on analysis of the biomedical literature • For the purpose of: • Turning unstructured data (natural language text) into structured statements • Taking advantage of the wealth of information in the literature for biological data analysis • Using (analyzing, building) semantic resources for the biomedical domain

  3. Today: Focus on Ontologies • Use of the structure of ontologies to understand relations among protein annotations • Analysis of the term structure of ontologies • Particular ontology of interest: Gene Ontology

  4. Gene Ontology (GO) • Taxonomic controlled vocabulary • ~ 16K nodes PGOpopulated by genes, proteins • Two orders on PGO: ≤isa,≤has Gene Ontology Consortium (2000): “Gene Ontology: Tool For the Unification of Biology”, Nature Genetics, 25:25-29

  5. The Gene Ontology: Usage • 33703terms • 20403 biological_process • 2810 cellular_component • 8996 molecular_function • Gene Annotations for 40+ organisms • 3504 publications in PubMed matching “gene ontology” (3/8/11) • ISI Web of Knowledge: 5371 refs to GO paper Graph statistics as of June 9, 2009

  6. Protein Function Prediction • Verspoor, K., Cohn, J., Mniszewski, S., and Joslyn, C. (2006). A Categorization Approach to Automated Ontological Function Annotation.Protein Science, v.15, pp.1544-1549.

  7. Automated Protein Function Annotation • Mappings • From regions of sequence, structure, keyword spaces • Into regions of biological function space: • taxonomic bio-ontologies of molecular function • Characterize formal structure of bio-ontologies: • Order theoretical approaches • Combinatorial algorithms

  8. POSOLE: POSet Ontology Laboratory Environment • POSOLE: a general environment for ontology experimentation • Graph representation of an ontology as a POSet • POSet statistics analysis (e.g. depth, width, average rank) • Algorithms for node categorization utilizing the structure of the ontology • First Deployment: Ontology categorization for automated protein function annotation • Function: Gene Ontology node • Protein: target sequence or Swiss-Prot identifier • Map proteins to sets of potential Gene Ontology nodes • Ontology categorization: “clustering” nodes in ontology space to identify the most likely node assignment • Dual Queries: Text and sequence neighborhoods

  9. POSOLE strategy • Function Prediction as Categorization of Nearest Neighbors • Application of POSOC categorization methodology utilizing the Gene Ontology structure to find the best covering nodes given a set of node “hits” • “Hits” are based on (application-dependent) mappings from neighbors of an input protein to Gene Ontology nodes • Covering nodes are function annotation predictions

  10. POSOLE architecture • PosoleRun, core of each application • Load the graph (GO) • Build a query, a set of query items • Categorize the query items • Each application defines its own QueryBuilder

  11. Categorization Task: POSOC“Cluster” Genes in Ontology Space • Given the Gene Ontology (GO) . . . And mappings to GO nodes . . . • “Splatter” them over the GO . . . Where do they end up? • Concentrated? -- Dispersed? • Clustered? -- High or low? • Overlapping or distinct? • Pseudo-distances between comparable nodes to measure vertical separation • POSOC traverses the structure of the GO, percolating hits upwards, and calculating scores for GO nodes. • Scores to rank-order nodes with respect to gene locations, balancing: • Coverage: Covering as many genes as possible • Specificity: But at the “lowest level” possible • “Cluster” based on non-comparable high score nodes http://www.c3.lanl.gov/posoc/ Joslyn, Cliff; Mniszewski, Susan; Fulmer, Andy; and Heaton, Gary: (2004) “The Gene Ontology Categorizer”, Bioinformatics, v. 20:s1, pp. 169-177

  12. Order Theoretical Categorization Method • Represent GO as labeled, finite ordered set • Given labels (genes) c, e, i . . . • What node(s) A,B, C, . . . ,K are best to attend to? • C • {H, J} • {A, H, J}

  13. POSOLE applications

  14. Application: BioCreAtIvE I, Task 2 Critical Assessment of Information Extraction in Biology • Automatic assignment of Gene Ontology annotations to human proteins based on a journal publication • Given a Swiss-Prot/TrEMBL protein ID and a document, predict a GO node to which the protein should be annotated • Also return the evidence text from the document supporting the annotation • Strategy: Annotation as Categorization of Document Neighborhood • Application of POSOC categorization utilizing the Gene Ontology structure to find the best covering nodes given a set of node “hits” • “Hits” in this case are based on overlaps between input terms and GO node terms (in labels, definitions)

  15. POSOC as applied to context terms • Collect all terms in a context window of n sentences around any reference to the protein of interest • Transform an input query into a set of node hits: • Morphologically normalize GO node labels • Look for any overlaps between input terms and terms in the normalized node labels • An overlap = a node hit, with strength based on the input weight of the term (from TFIDF) • Multiple overlaps on a given node count as multiple hits • POSOC returns a set of GO nodes representing cluster heads for weighted term input set, and data on which input terms contributed to the selection of each cluster head: Annotation predictions

  16. BioLASER:Los Alamos Semantic Event Recognizer for Biology • Text analysis environment: • Relation extraction • Term vector analysis • Domain-specific and application-specific components • Markup workflow implementation

  17. Application: CASP-6 Function Prediction Critical Assessment of Structure Prediction evaluation Function Prediction subtask • Automatic assignment of Gene Ontology annotations to target protein sequences • Strategy: Annotation as Categorization of Sequence Neighborhood • Application of POSOC categorization utilizing the Gene Ontology structure to find the best covering nodes given a set of node “hits” • “Hits” in this case are based on known mappings from proteins in the sequence neighborhood of the target to Gene Ontology nodes

  18. CASP architecture

  19. CASP Evaluation • Test set • proteins with known Gene Ontology mappings • 4530 SwissProt protein sequences associated from PDB • Protein to GO Mappings derived from UniProt • Eliminate PSI-BLAST identity matches from mappings used in prediction • Matches to protein with the same SwissProt Accession ID • Matches to protein with an accession ID that maps to the same SwissProt Entry ID • Matches to protein with an e-value < 10-130 or e-value < max e-value for known identity match • Goal: compare function predictions made by the system with known functions assigned to each input protein

  20. CASP Evaluation runs • Baseline Best Blast: Predictions are the GO nodes associated with non-identical protein scoring highest in the PSI-BLAST analysis. All predicted GO nodes are considered to be at rank 1. • Baseline Full Neighborhood: Predictions are the GO nodes associated with all proteins matched in the PSI-BLAST analysis (with evalue < 10). The predictions are ranked according to the evalue of the corresponding PSI-BLAST match. • POSOC Best Blast: Inputs to POSOC are the GO nodes associated with non-identical protein scoring highest in the PSI-BLAST analysis, weighted by evalue of the match. POSOC categorizes and ranks these inputs to produce the predictions. • POSOC Full Neighborhood: Inputs to are the GO nodes associated with all proteins matched in the PSI-BLAST analysis, weighted by evalue of the match. POSOC categorizes and ranks these inputs to produce the predictions.

  21. Evaluation analysis • Precision/Recall • Precision = % of predictions that are correct • Recall = % of known predictions that are recovered • Extension to ranked list of predictions • Consider precision/recall at different ranks

  22. Ontological Distance Metrics • How “far apart” are p and q? • Genealogical approach: • Radius 0: Equals: Direct match • Radius 1: Nuclear family: Parents, children, siblings • Radius 2: Extended family: grandparents, grandchildren, cousins, aunts/uncles, nieces/nephews

  23. Evaluation results: Precision

  24. Evaluation results: Recall

  25. Evaluation of Ontological predictions • Extension to ontological predictions: when does a GO node p in F(x) count as a “match” against a q in G(x)? • What about siblings? Ancestors? • Partial credit? • Based on proximity • Based on specificity • Adapt hierarchical precision/recall measure from Kiritchenko et al 2005

  26. Hierarchical Precision vs. Rank(Cellular Component branch)

  27. Hierarchical Precision vs. Rank(Molecular Function branch)

  28. Hierarchical Precision vs. Rank(Biological Process branch)

  29. Summary: Protein Function Prediction • We have constructed the POSOLE architecture, supporting integration of mappings from different spaces into function space • We utilize the mathematical structure of function space as defined by the Gene Ontology to help identify commonalities and “clusters”, as well as in evaluation • We have proposed an extension to Kiritchenko et al’s hierarchical precision/recall measure to support comparison of sets of predictions and answers • The results on CASP function prediction show the promise of the POSOLE and POSOC technologies for automated annotation of protein sequences.

  30. Ontology Quality Assurance • Verspoor, K., Dvorkin, D., Cohen, K.B., Hunter, L. (2009) Ontology quality assurance through analysis of term transformations. Bioinformatics 25(12):i77-i84.

  31. Key quality concern: Univocality • Univocality = one voice (Spinoza, 1677) “a shared interpretation of the nature of reality” (with thanks to David Hill @ Jackson Lab) • Consistency of expression of concepts • Regular, compositional, linguistic structure • Facilitates human usability • Computational tools can utilize this regularity

  32. Quality Assurance in the GO • Goal: identify violations of univocality • Problem: the GO is generally very high quality; how to identify the few inconsistencies? • Hypothesis: violations of univocality will correspond to transformational variants • Strategy: term transformation & clustering

  33. GO Term Transformation:Abstraction • Substitution of embedded GO & ChEBI terms toluene oxidation via 3-hydroxytoluene CTERM oxidation via CTERM regulation of coagulation regulation of GTERM leukotriene production during acute inflammatory response CTERM production during GTERM

  34. GO Term Transformations • Stopword removal toluene oxidation via 3-hydroxytoluene toluene oxidation 3-hydroxytoluene regulation of coagulation regulation coagulation • Alphabetic reording 3-hydroxytoluene oxidation toluene via coagulation of regulation

  35. Transformation combinations • Abstraction=1, StopRemoval=1, Reordering=1 toluene oxidation via 3-hydroxytoluene regulation of coagulation leukotriene production during acute inflammatory response

  36. Transformation combinations • Abstraction=1, StopRemoval=1, Reordering=1 toluene oxidation via3-hydroxytoluene CTERMCTERM oxidation regulation ofcoagulation GTERM regulation leukotriene production duringacute inflammatory response CTERMGTERM production

  37. Clustering • Group together all terms with a common form after transformation • Perform clustering for different combinations of transformations asr {GTERM constit structu} GO:0005201 -- extracellular matrix structural constituent GO:0005199 -- structural constituent of cell wall GO:0005213 -- structural constituent of chorion GO:0005200 -- structural constituent of cytoskeleton GO:0003735 -- structural constituent of ribosome GO:0017056 -- structural constituent of nuclear pore GO:0019911 -- structural constituent of myelin sheath

  38. Analysis of clusters • Heuristic search: • Consider only clusters with abstraction (a±±) • Identify terms in distinct a-- clusters, but merge together in a-r, as-, or asr. • Manual assessment of 190 clusters

  39. Transformation Impact • 25,539 source GO terms (12/2007 version) • Pre-processing reduces to 23,478 (8%) • a=Abstraction, s=StopRemoval, r=Reordering • Abstraction has most impact: 46% reduction

  40. Abstraction breakdown,a-- clusters

  41. Distribution of cluster size --- transformation asr transformation

  42. True Positive clusters • 67 clusters • 317 GO terms • Obsolete term filter: 7 clusters, 32 terms • Approximately 77 term rephrasings anticipated

  43. True Positive inconsistencies • {X Y} ≈ {Y of X} | {Y in X} [45%] {GTERM GTERM organis symbion} GO:0052387 -- induction by organism of symbiont apoptosis GO:0052351 -- induction by organism of systemic acquired resistance in symbiont GO:0052350 -- induction by organism of induced systemic resistance in symbiont GO:0052560 -- induction by organism of symbiont immune response GO:0052399 -- induction by organism of symbiont programmed cell death GO:0052396 -- induction by organism of symbiont non-apoptotic programmed cell death {GTERM multice organis} GO:0010259 -- multicellular organismal aging GO:0022412 -- reproductive cellular process in multicellular organism GO:0032504 -- multicellular organism reproduction GO:0033057 -- reproductive behavior in a multicellular organism GO:0033555 -- multicellular organismal response to stress GO:0035264 -- multicellular organism growth

  44. True Positives (2) • Determiners [16%] {GTERM forebra} GO:0021861 -- radial glial cell differentiation in the forebrain GO:0021846 -- cell proliferation in forebrain GO:0021872 -- generation of neurons in the forebrain {GTERM organ} GO:0031100 -- organ regeneration GO:0035265 -- organ growth GO:0010260 -- organ senescence GO:0001759 -- induction of an organ

  45. True Positives (3) • Other alternations [16%] {GTERM selecti site} GO:0000282 -- cellular bud site selection GO:0000918 -- selection of site for barrier septum formation • Conflicting conventions [6%] {GTERM endothe} (partial listing) GO:0003100 -- regulation of systemic arterial blood pressure by endothelin GO:0004962 -- endothelin receptor activity • Punctuation [3%] GO:0016653 -- oxidoreductase activity, acting on NADH, heme protein as acceptor GO:0016658 -- oxidoreductase activity, acting on NADH, flavin as acceptor GO:0050664 -- oxidoreductase activity, acting on NADH, with oxygen as acceptor GO:0043247 -- telomere maintenance in response to DNA damage GO:0042770 -- DNA damage response, signal transduction

  46. True Positives (4) • “Grab bag” • Lexical choice • “within” vs. “in” • “substrate-specific” vs. “substrate-dependent” • Superfluous words like “other”

  47. False positive breakdown

  48. False positive cluster examples • Semantic import of stopword [50%] {CTERM GTERM levels modulat symbion} (partial listing) GO:0052430 – modulation by host of symbiont RNA levels GO:0052018 – modulation by symbiont of host RNA levels {CTERM CTERM galacto GTERM} GO:0033580 -- protein amino acid galactosylation at cell surface GO:0033582 -- protein amino acid galactosylation in cytosol GO:0033579 -- protein amino acid galactosylation in endoplasmic reticulum {callose deposit GTERM} GO:0052542 -- callose deposition during defense response GO:0052543 -- callose deposition in cell wall

  49. False positives (2) • Non-parallel structure [27%] {CTERM CTERM} GO:0005204 -- chondroitin sulfate proteoglycan GO:0006088 -- acetate to acetyl-CoA GO:0015641 -- lipoprotein toxin {GTERM GTERM GTERM} (partial listing) GO:0019896 -- axon transport of mitochondrion GO:0047496 -- vesicle transport along microtubule GO:0047497 -- mitochondrion transport along microtubule GO:0032066 -- nucleolus to nucleoplasm transport GO:0052067 -- negative regulation by symbiont of entry into host cell via phagocytosis {GTERM storage} GO:0001506 -- neurotransmitter biosynthetic process and storage GO:0000322 -- storage vacuole

  50. False positives (3) • Stemming [17%] {regulat GTERM} (partial listing) GO:0045066 -- regulatory T cell differentiation GO:0045069 -- regulation of viral genome replication GO:0045055 -- regulated secretory pathway GO:0031347 -- regulation of defense response • Syntactic variation [5%] {GTERM mainten} GO:0045216 -- intercellular junction assembly and maintenance GO:0045217 -- intercellular junction maintenance GO:0045218 -- zonula adherens maintenance • Semantic import of word order[5%]

More Related