1 / 26

Semantic Similarity over Gene Ontology for Multi-label Protein Subcellular Localization

Semantic Similarity over Gene Ontology for Multi-label Protein Subcellular Localization. Shibiao WAN and Man-Wai MAK The Hong Kong Polytechnic University Sun-Yuan KUNG Princeton University. Outline. Introduction and Motivation Retrieval of GO Terms Semantic Similarity Measures

lieu
Download Presentation

Semantic Similarity over Gene Ontology for Multi-label Protein Subcellular Localization

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Semantic Similarity over Gene Ontology forMulti-label Protein Subcellular Localization Shibiao WAN and Man-Wai MAK The Hong Kong Polytechnic University Sun-Yuan KUNG Princeton University

  2. Outline • Introduction and Motivation • Retrieval of GO Terms • Semantic Similarity Measures • Multi-label Multi-Class Classification • Results • Conclusions

  3. Proteins and Their Subcellular Locations

  4. Subcellular Localization Prediction • The subcellular locations of proteins help biologists to elucidate the functions of proteins. • Identifying the subcellular locations by entirely experimentalmeans is time-consuming and costly. • Computational methods are necessary for subcellular localization prediction. • Previous research has found that gene ontology (GO) based methods outperform methods based on otherprotein features (e.g. AA composition).

  5. Multi-label Problem • Some proteins can simultaneously reside at, or move between, two or more subcellular locations. • Multi-label (Multi-location) proteins play important roles in some metabolicprocesses taking place in multiple subcellular locations. • State-of-the-art multi-label predictors, such as Plant-mPLoc, iLoc-Plant, andmGOASVMuse frequency counts of GO terms as features. • In this work, we propose using semantic similarity of GO terms as features for multi-label subcellular localization prediction.

  6. Method’s Flowchart Swiss-Prot Database BLAST Multi-label SVM S SVM . . . homolog AC GO Extraction by searching GOA database AC Semantic Similarity Measure Subcellular Location(s) SVM M . . . GOA Database SVM GO of training proteins Semantic Similarity Vector SS: Semantic Similarity

  7. Gene Ontology • Gene ontology is a set of standardized vocabularies annotating the functions of genes and gene products • GO terms, e.g., GO:0000187 • Aprotein sequence may correspond to 0, 1 or many GO terms.

  8. Gene Ontology: Example Search----GO:0000187 in http://www.geneontology.org/

  9. GOA Database • Gene Ontology Annotation database. • Provide structured annotations to proteins in UniProt Knowledgebase (UniProtKB) and other protein databases using standardized GO vocabularies. • Include a series of cross-references toother databases. • Given an Accession Number, the GOA database allows us to find a set of GO terms associated with that accession number.

  10. GOA Database 1 AC maps to many GO terms ! Accession Number (AC) GO term(s) SearchA0M8T9 in http://www.ebi.ac.uk/GOA/

  11. Finding GO Terms without an Accession Number Swiss-Prot Database BLAST S homolog AC GO Extraction by searching GOA database AC GO Terms of Qi GOA Database

  12. Semantic Similarity Measure GO term x Find Common Ancestors Computing Semantic Similarity A(x,y) sim(x,y) GO term y Ancestors SQL Query GODatabase

  13. Finding Common Ancestors, A(x,y)

  14. Finding Common Ancestors, A(x,y) GO:0000187 is_a part_of

  15. Semantic Similarity Measure We use Lin’s measure to estimate the semantic similarity between two GO terms (xand y):

  16. Semantic Similarity between 2 Proteins Semantic similarity between 2 proteins (Gi, Gj): where Semantic Similarity Vector: No. of training proteins

  17. Multi-label SVM Scoring GO of Qt GO of training proteins =

  18. Benchmark Datasets The Plant dataset

  19. Performance Metrics Overall locative accuracy: Overall actual accuracy: Actual accuracyis more objective and stricter!

  20. Performance Comparison The Plant dataset

  21. Conclusions • Our Proposed predictor performs significantly better than Plant-mPLoc and iLoc-Plant, and also better than mGOASVM, in terms of locative and actual accuracies. • As for individual locative accuracies, our proposed predictor are significantlyhigher than the three predictors for all of the 12 locations. • In terms of GO information extraction, Plant-mPLoc, iLoc-Plant and mGOASVMusethe occurrences of GO terms as features, whereas the proposed predictor discovers the semantic relationship between GO terms, fromwhich the semantic similarity between proteins can be obtained.

  22. Web Servers

  23. Thank you!

  24. Multi-label SVM Classifier Transformed labels for M-class problem:

  25. Retrieving GO Terms with/without AC Y AC known ? N Retrieve homologs by BLAST; Using the homolog N Y Retrieve a set of GO terms Y N Multi-label SVM classification Using back-up methods

  26. Finding Common Ancestors • The relationships between GO terms in the GO hierarchy can be obtained from the SQL database through the link: http://archive.geneontology.org/latest-termdb/go_daily-termdb-tables.tar.gz. • We only considered the ‘is-a’ relationship.

More Related