260 likes | 431 Views
Semantic Similarity over Gene Ontology for Multi-label Protein Subcellular Localization. Shibiao WAN and Man-Wai MAK The Hong Kong Polytechnic University Sun-Yuan KUNG Princeton University. Outline. Introduction and Motivation Retrieval of GO Terms Semantic Similarity Measures
E N D
Semantic Similarity over Gene Ontology forMulti-label Protein Subcellular Localization Shibiao WAN and Man-Wai MAK The Hong Kong Polytechnic University Sun-Yuan KUNG Princeton University
Outline • Introduction and Motivation • Retrieval of GO Terms • Semantic Similarity Measures • Multi-label Multi-Class Classification • Results • Conclusions
Subcellular Localization Prediction • The subcellular locations of proteins help biologists to elucidate the functions of proteins. • Identifying the subcellular locations by entirely experimentalmeans is time-consuming and costly. • Computational methods are necessary for subcellular localization prediction. • Previous research has found that gene ontology (GO) based methods outperform methods based on otherprotein features (e.g. AA composition).
Multi-label Problem • Some proteins can simultaneously reside at, or move between, two or more subcellular locations. • Multi-label (Multi-location) proteins play important roles in some metabolicprocesses taking place in multiple subcellular locations. • State-of-the-art multi-label predictors, such as Plant-mPLoc, iLoc-Plant, andmGOASVMuse frequency counts of GO terms as features. • In this work, we propose using semantic similarity of GO terms as features for multi-label subcellular localization prediction.
Method’s Flowchart Swiss-Prot Database BLAST Multi-label SVM S SVM . . . homolog AC GO Extraction by searching GOA database AC Semantic Similarity Measure Subcellular Location(s) SVM M . . . GOA Database SVM GO of training proteins Semantic Similarity Vector SS: Semantic Similarity
Gene Ontology • Gene ontology is a set of standardized vocabularies annotating the functions of genes and gene products • GO terms, e.g., GO:0000187 • Aprotein sequence may correspond to 0, 1 or many GO terms.
Gene Ontology: Example Search----GO:0000187 in http://www.geneontology.org/
GOA Database • Gene Ontology Annotation database. • Provide structured annotations to proteins in UniProt Knowledgebase (UniProtKB) and other protein databases using standardized GO vocabularies. • Include a series of cross-references toother databases. • Given an Accession Number, the GOA database allows us to find a set of GO terms associated with that accession number.
GOA Database 1 AC maps to many GO terms ! Accession Number (AC) GO term(s) SearchA0M8T9 in http://www.ebi.ac.uk/GOA/
Finding GO Terms without an Accession Number Swiss-Prot Database BLAST S homolog AC GO Extraction by searching GOA database AC GO Terms of Qi GOA Database
Semantic Similarity Measure GO term x Find Common Ancestors Computing Semantic Similarity A(x,y) sim(x,y) GO term y Ancestors SQL Query GODatabase
Finding Common Ancestors, A(x,y) GO:0000187 is_a part_of
Semantic Similarity Measure We use Lin’s measure to estimate the semantic similarity between two GO terms (xand y):
Semantic Similarity between 2 Proteins Semantic similarity between 2 proteins (Gi, Gj): where Semantic Similarity Vector: No. of training proteins
Multi-label SVM Scoring GO of Qt GO of training proteins =
Benchmark Datasets The Plant dataset
Performance Metrics Overall locative accuracy: Overall actual accuracy: Actual accuracyis more objective and stricter!
Performance Comparison The Plant dataset
Conclusions • Our Proposed predictor performs significantly better than Plant-mPLoc and iLoc-Plant, and also better than mGOASVM, in terms of locative and actual accuracies. • As for individual locative accuracies, our proposed predictor are significantlyhigher than the three predictors for all of the 12 locations. • In terms of GO information extraction, Plant-mPLoc, iLoc-Plant and mGOASVMusethe occurrences of GO terms as features, whereas the proposed predictor discovers the semantic relationship between GO terms, fromwhich the semantic similarity between proteins can be obtained.
Multi-label SVM Classifier Transformed labels for M-class problem:
Retrieving GO Terms with/without AC Y AC known ? N Retrieve homologs by BLAST; Using the homolog N Y Retrieve a set of GO terms Y N Multi-label SVM classification Using back-up methods
Finding Common Ancestors • The relationships between GO terms in the GO hierarchy can be obtained from the SQL database through the link: http://archive.geneontology.org/latest-termdb/go_daily-termdb-tables.tar.gz. • We only considered the ‘is-a’ relationship.