Species Independent Protein Localization Prediction for Multi-compartmentalized Proteins

Species Independent Protein Localization Prediction for Multi-compartmentalized Proteins Mark Doderer, Kihoon Yoon, and Stephen Kwek Department of Computer Science, University of Texas at San Antonio, San Antonio, Texas 78249, USA

Presentation Outline • Problem Overview • Background • Problem Statement and Approach • Methods and Materials • Similarity Searching • modECOC • Datasets • Summary of Results • Conclusion

Protein Localization • For a protein to achieve its functional intent it must localize to its intended location • This information can be used to solve other problems • Experimental determination is through cell fractionation, electron microscopy and fluorescence microscopy. These are time consuming, subjective and highly variable • Putative determination has been shown to be accurate, faster and can annotate unknown proteins.

Problem Statement • Single location prediction • Multi location prediction • Many predictors focus on the majority class

A hybrid algorithm • If a similar protein can be found use the known protein to predict the unknown protein • If a similar protein can not be found use a machine learning predictor built from the known data to predict the unknown protein

Similarity Searching Classifier • BlastAll • PAM30 Matrix • Bit score of 100

modECOC – machine learning classifier • Dietterich and Bakiri proposed the Error Correcting Output Code Classifier • Handles problems with many classes • Reliable class probability estimates • Doesn’t ignore the minority classes • Can use any classifier for the base classifiers

Relabeling a dataset

Modification to ECOC to allow for multi-location prediction • Modify base classifier labeling • “cyto_plas” will be re-labeled as 0. • “cyto_nucl” will be left out of this base classifier • Prediction through class score from voting • Find mean of class probabilities • Find standard deviations from mean for each class • Predict classes significantly different than the other classes

Features – characterizing the data • Amino acid frequency and sequence length • Physicochemical Characteristics • Betts and Russell • hydrophobicity, polarity, size, aliphatic, charge, aromatic and cBeta branch • For example hydrophobicity • Very hydrophobic - valine, isoleucine, leucine, methionine, phenylalanine, tryptophan, and cysteine • Least, partial and other • Gapped pairs with a gap of 0, 1 and 2 aa’s • Offers spatial information • The N and C terminal regions contain the signal peptides if they exist. Using 30 aa’s from each region and the reduced alphabet gives us 19 x 2 features.

Datasets • WolfPsort • Three groups of species • 12771 animal, 2333 plant and 2113 fungi proteins • From SwissProt • 12 unique labels • Maximum of two labels • Very imbalanced • PHPD • 5191 yeast proteins • 22 unique labels • ranges from 2 to 5 possible labels

Experiments • Cross fold validation (2-PHPD / 5 WolfPsort) • Prediction and Scoring • WolfPsort • partial score for partially correct predictions • Never predicts more than 2 locations • PHPD • Always predicts three locations • Three measures – anything correct, average score for labels correct, each class score for that class prediction

Results compared with WolfPsort

Results compared with PHPD

Conclusion • A hybrid classifier exploits the strengths of blasting and machine learning classifications and can work on a variety of datasets without parameter changes • ECOC is well suited to representing protein localization problems • modECOC handles multi-label problems with flexibility during prediction

Species Independent Protein Localization Prediction for Multi-compartmentalized Proteins

Species Independent Protein Localization Prediction for Multi-compartmentalized Proteins

Presentation Transcript

Protein structure prediction

Protein Localization

Proteins Protein Synthesis

Protein Structure Prediction

Semantic Similarity over Gene Ontology for Multi-label Protein Subcellular Localization

Protein structure prediction

Application of Stacked Generalization to a Protein Localization Prediction Task

Prediction of protein localization and membrane protein topology

Protein Structure Prediction

Protein Structure Prediction

LOCtree: prediction of protein subcellular localization

Support vector machine approach for protein subcelluar localization prediction (SubLoc)

Protein structure prediction

Transmembrane Protein Prediction

PROTEIN TRAFFICKING AND LOCALIZATION

PROTEIN LOCALIZATION and SECRETION

Protein Function Prediction

protein domain prediction

Protein Structure Prediction

Protein Structure Prediction

Protein Structure Prediction

Multi-Scale Hierarchical Structure Prediction of Helical Transmembrane Proteins