160 likes | 310 Views
Species Independent Protein Localization Prediction for Multi-compartmentalized Proteins. Mark Doderer, Kihoon Yoon, and Stephen Kwek Department of Computer Science, University of Texas at San Antonio, San Antonio, Texas 78249, USA. Presentation Outline. Problem Overview Background
E N D
Species Independent Protein Localization Prediction for Multi-compartmentalized Proteins Mark Doderer, Kihoon Yoon, and Stephen Kwek Department of Computer Science, University of Texas at San Antonio, San Antonio, Texas 78249, USA
Presentation Outline • Problem Overview • Background • Problem Statement and Approach • Methods and Materials • Similarity Searching • modECOC • Datasets • Summary of Results • Conclusion
Protein Localization • For a protein to achieve its functional intent it must localize to its intended location • This information can be used to solve other problems • Experimental determination is through cell fractionation, electron microscopy and fluorescence microscopy. These are time consuming, subjective and highly variable • Putative determination has been shown to be accurate, faster and can annotate unknown proteins.
Problem Statement • Single location prediction • Multi location prediction • Many predictors focus on the majority class
A hybrid algorithm • If a similar protein can be found use the known protein to predict the unknown protein • If a similar protein can not be found use a machine learning predictor built from the known data to predict the unknown protein
Similarity Searching Classifier • BlastAll • PAM30 Matrix • Bit score of 100
modECOC – machine learning classifier • Dietterich and Bakiri proposed the Error Correcting Output Code Classifier • Handles problems with many classes • Reliable class probability estimates • Doesn’t ignore the minority classes • Can use any classifier for the base classifiers
Modification to ECOC to allow for multi-location prediction • Modify base classifier labeling • “cyto_plas” will be re-labeled as 0. • “cyto_nucl” will be left out of this base classifier • Prediction through class score from voting • Find mean of class probabilities • Find standard deviations from mean for each class • Predict classes significantly different than the other classes
Features – characterizing the data • Amino acid frequency and sequence length • Physicochemical Characteristics • Betts and Russell • hydrophobicity, polarity, size, aliphatic, charge, aromatic and cBeta branch • For example hydrophobicity • Very hydrophobic - valine, isoleucine, leucine, methionine, phenylalanine, tryptophan, and cysteine • Least, partial and other • Gapped pairs with a gap of 0, 1 and 2 aa’s • Offers spatial information • The N and C terminal regions contain the signal peptides if they exist. Using 30 aa’s from each region and the reduced alphabet gives us 19 x 2 features.
Datasets • WolfPsort • Three groups of species • 12771 animal, 2333 plant and 2113 fungi proteins • From SwissProt • 12 unique labels • Maximum of two labels • Very imbalanced • PHPD • 5191 yeast proteins • 22 unique labels • ranges from 2 to 5 possible labels
Experiments • Cross fold validation (2-PHPD / 5 WolfPsort) • Prediction and Scoring • WolfPsort • partial score for partially correct predictions • Never predicts more than 2 locations • PHPD • Always predicts three locations • Three measures – anything correct, average score for labels correct, each class score for that class prediction
Conclusion • A hybrid classifier exploits the strengths of blasting and machine learning classifications and can work on a variety of datasets without parameter changes • ECOC is well suited to representing protein localization problems • modECOC handles multi-label problems with flexibility during prediction