1 / 16

Species Independent Protein Localization Prediction for Multi-compartmentalized Proteins

Species Independent Protein Localization Prediction for Multi-compartmentalized Proteins. Mark Doderer, Kihoon Yoon, and Stephen Kwek Department of Computer Science, University of Texas at San Antonio, San Antonio, Texas 78249, USA. Presentation Outline. Problem Overview Background

yaron
Download Presentation

Species Independent Protein Localization Prediction for Multi-compartmentalized Proteins

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Species Independent Protein Localization Prediction for Multi-compartmentalized Proteins Mark Doderer, Kihoon Yoon, and Stephen Kwek Department of Computer Science, University of Texas at San Antonio, San Antonio, Texas 78249, USA

  2. Presentation Outline • Problem Overview • Background • Problem Statement and Approach • Methods and Materials • Similarity Searching • modECOC • Datasets • Summary of Results • Conclusion

  3. Protein Localization • For a protein to achieve its functional intent it must localize to its intended location • This information can be used to solve other problems • Experimental determination is through cell fractionation, electron microscopy and fluorescence microscopy. These are time consuming, subjective and highly variable • Putative determination has been shown to be accurate, faster and can annotate unknown proteins.

  4. Problem Statement • Single location prediction • Multi location prediction • Many predictors focus on the majority class

  5. A hybrid algorithm • If a similar protein can be found use the known protein to predict the unknown protein • If a similar protein can not be found use a machine learning predictor built from the known data to predict the unknown protein

  6. Similarity Searching Classifier • BlastAll • PAM30 Matrix • Bit score of 100

  7. modECOC – machine learning classifier • Dietterich and Bakiri proposed the Error Correcting Output Code Classifier • Handles problems with many classes • Reliable class probability estimates • Doesn’t ignore the minority classes • Can use any classifier for the base classifiers

  8. Relabeling a dataset

  9. Modification to ECOC to allow for multi-location prediction • Modify base classifier labeling • “cyto_plas” will be re-labeled as 0. • “cyto_nucl” will be left out of this base classifier • Prediction through class score from voting • Find mean of class probabilities • Find standard deviations from mean for each class • Predict classes significantly different than the other classes

  10. Features – characterizing the data • Amino acid frequency and sequence length • Physicochemical Characteristics • Betts and Russell • hydrophobicity, polarity, size, aliphatic, charge, aromatic and cBeta branch • For example hydrophobicity • Very hydrophobic - valine, isoleucine, leucine, methionine, phenylalanine, tryptophan, and cysteine • Least, partial and other • Gapped pairs with a gap of 0, 1 and 2 aa’s • Offers spatial information • The N and C terminal regions contain the signal peptides if they exist. Using 30 aa’s from each region and the reduced alphabet gives us 19 x 2 features.

  11. Datasets • WolfPsort • Three groups of species • 12771 animal, 2333 plant and 2113 fungi proteins • From SwissProt • 12 unique labels • Maximum of two labels • Very imbalanced • PHPD • 5191 yeast proteins • 22 unique labels • ranges from 2 to 5 possible labels

  12. Experiments • Cross fold validation (2-PHPD / 5 WolfPsort) • Prediction and Scoring • WolfPsort • partial score for partially correct predictions • Never predicts more than 2 locations • PHPD • Always predicts three locations • Three measures – anything correct, average score for labels correct, each class score for that class prediction

  13. Results compared with WolfPsort

  14. Results compared with PHPD

  15. Conclusion • A hybrid classifier exploits the strengths of blasting and machine learning classifications and can work on a variety of datasets without parameter changes • ECOC is well suited to representing protein localization problems • modECOC handles multi-label problems with flexibility during prediction

More Related