1 / 28

Identifying Extracellular Plant Proteins Based on Frequent Subsequences of Amino Acids

Identifying Extracellular Plant Proteins Based on Frequent Subsequences of Amino Acids. Y. Wang, O. Zaiane, R. Goebel. Introduction. Protein: linear sequence of amino acids Protein subcellular localization Plant: nuclear, cytoplamic, mitochondria, extracellular, …

ruby-yang
Download Presentation

Identifying Extracellular Plant Proteins Based on Frequent Subsequences of Amino Acids

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Identifying Extracellular Plant Proteins Based on Frequent Subsequences of Amino Acids Y. Wang, O. Zaiane, R. Goebel

  2. Introduction • Protein: linear sequence of amino acids • Protein subcellular localization • Plant: nuclear, cytoplamic, mitochondria, extracellular, … • Intracellular vs. Extracellular • Sequence information alone • Class imbalance • Transparency

  3. Related Word • N-terminal sorting signals • Amino acid composition • Lexical analysis • Integrative approach • Subsequence methods

  4. Predicting Extracellular Proteins • Feature Extraction • Support Vector Machine • Boosting • Frequent Pattern Method

  5. Feature Extraction • Frequent subsequences: subsequences that occur in more than a certain percentage of extracellular proteins • Strong discriminative power • Perform similar functions via relationed biochemical mechanism • Capture local similarity

  6. Generalized Suffix Tree

  7. Support Vector Machine • Input data represented as feature vectors • Find a linear separator that separate the data and maximize the margin • Kernel function: nonlinear separator

  8. SVM for extracellular protein prediction • Data Transformation(sequencevector) • Frequent subsequences as features • Transform protein sequence as binary vectors • Kernel Functions • Linear kernel • Polynomial kernel • RBF kernel

  9. Boosting • Iterative algorithms to improve weak classifier • Different weighted distribution of examples in each iteration • Increase the weights of incorrectly classified examples, and decrease the weights of correctly classified ones

  10. AdaBoost

  11. Frequent Pattern Method • Frequent pattern: *X1*X2*…*Xn* extracellular • X1,X2,…Xn are frequent subsequences • “*” can be substituted to zero or up to MaxGap amino acids when matching a protein sequence

  12. FOIL algorithm

  13. Z-number :support of rule R :accuracy of rule R

  14. Experiments • Dataset(PASub project at UofA) • Plant: 3293 proteins, 171 extracellular • Five-cross validation

  15. Evaluation Matrix • Overall accuracy is not good enough • F-measure

  16. Result(SVM with subsequence)

  17. Result(Boosting with subsequence)

  18. Result(Frequent Pattern) MinLen=3 Min_gain=0.1 MinSup=5% MinConf=80% MaxGap=300

  19. Result(SVM with composition)

  20. Result(Boosting with composition)

  21. Cross Comparision

  22. SVM with combined features

  23. Boosting with combined features

  24. Effects of MinLen on SVM

  25. Effects of MinLen on boosting

  26. Conclusion • Presented three methods for identifying extracellular proteins based on frequent subsequence of amino acids • SVM achieves the best result • FSP method provides easily interpretable rules

  27. Future Work • Use for information about proteins (e.g., structure, function, …) • Integrating amino acid composition into FSP method • Incorporate more biological knowledge

More Related