20 likes | 192 Views
Sixth Annual Joint Bioinformatics Symposium 2006. Artificial Intelligence Research Laboratory Bioinformatics and Computational Biology Program Computational Intelligence, Learning, and Discovery Program Department of Computer Science.
E N D
Sixth Annual Joint • Bioinformatics Symposium • 2006 Artificial Intelligence Research Laboratory Bioinformatics and Computational Biology Program Computational Intelligence, Learning, and Discovery Program Department of Computer Science Machine Learning Versus Profile-Based Methods for Protein Phosphorylation Site Prediction Yasser EL-Manzalawy, Cornelia Caragea,Drena Dobbs, and Vasant Honavar Prediction of Phosphorylation Sites-Motivation Protein phosphorylation, performed by protein kinases, is a very important process involved in signal transduction pathways. Predicting phosphorylation sites is an essential step towards understanding phosphorylation, which in turn, is essential in understanding diseases and, ultimately, designing drugs that can prevent or cure diseases. • Profile-Based Approaches • Scansite • Aweb service that is using 63 experimentally developed motifs, represented as PSSM, for identifying potential Ser/Thr phosphorylated sites. • KinasePhos • Another web service that uses Kinase-specific HMMs for predictions. • Basic PSSM • Our implementation of PSSM motifs using PROFILEWEIGHT program. • Basic HMM • Our implementation of HMM motifs using HMMER software package. Results Table 2 compares the performance of ML methods against profile-based methods for predicting kinase-specific phosphorylation sites. We also report the ROC curves for basic PSSM and basic HMM in Fig. 3 Conclusions We proposed PSSMPhos, a method for combining PSSM profiles and ML methods. Our study demonstrates the superiority of ML over profile-based methods when enough training data is available. Our experiments suggest that ML methods and profile-based methods should complement each other to produce more efficient phosphorylation site prediction tools. • Sequence-Based Machine Learning Methods • The set of features for each Ser or Thr is based on windows n amino acids (n=15) centered around each Ser or Thr residue. • Encode each window as a 20*n binary vector, in which entries denote whether or not a particular amino acid appears at a particular position • Using this binary encoding, evaluate the performance of Support Vector Machine with Gaussian kernel (Bin(SVM)), Naïve Bayes (Bin(NB)), and Decision Tree (Bin(C4.5)) machine learning algorithms Fig.1: Addition of a phosphate to an amino acid Fig.2: Conformation changes caused by phosphorylation In this study, we empirically compare a number of Machine Learning (ML) and profile-based methods for predicting kinase-specific protein phosphorylation sites. We propose a method for combining PSSM profiles and ML approaches. Our proposed method yields fast and simple classifiers that consistently outperform profile-based methods for predicting kinase-specific phosphorylation sites. • PSSM-Based Representation – Our Approach (PSSMPhos) • Combines profile-based and machine learning approaches • PSSM motifs are obtained as before for each kinase family • Encode each window as an n+1 vector, using the computed PSSM, <e1(x1),…, en(xn),Score(x)>, where ei(xi) is the PSSM emitted score of observing amino acid xi at position i and Score(x) is the sum of the n emitted PSSM scores • Train kinase-specific classifiers (PSSMPhos(SVM), PSSMPhos(NB), PSSMPhos(C4.5)) on the PSSM based representation Phospho.ELM Data Set – a resource containing 1805 proteins from different species covering 1372 Tyr, 3175 Ser and 767 Thr experimentally verified phosphorylation sites manually curated from the literature. We constructed separate data sets for kinase families that are well represented in terms of the data available in the database (i.e., they are known to recognize more than 50 phosphorylation sites) (see Table 1) Fig.3: Comparison of ROC curves for BasicPSSM and BasicHMM for the six kinase families considered Table 1: Kinase families considered in our study and the number of Ser and Thr sites known to be phosphorylated Table 2: Prediction accuracy of different methods using 5-fold cross validation test Acknowledgements: This work is supported in part by grants from the National Science Foundation (IIS 0219699), and the National Institutes of Health (GM 066387) to Vasant Honavar.