10 likes | 222 Views
Train DB. Sampling. Test DB. train. train. train. train. train. Bag of Trained Classifiers. Predictions. Weighted Majority Vote. Glycosylation. N-linked glycosylation. O-linked glycosylation. GPI anchor. C-mannosylation. N-acetylglucosamine (N-GlcNAc). C-mannose. O -mannose.
E N D
Train DB Sampling . . . . Test DB train train train train train . . . . Bag of Trained Classifiers Predictions Weighted Majority Vote Glycosylation N-linked glycosylation O-linked glycosylation GPI anchor C-mannosylation N-acetylglucosamine (N-GlcNAc) C-mannose O-mannose O-xylose O-N-acetylglucosamine (O-GlcNAc) O-glucose O-N-acetylgalactosamine (O-GalNAc) O-hexose O-fucose H3N+ M L I L K T I F L R P S C S L L L T S Q Q E I D COO- S E Glycosylated? Non-Glycosylated? N-linked? O-linked? C-linked? Artificial Intelligence Research Laboratory Bioinformatics and Computational Biology Program Computational Intelligence, Learning, and Discovery Program Department of Computer Science • Rocky 2006 Glycosylation Site Prediction using Machine Learning Approaches Cornelia Caragea, Jivko Sinapov, Adrian Silvescu, Drena Dobbs and Vasant Honavar Biological Motivation Glycosylation is one of the most complex post-translational modifications (PTMs). It is the site-specific enzymatic addition of saccharides to proteins and lipids. Most proteins in eukaryotic cells undergo glycosylation. Results ROC Curves for N-Linked Dataset O-GlycBase v6.00: O- , N- & C- glycosylated proteins with 242 glycosylated entries available at http://www.cbs.dtu.dk/databases/OGLYCBASE/Oglyc.base.html Types of Glycosylation ROC Curves for O-Linked Training an ensemble classifier Problem: Predict glycosylation sites from amino acid sequence ROC Curves for C-Linked Comparison of ROC Curves for single and ensemble classifier • Previous Approaches • Trained Neural Networks used in netOglyc prediction server (Hansen et al., 1995) • Dataset: mucin type O-linked glycosylation sites in mammalian proteins • Trained SVMs based on physical properties, 0/1 system and a combination of these two (Li et al., 2006) • Dataset: mucin type O-linked glycosylation sites in mammalian proteins • Negative examples extracted from sequences with no known glycosylated sites • Trained/testedusing different ratios of positive and negative sites • Classifiers • SVM • 0/1 String Kernel • Substitution Matrix Kernel • Blast - Polynomial Kernel • J48 • Naïve Bayes • Identity windows • Identity plus additional information Conclusion In this work we addressed the problem of predicting glycosylation sites. Three types of machine learning algorithms were used: SVM, NB, and DT. We built predictive ensemble classifiers based on data corresponding to three forms of glycosylation: O-, N-, and C-Linked glycosylation. Our experiments show encouraging results. • Our Approach • We investigate 3 types of glycosylation and use an ensemble classifier approach • Dataset: N-, C- and O-linked glycoslation sites in proteins from several different species: human, rat, mouse, insect, worm, horse, etc. • Negative examples extracted from sequences with at least one experimentally verified glycosylated site Acknowledgements: This work is supported in part by a grant from the National Institutes of Health (GM 066387) to Vasant Honavar & Drena Dobbs