180 likes | 356 Views
Optimization of SVM parameters in caspase cleavage sites prediction using grid-computing Lawrence Wee. What are caspases?. Caspases are downstream effectors in apoptosis 1. Extrinsic. Intrinsic. As the final effectors of apoptosis, caspases cleave many protein substrates.
E N D
Optimization of SVM parameters in caspase cleavage sites prediction using grid-computing Lawrence Wee
What are caspases? Caspases are downstream effectors in apoptosis 1 Extrinsic Intrinsic As the final effectors of apoptosis, caspases cleave many protein substrates. 1. Hengartner MO. The biochemistry of apoptosis.Nature. 2000 Oct 12;407(6805):770-6.
Caspases are proteases Caspase Cleavage of Substrates1 Caspases are cysteine proteases. Recognize tetrapeptide sequence on substrates (P4-P3-P2-P1). P4 P3 P2 P1 P1’ P2’ - D– E – V – D --- T – Y Cleave after canonical Asp (D) residue at the P1 position. • 1. Fuentes-Prior et al. Biochem J. 2004 Dec 1;384(Pt 2):201-32. • 2. Thornberry et al. J Biol Chem. 1997 Jul 18;272(29):17907-11.
Caspases are proteases The Enormous Range of Caspase Substrates1 Apoptotic regulators Cytoskeletal proteins Caspase Substrates Organelle proteins DNA-associated proteins Caspases RNA-associated proteins Cell signaling proteins Cell cycle proteins Viral proteins More than 400 caspase substrates experimentally determined to date.1Many more await discovery. Other proteins ??? 1. Wee LJ, Tong JC, Tan TW, Ranganathan S. A multi-factor model for caspase degradome prediction. BMC Genomics. 2009, 10:S6.
Computation prediction of caspase cleavage sites • Identification of caspase substrates is important for elucidating biological function of caspases. • Refine our understanding of apoptotic and other caspase-dependent signaling pathways. • Wet-laboratory efforts can be laborious. • Consider computational prediction of caspase cleavage sites?
Support Vector Machines (SVM) • A type of machine learning algorithm • Works very well for several biological problems • Can be computationally hungry with large dimensions or parameters to optimize.
Prediction of caspase cleavage sites Support Vector Machines: A Brief Introduction1 Data-points belonging to 2 distinct classes are represented as vectors. A set of “learning” or “training” data-points belong to 2 classes (green and orange). Each data-point has a unique set of attributes represented by vectors. 1. Cortes,C. and Vapnik,V. (1995) Support vector networks. Machine Learning, 20, 273–293.
Prediction of caspase cleavage sites Support Vector Machines: A Brief Introduction1 The SVM algorithm constructs a “classifier” to discriminate the two classes. Maximal margin hyperplane The classifier is a maximal margin hyperplane that separates the two classes (green and orange) Support Vectors 1. Cortes,C. and Vapnik,V. (1995) Support vector networks. Machine Learning, 20, 273–293.
Prediction of caspase cleavage sites SVM: A Brief Introduction1 The SVM algorithm classifies new unseen data into one of two classes. The classifier assigns the new data-point into one of the two classes based on where it is represented relative to the hyperplane. New data-point assigned to “orange” class. 1. Cortes,C. and Vapnik,V. (1995) Support vector networks. Machine Learning, 20, 273–293.
Prediction of caspase cleavage sites SVM: A Brief Introduction1 SVM Decision Function with RBF kernel: 2 Parameters: C and gamma 1. Cortes,C. and Vapnik,V. (1995) Support vector networks. Machine Learning, 20, 273–293.
Prediction of caspase cleavage sites Computational issues Training dataset (390 sequences) Leave-one-out cross-validation SVM Classifier
Predicting caspase cleavage sites Computational issues Leave-one-out cross-validation for a set of C and gamma values: Training set (5 sequences) Seq 1 Seq 2 Seq 3 Seq 4 Seq 5 Set 1 Set 2 Set 3 Set 4 Set 5 Trained classifier
Prediction of caspase cleavage sites Computational issues Training dataset (390 sequences) For C=0.1, g=0.1, Accuracy = 70% Leave-one-out cross-validation SVM Classifier
Prediction of caspase cleavage sites Grid-based (brute force) optimization of SVM parameters
Two Computational Issues 1. Leave-one-out cross-validation is computationally tedious. With a dataset of 390 training examples, leave-one-out cross-validation takes ~12 secs using an Intel 2.66GHz Core2Duo processor with 4GB ram using 2 parameters (C and gamma). Challenge: How fast will grid computers complete the same computation?
Two Computational Issues 2. Brute-force optimization is computationally tedious. Challenge: How fast will grid computers complete the same computation (but repeated 100 times with different set of C and gamma values)?