450 likes | 574 Views
Predicting Secondary Structure of All-Helical Proteins Using Hidden Markov Support Vector Machines Blaise Gassend, Charles W. O'Donnell, William Thies, Andrew Lee, Marten van Dijk, and Srinivas Devadas. Computer Science and Artificial Intelligence Laboratory
E N D
Predicting Secondary Structure of All-Helical Proteins UsingHidden Markov Support Vector MachinesBlaise Gassend, Charles W. O'Donnell, William Thies, Andrew Lee, Marten van Dijk, and Srinivas Devadas Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology Workshop on Pattern Recognition in Bioinformatics – August 20, 2006
Protein Structure Prediction • Classical problem: given sequence, predict structure • High-level approaches 1. Energy-minimization (ab-initio) techniques - Elegant, but often lack correct parameters 2. Homology-based techniques - Useful, but hard to predict new proteins Sequence Structure Our approach: Use energy minimization, butlearn parameters from existing proteins
Our Framework (Training) Protein Data Bank Correct structure Amino-acid Sequence Prediction Algorithm Energy Parameters Predictedstructure LearningAlgorithm correct incorrect Done! Constraints energy(incorrect) > energy(correct)
Our Framework (Testing) Amino-acid Sequence Prediction Algorithm Energy Parameters Predictedstructure
Initial Focus: Secondary Structure • Classify each residue as alpha helix, beta strand, coil • In this paper, restrict to all-alpha proteins • Applications: • Informing tertiary structure predictors • Identification of homologous proteins • Identification of active sites (coils)
Secondary Structure Predictors Sequence + Sequence + Sequence Sequence Alignment Alignment Only Only Statistical Methods Statistical Methods HMMs
Secondary Structure Predictors Sequence + Sequence + Sequence Sequence Alignment Alignment Only Only Statistical Methods Statistical Methods Neural Networks Neural Networks HMMs
Secondary Structure Predictors Sequence + Sequence + Sequence Sequence Alignment Alignment Only Only Statistical Methods Statistical Methods Neural Networks Neural Networks SVMs HMMs
Secondary Structure Predictors Sequence + Sequence + Sequence Sequence Alignment Alignment Only Only Statistical Methods Statistical Methods Neural Networks Neural Networks SVMs HMMs HMMs
Secondary Structure Predictors Sequence + Sequence + Sequence Sequence Alignment Alignment Only Only Statistical Methods Statistical Methods 1400-2900 parameters Neural Networks Neural Networks SVMs HMMs HMMs 680 MB of support vectors 471 parameters • Exploits biochemical models • Offers biological insight
Secondary Structure Predictors Sequence + Sequence + Sequence Sequence Alignment Alignment Only Only Statistical Methods Statistical Methods 302 params 1400-2900 parameters Neural Networks Neural Networks SVMs HMMs HMMs 680 MB of support vectors 471 parameters • Exploits biochemical models • Offers biological insight
Our Framework Applied to Helix Prediction Protein Data Bank Alpha Helices Correct structure Amino-acid Sequence MNIFEMLRIDEGLHHHHHHHHH HiddenMarkov Model Prediction Algorithm Energy Parameters Support Vector Machines Predictedstructure LearningAlgorithm correct incorrect Done! Constraints energy(incorrect) > energy(correct)
Energy Parameters 302 Total
Energy Parameters • Example: Sequence:MNIFELRIDEGL Structure: HHHHHH Energy = 302 Total
Energy Parameters • Example: Sequence:MNIFELRIDEGL Structure: HHHHHH Energy = HF + HE + HL + HR + HI + HD (Helix) 302 Total
Energy Parameters • Example: Sequence:MNIFELRIDEGL Structure: HHHHHH Energy = HF + HE + HL + HR + HI + HD (Helix) + NM,-3 + NN,-2 + NI,-1 + NF,0 + NE,1 + NL,2 + NR,3 (N-cap) 302 Total
Energy Parameters • Example: Sequence:MNIFELRIDEGL Structure: HHHHHH Energy = HF + HE + HL + HR + HI + HD (Helix) + NM,-3 + NN,-2 + NI,-1 + NF,0 + NE,1 + NL,2 + NR,3 (N-cap) + CL,-3 + CR,-2 + CI,-1 + CD,0 + CE,1 + CG,2 + CL,3 (C-cap) 302 Total
Energy Parameters • Example: Sequence:MNIFELRIDEGL Structure: HHHHHH Energy = HF + HE + HL + HR + HI + HD (Helix) + NM,-3 + NN,-2 + NI,-1 + NF,0 + NE,1 + NL,2 + NR,3 (N-cap) + CL,-3 + CR,-2 + CI,-1 + CD,0 + CE,1 + CG,2 + CL,3 (C-cap) 302 Total
Learning the Parameters Feature Space Energy ( ) = HA*A + HG*G = w¢[A G] Legal structure Correct structure where w represents the energy parameters [HA HG] G: # of Glycines in Helices Highest energy in direction of energy parameters w A: # of Alanines in Helices
Learning the Parameters Feature Space Energy ( ) = HA*A + HG*G = w¢[A G] Legal structure Correct structure where w represents the energy parameters [HA HG] G: # of Glycines in Helices Highest energy in direction of energy parameters w w A: # of Alanines in Helices
Learning the Parameters Feature Space 1. Predict stucture Legal structure Correct structure Predicted structure G: # of Glycines in Helices w A: # of Alanines in Helices
Learning the Parameters Feature Space 1. Predict stucture Legal structure Correct structure Predicted structure G: # of Glycines in Helices A: # of Alanines in Helices
Learning the Parameters Feature Space 1. Predict stucture 2. Refine parameters Legal structure Correct structure Predicted structure G: # of Glycines in Helices Separating Hyperplane A: # of Alanines in Helices
Learning the Parameters Feature Space 1. Predict stucture 2. Refine parameters Legal structure Correct structure Predicted structure G: # of Glycines in Helices w Separating Hyperplane A: # of Alanines in Helices
Learning the Parameters Feature Space 1. Predict stucture 2. Refine parameters Legal structure Correct structure Predicted structure G: # of Glycines in Helices w A: # of Alanines in Helices
Learning the Parameters Feature Space 1. Predict stucture 2. Refine parameters 3. Predict structure Legal structure Correct structure Predicted structure G: # of Glycines in Helices w A: # of Alanines in Helices
Learning the Parameters Feature Space 1. Predict stucture 2. Refine parameters 3. Predict structure Legal structure Correct structure Predicted structure G: # of Glycines in Helices A: # of Alanines in Helices
Learning the Parameters Feature Space 1. Predict stucture 2. Refine parameters 3. Predict structure 4. Refine parameters Legal structure Correct structure Predicted structure G: # of Glycines in Helices A: # of Alanines in Helices
Learning the Parameters Feature Space 1. Predict stucture 2. Refine parameters 3. Predict structure 4. Refine parameters Legal structure Correct structure Predicted structure G: # of Glycines in Helices w A: # of Alanines in Helices
Learning the Parameters Feature Space 1. Predict stucture 2. Refine parameters 3. Predict structure 4. Refine parameters Legal structure Correct structure Predicted structure G: # of Glycines in Helices w A: # of Alanines in Helices
Learning the Parameters Feature Space 1. Predict stucture 2. Refine parameters 3. Predict structure 4. Refine parameters 5. Predict structure Legal structure Correct structure Predicted structure G: # of Glycines in Helices w A: # of Alanines in Helices
Learning the Parameters Feature Space 1. Predict stucture 2. Refine parameters 3. Predict structure 4. Refine parameters 5. Predict structure Legal structure Correct structure Predicted structure G: # of Glycines in Helices A: # of Alanines in Helices
Learning the Parameters Feature Space 1. Predict stucture 2. Refine parameters 3. Predict structure 4. Refine parameters 5. Predict structure 6. Refine parameters Legal structure Correct structure Predicted structure G: # of Glycines in Helices A: # of Alanines in Helices
Learning the Parameters Feature Space 1. Predict stucture 2. Refine parameters 3. Predict structure 4. Refine parameters 5. Predict structure 6. Refine parameters Legal structure Correct structure Predicted structure G: # of Glycines in Helices w A: # of Alanines in Helices
Learning the Parameters Feature Space 1. Predict stucture 2. Refine parameters 3. Predict structure 4. Refine parameters 5. Predict structure 6. Refine parameters Legal structure Correct structure Predicted structure G: # of Glycines in Helices w A: # of Alanines in Helices
Learning the Parameters Feature Space 1. Predict stucture 2. Refine parameters 3. Predict structure 4. Refine parameters 5. Predict structure 6. Refine parameters 7. Predict structure Legal structure Structurealready predicted Correct structure Predicted structure G: # of Glycines in Helices w A: # of Alanines in Helices
Learning the Parameters Feature Space 1. Predict stucture 2. Refine parameters 3. Predict structure 4. Refine parameters 5. Predict structure 6. Refine parameters 7. Predict structure 8. Terminate Legal structure Structurealready predicted Correct structure Predicted structure G: # of Glycines in Helices w A: # of Alanines in Helices
Learning the Parameters Feature Space 1. Predict stucture 2. Refine parameters 3. Predict structure 4. Refine parameters 5. Predict structure 6. Refine parameters 7. Predict structure 8. Terminate Legal structure Structurealready predicted Correct structure Predicted structure G: # of Glycines in Helices Details in paper: - How to converge faster - Early termination condition - w A: # of Alanines in Helices [Tsochantaridis et al., ICML’02]
Experimental Methodology • Data set: 300 non-homologous all-alpha proteins • From EVA’s sequence-unique subset of the PDB, July 2005 • Only consider alpha helices (“H” symbol in DSSP) • Randomly split into 150 training, 150 test proteins
Results • Comparison to others • Best HMM method to date that does not utilize alignment info • Offers 3.5% (Q), 0.2% (SOV) over previous best • Lags behind neural networks; e.g., Porter overall SOV = 76.6% • However, we could likely gain 6-8% from alignment profiles • Caveats • Moving beyond all-alpha proteins, we could suffer 3% • By considering 3/10 helices, we could decrease 2% [Nguyen02] [Rost93] [Jones99]
Conclusions • Represents first step toward learning biophysical parameters for energy minimization techniques • Iterative, demand-driven learning process using SVMs • Promising results on alpha-helix prediction • 77.6% among best Q for methods without alignment info • Future work: super-secondary structure • Will predict full “contact maps” rather than 3-state labels • For beta sheets, replace HMMs by multi-tape grammars http://protein.csail.mit.edu/
Prediction Algorithm • Parameters represent energetic benefitof a given feature in a protein structure • Features are fixed, chosen by designer • Example features: • Number of prolines in an alpha helix • Number of coils shorter than 2 residues • Energy (structure) = features 2 structure Energy (feature) • Minimal-energy structure found with dynamic prog. • Idea: consider all structures, exploiting overlapping problems • Implemented as HMM using Viterbi algorithm Structure withMinimal Energy
Learning Algorithm • Constraints have form: For all incorrectly predicted structures Si, in future selection of the parameters w: Energyw (Si) > Energyw (correct structure) Constraints are linear in the energy parameters. • If feasible, could solve with linear programming • In general, solve with Support Vector Machines (SVMs) • Energy(Si) ¸ Energy (correct structure) + 1 -i (i¸ 0) • Find parameters w minimizing ½ ||w||2 + C/n i=1i n • Provides general solution using soft-margin criterion