1 / 65

Machine Learning Algorithms for Protein Structure Prediction

Machine Learning Algorithms for Protein Structure Prediction. Jianlin Cheng Institute for Genomics and Bioinformatics School of Information and Computer Sciences University of California Irvine 2006. Outline. Introduction 1D Prediction 2D Prediction (Beta-Sheet Topology)

kaori
Download Presentation

Machine Learning Algorithms for Protein Structure Prediction

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Machine Learning Algorithms for Protein Structure Prediction Jianlin Cheng Institute for Genomics and Bioinformatics School of Information and Computer Sciences University of California Irvine 2006

  2. Outline • Introduction • 1D Prediction • 2D Prediction (Beta-Sheet Topology) • 3D Prediction (Fold Recognition) • Publications and Bioinformatics Tools

  3. Importance of Protein Structure Prediction AGCWY…… Cell Sequence Structure Function

  4. Four Levels of Protein Structure Primary Structure (a directional sequence of amino acids/residues) N C … Residue1 Residue2 Peptide bond Secondary Structure (helix, strand, coil) Alpha Helix Beta Strand / Sheet Coil

  5. Four Levels of Protein Structure Tertiary Structure Quaternary Structure (complex) G Protein Complex

  6. 1D: Secondary Structure Prediction MWLKKFGINLLIGQSV… Helix Neural Networks + Alignments Coil CCCCHHHHHCCCSSSSS… Accuracy: 78% Strand Cheng, Randall, Sweredoski, Baldi. Nucleic Acid Research, 2005

  7. 1D: Solvent Accessibility Prediction Exposed MWLKKFGINLLIGQSV… Neural Networks + Alignments eeeeeeebbbbbbbbeeeebbb… Accuracy: 79% Buried Cheng, Randall, Sweredoski, Baldi. Nucleic Acid Research, 2005

  8. 1D: Disordered Region Prediction Using Neural Networks MWLKKFGINLLIGQSV… Disordered Region 1D-RNN OOOOODDDDOOOOO… 93% TP at 5% FP Cheng, Sweredoski, Baldi. Data Mining and Knowledge Discovery, 2005

  9. 1D: Protein Domain Prediction Using Neural Networks MWLKKFGINLLIGQSV… Boundary + SS and SA 1D-RNN NNNNNNNBBBBBNNNN… Inference/Cut HIV capsid protein Domain 1 Domain 2 Domains Top ab-initio domain predictor in CAFASP4 Cheng, Sweredoski, Baldi. Data Mining and Knowledge Discovery, 2006.

  10. 1D: Predict Single-Site Mutation From Sequence Using Support Vector Machine Correlation = 0.76 • First method to predict energy changes from sequence accurately • Useful for protein engineering, protein design, and mutagenesis analysis Support Vector Machine …MWLAVFILINLK… Cheng, Randall, and Baldi. Proteins, 2006

  11. 2D: Contact Map Prediction 2D Contact Map 3D Structure 1 2 ………..………..…j...…………………..…n 1 2 3 . . . . i . . . . . . . n Distance Threshold = 8Ao Cheng, Randall, Sweredoski, Baldi. Nucleic Acid Research, 2005

  12. 2D: Disulfide Bond Prediction Cysteine i Support Vector Machine yes 2D-RNN Disulfide Bond Graph Matching Cysteine j [1] Baldi, Cheng, Vullo. NIPS, 2004. [2] Cheng, Saigo, Baldi. Proteins, 2005

  13. 2D: Prediction of Beta-Sheet Topology N terminus • Ab-Initio Structure Prediction • Fold Recognition • Protein Design • Protein Folding Beta Sheet Beta Strand Cheng and Baldi, Bioinformatics, 2005 C terminus Beta Residue Pair

  14. An Example of Beta-Sheet Topology Level 1 4 5 2 1 3 6 7 Structure of Protein 1VJG Beta Sheets

  15. An Example of Beta-Sheet Topology Level 1 Level 2 4 5 Antiparallel 2 1 3 6 7 Parallel Strand Strand Pair Strand Alignment Pairing Direction Structure of Protein 1VJG Beta Sheets

  16. An Example of Beta-Sheet Topology Level 1 Level 2 Level 3 4 5 Antiparallel H-bond 2 1 3 6 7 Parallel Strand Strand Pair Strand Alignment Pairing Direction Structure of Protein 1VJG Beta Sheets Beta Residue Residue Pair

  17. Three-Stage Prediction of Beta-Sheets • Stage 1 Predict beta-residue pairing probabilities using 2D-Recursive Neural Networks (2D-RNN, Baldi and Pollastri, 2003) • Stage 2 Use beta-residue pairing probabilities to align beta-strands • Stage 3 Predict beta-strand pairs and beta-sheet topology using graph algorithms

  18. Stage 1: Prediction of Beta-Residue Pairings Using 2D-Recusive Neural Networks Input Matrix I (m×m) Output / Target Matrix (m×m) Iij 2D-RNN O = f(I) (i,j) i j Oij: Pairing Prob. Tij: 0/1 …AHYHCKRWQNEDGHTPRKDECLIELMQDAQRMRK…. 20 for Residues 3 SS 2 SA

  19. An Example (Target) 1 2 3 4 5 6 7 Protein 1VJG Beta-Residue Pairing Map (Target Matrix)

  20. An Example (Target) 1 2 3 4 5 6 7 Antiparallel Parallel Protein 1VJG Beta-Residue Pairing Map (Target Matrix)

  21. An Example (Prediction)

  22. Stage 2: Beta-Strand Alignment Antiparallel • Use output probability matrix as scoring matrix • Dynamic programming • Disallow gaps and use the simplified search algorithm Parallel Total number of alignments = 2(m+n-1)

  23. Strand Alignment and Pairing Matrix • The alignment score is the sum of the pairing probabilities of the aligned residues • The best alignment is the alignment with the maximum score • Strand Pairing Matrix Strand Pairing Matrix of 1VJG

  24. Stage 3: Prediction of Beta-Strand Pairings and Beta-Sheet Topology (a) Seven strands of protein 1VJG in sequence order (b) Beta-sheet topology of protein 1VJG

  25. Minimum Spanning Tree Like Algorithm Strand Pairing Graph (SPG) (a) Complete SPG Strand Pairing Matrix

  26. Minimum Spanning Tree Like Algorithm Strand Pairing Graph (SPG) (b) True Weighted SPG (a) Complete SPG Strand Pairing Matrix Goal: Find a set of connected subgraphs that maximize the sum of the alignment scores and satisfy the constraints Algorithm: Minimum Spanning Tree Like Algorithm

  27. An Example of MST Like Algorithm 1 2 3 4 5 6 7 Step 1: Pair strand 4 and 5 1 2 3 4 5 4 5 6 7 Strand Pairing Matrix of 1VJG

  28. An Example of MST Like Algorithm 1 2 3 4 5 6 7 Step 2: Pair strand 1 and 2 1 2 3 4 5 4 5 6 7 2 1 Strand Pairing Matrix of 1VJG N

  29. An Example of MST Like Algorithm 1 2 3 4 5 6 7 Step 3: Pair strand 1 and 3 1 2 3 4 5 4 5 6 7 2 1 3 Strand Pairing Matrix of 1VJG N

  30. An Example of MST Like Algorithm 1 2 3 4 5 6 7 Step 4: Pair strand 3 and 6 1 2 3 4 5 4 5 6 7 2 1 3 6 Strand Pairing Matrix of 1VJG N

  31. An Example of MST Like Algorithm 1 2 3 4 5 6 7 Step 5: Pair strand 6 and 7 1 2 3 4 5 4 5 6 C 7 2 1 3 6 7 Strand Pairing Matrix of 1VJG N

  32. 1.Beta Residue Pairing 2. Beta Strand Alignment 3. Beta Strand Pairing

  33. 3D Structure Prediction MWLKKFGINLLIGQSV… • Ab-Initio Structure Prediction Simulation …… Physical force field – protein folding Contact map - reconstruction Select structure with minimum free energy • Template-Based Structure Prediction Query protein Fold MWLKKFGINKH… Recognition Alignment Template Protein Data Bank

  34. A Machine Learning Information Retrieval Framework for Fold Recognition Fold Recognition Cheng and Baldi, Bioinformatics, 2006 Query Protein Alignment MWLKKFGIN…… Template Protein Data Bank Machine Learning Ranking

  35. Classic Fold Recognition Approaches Sequence - Sequence Alignment (Needleman and Wunsch, 1970. Smith and Waterman, 1981) Query ITAKPAKTPTSPKEQAIGLSVTFLSFLLPAGWVLYHL Template ITAKPQWLKTSE------------SVTFLSFLLPQTQGLYHL Alignment (similarity) score Works for >40% sequence identity (Close homologs in protein family)

  36. Classic Fold Recognition Approaches Profile - Sequence Alignment (Altschul et al., 1997) ITAKPAKTPTSPKEQAIGLSVTFLSFLLPAGWVLYHL ITAKPEKTPTSPREQAIGLSVTFLEFLLPAGWVLYHL ITAKPAKTPTSPKEEAIGLSVTFLSFLLPAGWVLYHL ITAKPQKTPTSLKEQAIGLSVTFLSFLLPAGWALYHL Query Family Average Score Template ITAKPQWLKTSERSTEWQSVTFLSFLLPQTQGLYHN More sensitive for distant homologs in superfamily. (> 25% identity)

  37. Classic Fold Recognition Approaches Profile - Sequence Alignment (Altschul et al., 1997) 12………………………………….………………n ITAKPAKTPTSPKEQAIGLSVTFLSFLLPAGWVLYHL ITAKPEKTPTSPREQAIGLSVTFLEFLLPAGWVLYHL ITAKPAKTPTSPKEEAIGLSVTFLSFLLPAGWVLYHL ITAKPQKTPTSLKEQAIGLSVTFLSFLLPAGWALYHL Query Family Position Specific Scoring Matrix Or Hidden Markov Model Template ITAKPQWLKTSERSTEWQSVTFLSFLLPQTQGLYHN More sensitive for distant homologs in superfamily. (> 25% identity)

  38. Classic Fold Recognition Approaches Profile - Profile Alignment (Rychlewski et al., 2000) ITAKPAKTPTSPKEQAIGLSVTFLSFLLPAGWVLYHL ITAKPEKTPTSPREQAIGLSVTFLEFLLPAGWVLYHL ILAKPAKTPTSPKEEAIGLSVTFLSFLLPAGWVLYHL ITAKPQKTPTSLKEQAIGLSVTFLSFLLPAGWALYHL Query Family Template Family ITAKPQWLKTSERSTEWQSVTFLSFLLPQTQGLYHN IPARPQWLKTSKRSTEWQSVTFLSFLLPYTQGLYHN IGAKPQWLWTSERSTEWHSVTFLSFLLPQTQGLYHM More sensitive for very distant homologs. (> 15% identity)

  39. Classic Fold Recognition Approaches Sequence - Structure Alignment (Threading) (Bowie et al., 1991. Jones et al., 1992. Godzik, Skolnick, 1992. Lathrop, 1994) Fit Query Fitness Score MWLKKFGINLLIGQS…. Template Structure Useful for recognizing similar folds without sequence similarity. (no evolutionary relationship)

  40. Integration of Complementary Approaches FR Server1 Query Meta Server FR server2 Consensus (Lundstrom et al.,2001. Fischer, 2003) FR server3 Internet • Reliability depends on availability of external servers • Make decisions on a handful candidates

  41. Machine Learning Classification Approach Support Vector Machine (SVM) Class 1 Class 2 Proteins Class m Classify individual proteins to several or dozens of structure classes (Jaakkola et al., 2000. Leslie et al., 2002. Saigo et al., 2004) Problem 1: can’t scale up to thousands of protein classes Problem 2: doesn’t provide templates for structure modeling

  42. Machine Learning Information Retrieval Framework Query-Template Pair Score 1 Relevance Function (e.g., SVM) + Score 2 Rank . . . - Score n • Extract pairwise features • Comparison of two pairs (four proteins) • Relevant or not (one score) vs. many classes • Ranking of templates (retrieval)

  43. Pairwise Feature Extraction • Sequence / Family InformationFeatures Cosine, correlation, and Gaussian kernel • Sequence – Sequence Alignment Features Palign, ClustalW • Sequence – ProfileAlignmentFeatures PSI-BLAST, IMPALA, HMMer, RPS-BLAST • Profile – ProfileAlignment Features ClustalW, HHSearch, Lobster, Compass, PRC-HMM • Structural Features Secondary structure, solvent accessibility, contact map, beta-sheet topology

  44. Pairwise Feature Extraction

  45. Relevance Function: Support Vector Machine Learning Feature Space Positive Pairs (Same Folds) Support Vector Machine Negative Pairs (Different Folds) Training/Learning Hyperplane Training Data Set

  46. Relevance Function: Support Vector Machine Learning (2) (1) Margin Margin f(x) = K is Gaussian Kernel:

  47. Training and Cross-Validation • Standard benchmark (Lindahl’s dataset, 976 proteins) • 976 x 975 query-template pairs (about 7,468 positives) Query Query 1’s pairs 975 pairs 1 2 3 . . . . . 976 Query 2’s pairs Train / Learn 975 pairs . . . (90%: 1- 878) Rank 975 templates for each query Test (10%: 879 – 976) 975 pairs

  48. Results for Top Five Ranked Templates • Family: close homologs, more identity • Superfamily: distant homologs, less identity • Fold: no evolutionary relation, no identity

  49. Specificity-Sensitivity Plot (Family)

  50. Specificity-Sensitivity Plot (Superfamily)

More Related