1 / 21

Prediction of Protein Structure in 1D

Prediction of Protein Structure in 1D. 2 o structure, TM regions, and solvent accessibility. Topic 13. Chapter 29, Du and Bourne “Structural Bioinformatics”. The Truth (Information) is Out (In) There. The Truth (Information) is Out (In) There.

arnaud
Download Presentation

Prediction of Protein Structure in 1D

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Prediction of Protein Structure in 1D 2o structure, TM regions, and solvent accessibility Topic 13 Chapter 29, Du and Bourne “Structural Bioinformatics”

  2. The Truth (Information) is Out (In) There

  3. The Truth (Information) is Out (In) There But we’re still having a tough time finding it.

  4. GHWIATRGQLIREAYEDYRHFSSECPFIP CEEEEECCCEEEEECCCHHHHHHCCCCCC Protein Secondary Structure Prediction Given a protein sequence (primary structure), predict its secondary structures E: -strand H: -helix C: coil H: ( H: - helix, G: 310helix, I: -helix ) E: (E: -strand, B: bridge) C: (T: -turn, S: bend, C: coil) Assumption: short stretches of residues have propensity to adopt certain conformation ⇒ conformation of the central residue in a sequence fragment depends only on flanking residues (sliding window)

  5. Why secondary structure prediction? -- Because we can (kind of). -- Because itcouldbe a first step towards prediction of protein tertiary structure. “Have solution, need problem.” Nearly every imaginable algorithm has been applied to secondary structure prediction.

  6. Secondary Structure Prediction Methods • 1. First generation: Single amino acid propensities • Chou-Fasman method (1974), GOR I-IV • ~56-60% accuracy • 2. Second generation: Segments of 3-51 adjacent residues • NNSSP, SSPAL • ~65% accuracy • 3. Neural network • PHD, Psi-Pred, J-Pred • 4. Support vector machine (SVM) • 5. Hidden Markov Models (HMM) Third generation methods using evolutionary information ~76% accuracy

  7. Secondary Structure Prediction Accuracy 1. three-state per-residue prediction accuracy Mii, number of residues observed in state i and predicted in state i Nobs, the total number of residues observed in 3 states 2. per-segment prediction accuracy (SOV, Segment of OVerlap) Per-stage segment overlap:                                                  S1: observed SS segment S2: predicted SS segment

  8. Single Residue Propensity Methods Calculate the propensity for a given amino acid to adopt a certain ss-type i, amino acid , secondary structure state • Example: from a data set with 30 proteins #Ala=2,000, #residues=20,000, #helix=4,000, #Ala in helix=580 p(,aa) = 580/20,000, p() = 4,000/20,000, p(aa) = 2,000/20,000 P= 580 / (4,000/10) = 1.45

  9. Amino Acid Propensities to Secondary Structures Chou-Fasman method

  10. RSTEVRASRQLAKEKVN Window size Nearest Neighbor Methods * The idea is simple: predict SS of the central residue of a given segment from homologous segments (neighbors). For example, from database, find some number of the closest sequences to a subsequence defined by a window around the central residue, then use max (N, N, Nc) to assign the SS. E C C H H C C Homologous sequences C • Key parameters: • How to define similarity? • What size window of sequence should be examined? • How many close sequences should be selected?

  11. The Devil is in the details…

  12. Psi-Pred Method • D. Jones, J. Mol. Boil. 292, 195 (1999). • Method : Neural network • Input data : PSSM generated by PSI-BLAST • Bigger and better sequence database • Combining several database and data filtering • Training and test sets preparation • Ss prediction only makes sense for proteins with no homologous structure. • No sequence & structural homologues between training and test sets by CATH and PSI-BLAST (mimicking realistic situation).

  13. Psi-Pred Method--Neural Network • Window size = 15 • Two networks • First network (sequence-to-structure): • 315 = (20 + 1)  15 inputs • extra unit to indicate where the windows spans either N or C terminus • Data are scaled to [0-1] range by using 1/[1+exp(-x)] • 75 hidden units • 3 outputs (H, E, L) • Second network (structure-to-structure): • Structural correlation between adjacent sequences • 60 = (3 + 1)  15 inputs • 60 hidden units • 3 outputs • Accuracy ~76%

  14. Sample Psi-Pred Output Conf: Confidence (0=low, 9=high) ---very important!!!! Pred: Predicted secondary structure (H=helix, E=strand, C=coil) AA: Target sequence # PSIPRED HFORMAT (PSIPRED V2.3 by David Jones) Conf: 966899999997542002357777557999999716898188034435788873356776 Pred: CCHHHHHHHHHHHHHHHCCCCCCCHHHHHHHHHHHCCCCCEEECCCCEEEEEEECCCCCC AA: MMWEQFKKEKLRGYLEAKNQRKVDFDIVELLDLINSFDDFVTLSSCSGRIAVVDLEKPGD 10 20 30 40 50 60 Conf: 777179998337888888988751235636899718261220179868899999998557 Pred: CCCCEEEEEECCCCCHHHHHHHHHCCCCCEEEEECCCEEEEECCCHHHHHHHHHHHHHCC AA: KASSLFLGKWHEGVEVSEVAEAALRSRKVAWLIQYPPIIHVACRNIGAAKLLMNAANTAG 70 80 90 100 110 120 Conf: 200242314703799714651435541487355188999999999999999889999999 Pred: CCCCCCEECCCEEEEEECCCEEEEEECCCCCEEECHHHHHHHHHHHHHHHHHHHHHHHHH AA: FRRSGVISLSNYVVEIASLERIELPVAEKGLMLVDDAYLSYVVRWANEKLLKGKEKLGRL 130 140 150 160 170 180 ***Compare the prediction for residues 9 and 17***

  15. Sample Psi-Pred Output-II

  16. Again, voting rules methods tend to be best ATKAVCVLKGDGPVQGTIHFEAKGDTVVVTGSITGLTEGDHGFHVHQFGDNTQGCTSAGP 2SOD CCCCCCCCCCCCCCCCEEHCCHHECEEEEEEEEEEEECCCCCCCCCCCCCCCCCCCCCCC BPS CCHEEEEECCCCCCCCEEEHHHCCCEEEEEEEEECECCCCCCEEEECCCCCCCCCCCCCC D_R CCCEEEEEECCCCCEEEEEEEECCCEEEEEEEEEEEECCCCCEEEEECCCCCCCCCCCCC DSC CCCEEEEECCCCCCCEEEEEECCCCEEEEEEEEECCCCCCCCEEEEEECCCCCCCCCCCC GGR HHHCEEEECCCCCCCEEEEEECCCCEEEEEECEEEEEECCCCEEEEECCCCCCEEECCCC GOR CCCCEEEECCCCCCCCCEEECCCCCCEEEEECEEECCCCCCCEEEECCCCCCCCEEECCC H_K CCCCEEEEECCCCCCCCCEEECCCCCEEEECCCCCCCCCCCEEEEEEEECCCCCCCCCCC K_S CCCCEEEECCCCCCCCEEEEECCCCEEEEEEEEEEECCCCCCEEEEECCCCCCCCCCCCC JOI ---EEEEE------EEEEEEEEE--EEEEEEEEE-----EEEEEEEE------------- 2SOD HFNPLSKKHGGPKDEERHVGDLGNVTADKNGVAIVDIVDPLISLSGEYSIIGRTMVVHEK 2SOD CCCCCCCCCCCCCCCCCCCCCCECCCCCCHEECCCCCCCCCECCEECEEEEEEEEEEECC BPS CCCCCCCCCCCCCCCHHCECCCCCECCCCCCEEEEEEECCEEEECCCEEEEEEEEEEECC D_R CCCCCCCCCCCCCCEEEEECCCCCCCCCCCCEEEEEECCCCCCCCCCEEEEEEEEEEECC DSC CCCCCCCCCCCCCCCCEEECCCCCCCCCCCCCEEEEECCCCCCCCCCEEEECEEEEEECC GGR CCCCCCCCCCCCCCHHEEECCCCCCCCCCCCEEEEEEECCEEECCCCEEEEEEEEEECCC GOR CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCEECCCCCCCCCCCCCCHHHHHHEECCC H_K CCCCCCCCCCCCCCCCEEECCCCCCCCCCCCCEEEEEEEEEEEEECCCEEECCEEEEEEEK_S CCCCCCCCCCCCCCCCEEECCCCCCCCCCCCEEEEEECCCCECCCCCEEEEEEEEEEECC JOI --------------------EEEEEE------EEEEEEE--------------EEEEE-- 2SOD

  17. Prediction Accuracy (EVA) EVA: Automatic evaluation of prediction servers

  18. How Far Can We Go? • Currently ~76% • Proteins with more than 100 homologues 80% • Assignment is ambiguous (5-15%). Recall DSSP vsSTRIDE. • -- non-unique protein structures (dynamic), H-bond cutoff, etc. • Different secondary structures between homologues (~12%). • Non-locality. Secondary structure is influenced by long-range interactions. -- Some segments can have multiple structure types (chameleon sequences).

  19. Solvent accessibility • Conceptually similar problem to SS prediction: Buried vs. Exposed. • Weighted Ensemble Solvent Accessibility predictor: http://pipe.scs.fsu.edu/wesa.html E E E E E E B B B B B B

  20. Why bother? • To provide structural context for putative mutations that one wants to characterize biochemically or biophysically.

  21. Transmembrane Segment Prediction • Again, conceptually similar problem to SS prediction: TM vs. Not.

More Related