1 / 53

Prediction of protein contact maps

Prediction of protein contact maps. Piero Fariselli. Department of Biology University of Bologna. From Sequence to Function. >BGAL_SULSO BETA-GALACTOSIDASE Sulfolobus solfataricus. MYSFPNSFRFGWSQAGFQSEMGTPGSEDPNTDWYKWVHDPENMAAGLVSG DLPENGPGYWGNYKTFHDNAQKMGLKIARLNVEWSRIFPNPLPRPQNFDE

quiana
Download Presentation

Prediction of protein contact maps

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Prediction of protein contact maps Piero Fariselli Department of Biology University of Bologna

  2. From Sequence to Function >BGAL_SULSO BETA-GALACTOSIDASE Sulfolobus solfataricus. MYSFPNSFRFGWSQAGFQSEMGTPGSEDPNTDWYKWVHDPENMAAGLVSG DLPENGPGYWGNYKTFHDNAQKMGLKIARLNVEWSRIFPNPLPRPQNFDE SKQDVTEVEINENELKRLDEYANKDALNHYREIFKDLKSRGLYFILNMYH WPLPLWLHDPIRVRRGDFTGPSGWLSTRTVYEFARFSAYIAWKFDDLVDE YSTMNEPNVVGGLGYVGVKSGFPPGYLSFELSRRHMYNIIQAHARAYDGI KSVSKKPVGIIYANSSFQPLTDKDMEAVEMAENDNRWWFFDAIIRGEITR GNEKIVRDDLKGRLDWIGVNYYTRTVVKRTEKGYVSLGGYGHGCERNSVS LAGLPTSDFGWEFFPEGLYDVLTKYWNRYHLYMYVTENGIADDADYQRPY YLVSHVYQVHRAINSGADVRGYLHWSLADNYEWASGFSMRFGLLKVDYNT KRLYWRPSALVYREIATNGAITDEIEHLNSVPPVKPLRH Genomic sequences Protein structures Protein sequences Functional Genomics and Proteomics Protein functions

  3. The Protein Folding T T C C P S I V A R S N F N V C R L P G T P E A L C A T Y T G C I I I P G A T C P G D Y A N

  4. (Rost B.) http://dodo.cpmc.columbia.edu/cubic/papers/

  5. The Data Bases of Sequences and Structures EMBL: 195,241,608 sequences 292,078,866,691nucleotides >BGAL_SULSO BETA-GALACTOSIDASE Sulfolobus solfataricus. MYSFPNSFRFGWSQAGFQSEMGTPGSEDPNTDWYKWVHDPENMAAGLVSG DLPENGPGYWGNYKTFHDNAQKMGLKIARLNVEWSRIFPNPLPRPQNFDE SKQDVTEVEINENELKRLDEYANKDALNHYREIFKDLKSRGLYFILNMYH WPLPLWLHDPIRVRRGDFTGPSGWLSTRTVYEFARFSAYIAWKFDDLVDE YSTMNEPNVVGGLGYVGVKSGFPPGYLSFELSRRHMYNIIQAHARAYDGI KSVSKKPVGIIYANSSFQPLTDKDMEAVEMAENDNRWWFFDAIIRGEITR GNEKIVRDDLKGRLDWIGVNYYTRTVVKRTEKGYVSLGGYGHGCERNSVS LAGLPTSDFGWEFFPEGLYDVLTKYWNRYHLYMYVTENGIADDADYQRPY YLVSHVYQVHRAINSGADVRGYLHWSLADNYEWASGFSMRFGLLKVDYNT KRLYWRPSALVYREIATNGAITDEIEHLNSVPPVKPLRH UNIPROT: 428 650sequences 154'416'236 residues PDB: 68000 structures membrane proteins 1% November/2009

  6. What is a multiple alignment ? The short answer is this - VTISCTGSSSNIGAG-NHVKWYQQLPG VTISCTGTSSNIGS--ITVNWYQQLPG LRLSCSSSGFIFSS--YAMYWVRQAPG LSLTCTVSGTSFDD--YYSTWVRQPPG PEVTCVVVDVSHEDPQVKFNWYVDG-- ATLVCLISDFYPGA--VTVAWKADS-- AALGCLVKDYFPEP--VTVSWNSG--- VSLTCLVKGFYPSD--IAVEWESNG--

  7. Evolutionary information • Multiple Sequence Alignment (MSA) of similar sequences • Sequence profile: for each position a 20-valued vector contains the aminoacidic composition of the aligned sequences. 1 Y K D Y H S - D K K K G E L - - 2 Y R D Y Q T - D Q K K G D L - - 3 Y R D Y Q S - D H K K G E L - - 4 Y R D Y V S - D H K K G E L - - 5 Y R D Y Q F - D Q K K G S L - - 6 Y K D Y N T - H Q K K N E S - - 7 Y R D Y Q T - D H K K A D L - - 8 G Y G F G - - L I K N T E T T K 9 T K G Y G F G L I K N T E T T K 10 T K G Y G F G L I K N T E T T K A 0 0 0 0 0 0 0 0 0 0 0 10 0 0 0 0 C 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 D 0 0 70 0 0 0 0 60 0 0 0 0 20 0 0 0 E 0 0 0 0 0 0 0 0 0 0 0 0 70 0 0 0 F 0 0 0 10 0 33 0 0 0 0 0 0 0 0 0 0 G 10 0 30 0 30 0 100 0 0 0 0 50 0 0 0 0 H 0 0 0 0 10 0 0 10 30 0 0 0 0 0 0 0 K 0 40 0 0 0 0 0 0 10 100 70 0 0 0 0 100 I 0 0 0 0 0 0 0 0 30 0 0 0 0 0 0 0 L 0 0 0 0 0 0 0 30 0 0 0 0 0 0 0 0 M 0 0 0 0 0 0 0 0 0 0 0 0 0 60 0 0 N 0 0 0 0 10 0 0 0 0 0 30 10 0 0 0 0 P 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Q 0 0 0 0 40 0 0 0 30 0 0 0 0 0 0 0 R 0 50 0 0 0 0 0 0 0 0 0 0 0 0 0 0 S 0 0 0 0 0 33 0 0 0 0 0 0 10 10 0 0 T 20 0 0 0 0 33 0 0 0 0 0 30 0 30 100 0 V 0 0 0 0 10 0 0 0 0 0 0 0 0 0 0 0 W 0 10 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Y 70 0 0 90 0 0 0 0 0 0 0 0 0 0 0 0 MSA sequence position Sequence profile

  8. 3D structure prediction of proteins New folds Existing folds Building by homology Ab initio prediction Threading 0 10 20 30 40 50 60 70 80 90 100 Homology (%)

  9. Contacts and Contact Maps F 156 V 299 I 269 F 297 V 271 I 240 V 238 Contact definition

  10. 1. Based on Ca 2. Based on Cb 3. All-atom (without Hydrogens) Protein contact definitions:

  11. From the 3D structure to the contact map Given a protein of length L, and a square matrix M of dimension L L For each pair of residue i and j calculate distance between i and j if distance < threshold put 1 in the cell M(i,j) otherwise put 0 in the cell M(i,j)

  12. Computation of Contact Maps From 3D Structure TTCCPSIVARSNFNVCRLPGTPEAICATYTGCIIIPGATCPGDYAN TTCCPSIVARSNFNVCRLPGTPEAICATYTGCIIIPGATCPGDYAN F 156 To Contact Map V 299 I 269 F 297 V 271 I 240 V 238

  13. Protein Structural Classes All-b All-a a /b a + b

  14. 2 4 1 3 3 4 2 1 An Example of a Contact map (All-a)

  15. An Example of a Contact map (All-b) N C

  16. C N An Example of Contact map (a + b)

  17. From the contact map to the 3D structure • Two methods have been proposed : • Bohr et al., “Protein Structure from distance Inequalities” J.Mol. Biol. 1993, 231:861-869 => based on a steepest descent procedure • Vendruscolo and Domany Fold. Des. 1998, 2:295-306=> based on a modified Metropolis procedure

  18. 6pti Reconstruction Efficiency (58 residues) Vendruscolo and Domany Fold. Des. 1998 RMSD M (Number of random flipping) At M= 200 No of eliminated true contacts 6 % real contacts No of added false contacts 52 % real contacts

  19. From the contact map to the 3D structure: the reconstruction efficiency

  20. 3-D Modelling through Contact Maps example: Bacteriorhodopsin Model N 1QHJ (1.9 Å) Contact map C RMSD = 2.5 Å

  21. MARC efficiency in 3D reconstruction from the protein contact map after progressive elimination of true contacts (6pti)

  22. MARC efficiency in 3D reconstruction after progressive addition of wrong contacts to a protein contact map with 30 % of true contacts (6pti)

  23. Prediction of Contact Maps

  24. Prediction of Contact Maps Several methods have been applied: • Bohr et al., FEBS 1990 261:43-46=> based on neural networks • Göbel et al., PROTEINS 1994 18: 309-317=> based on correlated mutations in proteins • Thomas et al., Prot. Eng. 1996 9: 941-948=> based on a statistical method and evolution information • Olmea and Valencia Fold. Des. 1997 2:S25-S32 => based on correlated mutations and other information • Fariselli and Casadio Prot. Eng 1999 12:15-21=> based on neural networks and evolutionary information • Fariselli et al., CASP4/ and Prot. Eng. in press=> Neural networks and other information • Pollastri and Baldi al., Bioinformatics 2002 18 S62-S70=> Recurrent Neural networks

  25. Relevant points • Contact Threshold • Sequence separation (or sequence gap) • No of contacts vs No of non-contacts

  26. 16 Å The Contact Threshold

  27. 16 Å 12 Å The Contact Threshold

  28. 16 Å 12 Å 8 Å The Contact Threshold

  29. 16 Å 12 Å 8 Å 6 Å The Contact Threshold

  30. 40 1 100 20 Sequence separation …VTISCTGSSSNIGAGNHVKWYQQLPG…

  31. The Sequence Separation example of a sequence separation = 10 residues 2

  32. Frequency distribution of the real and hypothetical contacts as a function of sequence separation

  33. Relation between the number of contacts and the protein length Number of contacts Protein length

  34. Evaluation of the efficiency of contact map predictions • 1) Accuracy: • A = Ncp* / Ncp • where Ncp* and Ncp are the number of correctly assigned contacts and that of total predicted contacts, respectively. • 2) Improvement over a random predictor : • R = A / (Nc/Np) • where Nc/Np is the accuracy of a random predictor ; Nc is the number of real contacts in the protein of length Lp, and Np are all the possible contacts • 3) Difference in the distribution of the inter-residue distances in the 3D structure for predicted pairs compared with all pair distances in the structure (Pazos et al., 1997): • Xd=i=1,n(Pic - Pia ) / n di • where n is the number of bins of the distance distribution (15 equally distributed bins from 4 to 60Å cluster all the possible distances of residue pairs observed in the protein structure); di is the upper limit (normalised to 60 Å) for each bin, e.g. 8 Å for the 4 to 8 Å bin; Pic and Pia are the percentage of predicted contact pairs (with distance between diand di-1 ) andthat of all possible pairs respectively

  35. Tools out of machine learning approaches Neural Networks Training Prediction Data Base Subset New sequence TTCCPSIVARSNFNVCRLPGTPEAICATYTGCIIIPGATCPGDYAN General rules Known mapping Prediction

  36. Contact definition used: • Cb - Cb distance < 0.8 nm • Sequence gap > 7 residues

  37. The database of proteins used to train and test the contact map predictors.

  38. Neural Network-based predictor • 1 output neuron (contact/non-contact) • 1 hidden layer with 8 neurons • Input layer with 1071 input neurons : • Ordered residue pairs (1050 neurons) • Secondary structures (18 neurons) • Correlated mutations (1 neuron) • Sequence conservation (2 neurons)

  39. Representation of the input coding based on ordered couples. (A) An alignment of 5 (hypothetical) sequences they are represented in a HSSP file (Sander and Schneider, 1991). i and j stand for the positions of the two residues making or not making contact (A and D in the leading sequence or sequence 1). (B) Single sequence coding. The position representing the couple (AD) in the vector is set to 1.0 while the other positions are set to 0. (C) Multiple sequence coding. For each sequence in the alignment (1 to 5 in the scheme in A) a couple of residues in position i and j is counted. The final input coding representing the frequency of each couple in the alignment is normalized to the number of the sequences

  40. 1 MVKGPGLYTDIGKKARDLLYKDYHSDKKFTISTYSPTGVAITSS 2 MVKGPGLYSDIGKRARDLLYRDYQSDHKFTLTTYTANGVAITST 3 MVKGPGLYTEIGKKARDLLYRDYQGDQKFSVTTYSSTGVAITTT N sequences j i 1 MVKGPGLYTDIGKKARDLLYKDYHSDKKFTISTYSPTGVAITSS 2 MVKGPGLYSDIGKRARDLLYRDYQSDHKFTLTTYTANGVAITST 1 MVKGPGLYTDIGKKARDLLYKDYHSDKKFTISTYSPTGVAITSS 3 MVKGPGLYTEIGKKARDLLYRDYQGDQKFSVTTYSSTGVAITTT 2 MVKGPGLYSDIGKRARDLLYRDYQSDHKFTLTTYTANGVAITST 3 MVKGPGLYTEIGKKARDLLYRDYQGDQKFSVTTYSSTGVAITTT M = N·(N-1)/2 couples S : McLachlan substitution matrix S(T;S) S(T;T) S(S;T) S(I;L) S(I;V) S(L;V) Vj Vi M-valued vectors: Correlation: Correlated mutations Multiple sequence alignment

  41. The neural network architecture for prediction of contact maps

  42. Accuracy of contact map prediction using a cross-validated data set (170 proteins) No of proteins Accuracy

  43. T0087: 310 residues (A = 0.20 FR/NF ) C N

  44. T0106: 123 residues (A=0.06 FR / NF ) C N

  45. T0128: 222 residues (A = 0.24 CM ) C N

  46. T0110: 128 residues (A = 0.30 FR ) N C

  47. T0125: 141 residues (A = 0.03 CM ) N C

  48. T0124: 242 residues (A = 0.01 NF) C N

More Related