510 likes | 525 Views
Comparative Protein Structure Modeling Lecture 4.1. Roberto Sanchez Structural Biology Program, Mount Sinai School of Medicine New York, NY 10029, USA. roberto.sanchez@physbio.mssm.edu http://physbio.mssm.edu/~sanchez/. What is comparative modeling and why is it useful?
E N D
Comparative Protein Structure ModelingLecture 4.1 Roberto Sanchez Structural Biology Program, Mount Sinai School of Medicine New York, NY 10029, USA roberto.sanchez@physbio.mssm.edu http://physbio.mssm.edu/~sanchez/ • What is comparative modeling and why is it useful? • Steps in CM (overview + some details) • Accuracy of comparative models • Loop modeling • CM and Structural Genomics
Sequence Structure GFCHIKAYTRLIM… Function via Structure Function
Why is it useful to know the structure of a protein not only its sequence? • The biochemical function (activity) of a protein is defined by its interaction with other molecules. • The biological function is in large part a consequence of these interactions. • The 3D structure is more informative than sequence because interactions are determined by residues that are close in space but are frequently distant in sequence. In addition, since evolution tends to conserve function and function depends more directly on structure than on sequence, structure is more conserved in evolution than sequence. The net result is that patterns in space are frequently more recognizablethan patterns in sequence.
Why Protein Structure Prediction? Known Sequences (5/30/01) : 694,000 Known Structures (5/29/01) : 15,200 We know the experimental 3D structure for less than 3% of the protein sequences. For the remaining 97% we need some sort of 3D structure prediction.
…SDVIFTEDGILICNRK… What is Comparative Protein Structure Modeling? Protein Structure Prediction
GFCHIKAYTRLIMVG… Anacystis nidulans Anabaena 7120 Condrus crispus Desulfovibrio vulgaris Folding Evolution Principles of Protein Structure Ab initio prediction Fold Recognition Comparative Modeling
START TARGET TEMPLATE ASILPKRLFGNCEQTSDEGLKIERTPLVPHISAQNVCLKIDDVPERLIPERASFQWMNDK Template Search Target – Template Alignment ASILPKRLFGNCEQTSDEGLKIERTPLVPHISAQNVCLKIDDVPERLIPE MSVIPKRLYGNCEQTSEEAIRIEDSPIV---TADLVCLKIDEIPERLVGE Model Building Model Evaluation No OK? Yes END Steps in Comparative Protein Structure Modeling A. Šali, Curr. Opin. Biotech. 6, 437, 1995. R. Sánchez & A. Šali, Curr. Opin. Str. Biol. 7, 206, 1997. M. Marti et al. Ann. Rev. Biophys. Biomolec. Struct., 29, 291, 2000.
Template Search Methods • Sequence similarity searches(BLAST, FastA) • Profile and iterative methods(HMMs, PSI-BLAST) • Structure based threading(THREADER, PROFIT)
Target – Template Alignment Methods • Dynamic Programming Pairwise Alignments • Multiple Alignments, Profiles, HMMs • Structure based approaches (Threading)
Model Building Methods • Rigid Body Assembly(COMPOSER) • Segment Matching(SEGMOD) • Satisfaction of Spatial Restraints(MODELLER) A. Šali, Curr. Opin. Biotech. 6, 437, 1995 R. Sánchez & A. Šali, Curr. Opin. Str. Biol. 7, 206, 1997
3D GKITFYERGFQGHCYESDC-NLQP SEQ GKITFYERG---RCYESDCPNLQP F(R) = Ppi(fi/I) EXTRACT Spatial Restraints i SATISFY Spatial Restraints Comparative Modeling by MODELLER http://guitar.rockefeller.edu/modeller/ A. Šali & T. Blundell, J. Mol. Biol. 234, 779, 1993
Model Evaluation methods • Stereochemistry(PROCHECK) • Environment(Profiles3D) • Statistical potentials based methods(PROSAII)
Model Evaluation: Alignment Errors R. Sánchez & A. Šali, Proteins, Suppl. 1, 50-58, 1997
Huang et al. J. Clin. Immunol. 18,169,1998. Matsumoto et al. J.Biol.Chem. 270,19524,1995. Šali et al. J. Biol. Chem. 268, 9023, 1993. Native mMCP-7 at pH=5 (His+) Native mMCP-7 at pH=7 (His0) Predicting features of a model that are not present in the template Do mast cell proteases bind proteoglycans? Where? When? • mMCPs bind negatively charged proteoglycans through electrostatic interactions? • Comparative models used to find clusters of positively charged surface residues. • Tested by site-directed mutagenesis..
Incorrecttemplate Misalignment Distortion in correctly aligned regions MODEL X-RAY TEMPLATE Region without a template Sidechain packing Typical Errors in Comparative Models
CASP: Lessons from Blind Predictions Build models for proteins of unknown structure. Structures are determined after the models are submitted. Models are evaluated by comparing them with the corresponding experimental structures.
CASP: Lessons from Blind PredictionsMultiple Template Models • Comparative modeling (by MODELLER) can combine the best regions from each template. • The per-residue accuracy of comparative models can not be higher than that of any of the templates. • The overall accuracy of models can be higher than that of any of the templates.
CASP: Lessons from Blind Predictions (DFR) R. Sánchez & A. Šali, Proteins, Suppl. 1, 50-58, 1997
Model Accuracy as a Function of Target-Template Sequence Identity
25% sequence identity 24% sequence identity YGL203C 1ac5 YJL001W 1rypH His 488 Ser 176 Asp 383 Some Models Can Be Surprisingly Accurate (in Some Regions)
a+b barrel: flavodoxin IG fold: immunoglobulin antiparallel b-barrel Loop Modeling in Protein Structures A. Fiser, R. Do & A. Šali, Prot. Sci.,9, 1753, 2000
Loop modeling strategies Database search Conformational search • database is complete only up to 4-6 residues • even in DB search, the different conformations must be ranked • loops longer than 4 residues need extensive optimization • DB method is efficient for specific families (eg. Canonical loops in Ig’s, • b- hairpins etc)
Loop Modeling by Conformational Search • Protein representation. • Energy (scoring) function. • Optimization algorithm.
Energy Function for Loop Modeling The energy function is a sum of many terms: 1) Statistical preferences for dihedral angles: 2) Restraints from the CHARMM-22 force field: 3) Statistical potential for non-bonded contacts:
RMSD=2.8Å RMSD=0.6Å RMSD=1.1Å HIGH ACCURACY (<1Å) 50% (30%) of 8-residue loops MEDIUM ACCURACY (<2Å) 40% (48%) of 8-residue loops LOW ACCURACY (>2Å) 10% (22%) of 8-residue loops Accuracy of Loop Modeling A. Fiser, R. Do & A. Šali, Prot. Sci., 9, 1537, 2000
Problems in Practical Loop Modeling • Decide which regions to model as loops. • Correct alignment of anchor regions & environment. • Modeling of a loop. T0076: 46-53 RMSDmnch loop = 1.37 Å RMSDmnch anchors = 1.52 Å T0058: 80-85 RMSDmnch loop = 1.09 Å RMSDmnch anchors = 0.29 Å
How can Comparative Modeling be used in Structural Genomics?
Structural Genomics • Definition:The aim of structural genomics is to put every protein sequence within a modeling distance of a known protein structure. • Size of the problem: • There are a few thousand domain fold families. • There are ~20,000 sequence families (30% sequence id). • Solution: • Determine many protein structures. • Increase modeling distance. Šali. Nat. Struct. Biol. 5, 1029, 1998. Burley et al. Nat. Genet. 23, 151, 1999. Šali & Kuriyan. TIBS22, M20, 1999. Sanchez et al. Nat. Str. Biol. 7, 986, 2000
Target Selection How many structures need to be solved? Which structures should we solve first? How can Comparative Modeling be used in Structural Genomics? • Target Amplification How much of the sequence space is covered by: • a new structure • all structures
Target Selection for Structural Genomics Select targets such that every protein sequence is withina modeling distanceof a known protein structure. Modeling distance: correct alignment, corresponding to >30% sequence identity. G. Kurban, R. Sánchez, A. Šali, T. Gaasterland.
Leveraging Templates by Comparative ModelingQuantifying Productivity of Structural Genomics http://www.nysgrc.org Models are in MODBASE at http://guitar.rockefeller.edu/modbase/
1 For each sequence For each template 1 END MODPIPE: Large-Scale Comparative Protein Structure Modeling START Prepare PSI-BLAST PSSM by comparing the sequence against the NR database of sequences Align the matched part of the target sequence with the template structure MODELLER PSI-BLAST Use the sequence PSSM to search against the representative set of PDB chains (F and no-F) Build a model for the target segment by satisfaction of spatial restraints Evaluate the model Use the PDB chain PSSMs to search against the sequence (F and no-F) Select Templates using a permissive E-value cutoff R. Sánchez & A. Šali, Proc. Natl. Acad. Sci. USA95, 13597, 1998 R. Sánchez, F. Melo, N. Mirkovic, A. Šali, in preparation
MODPIPE Model of Yeast Hypothetical Protein YIL073C YIL073C model PDB 1a17 template E-value = 65 Seq. Id. = 20% pG = 0.97 Das et al. EMBO J.17, 1192, 1998 The tetratricopeptide repeat (TPR) is a degenerate 34 aa sequence identified in a variety of proteins, present in tandem arrays, mediates protein-protein interactions. R. Sánchez, F. Melo, N. Mirkovic, A. Šali.
Factors affecting coverage:PDB growth Fold assignments Reliable models
Organism Statistics Top 10 organism by number of models
Organism Statistics Top 10 organism by number of models
MODBASE R. Sánchez, U. Pieper, N.Mirkovic, P. I. W. de Bakker, E. Wittenstein, and A. Šali. Nucl. Acids Res., 28, 250. 2000 R. Sánchez and A. Šali. Bioinformatics, 15, 1060, 1999