Unraveling Protein Structures: Comparative Modeling Insights

Comparative Protein Structure ModelingLecture 4.1 Roberto Sanchez Structural Biology Program, Mount Sinai School of Medicine New York, NY 10029, USA roberto.sanchez@physbio.mssm.edu http://physbio.mssm.edu/~sanchez/ • What is comparative modeling and why is it useful? • Steps in CM (overview + some details) • Accuracy of comparative models • Loop modeling • CM and Structural Genomics

Sequence Structure GFCHIKAYTRLIM… Function via Structure Function

Why is it useful to know the structure of a protein not only its sequence? • The biochemical function (activity) of a protein is defined by its interaction with other molecules. • The biological function is in large part a consequence of these interactions. • The 3D structure is more informative than sequence because interactions are determined by residues that are close in space but are frequently distant in sequence. In addition, since evolution tends to conserve function and function depends more directly on structure than on sequence, structure is more conserved in evolution than sequence. The net result is that patterns in space are frequently more recognizablethan patterns in sequence.

Why Protein Structure Prediction? Known Sequences (5/30/01) : 694,000 Known Structures (5/29/01) : 15,200 We know the experimental 3D structure for less than 3% of the protein sequences. For the remaining 97% we need some sort of 3D structure prediction.

…SDVIFTEDGILICNRK… What is Comparative Protein Structure Modeling? Protein Structure Prediction

GFCHIKAYTRLIMVG… Anacystis nidulans Anabaena 7120 Condrus crispus Desulfovibrio vulgaris Folding Evolution Principles of Protein Structure Ab initio prediction Fold Recognition Comparative Modeling

START TARGET TEMPLATE ASILPKRLFGNCEQTSDEGLKIERTPLVPHISAQNVCLKIDDVPERLIPERASFQWMNDK Template Search Target – Template Alignment ASILPKRLFGNCEQTSDEGLKIERTPLVPHISAQNVCLKIDDVPERLIPE MSVIPKRLYGNCEQTSEEAIRIEDSPIV---TADLVCLKIDEIPERLVGE Model Building Model Evaluation No OK? Yes END Steps in Comparative Protein Structure Modeling A. Šali, Curr. Opin. Biotech. 6, 437, 1995. R. Sánchez & A. Šali, Curr. Opin. Str. Biol. 7, 206, 1997. M. Marti et al. Ann. Rev. Biophys. Biomolec. Struct., 29, 291, 2000.

Template Search Methods • Sequence similarity searches(BLAST, FastA) • Profile and iterative methods(HMMs, PSI-BLAST) • Structure based threading(THREADER, PROFIT)

Target – Template Alignment Methods • Dynamic Programming Pairwise Alignments • Multiple Alignments, Profiles, HMMs • Structure based approaches (Threading)

Model Building Methods • Rigid Body Assembly(COMPOSER) • Segment Matching(SEGMOD) • Satisfaction of Spatial Restraints(MODELLER) A. Šali, Curr. Opin. Biotech. 6, 437, 1995 R. Sánchez & A. Šali, Curr. Opin. Str. Biol. 7, 206, 1997

3D GKITFYERGFQGHCYESDC-NLQP SEQ GKITFYERG---RCYESDCPNLQP F(R) = Ppi(fi/I) EXTRACT Spatial Restraints i SATISFY Spatial Restraints Comparative Modeling by MODELLER http://guitar.rockefeller.edu/modeller/ A. Šali & T. Blundell, J. Mol. Biol. 234, 779, 1993

Model Evaluation methods • Stereochemistry(PROCHECK) • Environment(Profiles3D) • Statistical potentials based methods(PROSAII)

Model Evaluation: Alignment Errors R. Sánchez & A. Šali, Proteins, Suppl. 1, 50-58, 1997

Are models useful if they are just copies of the template?

Huang et al. J. Clin. Immunol. 18,169,1998. Matsumoto et al. J.Biol.Chem. 270,19524,1995. Šali et al. J. Biol. Chem. 268, 9023, 1993. Native mMCP-7 at pH=5 (His+) Native mMCP-7 at pH=7 (His0) Predicting features of a model that are not present in the template Do mast cell proteases bind proteoglycans? Where? When? • mMCPs bind negatively charged proteoglycans through electrostatic interactions? • Comparative models used to find clusters of positively charged surface residues. • Tested by site-directed mutagenesis..

Model Accuracy

Incorrecttemplate Misalignment Distortion in correctly aligned regions MODEL X-RAY TEMPLATE Region without a template Sidechain packing Typical Errors in Comparative Models

CASP: Lessons from Blind Predictions Build models for proteins of unknown structure. Structures are determined after the models are submitted. Models are evaluated by comparing them with the corresponding experimental structures.

CASP: Lessons from Blind PredictionsMultiple Template Models • Comparative modeling (by MODELLER) can combine the best regions from each template. • The per-residue accuracy of comparative models can not be higher than that of any of the templates. • The overall accuracy of models can be higher than that of any of the templates.

CASP: Lessons from Blind Predictions (DFR) R. Sánchez & A. Šali, Proteins, Suppl. 1, 50-58, 1997

Model Accuracy as a Function of Target-Template Sequence Identity

25% sequence identity 24% sequence identity YGL203C 1ac5 YJL001W 1rypH His 488 Ser 176 Asp 383 Some Models Can Be Surprisingly Accurate (in Some Regions)

Applications of Comparative Models

a+b barrel: flavodoxin IG fold: immunoglobulin antiparallel b-barrel Loop Modeling in Protein Structures A. Fiser, R. Do & A. Šali, Prot. Sci.,9, 1753, 2000

Loop modeling strategies Database search Conformational search • database is complete only up to 4-6 residues • even in DB search, the different conformations must be ranked • loops longer than 4 residues need extensive optimization • DB method is efficient for specific families (eg. Canonical loops in Ig’s, • b- hairpins etc)

Loop Modeling by Conformational Search • Protein representation. • Energy (scoring) function. • Optimization algorithm.

Energy Function for Loop Modeling The energy function is a sum of many terms: 1) Statistical preferences for dihedral angles: 2) Restraints from the CHARMM-22 force field: 3) Statistical potential for non-bonded contacts:

Mainchain Terms for Loop Modeling

Optimization of Objective Function

Calculating an Ensemble of Loop Models

Accuracy of loop models

Assessing Accuracy of Loop Models

RMSD=2.8Å RMSD=0.6Å RMSD=1.1Å HIGH ACCURACY (<1Å) 50% (30%) of 8-residue loops MEDIUM ACCURACY (<2Å) 40% (48%) of 8-residue loops LOW ACCURACY (>2Å) 10% (22%) of 8-residue loops Accuracy of Loop Modeling A. Fiser, R. Do & A. Šali, Prot. Sci., 9, 1537, 2000

Fraction of Loops Modeled With at Least Medium Accuracy

Problems in Practical Loop Modeling • Decide which regions to model as loops. • Correct alignment of anchor regions & environment. • Modeling of a loop. T0076: 46-53 RMSDmnch loop = 1.37 Å RMSDmnch anchors = 1.52 Å T0058: 80-85 RMSDmnch loop = 1.09 Å RMSDmnch anchors = 0.29 Å

How can Comparative Modeling be used in Structural Genomics?

Structural Genomics • Definition:The aim of structural genomics is to put every protein sequence within a modeling distance of a known protein structure. • Size of the problem: • There are a few thousand domain fold families. • There are ~20,000 sequence families (30% sequence id). • Solution: • Determine many protein structures. • Increase modeling distance. Šali. Nat. Struct. Biol. 5, 1029, 1998. Burley et al. Nat. Genet. 23, 151, 1999. Šali & Kuriyan. TIBS22, M20, 1999. Sanchez et al. Nat. Str. Biol. 7, 986, 2000

Target Selection How many structures need to be solved? Which structures should we solve first? How can Comparative Modeling be used in Structural Genomics? • Target Amplification How much of the sequence space is covered by: • a new structure • all structures

Target Selection for Structural Genomics Select targets such that every protein sequence is withina modeling distanceof a known protein structure. Modeling distance: correct alignment, corresponding to >30% sequence identity. G. Kurban, R. Sánchez, A. Šali, T. Gaasterland.

Leveraging Templates by Comparative ModelingQuantifying Productivity of Structural Genomics http://www.nysgrc.org Models are in MODBASE at http://guitar.rockefeller.edu/modbase/

1 For each sequence For each template 1 END MODPIPE: Large-Scale Comparative Protein Structure Modeling START Prepare PSI-BLAST PSSM by comparing the sequence against the NR database of sequences Align the matched part of the target sequence with the template structure MODELLER PSI-BLAST Use the sequence PSSM to search against the representative set of PDB chains (F and no-F) Build a model for the target segment by satisfaction of spatial restraints Evaluate the model Use the PDB chain PSSMs to search against the sequence (F and no-F) Select Templates using a permissive E-value cutoff R. Sánchez & A. Šali, Proc. Natl. Acad. Sci. USA95, 13597, 1998 R. Sánchez, F. Melo, N. Mirkovic, A. Šali, in preparation

MODPIPE Model of Yeast Hypothetical Protein YIL073C YIL073C model PDB 1a17 template E-value = 65 Seq. Id. = 20% pG = 0.97 Das et al. EMBO J.17, 1192, 1998 The tetratricopeptide repeat (TPR) is a degenerate 34 aa sequence identified in a variety of proteins, present in tandem arrays, mediates protein-protein interactions. R. Sánchez, F. Melo, N. Mirkovic, A. Šali.

Mycoplasma genitalium MODPIPE Models

Factors affecting coverage:PDB growth Fold assignments Reliable models

Organism Statistics Top 10 organism by number of models

MODBASE R. Sánchez, U. Pieper, N.Mirkovic, P. I. W. de Bakker, E. Wittenstein, and A. Šali. Nucl. Acids Res., 28, 250. 2000 R. Sánchez and A. Šali. Bioinformatics, 15, 1060, 1999

Unraveling Protein Structures: Comparative Modeling Insights

Unraveling Protein Structures: Comparative Modeling Insights

Presentation Transcript

Overview

Overview

OVERVIEW

Overview

Overview

Overview

Overview

overview

Overview

Overview

Overview

Overview

Overview

Overview

Overview

Overview

Overview

Overview

Overview

Overview

Overview

Overview