Bioinformatics

Bioinformatics Ayesha M. Khan Spring 2013

Protein Modelling

What is protein modelling? • Suppose we have no resources or expertise of X-ray crystallography or NMR, but only the protein sequence (target) available and we would like to know its 3D structure. • Use of computational methods: provide a useful model and fill the gap between sequence and structure space. • Protein modelling is the only way to obtain structural information computationally, where it is difficult to pursue an experiment. • Many proteins are simply too large for NMR analysis and cannot be crystallized for X-ray diffraction and thus protein modelling acts as a substitute in these cases. • Time consuming as well

Methods of protein modelling

Comparative modelling • Uses previously solved structures as starting points or “templates”. • Why is it effective? • Although the number of actual proteins is vast, there is a limited set of tertiary structural motifs to which most proteins belong. • Only around 2000 distinct protein folds in nature, although several million different proteins. • When the structure of one protein in a family has been determined by experimentation, the other members of the same family can be modelled, based on their alignment to the known structure.

Homology modelling • Based on the reasonable assumption that two homologous proteins will share very similar structures. • Given the amino acid sequence of an unknown structure and the solved structure of a homologous protein, each amino acid in the solved structure is mutated, computationally, into the corresponding amino acid from the unknown structure. • It is the prediction of a 3D structure of a target protein from the amino acid sequence of a homologous (template) protein for which an X-ray or NMR structure is available.

Programs for homology modeling: Many programs for automated homology modeling are now available, so anyone can construct a homology model on a regular PC. However, construction of a “good” homology model (at least for sequences that are not highly similar) usually requires some expertise and usually should be done with human intervention, rather than in a fully automated fashion. A few of the freely available programs for homology modeling: SWISS-MODEL– Produces accurate models; fast; good tutorials available. http://swissmodel.expasy.org/ I-TASSER– Produces accurate models; easy to use, but slow http://zhanglab.ccmb.med.umich.edu/I-TASSER/ Modeller– must be downloaded and installed locally http://salilab.org/modeller/modeller.html

Databases of homology models: The rate of new protein sequence determination is far outpacing the rate of structure determination by X-ray crystallography and NMR. Therefore, initiatives are underway to automatically generate homology models for large numbers of new protein sequences. One database of automatically generated homology models is SWISS-MODEL Repository: http://swissmodel.expasy.org/repository/

Is a homology model CORRECT? Since the actual (experimentally determined) structure of the target is not known, there is no way to say whether or not the homology model is “correct.” The best a researcher can do is compare the homology model to the structure of the template from which it was derived. If the atom positions in the model do not deviate very much from those of the template, the homology model is said to be “accurate.” The greater the deviation between model and template, the lower the accuracy of the model. When is a homology model definitely INCORRECT? A homology model has regions that are incorrect if it contains structural features that do not occur in native proteins, such as: • Hydrophobic side chains on the surface of the model (these side chains should be buried) • Buried polar or ionic groups that do not have their hydrogen-bonding or ionic-bonding capabilities “satisfied” by neighboring groups • Unreasonable bond lengths or angles • Unfavorable noncovalent contacts between atoms (clashes) • Unreasonable dihedral angles

Accuracy of homology modeling • The template selection and alignment accuracy are crucial to the accuracy of a homology model. • The accuracy of the model depends on the percentage of sequence identity between the target and template. The average coordinate agreement between the modeled structure and the actual structure drops ~0.3 Å for each 10% reduction in sequence identity. • The largest structural differences between homologous proteins are in surface loops. In other words, the structure of the protein core is more highly conserved. Therefore, the regions that are most likely to be in error in a homology model are the surface loops.

Accuracy of homology modeling (contd) • High-accuracy homology models can be built when the target and template have 50% or greater sequence identity. Errors are mostly mistakes in side-chain packing, small shifts of the core backbone regions, and occasionally larger errors in loops. • Medium-accuracy homology models can be built when the proteins share 30-50% sequence identity. There can be alignment mistakes, and there are more frequent side-chain packing, core distortion, and loop modeling errors. • Low-accuracy homology models are based on proteins that share <30% sequence identity. If a model is based on an almost insignificant alignment to a known structure, the model may have an entirely incorrect fold. • The best model-building programs will produce models of similar accuracy, provided that the methods are used optimally.

Building the model

Protein threading • Protein threading scans the amino acid sequence of an unknown structure against a database of solved structures.

Threading for tertiary structure prediction Structure is more conserved than sequence, so many proteins share similar folds, even in the absence of sequence similarity. If a suitable template does not exist for homology modeling of a target sequence, threading can be used to identify a potential structure for the target from among known structures of proteins that do not share significant sequence similarity with the target sequence. Threading predicts the structural fold of a protein by fitting its sequence into a structural database and selecting the best-fitting fold. Essentially, the target sequence is tested for compatibility with all structures in the database. Various methods are used to compare the target sequence to the known structures and determine which one, if any, it fits best. Unlike homology modeling, threading does not result in an all-atom structural model for the target sequence. Nevertheless, these relatively poor models can still potentially provide insight into the function of a new protein. There is a high rate of false positives when using threading.

A few of the freely available programs for threading: GenTHREADER– another version called pGenTHREADER makes use of profiles and predicted secondary structure to increase accuracy. http://bioinf.cs.ucl.ac.uk/web_servers/ 3D-PSSM– beware: template library may be outdated http://www.sbg.bio.ic.ac.uk/~3dpssm/index2.html 3D-JIGSAW http://bmm.cancerresearchuk.org/~3djigsaw/

Ab initio structural prediction • Ab initio predictions are based on sequence information only, without the aid of any known structures. • Since proteins fold on their own to their correct structures, there must be information about that structure inherent in the amino acid sequence. Ab initio methods try to use what is currently known about the physicochemical laws governing protein folding to predict the structure of a protein from its sequence. • The normal, functional structure of a protein (its “native state”) is often a conformation that has the lowest possible free energy. Ab initio methods predict a structure for a target protein by attempting to find the lowest energy conformation that the polypeptide chain can adopt. • One approach would be to try out ALL possible conformations to determine which one has the lowest energy. However, this is not computationally possible at this time. (It would take 1020 years for a 40-residue protein!) Ab initio methods, therefore, use a variety of heuristic approaches to sample only some of the possible conformations in an attempt to find the one with the lowest energy.

Ab initio structural prediction: • Ab initio methods are largely unsuccessful. • Ab initio methods are useful only in cases where homology modeling and threading fail, and then the prediction should be interpreted very cautiously. • Structural proteomics efforts are underway which may soon make ab initio methods largely obsolete. It is estimated that most of the possible structural folds have already been solved by x-ray diffraction or NMR. • When at least one protein structure from each possible fold has been determined experimentally, it will then be possible to predict other structures using homology modeling and threading, obviating the need for ab initio methods. • In the meantime, ab initio methods are useful for exploring the relationship between sequence and structure, and providing insight into the process of protein folding. • In addition, ab initio methods can sometimes be useful during homology modeling for building portions of proteins not present in the template structure (loop building).

Accuracy and application of protein structure models. Structures A-C are homology models based on about 60% (A), 40% (B), and 30% (C) sequence identity to their template structure. Structures D and E are ab initio predictions using a program called Rosetta. Predicted structures are in red, and actual structures are in blue. The accuracy of the models decrease significantly in going from A to E, but the overall structure is still roughly correct.

CASP: Critical Assessment of Techniques for Protein Structure Prediction CASP is an international contest held every two years in which scientists try to predict the structures of proteins using methods they have developed that include homology modeling, threading, and ab initio techniques. Contestants are given the sequences of proteins whose structures have been determined by x-ray crystallography or NMR but have not yet been made public. After contestants have made and submitted their predictions, the actual structures are released, the predictions are compared to the actual structures, and the predictions are assessed for accuracy. The CASP contest is a major driving force in the development of tertiary structure prediction methods. CASP began in 1994; CASP10 was held in 2012.

When dealing with predicted protein structures, it is important to remember: “Models are not molecules observed” No matter how they are obtained, before we ask what they tell us, we must ask how well macromolecular models fit with other things we already know. A model is like any scientific theory: it is useful only to the extent that it supports predictions that we can test by experiment. Our initial confidence in it is justified only to the extent that it fits what we already know. Our confidence can grow only if its predictions are verified.

Bioinformatics

Bioinformatics

Presentation Transcript

Bioinformatics

Bioinformatics:

Bioinformatics

Bioinformatics

Bioinformatics

Bioinformatics

Bioinformatics

Bioinformatics

Bioinformatics

Bioinformatics

Bioinformatics

Bioinformatics

Bioinformatics

Bioinformatics

Bioinformatics

Bioinformatics

Bioinformatics

Bioinformatics

Bioinformatics

BIOINFORMATICS

Bioinformatics