270 likes | 437 Views
G53BIO – Bioinformatics http://www.cs.nott.ac.uk/~jqb/G53BIO. Protein Structure Prediction Dr. Jaume Bacardit – jqb@cs.nott.ac.uk Prof. Natalio Krasnogor – nxk@cs.nott.ac.uk.
E N D
G53BIO – Bioinformaticshttp://www.cs.nott.ac.uk/~jqb/G53BIO Protein Structure PredictionDr. Jaume Bacardit –jqb@cs.nott.ac.uk Prof. Natalio Krasnogor – nxk@cs.nott.ac.uk Some material taken from “Arthur Lesk Introduction to Bioinformatics 2nd edition Oxford University Press 2005” and “Introduction to Bioinformatics by Anna Tramontano”
Outline • Introduction and motivation • PSP: A family of problems • Prediction of structural aspects of protein residues • Prediction of the 3D structure of proteins • Assessment of PSP quality: CASP • Summary
Protein Structure: Introduction • Proteins are molecules of primary importance for the functioning of life • Structural Proteins (collagen nails hair etc.) • Enzymes • Transmembrane proteins • Proteins are polypeptide chains constructed by joining a certain kind of peptides amino acids in a linear way • The chain of amino acids however folds to create very complex 3D structures • There is a general consensus that the end state of the folding process depends on the amino acid composition of the chain
Motivation for PSP • The function of a protein depends greatly on its structure • The structure that a protein adopts is vital to it’s chemistry • Its structure determines which of its amino acids are exposed to carry out the protein’s function • Its structure also determines what substrates it can react with • However the structure of a protein is very difficult to determine experimentally and in some cases almost impossible
Protein Structure Prediction • That is why we have to predict it • PSP aims to predict the 3D structure of a protein based on its primary sequence
Impact of PSP • PSP is an open problem. The 3D structure depends on many variables • It has been one of the main holy grails of computational biology for many decades • Impact of having better protein structure models are countless • Genetic therapy • Synthesis of drugs for incurable diseases • Improved crops • Environmental remediation
Prediction types of PSP • There are several kinds of prediction problems within the scope of PSP • The main one of course is to predict the 3D coordinates of all atoms of a protein (or at least the backbone) based on its primary sequence • There are many structural properties of individual residues within a protein that can be predicted for instance: • The secondary structure state of the residue • If a residue is buried in the core of the protein or exposed in the surface • Accurate predictions of these sub-problems can simplify the general 3D PSP problem
Prediction types of PSP • There is an important distinction between the two classes of prediction • The 3D PSP is generally treated as an optimisation problem • The prediction of structural aspects of protein residues are generally treated as machine learning problems
Optimisation • Given a problem for which you have a way of assessing how good is each possible solution • An evaluation function • Optimisation is the process of finding the best possible solution • Dynamic programming (as seen for sequence alignment) is an optimisation method • Genetic Algorithms are another examples of optimisation • The key differences between them is how they explore the space of candidate solutions
Machine Learning • Machine learning: How to construct programs that automatically learn from experience [Mitchell 1997] • ML is a Computer Science discipline part of the Artificial Intelligence field • Its goal is to construct automatically a description of some phenomenon given a set of data extracted from previous observations of the phenomenon because it would be beneficial to predict it in the future.
Unknown instance Training Set Learning Method Theory Class Flow of data in machine learning • Specifically we are concerned with supervised learning. That is when we know the solution for the training data
Types of machine learning • Rule learning 1 If (X<0.25 and Y>0.75) or (X>0.75 and Y<0.25) then If (X>0.75 and Y>0.75) then If (X<0.25 and Y<0.25) then Y Everything else 0 1 X
Other machine learning techniques • Other methods that have also been used in PSP are • Artificial Neural Networks • Support Vector Machines • Hidden Markov Models • If you are interested in the technology side of PSP a good book is “Bioinformatics: The Machine Learning Approach” by Baldi and Brunak
Prediction of structural aspects of protein residues • Many of these features are due to local interactions of an amino acid and its immediate neighbours • Can it be predicted using information from the closest neighbours in the chain? • In this simplified example to predict the SS state of residue i we would use information from residues i-1 i and i+1. That is a window of ±1 residues around the target Ri-5 SSi-5 Ri-4 SSi-4 Ri-3 SSi-3 Ri-2 SSi-2 Ri-1 SSi-1 Ri SSi Ri+2 SSi+2 Ri+3 SSi+3 Ri+4 SSi+4 Ri+5 SSi+5 Ri+1 SSi+1 Ri-1 Ri Ri+1 SSi Ri Ri+1 Ri+2 SSi+1 Ri+1 Ri+2 Ri+3 SSi+2
What information do we include for each residue? • Early prediction methods used just the primary sequence the AA types of the residues in the window • However the primary sequence has limited amount of information • It does not contain any evolutionary information it does not say which residues are conserved and which are not • Where can we obtain this information? • Position-Specific Scoring Matrices which is a product of a Multiple Sequence Alignment
Position-Specific Scoring Matrices (PSSM) • For each residue in the query sequence compute the distribution of amino acids of the corresponding residues in all aligned sequences (discarding those too similar to the query) • This distributions will tell us which mutations are likely and which mutations are less likely for each residue in the query sequence • In essence it’s similar to a substitution matrix but tailored for the sequence that we are aligning • A PSSM profile will also tell us which residues are more conserved and which residues are more subject to insertions or deletions
PSSM for the 10 first residues of 1n7lA A R N D C Q E G H I L K M F P S T W Y V A: 4 -1 -2 -2 0 -1 -1 0 -2 -1 -2 -1 -1 -2 -1 1 0 -3 -2 0 M:-1 -2 -3 -4 -2 -1 -2 -3 -2 1 2 -2 7 0 -3 -2 -1 -2 -1 1 E:-1 0 0 2 -4 2 6 -2 0 -4 -3 1 -2 -4 -1 0 -1 -3 -2 -3 K:-1 2 0 -1 -4 1 1 -2 -1 -3 -3 5 -2 -4 -1 0 -1 -3 -2 -3 V: 0 -3 -3 -4 -1 -3 -3 -4 -4 3 1 -3 1 -1 -3 -2 0 -3 -1 5 Q:-1 1 0 0 -3 6 2 -2 0 -3 -3 1 -1 -4 -2 0 -1 -2 -2 -3 Y:-2 -1 -1 -3 -3 -1 -1 -3 6 -2 -2 -2 -1 2 -3 -2 -2 1 7 -2 L:-2 -3 -4 -4 -2 -3 -3 -4 -3 2 5 -3 2 0 -3 -3 -1 -2 -1 1 T: 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1 2 5 -3 -2 0 R:-2 6 -1 -2 -4 1 0 -3 0 -3 -3 2 -2 -3 -2 -1 -1 -3 -2 -3
Secondary Structure Prediction • The most usual way is to predict whether a residue belongs to an α helix a β sheet or is in coil state • Several programs can determine the actual SS state of a protein from a PDB file. The most common of them is DSSP • Typically, a window of ±7 amino acids (15 in total) is used
Secondary Structure Prediction MSA PSSM1 PSSM2 PSSM3 PSSMn-1 PSSMn Primary sequence R1 R2 R3 Rn-1 Rn PSSM profile of sequence Prediction method Windows generation SSi? PSSMi-1 PSSMi PSSMi+1 Window of PSSM profiles Prediction • The most popular public SS predictor is PSIPRED
Coordination Number Prediction • Two residues of a chain are said to be in contact if their distance is less than a certain threshold (e.g. 8Å) • CN of a residue : count of contacts that a certain residue has • CN gives us a simplified profile of the density of packing of the protein Native State Contact Primary Sequence
Example of a rule set for CN prediction • All AA types associated to the central residue are hydrophobic (core of a protein) • D E consistently do not appear in the predicates. They are negatively charges residues (surface of a protein)
Other predictions • Other kinds of residue structural aspects that can be predicted • Solvent accessibility: Amount of surface of each residue that is exposed to solvent • Recursive Convex Hull: A metric that models a protein as an onion and assigns each residue to a layer. Formally each layer is a convex hull of points • These features (and others) are predicted in a similar was as done for SS or CN
Contact Map prediction • Prediction given two residues from a chain whether these two residues are in contact or not • This problem can be represented by a binary matrix. 1= contact 0 = non contact • Plotting this matrix reveals many characteristics from the protein structure helices sheets
Contact Map Prediction • Instead of a single window around the target now there are two windows around the pair of residues to be predicted to be in contact or not • Many methods also use a third window, placed in the middle point in the chain between the two target residues
Contact Map prediction at Nottingham • For each position in these 3 windows we include: • PSSM profile • Predicted SS, SA, RCH and CN • The whole connecting segment between the two targets is represented as • Distribution of AA and predicted SS, SA, RCH and CN
Contact Map prediction at Nottingham • Moreover, global protein information is also included • Sequence length • Separation between target residues • Contact propensity of target residues • Distribution of AA and predicted SS, SA, RCH and CN of the whole chain • Each instance is represented by 631 variables
Contact Map prediction at Nottingham Training set • Training set of 2413 proteins selected to represent a broad set of sequences • 32 million pairs of amino-acids (instances in the training set) with less than 2% of real contacts • Each instance is characterized by up to 631 attributes • 50 samples of ~660000 examples are generated from the training set. Each sample contains two no-contact instances for each contact instance • The BioHEL GBML method (Bacardit et al., 2009) was run 25 times on each sample • An ensemble of 1250 rule sets (50 samples x 25 seeds) performs the contact maps predictions using simple consensus voting • Confidence is computed based on the votes distribution in the ensemble x50 Samples x25 Rule sets Consensus Predictions