981 likes | 1.34k Views
Protein structure prediction: The holy grail of bioinformatics. Proteins: Four levels of structural organization: Primary structure Secondary structure Tertiary structure Quaternary structure. Primary structure = the linear amino acid sequence.
E N D
Protein structure prediction:The holy grail of bioinformatics
Proteins: Four levels of structural organization: Primary structure Secondary structure Tertiary structure Quaternary structure
Secondary structure = spatial arrangement of amino-acid residues that are adjacent in the primary structure
a helix = A helical structure, whose chain coils tightly as a right-handed screw with all the side chains sticking outward in a helical array. The tight structure of the a helix is stabilized by same-strand hydrogen bonds between -NH groups and -CO groups spaced at four amino-acid residue intervals.
The b-pleated sheet is made of loosely coiled b strands are stabilized by hydrogen bonds between -NH and -CO groups from adjacent strands.
An antiparallel β sheet. Adjacent β strands run in opposite directions. Hydrogen bonds between NH and CO groups connect each amino acid to a single amino acid on an adjacent strand, stabilizing the structure.
A parallel β sheet. Adjacent β strands run in the same direction. Hydrogen bonds connect each amino acid on one strand with two different amino acids on the adjacent strand.
a helix b sheet (parallel and antiparallel) tight turns flexible loops irregular elements (random coil)
The tertiary structure is formed by the folding of secondary structures by covalent and non-covalent forces, such ashydrogen bonds,hydrophobic interactions,salt bridgesbetween positively and negatively charged residues, as well asdisulfide bondsbetween pairs of cysteines.
Quaternary structure = spatial arrangement of subunits and their contacts.
Holoproteins & Apoproteins Holoprotein Prosthetic group Apoprotein Holoprotein Prosthetic group
Prosthetic group Heme
Christian B. Anfinsen 1916-1995 Sela M, White FH, & Anfinsen CB. 1959. The reductive cleavage of disulfide bonds and its application to problems of protein structure. Biochim. Biophys. Acta. 31:417-426.
Not all proteins fold independently. Chaperones.
The denaturation and renaturation of proteins
Reducing agents: Ammonium thioglycolate (alkaline) pH 9.0-10 Glycerylmonothioglycolate (acid) pH 6.5-8.2
What do we need to know in order to state that the tertiary structure of a protein has been solved? Ideally: We need to determine the position of all atoms and their connectivity. Less Ideally: We need to determine the position of all Cbackbone structure).
Protein structure: Limitations and caveats • Not all proteins or parts of proteins assume a well-defined 3D structure in solution. • Protein structure is not static, there are various degrees of thermal motion for different parts of the structure. • There may be a number of slightly different conformations in solution. • Some proteins undergo conformational changes when interacting with STUFF.
Experimental Protein Structure Determination • X-ray crystallography • most accurate • in vitro • needs crystals • ~$100-200K per structure • NMR • fairly accurate • in vivo • no need for crystals • limited to very small proteins • Cryo-electron-microscopy • imaging technology • low resolution
Why predict protein structure? • Structural knowledge = some understanding of function and mechanism of action • Predicted structures can be used in structure-based drug design • It can help us understand the effects of mutations on structure and function • It is a very interesting scientific problem (still unsolved in its most general form after more than 50 years of effort)
Secondary structure prediction • Historically first structure prediction methods predicted secondary structure • Can be used to improve alignment accuracy • Can be used to detect domain boundaries within proteins with remote sequence homology • Often the first step towards 3D structure prediction • Informative for mutagenesis studies
Protein Secondary Structures (Simplifications) -HELIX -STRAND COIL (everything else)
Assumptions • The entire information for forming secondary structure is contained in the primary sequence • side groups of residues will determine structure • examining windows of 13-17 residues is sufficient to predict secondary structure • a-helices 5–40 residues long • b-strands 5–10 residues long
Predicting Secondary Structure From Primary Structure • accuracy 64-75% • higher accuracy for a-helices than for b-sheets • accuracy is dependent on protein family • predictions of engineered (artificial) proteins are less accurate
A surprising result! Chameleon sequences
The “Chameleon” sequence sequence 1 sequence 2 TEAVDAATAEKVFKQYANDNGVDGEWTYDDATKTFTVTEK Replace both sequences with an engineered peptide (“chameleon”) TEAVDAWTVEKAFKTFANDNGVDGAWTVEKAFKTFTVTEK a -helix b-strand Source: Minor and Kim. 1996. Nature 380:730-734
Measures of prediction accuracy • Qindex and Q3 • Correlation coefficient
Qindex Qindex: (Qhelix, Qstrand, Qcoil, Q3) • percentage of residues correctly predicted as a-helix, b-strand, coil, or for all 3 conformations. Drawbacks: - even a random assignment of structure can achieve a high score (Holley & Karpus 1991)
Correlation coefficient Ca= 1 (=100%)
First generation methods: single residue statistics Chou & Fasman (1974 & 1978) : Some residues have particular secondary-structure preferences. Based on empirical frequencies of residues in -helices, -sheets, and coils. Examples: Glu α-helix Val β-strand
Chou-Fasman Method • Accuracy: Q3 = 50-60%
Second generation methods: segment statistics • Similar to single-residue methods, but incorporating additional information (adjacent residues, segmental statistics). • Problems: • Low accuracy - Q3 below 66% (results). • Q3 of -strands (E) : 28% - 48%. • Predicted structures were too short.
The GOR method • developed by Garnier, Osguthorpe & Robson • build on Chou-Fasman Pij values • evaluate each residue PLUS adjacent 8 N-terminal and 8 carboxyl-terminal residues • sliding window of 17 residues • underpredicts b-strand regions • GOR method accuracy Q3 = ~64%
Third generation methods • Third generation methods reached 77% accuracy. • They consist of two new ideas: 1. A biological idea – Using evolutionary information based on conservation analysis of multiple sequence alignments. 2. A technological idea – Using neural networks.
Artificial Neural Networks An attempt to imitate the human brain (assuming that this is the way it works).