Introduction to Bioinformatics: Lecture XI Computational Protein Structure Prediction

Introduction to Bioinformatics: Lecture XIComputational Protein Structure Prediction Jarek Meller Division of Biomedical Informatics, Children’s Hospital Research Foundation & Department of Biomedical Engineering, UC JM - http://folding.chmcc.org

Outline of the lecture • Protein structure and complexity of conformational search: from similarity based methods to de novo structure prediction • Multiple sequence alignment and family profiles • Secondary structure and solvent accessibility prediction • Matching sequences with known structures: threading and fold recognition • Ab initio folding simulations JM - http://folding.chmcc.org

Polypeptide chains: backbone and side-chains N-ter C-ter JM - http://folding.chmcc.org

Distinct chemical nature of amino acid side-chains C-ter PHE N-ter CYS VAL GLU ARG JM - http://folding.chmcc.org

Hydrogen bonds and secondary structures b-strand a-helix JM - http://folding.chmcc.org

Tertiary structure and long range contacts: annexin JM - http://folding.chmcc.org

Quaternary structure and protein-protein interactions: annexin hexamer JM - http://folding.chmcc.org

Domains, interactions, complexes: cyclin D and Cdk Cyclin Box JM - http://folding.chmcc.org

Domains, interactions, complexes: VHL JM - http://folding.chmcc.org

Protein folding problem • The protein folding problem consists of predicting three-dimensional structure of a protein from its amino acid sequence • Hierarchical organization of protein structures helps to break the problem into secondary structure, tertiary structure and protein-protein interaction predictions • Computational approaches for protein structure prediction: similarity based and de novo methods JM - http://folding.chmcc.org

Polypeptide chains: backbone and rotational degrees of freedom H O R2 | || | NH3+--Ca--C -- N --Ca-- C --O- | | | \\ R1 H H O The equilibrium length of the peptide bond (C -- N) is about 2 [Ang]. The average Ca - Ca distance in a polypeptide chain is about 3.8 [Ang]. The angle of rotation around N - Ca bond is called j, and the angle around the Ca - C bond is called f. These two angles define the overall conformation of polypeptide chains. Simplifying, there are three discrete states (rotations) for each of these single bonds, implying 9N possible backbone conformations. JM - http://folding.chmcc.org

Scoring alternative conformations with empirical force fields (folding potentials) Ideally, each misfolded structure should have an energy higher than the native energy, i.e. : Emisfolded - Enative > 0 E misfolded native JM - http://folding.chmcc.org

Ab initio (or de novo) folding simulations • When dealing with a new fold, the similarity base methods cannot be applied • Ab initio folding simulations consist of conformational search with an empirical scoring function (“force field”) to be maximized (or minimized) • Computational bottleneck: exponential search space and sampling problem (global optimization!) • Fundamental problem: inaccuracy of empirical force fields • Importance of mixed protocols, such as Rosetta by D. Baker and colleagues (more when Monte Carlo protocols for global optimization are introduced) JM - http://folding.chmcc.org

Similarity based approaches to structure prediction: from sequence alignment to fold recognition • High level of redundancy in biology: sequence similarity is often sufficient to use the “guilt by association” rule: if similar sequence then similar structure and function • Multiple alignments and family profiles can detect evolutionary relatedness with much lower sequence similarity, hard to detect with pairwise sequence alignments: Psi-BLAST by S. Altschul et. al. • For sufficiently close proteins one may superimpose the backbones using sequence alignment and then perform conformational search (with the backbone fixed) to find the optimal geometry (according to atomistic empirical force field) of the side-chains: homology modeling (e.g. Modeller by A. Sali et. al.) • Many structures are already known (see PDB) and one can match sequences directly with structures to enhance structure recognition: fold recognition • For both, fold recognition and de novo simulation, prediction of intermediate attributes such secondary structure or solvent accessibility helps to achieve better sensitivity and specificity JM - http://folding.chmcc.org

Protein families and domains The notion of protein family is derived from evolutionary considerations: members of the same family are related, perform the same function and are assumed to have diverged from the same ancestor. The notion of domain is derived from structural considerations: “A domain is defined as an autonomous structural unit, or a reusable sequence unit that may be found in multiple protein contexts”, Baterman et. al. PFAM (7246 families as of April 2004): http://www.sanger.ac.uk/Software/Pfam/ PRODOM: http://prodes.toulouse.inra.fr/prodom/current/html/home.php CDD: http://www.ncbi.nlm.nih.gov/Structure/cdd/cddsrv.cgi Check: pfam00134.11, Cyclin_N JM - http://folding.chmcc.org

Multiple alignment and PSSM JM - http://folding.chmcc.org

Multiple alignment, clustering and families • DP search gives optimal solution scaling exponentially with the number of sequences K, O(nK), not practical for more than 3,4 sequences. • Standard heuristics start from pairwise alignments (e.g. PsiBLAST, Clustalw) • Hidden Markov Model approach to family profiles (profile HMM) as an alternative with pre-fixed parameters, trained separately for each family. Some initial multiple alignments necessary for training (next lecture). JM - http://folding.chmcc.org

Predicting 1D protein profiles from sequences: secondary structures and solvent accessibility a) Multiple alignment and family profiles improve prediction of local structural propensities. b) Use of advanced machine learning techniques, such as Neural Networks or Support Vector Machines improves results as well. B. Rost and C. Sander were first to achieve more than 70% accuracy in three state (H, E, C) classification, applying a) and b). SABLE server http://sable.cchmc.org POLYVIEW server http://polyview.cchmc.org JM - http://folding.chmcc.org

Predicting 1D protein profiles from sequences: secondary structures and solvent accessibility JM - http://folding.chmcc.org

Predicting transmembrane domains JM - http://folding.chmcc.org

“Hydropathy” profiles and membrane domains prediction Problem Design a simple algorithm for finding putative transmembrane regions based on “hydropathy” (or hydrophobicity) profiles. Consider an extension based on prototypes and k-NN. JM - http://folding.chmcc.org

Predicting transmembrane domains JM - http://folding.chmcc.org

Going beyond sequence similarity: threading and fold recognition When sequence similarity is not detectable use a library of known structures to match your query with target structures. As in case of de novo folding, one needs a scoring function that measures compatibility between sequences and structures. JM - http://folding.chmcc.org

Why “fold recognition”? • Divergent (common ancestor) vs. convergent (no ancestor) evolution • PDB: virtually all proteins with 30% seq. identity have similar structures, however most of the similar structures share only up to 10% of seq. identity ! • www.columbia.edu/~rost/Papers/1997_evolution/paper.html (B. Rost) • www.bioinfo.mbb.yale.edu/genome/foldfunc/ (H. Hegyi, M. Gerstein) JM - http://folding.chmcc.org

Simple contact model for protein structure prediction Each amino acid is represented by a point in 3D space and two amino acids are said to be in contact if their distance is smaller than a cutoff distance, e.g. 7 [Ang]. JM - http://folding.chmcc.org

Sequence-to-structure matching with contact models • Generalized string matching problem: aligning a string of amino acids against a string of “structural sites” characterized by other residues in contact • Finding an optimal alignment with gaps using inter-residue pairwise models: E = Sk< lek l , is NP-hard because of the non-local character of scores at a given structural site (identity of the interaction partners may change depending on location of gaps in the alignment) R.H. Lathrop, Protein Eng. 7 (1994) JM - http://folding.chmcc.org

Hydrophobic contact model and sequence-to-structure alignment - HPHPP • Solutions to this yet another instance of the global optimization problem: • Heuristic (e.g. frozen environment approximation) • “Profile” or local scoring functions (folding potentials) JM - http://folding.chmcc.org

Using sequence similarity, predicted secondary structures and contact potentials: fold recognition protocols In practice fold recognition methods are often mixtures of sequence matching and threading, e.g., with compatibility between a sequence and a structure measured by contact potentials and predicted secondary structures compared to the secondary structure of a template). D.Fischer and D. Eisenberg, Curr. Opinion in Struct. Biol. 1999, 9: 208 JM - http://folding.chmcc.org

Some fold recognition servers • PsiBLAST(Altschul SF et. al., Nucl. Acids Res. 25: 3389) • Live Bench evaluation(http://BioInfo.PL/LiveBench/1/) : • FFAS (L. Rychlewski, L. Jaroszewski, W. Li, A. Godzik (2000), Protein Science 9: 232) : seq. profile against profile • 3D-PSSM (Kelley LA, MacCallum RM, Sternberg JE, JMB 299: 499 ) : 1D-3D profile combined with secondary structures and solvation potential • GenTHREADER (Jones DT, JMB 287: 797) : seq. profile combined with pairwise interactions and solvation potential • LOOPP: annotations of remote homologs http://www.tc.cornell.edu/CBIO/loopp JM - http://folding.chmcc.org

Introduction to Bioinformatics: Lecture XI Computational Protein Structure Prediction

Introduction to Bioinformatics: Lecture XI Computational Protein Structure Prediction

Presentation Transcript

Lecture 7: Protein purification

Hydrophobic Residue Patterning in β -Strands and Implications for β -Sheet Nucleation

Lecture 1 - Introduction to CFD Applied Computational Fluid Dynamics

Computational Game Theory

Visualizing Protein Structures and Structural Bioinformatics

Protein sequence databases http://education.expasy.org/cours/Murcia2011/

FE Review Computational Tools

Lecture #8 – Introduction to Animal Structure and Function

BIOINFORMATICS Datamining #1

Basic Bioinformatics and Gene Effects on Growth Rate in Bacteria

Introduction to bioinformatics

Stoyan Georgiev advisors: Uwe Ohler and Sayan Mukherjee

Lecture 1. Introduction

6.096 Lecture 10

Genomes to Hits: The Emerging Assembly Line in Silico

Allele Mining: with respect to Comparative Protein Structure Modelling and Docking study

Prediction of protein function

A Comprehensive Bioinformatics Study Of The Interaction Between Peripheral Proteins And Membrane

INTRODUCTION The aim of computational structural biology is to

Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics

Protein structure prediction: The holy grail of bioinformatics