CSCE555 Bioinformatics

CSCE555 Bioinformatics • Lecture 18 Protein Tertiary Structure Prediction Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page: http://www.scigen.org/csce555 University of South Carolina Department of Computer Science and Engineering 2008 www.cse.sc.edu.

Outline • Experimental limitation of protein structure determination • Tertiary Structure Prediction • AB initio • Homology modeling • Threading

Experimental Protein Structure Determination • High-resolution structure determination • X-ray crystallography (<1A) • Nuclear magnetic resonance (NMR) (~1-2.5A) • Lower-resolution structure determination • Cryo-EM (electron-microscropy) ~10-15A • Theoretical Models? • Highly variable - but a few equiv to X-ray!

Tertiary Structure Prediction • Fold or tertiary structure prediction problem can be formulated as a search for minimum energy conformation • Search space is defined by psi/phi angles of backbone and side-chain rotamers • Search space is enormous even for small proteins! • Number of local minima increases exponentially with number of residues Computationally it is an exceedingly difficult problem!

LevinthalParadox of Protein Folding: How nature does search? We assume that there are three conformations for each amino acid (ex. α-helix, β-sheet and random coil). If a protein is made up of 100 amino acid residues, a total number of conformations is 3100 = 515377520732011331036461129765621272702107522001≒5 x 1047. If 100 psec (10-10 sec) were required to convert from a conformation to anotherone, a random search of all conformations would require 5 x 1047x 10-10 sec ≒1.6 x 1030 years. However, folding of proteins takesplace in msec to sec order. Therefore, proteins fold not via a random search but a more sophisticated search process. We want to watch the folding process of a protein using molecular simulation techniques.

Steps in Protein Folding 1- "Collapse"- driving force is burial of hydrophobic aa’s (fast - msecs) 2- Molten globule - helices & sheets form, but "loose" (slow - secs) 3- "Final" native folded state - compaction, some 2' structures rearranged Native state? - assumed to be lowest free energy - may be an ensemble of structures

Protein Folding Funnel Local mimina Global minimum Native Structure

Protein Structure Prediction • Ab initio • Use just first principles: energy, geometry, and kinematics • Homology • Find the best match to a database of sequences with known 3D-structure Combinations • Threading • Meta-servers and other methods Knowledge based approaches

Ab Initio Prediction • Basic idea Anfinsen’s theory: Protein native structure corresponds to the state with the lowest free energy of the protein-solvent system. • General procedures • Develop a Potential/Energy function • Evaluate the energy of protein conformation • Select native structure • Conformational search algorithm • To produce new conformations • Search the potential energy surface and locate the global minimum (native conformation) Provides both folding pathway & folded structure Can only apply to very small proteins

Potential Functions for PSP • Potential function • Physical based energy function Empirical all-atom forcefields: CHARMM, AMBER, ECEPP-3, GROMOS, OPLS Parameterization: Quantum mechanical calculations, experimental data Simplified potential: UNRES (united residue) • Solvation energy • Implicit solvation model: Generalized Born (GB) model, surface area based model • Explicit solvation model: TIP3P (computationally expensive)

O ＋ー H General Form of All-atom Forcefields Φ Θ r Bond stretching term Angle bending term Dihedral term The most time demanding part. Van der Waals term H-bonding term Electrostatic term r r r

Search Potential Energy Surface We are interested in minimum points on Potential Energy Surface (PES) • Conformational search techniques • Energy Minimization • Monte Carlo • Molecular Dynamics • Others: Genetic Algorithm, Simulated Annealing

Energy Minimization • Energy minimization • Methods • First-order minimization: Steepest descent, Conjugate gradient minimization • Second derivative methods: Newton-Raphson method • Quasi-Newton methods: L-BFGS Local miminum

Monte Carlo • In molecular simulations, ‘Monte Carlo’ is an importance sampling technique. 1. Make random move and produce a new conformation 2. Calculate the energy change E for the new conformation 3. Accept or reject the move based on the Metropolis criterion Boltzmann factor If E<0, P>1, accept new conformation; Otherwise: P>rand(0,1), accept, else reject.

Ab initio Prediction – CASP results

Comparative Modeling (Knowledge based approach) Two primary methods 1) Homology modeling 2) Threading (fold recognition) Both rely on availability of experimentally determined structures that are "homologous" or at least structurally very similar to target Provide folded structure only

Homology Modeling • Identify homologous protein sequences (-BLAST) • Among available structures, choose the one with closest sequence match to target as template (can combine steps 1 & 2 by using PDB-BLAST) • Build model by placing residues in corresponding positions of homologous structure & refine by "tweaking" • Homology modeling - works "well" • Computationally? not very expensive • Accuracy? higher sequence identity  better model • Requires ~30% sequence identity with sequence for which structure is known

Raw model Loop modeling Side chain placement Refinement Homology-based Prediction

Homology-based Prediction

Threading - Fold Recognition • Threading - works "sometimes" • Computationally? Can be expensive or cheap, depends on energy function & whether "all atom" or "backbone only" threading • Accuracy? in theory, should not depend on sequence identity (should depend on quality of template library & "luck") • Usually, higher sequence identity to protein of known structure  better model Identify “best” fit between target sequence & template structure

Threading Algorithm for PSP • Database of 3D structures and sequences • Protein Data Bank (or non-redundant subset) • Query sequence • Sequence < 25% identity to known structures • Alignment protocol • Dynamic programming • Evaluation protocol • Distance-based potential or secondary structure • Ranking protocol

Threading • Basic premise: • Statistics from Protein Data Bank (~40,000 structures) • Thus, chances for a protein to have a native-like structural fold in PDB are quite good • Note: Proteins with similar structural folds could be either homologs or analogs The number of unique structural folds in nature is fairly small (probably 2000-3000) Until very recently, 90% of new structures submitted to PDB had similar structural folds in PDB

Steps in Threading Target Sequence ALKKGF…HFDTSE Structure Templates Align target sequencewith template structures (fold library) from the Protein Data Bank (PDB) Calculate energy score to evaluate goodness of fit between target sequence & template structure Rank models based on energy scores

Threading Issues Find “correct” sequence-structure alignment of a target sequence with its native-like fold in PDB • Structure database - must be complete: no decent model if no good template in library! • Sequence-structure alignment algorithm: Bad alignment  Bad score! • Energy function (scoring scheme): • must distinguish correct sequence-fold alignment from incorrect sequence-fold alignments • must distinguish “correct” fold from close decoys • Prediction reliability assessment - How determine whether predicted structure is correct? (or even close?)

Threading: Template database • Build a database of structural templates (eg, ASTRAL domain library derived from the PDB) Supplement with additional decoys, e.g., generated using ab initio approach such as Rosetta (Baker)

Threading: Energy function • Two main methods (and combinations of these) • Structural profile (environmental)physico-chemical properties of aa’s • Contact potential (statistical) based on contact statistics from PDB Miyazawa & Jernigan (ISU)

Protein Threading: Typical energy function What is "probability" that two specific residues are in contact? How well does a specific residue fit structural environment? Alignment gap penalty? Total energy: Ep + Es + Eg Goal: Find a sequence-structure alignment that minimizes the energy function

CAFASP GOAL The goal of CAFASP is to evaluate the performance of fully automatic structure prediction servers available to the community. In contrast to the normal CASP procedure, CAFASP aims to answer the question of how well servers do without any intervention of experts, i.e. how well ANY user using only automated methods can predict protein structure. CAFASP assesses the performance of methods without the user intervention allowed in CASP.

Performance Evaluation in CAFASP3 Servers with name in italic are meta servers MaxSub score ranges from 0 to 1 Therefore, maximum total score is 30 (http://ww.cs.bgu.ac.il/~dfischer/CAFASP3, released in December, 2002.)

One structure where RAPTOR did best Red: true structure Blue: correct part of prediction Green: wrong part of prediction • Target Size:144 • Super-imposable size within 5A: 118 • RMSD:1.9

Some more results by other programs

Summary of current state of the art

Automated Web-Based Homology Modeling • SWISS Model :http://www.expasy.org/swissmod/SWISS-MODEL.html • WHAT IF :http://www.cmbi.kun.nl/swift/servers/ • The CPHModels Server :http://www.cbs.dtu.dk/services/CPHmodels/ • 3D Jigsaw : http://www.bmm.icnet.uk/~3djigsaw/ • SDSC1 :http://cl.sdsc.edu/hm.html • EsyPred3D :http://www.fundp.ac.be/urbm/bioinfo/esypred/

Comparative Modeling Server & Program • COMPOSERhttp://www.tripos.com/sciTech/inSilicoDisc/bioInformatics/matchmaker.html • MODELERhttp://salilab.org/modeler • InsightIIhttp://www.msi.com/ • SYBYLhttp://www.tripos.com/

CSCE555 Bioinformatics