Computational Molecular Biology Protein Structure and Homology Modeling Dr. Francesco Musiani

ComputationalMolecular Biology ProteinStructure and HomologyModeling Dr. Francesco Musiani

Sequence, function and structure relationships • Life is the ability to metabolize nutrients, respond to external stimuli, grow, reproduce and evolve • From a chemical point of view, proteins are linear hetero-polymers formed by amino acids (aa) • Proteins assume a 3D shape which is usually responsible for function • The consequence of the tight link between structure, function and evolutionary pressure distinguish proteins from ordinary polymers

Proteinstructure • The sequence of amino acids is called the primary structure • Secondary structure refers to local folding • Tertiary structure is the arrangement of secondary elements in 3D • Quaternary structure describes the arrangement of a protein subunits • The peptide bond is planar and the dihedral angle it defines is almost always 180°

Protein structure • What is a dihedral angle? • Is the angle between two planes. In practice, if you have four connected atoms and you want measure the dihedral angle around the central bond, you orient the system in such a way that the two central atoms are superimposed and measure the resulting angle between the first and last atom.

Protein structure • The simplest arrangements of aa is the alpha-helix, a right handed spiral conformation. • The structure repeats itself every 5.4 Å along the helix axis. • There are 3.6 aa per turn. O(n)-NH(n+4) H-bond

Protein structure • The beta sheet. • The R groups of neighboring residues in strand point in opposite directions. • There are parallel or anti-parallel beta sheets.

Protein structure Ramchandanplot: pairs of angles that do not cause the atoms of a dipeptide to collide.

Protein structure Collagen triple helix Anti-parallel β-sheet Left-handed α-helix Parallel β-sheet Right-handed α-helix

Protein structure Loops: regions without repetitive structure that connects secondary structure elements.

Protein structure Supersecondary elements (motifs): arrangements of two or three consecutive secondary structure that are present in many different protein structures, even with completely different sequences.

Protein structure Domains: portion of the polypeptide chain that folds into a compact semi-independent unit. • Class (C) Derived from secondarystructurecontentisassignedautomatically • Architecture (A) Describes the grossorientation of secondarystructures, independent of connectivity. • Topology (T) Clusters structuresaccording to theirtopologicalconnections and numbers of secondarystructures • Homologoussuperfamily (H)

Ala: transient interactions Thr, Ser: phosphorylation target: protein kinases attack phosphate group to the side-chain. Gly: unusualramachandran, oftenfound in turns Thr: Beta-branched more often found in beta-sheets. Cys: Very reactive, coordinate metals.

The problem of protein folding • What is protein fold: • Compact, globular folding arrangement of the polypeptide chain • Chain folds to optimize packing of the hydrophobic residues in the interior core of the protein • Thermodynamics: ΔG = ΔH – TΔS • (i.e. stability of a given conformation) • Enthalpy: electrostatics, dispersion, van der Waals, • H-bonds. • Entropy: water molecules form “ordered cages” around hydrophobic amino acids. The protein folding process breaks this order. • The free energy of folding of a protein • is of the order of few kcal/mol

The problem of protein folding • Anfinsen’s dogma: (at least for small globular proteins) the native structure is determined only by the protein's amino acid sequence • Levinthalparadox: because of the very large number of degrees of freedom in an unfolded polypeptide chain, the molecule has an astronomical number of possible conformations • Funneltheory: everyproteinhas a specificfoldingpathway

The problem of protein folding - 0 + –TΔS Conformational entropy ΔH Internal interactions –TΔS Hydrophobic effects Result: ΔG Folding

Evolution of protein structure • What if a base-substitution event occurs • in a protein-coding DNA region? • The fine balance between the gain and loss of free energy of folding is compromised: no single energy minimun→ NOT FOLD • The energy landscape of the protein change, but there is a global minimum of energy → same or similar function (i.e. local perturbations without affecting the general shape or topology) FOLD

The “comparative modeling” principle

Evolutionary-based methods for protein structure prediction • Proteins evolved from a common ancestor maintain similar core 3D structures • We can use protein of known structure (templates) • to model protein of unknown 3D structure (targets) • by starting from the sequence • This can be done if the templates and the target • are evolutionarily correlated

Why Protein Structure Prediction? We have an experimentally determined atomic structure for only ~1% of the knownproteinsequences

Why Protein Structure Prediction? Growth in the number of unique folds per year in the PDB based on the SCOP data base from 1986 to 2007

Why? • We can use homology modeling to predict the structure of proteins of unknown structure… • but also… • To reconstruct some missing part in an incomplete protein structure (common in low resolution structures or for large mobile loops) • To calculate a mutant of a known protein structure • To calculate the mean structure of an NMR ensamble

Homology modeling flowchart Query sequence Search for suitable template(s) Sequencedatabases Alignsequence with template(s) Template PDB structure(s) Possibleerrors Calculate model(s) Model(s) Assessresults Refinement (loops)

Homology modeling flowchart http://salilab.org/modeller/

How does it works?

1. Align sequence with structures • First, must determine the template structures • Simplistically, try to align the target sequence against every known structure’s sequence. • In practice, this is too slow, so heuristics are used (e.g. BLAST) • Profile or HMM searches are generally more sensitive in difficult cases (Modeller’sprofile.build method, PSI-BLAST or HHpred) • Could also use threading or other web servers • Remember to look at: • Sequence identity/similarity between the putative template(s) and the target • Experimental method, resolution and completeness of the template(s) • Other compounds bound to the template(s) • Oligomerization state

1. Align sequence with structures • Alignment to templates • Sequence-sequence: relies purely on a matrix of observed residue-residue mutation probabilities (‘align’) • Sequence-structure: gap insertion is penalized within secondary structure (helices etc.) (‘align2d’) • Other features, profile-profile, and/or user-defined (‘salign’) or use an external program • Remember: • An error in the alignment is always a fatal error for the whole modeling procedure! • One amino acid sequence plays coy; a pair of homologous sequences whisper; many aligned sequences shout out loud (A.M. Lesk, Introduction to Bioinformatics, 2002)

1. Align sequence with structures • Evaluation of sequence alignment quality E. Krieger, S.B. Nabuurs, G. Vriend: „Homologymodeling“. In Structural Bioinformatics. P.E. Bourne and H. Weissig Eds. (2003).

2. Extract spatial restraints • Spatial restraints incorporate homology information, statistical preferences, and physical knowledge • Template Cα- Cα internal distances • Backbone dihedrals (φ/ψ) • Sidechain dihedrals given residue type of both target and template • Force field stereochemistry (bond, angle, dihedral) • Statistical potentials • Other experimental constraints • Etc.

3. Satisfy spatial restraints • Satisfaction of spatial restraints • Represent system at appropriate level(s) of resolution (e.g. atoms, residues, domains, proteins) • Convert each data source into spatial restraints (e.g. harmonic distance simulates using “spring”) • Sum all restraints into a scoring function • Generate models that are consistent with all restraints by optimizing the scoring function (e.g. conjugate gradients, molecular dynamics, Monte Carlo)

3. Satisfy spatial restraints • All information is combined into a single objective function • Restraints and statistics are converted to an “energy” by taking the negative log • Force field (CHARMM 22) simply added in • Function is optimized by conjugate gradients and simulated annealing molecular dynamics, starting from the target sequence threaded onto template structure(s) • Multiple models are generally recommended • ‘best’ model or cluster or models chosen by simply taking the lowest objective function score, or using a model assessment method such as Modeller’s own DOPE or GA341, or external programs such as PROSA or DFIRE

4. Assess results • How do we know if the model is a good one? • Check log file for restraint violations and Modellerscore (molpdf) (not reliable since the scoring function is not perfect!) • Use another assessment score on the final model • Fold assignment: GA341 • Statistical potential: DOPE • Other programs (e.g. Prosa) • Use structure assessment programs (e.g. ProCheck) • Fit the model to some other experimental data not used in the modeling procedure

Typical assessments DOPE profile Ramachandran plot (ProCheck) PROSA profile

Structural alignment Root-mean square deviation (RMSD) 2 Wherexi and xj are the coordinate vectors of the structurei and j, respectively, and Nis the number of atoms of the twostrucures Structural alignment of thioredoxins from humans (red) and the fly Drosophila melanogaster (yellow)

Typical errors in comparative models

Model Accuracy as a Function of Target-Template Sequence Identity

Model accuracy

Applications of protein structure models Topologyrecognition Familiassignment Overallfold Mutagenesis design Functionalrelationship Drug design Virtual screening Docking Binding site detection

Model refining • Loop optimization • Often, there are parts of the sequence which have no detectable templates • “Mini folding problem” – these loops must be sampled to get improved conformations • Database searches only complete for 4-6 residue loops • Modeller uses conformational search with a custom energy function optimized for loop modeling (statistical potential derived from PDB) • Fiser/Melo protocol (‘loopmodel’) • Newer DOPE + GB/SA protocol (‘dope_loopmodel’)

Model refining • Accuracy of loop models as a function of amount of optimization

Model refining • Fraction of loops modeled with medium accuracy (<2Å)

Advanced topics • Modeller can also • Perform more sensitive searches for templates (sequence-profile, profile-profile, similar to PSI-BLAST) • Incorporate ligands, RNA/DNA and water molecules into built models • Build structures of multi-chain proteins (homo or hetero) • Add extra restraints to the modeling process (such as known distances, e.g. from FRET) • Use multiple templates to build a model • Remember: • You don’t have to use Modeller for template search, alignment, assessment or refinement. If you know your template (e.g. from BLAST) just format the alignment for Modeller and skip straight to the model building step!

Excercise • Target: • ChlIfrom cyanobacteriaSynechocystissp. (strain PCC 6803) • Procedure: • Search for templatesusingUniProt (http://www.uniprot.org/) • Align the target and the template(s) • PrepareModeller input files • Evaluate the model structure • Materials and Methods: • UniProt • Modeller(http://salilab.org/modeller/) • Modellermanual • ProCheck web server (http://www.ebi.ac.uk/thornton-srv/databases/pdbsum/Generate.html) • Prosa web server (https://prosa.services.came.sbg.ac.at/prosa.php)

Computational Molecular Biology Protein Structure and Homology Modeling Dr. Francesco Musiani