Secondary Structure Prediction

ProteinAnalysisWorkshop 2006 Secondary Structure Prediction Alain Schenkel Chris Wilton Bioinformatics group Institute of Biotechnology University of helsinki

Overview • Review of protein structure. • Introduction to structure prediction: • Different approaches. • Prediction of 1D strings of structural elements. • Server/soft review: • COILS, MPEx, … • The PredictProtein metaserver.

Proteins • Proteins play a crucial role in virtually all biological processes with a broad range of functions. • The activity of an enzyme or the function of a protein is governed by the three-dimensional structure. H11_MOUSE histocompatibility antigen VE2_BPV1 Bovine DNA-binding domain

20 amino acids - the building blocks Clickable map at: http://www.russell.embl-heidelberg.de/aas/

The Amino Acids - hydrophobic

The Amino Acids - polar

The Amino Acids - charged

Secondary Structure:a-helix Alpha-helix: 413 Very seldom: 310, 516 (Pi-helix)

Secondary Structure:a-helix • 3.6 residues per turn • Axial dipole moment • Hydrogen-bonded • Protein surfaces • Typically, no Proline nor Glycine (“helix-breaker”)

Secondary Structure:b-sheets

Secondary Structure:b-sheets • Parallel or antiparallel • Alternating side-chains • Connecting loops often have polar amino acids

Secondary Structure: b-sheets

Terminology • Primary structure: The sequence of amino acid residues FTPAVHAFLDKFLAS …

Terminology • Secondary structure: • A first level of structural organization. • Provides rigidity. • The structural form adopted by each amino-acid residue: • H: helix ( alpha ) • E: extended ( beta strand ) • T: turn ( often Proline ) • C: coil ( random, unstructured )

Terminology • Secondary structure elements (SSE): • Stretches of residues in H conformation are helical SSEs. • Stretches of residues in E conformation are beta-strand SSEs. • Stretches of residues in C conformation are loops or coil. • Turns (T) are isolated residues, usually Proline or Glycine. • Other notation (in 3 states): L for all but H,E.

Secondary Structure Elements • Example: one helix, one beta strand, three loops Primary: MSEGEDDFPRKRTPWCFDDEHMC Secondary: CCHHHHHHCCCCEEEEEECCCCC

Terminology • Tertiary structure: • The full 3D structure of a single polypeptide chain. • Secondary structure elements pack together to form a structural core. • Called a protein “fold”.

Terminology • How several fully folded protein chains pack together to form a fully functional protein. • Example: 1jch (ribosome inhibitor). • Quaternary structure: PDB identifier The Protein Data Bank is the principal repository for solved structures.

Example: 1jch has 4 chains The elongated 2-helix structures in the center are called coiled-coils.

Structural classification of folds For example (CATH): • alpha • beta • alpha+beta • alpha/beta • irregular More on structural classification next week.

Biochemical classification of folds • Globular proteins: • in aqueous environment, • compact fold, • hydrophobic core and polar surfaces. • Membrane proteins: • attached to or across the cell membrane, • hydrophobic surface within membrane. • Fibrous proteins: • structural role, • repeat of regular/atypical SSE or irregular structure.

Globular (2 domains) Transmembrane Fibrous

INTRODUCTION TO STRUCTURE PREDICTION

Why is 3D Structure Important? • A pre-requisite for understanding function • processes of molecular recognition, • eg DNA recognition by 2bop. • Catalytic mechanisms of enzymes • often require key residues to be close together in 3D space. • Structure is often preserved under evolution when sequence is not. • Drug design.

Structure Prediction GPSRYIVDL… ?

Approaches to structure prediction • Ab initio: fromphysical principles only. • De novo: knowledge-based potentials from PDB. • Fold recognition: thread sequence through known structures for compatibility. • Homology modeling: use sequence alignment to infer possible templatestructure. More on homology modeling next week.

Prediction in One-Dimension Simplification: project 3D structure onto strings of structural assignments. Eg: • coiled-coils • membrane helices • solvent accessibility: residue is buried or exposed …eeebbbbeebbbbee… • secondary structure elements: …HHHLLLEEEEEELLEEE… If accurate: can be used to improve predictions of 3D structures (eg, in fold recognition).

A Flow Chart for Structure Prediction http://speedy.embl-heidelberg.de/gtsp/flowchart2.html

Structure Prediction Why is structure prediction, and in particular ab initio prediction, a difficult problem? • Many degrees of freedom: atoms of all residues and solvent. • Problem increases exponentially per residue. • Remote noncovalent interactions complicate matters. • A delicate problem of stability. • Cannot exhaustively search all possible conformations. A folding protein does not try all conformations !! (Levinthal paradox)

Basic Principle of Folding (globular protein) Pack hydrophobic side chains into the interior of the molecule, away from solvent. So, • Hydrophobic residues predominantly within a central structural core. Tight packing (crystal-like). • Hydrophilic residues predominantly on the protein surface, exposed to solvent. But main chain is highly polar. This forces the formation of SSEs in the core. So, • Core residues tend to be in SSEs. • Loops are on the outside of the protein.

Protein Structure and Evolution • Rate of evolution of genomic DNA sequence reflects degree of functional constraint. • Protein coding regions evolve much more slowly than non-coding regions: • need to maintain stable 3D protein structure, • need to maintain vital biological function.

Rates of Protein Sequence Evolution • Sequences of highly constrained structures evolve very slowly (eg: histones). • Less constrained ones evolve more quickly (eg: immunoglobulins). • In general: response to mutation is structural change, but many mutations will not (or only slightly) change the structure => Structure is better conserved than sequence.

Evolution of SSEs and Loops • Residues in the hydrophobic core (SSEs) are constrained by the need for tight packing: • changes rarely accepted - evolution is slow. • Residues on the surface (loops) are less constrained (simply need to be hydrophilic): • aa substitution less restricted – evolution is quicker.

Evolution of Key Residues • Residues with key functional roles will be conserved. • Eg: active site residues involved in catalysis. • BUT: gene duplication can lead to change of function without changing structure. • Residues with key structural role also tend to be conserved. Eg: • GLY: high conformational flexibility => tight turns,… • PRO: side-chain bounds back to backbone => tight turns. • CYS: disulfide bridges.

Structure Prediction by Homology Multiple sequence / structure alignments measure differences in evolutionary rates of residues, and thus • Contain more information than a single sequence for applications such as homology modeling and secondary structure prediction, • Give location of conserved regions and motifs, residues buried in the protein core or exposed to solvent, plus important secondary structures. More on homology modeling next week.

Secondary Structure Prediction Three generations: • Single residue statistical analysis: • For each amino acid type, assign its ‘propensity’ to be in a helix, sheet, or coil. • Limited accuracy: ~55-60% on average. • Eg: Chou-Fasman (1974), not used any more.

Secondary Structure Prediction • Segment-based statistics: • Look for correlations (within 11-21 aa windows). • Many algorithms have been tried. • Most performant: Neural Networks: • Input: a number of protein sequences with their known secondary structure. • Output: a trained network that predicts secondary structure elements for given query sequences. • Accuracy < 70%. • Eg: GORII, COMBINE.

Neural Networks 3 states output prediction for this residue prediction query trained network (picture from B.Rost, 1999)

Secondary Structure Prediction • Using information from evolution: • Compute a sequence profile from a multiple sequence alignment. • Use profile instead of query as input to Neural Network. • 6-8 % points increase in accuracy over Neural Network only. • Eg: • PHD/PROF: alignments by MaxHom (B. Rost, 1996/2000) • PSI-PRED: alignments from Psi-Blast (D.T. Jones, 1999) • Accuracy: 72% ± 11%. # of correctly predicted 2ndary str. states Accuracy measured as Q3= total # of residues

Accuracy Illustration Psi-Pred benchmark on set of 187 chains. (D.T. Jones, 1999) Your query could be here !! In particular, accuracy can be as low as 50% for a given query => Use many different methods and compare answers.

Other Structural Features There are other structural features that one can try to predict: • coiled-coils, • membrane helices, • solvent accessibility, • globularity, • disulfide bridges, • confomational switches, • …

POPULAR SERVERS FOR DEALING WITH SECONDARY STRUCTURES Coiled-coils Transmembrane helices Secondary structure Metaservers

Prediction of coiled-coils Coiled-coils are generally solvent exposed multi-stranded helix structures: two-stranded Helix periodicity and solvent exposure impose special pattern of heptad repeat: Helical diagram of 2 interacting helices: … abcdefg … • hydrophobic residues • hydrophilic residues (From Wikipedia Leucine zipper article)

The COILS server at EMBnet • Compares a sequence to a database of known, parallel two-stranded coiled-coils, and derives a similarity score. • By comparing this score to the distribution of scores in globular and coiled-coil proteins, the program then calculates the probability that the sequence will adopt a coiled-coil conformation. • Options: • scoring matrices, • window size (score may vary), • weighting options.

COILS Limitations • The program works well for parallel two-stranded structures that are solvent-exposed but runs progressively into problems with the addition of more helices, their antiparallel orientation and their decreasing length. • The program fails entirely on buried structures.

COILS Demo Let us submit the sequence >1jch_A VAAPVAFGFPALSTPGAGGLAVSISAGALSAAIADIMAALKGPFKFGLWGVALYGVLPSQ IAKDDPNMMSKIVTSLPADDITESPVSSLPLDKATVNVNVRVVDDVKDERQNISVVSGVP MSVPVVDAKPTERPGVFTASIPGAPVLNISVNNSTPAVQTLSPGVTNNTDKDVRPAFGTQ GGNTRDAVIRFPKDSGHNAVYVSVSDVLSPDQVKQRQDEENRRQQEWDATHPVEAAERNY ERARAELNQANEDVARNQERQAKAVQVYNSRKSELDAANKTLADAIAEIKQFNRFAHDPM AGGHRMWQMAGLKAQRAQTDVNNKQAAFDAAAKEKSDADAALSSAMESRKKKEDKKRSAE NNLNDEKNKPRKGFKDYGHDYHPAPKTENIKGLGDLKPGIPKTPKQNGGGKRKRWTGDKG RKIYEWDSQHGELEGYRASDGQHLGSFDPKTGNQLKGPDPKRNIKKYL to the COILS server at EMBnet: http://www.ch.embnet.org/software/COILS_form.html

mtidk matrix, no weights, all window lengths

Frame probabilities at each residue. Columns: window size of 14, 21, 28 aa. high probability heptads

Transmembrane Region Prediction Transmembrane regions: • Usually contain residues with hydrophobic side chains (surface must be hydrophobic). • Usually ~20 residues long, can be up to 30 if not perpendicular through membrane. Methods: • Hydropathy plots (historical, better methods now available) • Threading (TMpred, MEMSAT), • Hidden Markov Model (TMHMM), • Neural Network (PHDhtm).

Hydropathy Plots (Kyte-Doolittle) • compute an average hydropathy value for each position in the query sequence, • window length of 19 usually chosen for membrane-spanning region prediction. Peaks between scales 1-2?

Secondary Structure Prediction