V6 – Secondary Structure of TM proteins

V6 – Secondary Structure of TM proteins suggested reading for this lecture: Appl. Bioinf. 1, 21 (2002) Introduction Prediction of secondary structure elements Performance on test sets Membrane Bioinformatics – Part II

Introduction Membrane proteins are crucial for survival: - they are key components for cell-cell signaling - they mediate the transport of ions and solutes across the membrane - they are crucial for recognition of self. The pharmaceutical industry preferably targets membrane-bound receptors. Particularly important: large super-family of G protein-coupled receptors (GPCRs) - receptors for hormones, neurotransmitters, growth factors, light and odor-related ligands. More than 50% of the prescription drugs act on GPCRs. Membrane Bioinformatics – Part II

Topology of Membrane Proteins Inside the lipid bilayer, the protein backbone may not form hydrogen bonds with the aliphatic chains of the phospholipid molecules  the backbone atoms need to form H-bonds among eachother.  they adopt either -helical or -sheet conformations. Membrane Bioinformatics – Part II

Topology of Membrane Proteins http://www.biologie.uni-konstanz.de/folding/Structure%20gallery%201.html Membrane Bioinformatics – Part II

History of membrane protein structure determination 1984 bacterial reaction center noble price to Michel, Deisenhöfer, Huber 1987 1990 EM map of bacteriorhodopsin Henderson 1997 high-resolution structure by Lücke now several intermediates of the photocycle 1992 porin (complete -barrel) 1998 halorhodopsin 1995 Cytochrome c Oxidase 1998 F1ATPase noble price to John Walker 1997 1998 KCSA ion channel noble price to Roderick McKinnon 2003 2000 aquaporin 2000 rhodopsin (Palczewski) 2002 SERCA Ca2+ ATPase (Toyoshima) 2003 voltage-gated ion channel 2005 NaH Antiporter (Hunte) Membrane Bioinformatics – Part II

Lipid bilayer simplifies the prediction problem TM proteins are forced into two classes: -helical, or -sheet. -helices are typically tilted with respect to the membrane normal between 10 – 45°. The hydrophobic lipid bilayer reduces the three-dimensional structure formation almost to a 2D problem. Membrane Bioinformatics – Part II

Predicting TM helix location Hydrophobicity scales provide simple criteria to predict membrane helices. TMH can be predicted based on the distinctive patterns of hydrophobic (TM) and polar (non-membrane or water-soluble) regions within the sequence. Observed patterns: (1) TM helices are predominantly apolar and 12-35 residues long. (2) Globular regions between TMH are typically shorter than 60 residues (3) Most TMH proteins have a specific distribution of the positively charged amino acids arginine and lysine, „positive-inside-rule“ (Gunnar von Heijne). Connecting „loop“ regions on the inside of the membrane have more positive charges than „loop“ regions on the outside. (4) Long globular regions (> 60 residues) differ in their composition from those globular regions subject to the „inside-out-rule“: Membrane Bioinformatics – Part II

Kyte-Doolittle hydrophobicity scale (1982) Assign hydropathy value to each amino acid. Use sliding-window to identify membrane regions. Sum the hydrophobicity scale over all w residues in the window of length w. Use threshold T to assign segment as predicted membrane helix. w = 19 residues could best discriminate between membrane and globular proteins. Threshold T > 1.6 was suggested for the average over 19 residues. Membrane Bioinformatics – Part II

More refined indices One drawback of pure hydropathy-based methods is that they fail to discriminate accurately between membrane regions and highly hydrophobic globular segments. PRED-TMR algorithm: combine with propensities of finding certain amino acid residues at the termini of TM helices. Other hydrophobicity scales: - Wimley & White : based on partition experiments of peptides between water/lipid bilayer and water/octanol - TMFinder (Liu & Deber scale) : based on HPLC retention time of peptides with non-polar phase helicity. Membrane Bioinformatics – Part II http://blanco.biomol.uci.edu/hydrophobicity_scales.html

Folding of helical membrane proteins White, FEBS Lett. 555, 116 (2003) Membrane Bioinformatics – Part II

Hydrophobicity Scales White, FEBS Lett. 555, 116 (2003) Membrane Bioinformatics – Part II

Translocon-assisted folding of TM proteins? Upper picture (model!): the newly synthesized polypeptide chain of a membrane protein is inserted from the ribosome into the membrane via interaction with a TM complex, the “translocon” (EM map shown). lower picture: experiment largely supports the concerted view. What determines insertion into the membrane ? White, FEBS Lett. 555, 116 (2003) Membrane Bioinformatics – Part II

Integration of H-segments into the microsomal membrane Ingenious experiment! Introduce marker that shows whether helix segment H is inserted into membrane or not. a, Wild-type Lep has two N-terminal TM segments (TM1 and TM2) and a large luminal domain (P2). H-segments were inserted between residues 226 and 253 in the P2-domain. Glycosylation acceptor sites (G1 and G2) were placed in positions 96–98 and 258–260, flanking the H-segment. For H-segments that integrate into the membrane, only the G1 site is glycosylated (left), whereas both the G1 and G2 sites are glycosylated for H-segments that do not integrate in the membrane (right). b, Membrane integration of H-segments with the Leu/Ala composition 2L/17A, 3L/16A and 4L/15A. Bands of unglycosylated protein are indicated by a white dot; singly and doubly glycosylated proteins are indicated by one and two black dots, respectively. Hessa et al., Nature 433, 377 (2005) Membrane Bioinformatics – Part II

Insertion determined by simple physical chemistry measure fraction of singly glycosylated (f1g) vs. doubly glycosylated (f2g) Lep molecules c, Gapp values for H-segments with 2–4 Leu residues. Individual points for a given n show Gapp values obtained when the position of Leu is changed. d, Mean probability of insertion (p) for H-segments with n = 0–7 Leu residues. Hessa et al., Nature 433, 377 (2005) Membrane Bioinformatics – Part II

Biological and biophysical Gaa scales a, Gappaa scale derived from H-segments with the indicated amino acid placed in the middle of the 19-residue hydrophobic stretch. Only Ile, Leu, Phe, Val really favor membrane insertion. All polar and charged ones are very unfavored. b, Correlation between Gappaa values measured in vivo and in vitro. c, Correlation between the Gappaa and the Wimley–White water/octanol free energy scale for partitioning of peptides. Hessa et al., Nature 433, 377 (2005) Membrane Bioinformatics – Part II

Positional dependencies in Gapp Tyr and Trp are favorable in interface region. a, Symmetrical H-segment scans with pairs of Leu (red), Phe (green), Trp (pink) or Tyr (light blue) residues. The Leu scan is based on symmetrical 3L/16A H-segments with a Leu-Leu separation of one residue (sequence shown at the top; the two red Leu residues are moved symmetrically outwards) up to a separation of 17 residues. For the Phe scan, the composition of the central 19-residues of the H-segments is 2F/1L/16A, for the Trp scan it is 2W/2L/15A, and for the Tyr scan it is 2Y/3L/14A. The G app value for the 4L/15A H-segment GGPGAAALAALAAAAALAALAAAGPGG is also shown (dark blue). b, Red lines show G app values for symmetrical scans of 2L/17A (triangles), 3L/16A (circles), and 4L/15A (squares) H-segments. c, Same as b but for a symmetrical scan with pairs of Ser residues in H-segments with the composition 2S/4L/13A. Hessa et al., Nature 433, 377 (2005) Membrane Bioinformatics – Part II

Using observed amino acid propensities With availability of more and more 3D structures, it became possible to train statistical approaches based on the observed frequencies of amino acids in membrane proteins vs. non-membrane proteins. Similar concept as that in secondary structure prediction for globular proteins. TMpred : uses statistical amino acid preferences for scoring SPLIT (Juretic et al.) : - uses derived amino acid preferences for the „state“ membrane helix for a data set of integral membrane proteins with partially known secondary structure - combine with preferences for -strand, turn and non-regular secondary structure based on sets of soluble proteins with known structure. This method can identify shorter, unstable or movable membrane-helices. Membrane Bioinformatics – Part II

Incorporating more information: TopPred TopPred (von Heijne 1992) predicts the complete topology of membrane proteins by using - hydrophobicity analysis - automatic generation of possible topologies - ranking these topologies by the positive-inside rule. TopPred uses a particular sliding trapezoid window to detect segments of outstanding hydrophobicity. The two bases of the trapezoid are 11 and 21 residues long. TopPred chooses thresholds by considering a segment as TM helix that yielded the optimal difference between the number of positively charged residues at the inside and at the outside. Membrane Bioinformatics – Part II

Improvements from dynamic programming: MEMSAT MEMSAT (1994) implemented statistical tables (log likelihoods) compiled from well-characterized TM proteins and a dynamic programming algorithm to recognize membrane topology models by expectation maximisation. Residues are classified as being one of 5 structural states: Li inside loop Lo outside loop Hi inside helix end Hm helix middle Ho outside helix end. Helix end caps are defined to span over 4 adjacent residues (one helical turn). Compile propensities of amino acids for 5 states. Calculate score of relating given sequences to a predicted topology. Finding optimal score is guaranteed by dynamic programming. Membrane Bioinformatics – Part II

Using evolutionary information It is known from predicting secondary structures of globular proteins that using multiple sequence alignment information improves prediction accuracy significantly. PHDtm: predict location and topology of TM helices by a system of neural networks. Was later combined with dynamical programming. Membrane Bioinformatics – Part II

Using evolutionary information TMAP (1996): uses propensity values determined for segments of 21 consecutive residues in transmembrane segments (Pm), and for the flanking 4-residue caps of TM helices (Pe). Residues with high Pm tend to be hydrophobic residues with high Pe tend to be polar and basic. Compute compositional difference in the protein segments exposed to the two surfaces of a membrane for 12 important residues: mostly at the outside of membranes: Asn, Asp, Gly, Phe, Pro, Trp, Tyr, Val mostly inside: Ala, Arg, Cys, Lys. Use consensus over these 12 residues to predict topology. Membrane Bioinformatics – Part II

Using grammatical rules The lipid bilayer constrains the structure of the membrane-passing regions of proteins in many ways. TMHMM (Sonnhammer et al. 1998, Krogh et al. 2001) and HMMTOP (Tusnady & Simon 1998, 2001) implement Hidden Markov Models. TMHMM: uses cyclic model with 7 states for - TM helix core - TM helix caps on the N- and C-terminal side - non-membrane region on the cytoplasmic side - 2 non-membrane regions on the non-cytoplasmic side (for short and long loops to account for different membrane insertion mechanism) - a globular domain state in the middle of each non-membrane region Membrane Bioinformatics – Part II

Using grammatical rules HMMTOP: uses hidden Markov model distinguishing 5 structural states - inside non-membrane regions - inside TMH-cap - membrane helix - outside TMH-cap - outside non-membrane region This model is similar to MEMSAT. Membrane Bioinformatics – Part II

Availability of prediction methods. Many of these servers are also available through a Meta-Server META-PP at the site of Burkhard Rost. Membrane Bioinformatics – Part II

Prediction accuracy Often, authors claimed that their methods are > 90% accurate. However, Chen and Rost claim that most authors have significantly overestimated the accuracy of their methods. (1) there are not enough high-resolution structures to allow a statistically significant analysis. Training and test sets may share or have homologous members. Using low-resolution experiments, e.g. gene fusion, is no work around. Low-resolution experiments differ from high-resolution structures almost as much as prediction methods. (2) All methods optimise some parameters. Methods perform much better on proteins for which they were developed than on new proteins. Membrane Bioinformatics – Part II

Prediction accuracy (3) Methods using evolutionary information failed due to the surprising fact that membrane helices are not entirely conserved across species. This is surprising since it implies that those proteins either do not perform similar cellular functions, e.g. GPCRs, or that we can actually realize the function with a different number of membrane regions in some cases. (4) Levels of prediction accuracy between methods can often not be compared appropriately to one another since they are frequently based on different measures for prediction accuracy and on different data sets. Membrane Bioinformatics – Part II

Most methods get number of helices right All methods based on advanced algorithms tend to underestimate TM helices %obs > %prd. a Data set: Sequence-unique subset of 36 high-resolution TM helical proteins from PDB. This is the largest subset of all 105 high-resolution membrane chains, which fulfils the condition that no pair in the set has significant sequence similarity as defined in Rost (1999). b Methods c Per-segment accuracy: Qokpercentage of proteins for which all TM helices are predicted correctly (allowed deviation of up to 3 residues), Q%obshtmpercentage of all observed helices that are correctly predicted, Q%prdhtmpercentage of all predicted helices that are correctly predicted, TOPO percentage of proteins for which the topology (orientation of helices) is correctly predicted (empty for methods that do not predict topology). d Per-residue accuracy: Q2percentage of correctly predicted residues in two-states: membrane helix / non-membrane helix, Q%obs2T percentage of all observed TMH helix residues that are correctly predicted, Q %prd2T percentage of all predicted TMH helix residues that are correctly predicted, Q%obs2N percentage of all observed non-TMH helix residues that are correctly predicted, Q%prd2N percentage of all predicted non-TMH helix residues that are correctly predicted. e ERROR: the estimates for per-segment accuracy resulted from a bootstrap experiment with M = 100 and K = 18; the estimates for per-residue accuracy were obtained by standard deviations over Gaussian distributions for the respective score. f Numbers in italics: two standard deviations below the numerically highest value in each column (set in bold letters). NOTE: all methods are tested on the same set of proteins. However, the numbers are NOT from a cross-validation experiment, ie some methods may have used some of the proteins for training. Generally, newer methods are more likely to be overestimated than older ones. In particular, HMMTOP2, TMHMM1, and WW have been developed using ALL the proteins listed here. Membrane Bioinformatics – Part II

Prediction accuracy About 86% of the TMH residues predicted by the best methods are correctly predicted. Assume that we consider a prediction of a membrane helix correct if the predicted and the observed helical regions differ by less than 3 residues.  the best current methods correctly predict all membrane helices for 70 – 75% of all proteins. However, the topology is predicted correctly for only about half of all proteins. The best method, HMMTOP2, had all proteins listed in its training set. Simple hydrophobicity scales are less accurate than advanced methods. Membrane Bioinformatics – Part II

All methods confuse TM helices with signal peptides Signal peptides that are cleaved off secreted proteins usually contain stretches of hydrophobic residues resembling membrane helices. The most accurate specialists for membrane prediction (TMHMM and PHDhtm) falsely predict about 30 – 40% of all signal peptides as TM helices. Simple hydrophobicity scales predict more than 90% of the signal peptides as TM helices. Membrane Bioinformatics – Part II

Many methods predict TM helices in globular proteins Simple hydrophobicity scales reach levels close to 100% false positives. Advanced methods (SOSUI; TMHMM1, PHDhtm) predict TM helices in less than 2% of all globular proteins. Different methods predict similar numbers of TM proteins in genomes: about 10 – 30%. The overall content of TM proteins in genomes of different complexity is similar. However, eukaryotes have significantly more proteins with > 10 TM helices than all other species. Also, the distribution is different: eukaryotes have more 7 TM proteins (receptors) prokaryotes have more 6TM and 12TM proteins (ABC transporters). Membrane Bioinformatics – Part II

Future directions Meta servers yield improved predictions. > 90% correct topologies can be obtained by a simple majority vote between the results of various methods. TM helix prediction and signal peptide prediction should be combined Useful: databases for particular families of TM proteins and sequence motifs e.g. GPCR database Membrane-specific substitution matrices improve database searches e.g. PHAT by Henikoff & Henikoff improved alignments of TM proteins Membrane Bioinformatics – Part II

Summary TM helices are typically continuous stretches of mostly hydrophobic residues. Simple methods based on summing up hydrophobicities work okay but not really well. Advanced methods include additional features such as the „positive-inside rule“. The currently most successful methods are based on Hidden Markov Models or Neural Networks. Evaluating performance accuracy should be done using carefully separated training and test sets. It is possible to discriminate signal peptides and TM helices. Only Split 4.0 may detect short non-membrane spanning helices. Membrane Bioinformatics – Part II

V6 – Secondary Structure of TM proteins