Alpha-helical transmembrane protein structure prediction Timothy Nugent

Alpha-helical transmembrane protein structure predictionTimothy Nugent

Alpha-helical transmembrane proteins • Transmembrane (TM) proteins are involved in a wide range of vital cellular processes. • Comprise about 20-30% of a typical proteome. • Composed of hydrophobic, membrane-spanning alpha-helices, connected with loop regions. • Poorly represented in structural databases. • Bottleneck is obtaining sufficient quantities of material for crystallisation trials. • Recombinant overexpression techniques are successful but have worked mainly for producing prokaryotic TM proteins. • Predicting TM protein structure is therefore an important challenge for bioinformatics.

Transmembrane protein topology • Topology of a transmembrane protein describes which regions are membrane-spanning and which are 'inside' or 'outside' (e.g. cytoplasmic/extracellular or cytoplasmic/lumenal). • Number and position of TM helices. • Position of the N-terminal. • Topology can provide insight into a protein's function. • Direct further experimental work.

Using support vector machines for topology prediction • Early prediction methods were based on the physicochemical principle of a sliding window of hydrophobicity combined with the 'positive-inside' rule. • Recently, more advanced methods using machine learning algorithms such as hidden Markov models (e.g. TMHMM, PHOBIUS) and neural networks (MEMSAT3) have been developed, • They have achieved significant improvements in prediction accuracy (~80%). • However, none of the top scoring methods use SVMs. • While hidden Markov models and neural networks may have multiple outputs, SVMs are binary classifiers. • In order to deal with TM topology prediction, multiple SVM will have to be combined, e.g. • TM helix / loop • Inside loop / Outside loop • Signal peptide / ¬Signal peptide • Re-entrant loop / ¬Re-entrant loop

Re-entrant helices

Data set composition • Based on Möller set and MPTOPO database. • Novel TM sequences parsed from SWISS-PROT and blasted vs PDB. • Remove fragments, chain breaks, colicins, venoms etc. • Homology reduce at 40% sequence identity. • Topologies determined by OPM or PDB_TM.

Support vector machine training • Data set of 131 non-redundant protein sequences. • Jack knife cross-validation - sequences with >25% sequence identity removed from training sets. • Signal peptide SVM – 10-fold cross validation + additional data from Phobius set and SWISS-PROT. • PSI-BLAST profiles vs Uniref 90. E-value threshold for inclusion = 0.001 • Normalise by Z-score. • 27-35 residue sliding window. • Optimise window size, kernel choice and parameters using Mathew's Correlation Coefficient: • SVM outputs fed into dynamic programming algorithm.

Per residue SVM prediction accuracy

Overall prediction accuracy • OCTOPUS results were not cross-validated therefore are likely to be overestimated as there is considerable overlap between test and training sets. • Tested vs the Möller data set – scores 79%. • Tested vs the TOPDB data set – scores 67%.

Ubiquinol Oxidase

Glycerol uptake facilitator

Discriminating between TM and globular proteins • For SVM training, 416 non-redundant chains from globular proteins of known structure, combined with our novel set of 131 TM proteins. • 2269 globular sequences were used used as test cases. • PSI-BLAST profiles were generated for all sequences and 10-fold cross validation was used to assess performance. • Window size = 33, Kernel = RBF, MCC = 0.78

Whole genome analysis

Conclusions • Novel SVM-based approach predicts correct topology with 89% accuracy, 10% higher than next best method OCTOPUS. • Incorporates signal peptide and re-entrant helix prediction. • Signal peptide containing proteins correctly predicted with 93% accuracy. • Re-entrant helix containing proteins correctly predicted with 64% accuracy. • Good TM/globular protein discrimination – highly suited to whole genome analysis.

Alpha-helical transmembrane protein fold prediction • Despite significant efforts to predict TM protein topology, little attention has been directed toward developing a method to pack the helices together. • Since the membrane-spanning region is predominantly composed of alpha-helices with a common alignment, this task should in principle be easier than predicting the fold of globular proteins. • However, unusual structural features render simple lattice models that may work for regularly packed proteins unable to model the diverse packing arrangements now present in structural databases. • We have used predicted lipid exposure, residue contacts, sequence statistics and a graph-based approach to find the optimal arrangement.

Contact prediction in membrane proteins • Many methods exist for predicting residue contacts in globular proteins, but only two specifically for membrane proteins. • Methods trained to predict contact maps in globular proteins perform poorly when applied to membrane proteins. • Although still poorly represented in structural databases, there are now enough PDB structures to train methods using membrane proteins. • Incorporate membrane protein-specific features into the prediction. • Directing further experimental work. • Assist with structural and functional analysis. • Decoy discrimination. • Reliable prediction of helix interaction is an important step in prediction and classification of membrane protein folds.

Predicting residue contacts using support vector machines • Using dataset of 131 sequences, 74 have > 1 TM Helix. >25% ID removed from training sets. • Only looking at interactions within a single chain, not between chains. • 3 contact definitions: • backbone/side chain heavy atoms are within 5.5 Å. • C-beta atoms are within 6 Å or distance between interacting pair is less than the sum of their VDW radii + 0.6 Å. • C-beta atoms are within 8 Å (C-alpha for glycine). • Features – 7 residue window centred on each residue in pair. Using PSI-BLAST profiles. • Predicted lipid exposure for each residue. • Binary vector representing distance between residues. • Two values representing relative position in each helix – equivalent to a Z coordinate. • Other features, e.g. lengths of each helix, length of sequence, result in lower MCC. • SVM training files have roughly balanced positive/negative ratio (1:1.25).

An SVM to predict lipid exposure • CGDB • Surround protein with randomly placed DPPC lipids and solvent. • Carry out minimum of 200ns of MD using GROMACS. • Database contains simulation properties, final system coordinates, and analysis of lipid-protein interactions. • Labels generated using CGDB MD data where time exposed to lipid is > 0.5 • Threshold [0-1] used, but 0.5 appears to be optimal. • Used PSI-BLAST profile. • Windows of 7 residues. • 77 proteins from CGDB were used for training. • 74 Proteins in our set were jack knifed. • CGDB • Surround protein with randomly placed DPPC lipids and solvent. • Carry out minimum of 200ns of MD using GROMACS. • Database contains simulation properties, final system coordinates, and analysis of lipid-protein interactions.

Lipid exposure per-residue SVM results • Compared with LIPS (LIPid-facing Surface) - a method for prediction of helix-lipid interfaces of TM helices from sequence information alone. • Produces LIPS score = lipophilicity * entropy • No threshold is given by LIPS – used GA to find optimal value.

Adding the lipid exposure SVM scores to the residue-residue interaction SVM • 2 features added – raw lipid exposure SVM score for each residue in the interacting pair.

Helix-Helix interaction scores • Helix-helix interaction requires one pair of contacts to be correctly predicted. • 593 interacting helices. • 815 non-interacting helices. * No cross validation on 41 sequences common to TMHIT training set.

A graph-based approach to find the optimal helix packing arrangement • Helices (vertices) and their interactions (edges) can be represented as a graph. • By employing a force-directed algorithm, the method attempts to minimising edge crossing while maintaining uniform edge length, attributes common in native structures. • Uses Kamada-Kawai algorithm.

A graph-based approach to find the optimal helix packing arrangement

Conclusions • A tool to predict lipid exposure of residues within the membrane. 70% accuracy, an improvement of ~8% over the LIPS method. • A tool to generate a contact map and predict helix-helix interactions within the membrane. Up to 67% accurate – significant improvement over all other methods (apart from TMHIT, though these results not cross-validated). • Uses a graph-based approach to generate optimal packing arrangement.

References • Nugent T and Jones DT. Transmembrane Protein Topology Prediction using Support Vector Machines. BMC Bioinformatics 2009, 10:159. • Lobley AE, Nugent T, Orengo CA, Jones DT. FFPred: an integrated feature-based function prediction server for vertebrate proteomes. Nucleic Acids Res. 2008 36:W297-302. • Nugent T, Mole SE, Jones DT. The transmembrane topology of Batten disease protein CLN3 determined by consensus computational prediction constrained by experimental data. FEBS Lett. 2008 Apr 2;582(7):1019-24. • Nugent T and Jones DT. Membrane Protein Structure Prediction. From Protein Structure to Function with Bioinformatics. Edited by Daniel Rigden. Springer.

Alpha-helical transmembrane protein structure prediction Timothy Nugent

Alpha-helical transmembrane protein structure prediction Timothy Nugent

Presentation Transcript

Protein structure prediction

Protein Structure Prediction

Protein structure prediction

Protein Structure Prediction

Protein Structure Prediction

Protein Structure Prediction

Protein structure prediction

Using Support Vector Machines for transmembrane protein topology prediction Tim Nugent

Protein structure prediction

Protein Structure Prediction

Protein Structure Prediction

Protein Structure Prediction

Protein Structure Prediction

Alpha-helical transmembrane protein fold prediction using residue contacts

Protein structure prediction

Transmembrane Protein Prediction

Protein structure prediction

Support Vector Machine-based Transmembrane Protein Topology Prediction Tim Nugent

Protein Structure Prediction

Protein Structure Prediction

Protein Structure Prediction

Protein Structure Prediction