250 likes | 389 Views
Alpha-helical transmembrane protein structure prediction Timothy Nugent. Alpha-helical transmembrane proteins. Transmembrane (TM) proteins are involved in a wide range of vital cellular processes. Comprise about 20-30% of a typical proteome.
E N D
Alpha-helical transmembrane protein structure predictionTimothy Nugent
Alpha-helical transmembrane proteins • Transmembrane (TM) proteins are involved in a wide range of vital cellular processes. • Comprise about 20-30% of a typical proteome. • Composed of hydrophobic, membrane-spanning alpha-helices, connected with loop regions. • Poorly represented in structural databases. • Bottleneck is obtaining sufficient quantities of material for crystallisation trials. • Recombinant overexpression techniques are successful but have worked mainly for producing prokaryotic TM proteins. • Predicting TM protein structure is therefore an important challenge for bioinformatics.
Transmembrane protein topology • Topology of a transmembrane protein describes which regions are membrane-spanning and which are 'inside' or 'outside' (e.g. cytoplasmic/extracellular or cytoplasmic/lumenal). • Number and position of TM helices. • Position of the N-terminal. • Topology can provide insight into a protein's function. • Direct further experimental work.
Using support vector machines for topology prediction • Early prediction methods were based on the physicochemical principle of a sliding window of hydrophobicity combined with the 'positive-inside' rule. • Recently, more advanced methods using machine learning algorithms such as hidden Markov models (e.g. TMHMM, PHOBIUS) and neural networks (MEMSAT3) have been developed, • They have achieved significant improvements in prediction accuracy (~80%). • However, none of the top scoring methods use SVMs. • While hidden Markov models and neural networks may have multiple outputs, SVMs are binary classifiers. • In order to deal with TM topology prediction, multiple SVM will have to be combined, e.g. • TM helix / loop • Inside loop / Outside loop • Signal peptide / ¬Signal peptide • Re-entrant loop / ¬Re-entrant loop
Data set composition • Based on Möller set and MPTOPO database. • Novel TM sequences parsed from SWISS-PROT and blasted vs PDB. • Remove fragments, chain breaks, colicins, venoms etc. • Homology reduce at 40% sequence identity. • Topologies determined by OPM or PDB_TM.
Support vector machine training • Data set of 131 non-redundant protein sequences. • Jack knife cross-validation - sequences with >25% sequence identity removed from training sets. • Signal peptide SVM – 10-fold cross validation + additional data from Phobius set and SWISS-PROT. • PSI-BLAST profiles vs Uniref 90. E-value threshold for inclusion = 0.001 • Normalise by Z-score. • 27-35 residue sliding window. • Optimise window size, kernel choice and parameters using Mathew's Correlation Coefficient: • SVM outputs fed into dynamic programming algorithm.
Overall prediction accuracy • OCTOPUS results were not cross-validated therefore are likely to be overestimated as there is considerable overlap between test and training sets. • Tested vs the Möller data set – scores 79%. • Tested vs the TOPDB data set – scores 67%.
Discriminating between TM and globular proteins • For SVM training, 416 non-redundant chains from globular proteins of known structure, combined with our novel set of 131 TM proteins. • 2269 globular sequences were used used as test cases. • PSI-BLAST profiles were generated for all sequences and 10-fold cross validation was used to assess performance. • Window size = 33, Kernel = RBF, MCC = 0.78
Conclusions • Novel SVM-based approach predicts correct topology with 89% accuracy, 10% higher than next best method OCTOPUS. • Incorporates signal peptide and re-entrant helix prediction. • Signal peptide containing proteins correctly predicted with 93% accuracy. • Re-entrant helix containing proteins correctly predicted with 64% accuracy. • Good TM/globular protein discrimination – highly suited to whole genome analysis.
Alpha-helical transmembrane protein fold prediction • Despite significant efforts to predict TM protein topology, little attention has been directed toward developing a method to pack the helices together. • Since the membrane-spanning region is predominantly composed of alpha-helices with a common alignment, this task should in principle be easier than predicting the fold of globular proteins. • However, unusual structural features render simple lattice models that may work for regularly packed proteins unable to model the diverse packing arrangements now present in structural databases. • We have used predicted lipid exposure, residue contacts, sequence statistics and a graph-based approach to find the optimal arrangement.
Contact prediction in membrane proteins • Many methods exist for predicting residue contacts in globular proteins, but only two specifically for membrane proteins. • Methods trained to predict contact maps in globular proteins perform poorly when applied to membrane proteins. • Although still poorly represented in structural databases, there are now enough PDB structures to train methods using membrane proteins. • Incorporate membrane protein-specific features into the prediction. • Directing further experimental work. • Assist with structural and functional analysis. • Decoy discrimination. • Reliable prediction of helix interaction is an important step in prediction and classification of membrane protein folds.
Predicting residue contacts using support vector machines • Using dataset of 131 sequences, 74 have > 1 TM Helix. >25% ID removed from training sets. • Only looking at interactions within a single chain, not between chains. • 3 contact definitions: • backbone/side chain heavy atoms are within 5.5 Å. • C-beta atoms are within 6 Å or distance between interacting pair is less than the sum of their VDW radii + 0.6 Å. • C-beta atoms are within 8 Å (C-alpha for glycine). • Features – 7 residue window centred on each residue in pair. Using PSI-BLAST profiles. • Predicted lipid exposure for each residue. • Binary vector representing distance between residues. • Two values representing relative position in each helix – equivalent to a Z coordinate. • Other features, e.g. lengths of each helix, length of sequence, result in lower MCC. • SVM training files have roughly balanced positive/negative ratio (1:1.25).
An SVM to predict lipid exposure • CGDB • Surround protein with randomly placed DPPC lipids and solvent. • Carry out minimum of 200ns of MD using GROMACS. • Database contains simulation properties, final system coordinates, and analysis of lipid-protein interactions. • Labels generated using CGDB MD data where time exposed to lipid is > 0.5 • Threshold [0-1] used, but 0.5 appears to be optimal. • Used PSI-BLAST profile. • Windows of 7 residues. • 77 proteins from CGDB were used for training. • 74 Proteins in our set were jack knifed. • CGDB • Surround protein with randomly placed DPPC lipids and solvent. • Carry out minimum of 200ns of MD using GROMACS. • Database contains simulation properties, final system coordinates, and analysis of lipid-protein interactions.
Lipid exposure per-residue SVM results • Compared with LIPS (LIPid-facing Surface) - a method for prediction of helix-lipid interfaces of TM helices from sequence information alone. • Produces LIPS score = lipophilicity * entropy • No threshold is given by LIPS – used GA to find optimal value.
Adding the lipid exposure SVM scores to the residue-residue interaction SVM • 2 features added – raw lipid exposure SVM score for each residue in the interacting pair.
Helix-Helix interaction scores • Helix-helix interaction requires one pair of contacts to be correctly predicted. • 593 interacting helices. • 815 non-interacting helices. * No cross validation on 41 sequences common to TMHIT training set.
A graph-based approach to find the optimal helix packing arrangement • Helices (vertices) and their interactions (edges) can be represented as a graph. • By employing a force-directed algorithm, the method attempts to minimising edge crossing while maintaining uniform edge length, attributes common in native structures. • Uses Kamada-Kawai algorithm.
A graph-based approach to find the optimal helix packing arrangement
Conclusions • A tool to predict lipid exposure of residues within the membrane. 70% accuracy, an improvement of ~8% over the LIPS method. • A tool to generate a contact map and predict helix-helix interactions within the membrane. Up to 67% accurate – significant improvement over all other methods (apart from TMHIT, though these results not cross-validated). • Uses a graph-based approach to generate optimal packing arrangement.
References • Nugent T and Jones DT. Transmembrane Protein Topology Prediction using Support Vector Machines. BMC Bioinformatics 2009, 10:159. • Lobley AE, Nugent T, Orengo CA, Jones DT. FFPred: an integrated feature-based function prediction server for vertebrate proteomes. Nucleic Acids Res. 2008 36:W297-302. • Nugent T, Mole SE, Jones DT. The transmembrane topology of Batten disease protein CLN3 determined by consensus computational prediction constrained by experimental data. FEBS Lett. 2008 Apr 2;582(7):1019-24. • Nugent T and Jones DT. Membrane Protein Structure Prediction. From Protein Structure to Function with Bioinformatics. Edited by Daniel Rigden. Springer.