210 likes | 403 Views
Support Vector Machine-based Transmembrane Protein Topology Prediction Tim Nugent. Alpha-helical Transmembrane Proteins. Transmembrane proteins fulfil many critical cellular functions. Comprise about 30% of the human proteome.
E N D
Support Vector Machine-based Transmembrane Protein Topology PredictionTim Nugent
Alpha-helical Transmembrane Proteins • Transmembrane proteins fulfil many critical cellular functions. • Comprise about 30% of the human proteome. • Composed of hydrophobic, membrane-spanning alpha-helices, connected with loop regions. • Poorly represented in structural databases. • Predicting their structure and topology is therefore an important challenge for bioinformatics.
Using Support Vector Machines for TM Topology Prediction • Recently, more advanced methods using machine learning algorithms such as hidden Markov models (e.g. TMHMM, PHOBIUS) and neural networks (MEMSAT3) have been developed, • They have achieved significant improvements in prediction accuracy (~80%). • However, none of the top scoring methods use SVMs. • While hidden Markov models and neural networks may have multiple outputs, SVMs are binary classifiers. • In order to deal with TM topology prediction, multiple SVM will have to be combined, e.g. • TM helix / Loop • Inside Loop / Outside Loop • Signal Peptide / ¬Signal Peptide • Re-entrant Loop / ¬Re-entrant Loop
Assembling a Novel Data Set of Transmembrane Proteins • In order to study and predict features of transmembrane (TM) proteins, the use of a high quality data set containing sequences with experimentally confirmed TM regions is essential for both training and validation purposes. • Based on Möller set and MPTOPO database. Novel TM sequences parsed from SWISS-PROT and blasted vs PDB. • Remove fragments, chain breaks, colicins, venoms etc. • Homology reduce at 40% sequence identity. • Topologies determined by OPM or PDB_TM. • Since PDB structures of TM proteins contain no lipid, theoretical approaches are used to predict the position of the membrane relative to the structure, and thus the TM helix boundaries. • OPM uses water-lipid transfer energy minimisation • PDB_TM uses hydrophobicity/structural feature analysis
Data Set Composition • Based on Möller set and MPTOPO database. Novel TM sequences parsed from SWISS-PROT and blasted vs PDB. • Remove fragments, chain breaks, colicins, venoms etc. • Homology reduce at 40% sequence identity. • Topologies determined by OPM or PDB_TM.
Novel Data Set • Theoretical membrane placement on to the Mechanosensitive channel protein MscS crystal structure (PDB code 2oau) by OPM (left) and PDB_TM (right). The membrane region is between the red and blue bars.
Re-entrant Helices • Re-entrant helices in Aquaporin Z (left) from Escherichia coli (PDB code 1rc2) and Potassium channel (right) from Bacillus cereus (PDB code 2ahy) marked with black arrows.
Support Vector Machine Training • Data set of 131 non-redundant protein sequences. • Jack knife cross-validation - sequences with >25% sequence identity removed from training sets. • Signal peptide SVM – 10-fold cross validation + additional data from Phobius set and SWISS-PROT (2654 sequences). • PSI-BLAST profiles vs Uniref 90. E-value threshold for inclusion = 0.001 • Normalise by Z-score. • 27-35 (update - 41) residue sliding window. • Transduction. • Optimise window size, kernel choice and parameters using Mathew's Correlation Coefficient:
Window Size H/L SVM Split 1: 37 39 35 33 31 MCC 1: 0.79 0.79 0.79 0.79 0.79 Split 2: 37 35 39 33 31 MCC 2: 0.82 0.82 0.82 0.82 0.81 * * * I/O SVM Split 1: 43 45 41 39 37 35 33 MCC 1: 0.66 0.66 0.66 0.66 0.65 0.64 0.63 Split 2: 45 43 41 39 37 35 33 MCC 2: 0.55 0.55 0.55 0.55 0.54 0.54 0.52 * * * * * Max TM helix length = 33 residues Average TM helix length = 21 residues Average topogenic loop (< 60 residues) length = 19 residues
Dynamic Programming • Modified version of original MEMSAT algorithm, treating TM helices as discrete units, rather than separating them into inside, outside and middle components. • Re-entrant helix and signal peptide states were added. • Residues were therefore predicted to lie in one of five different topological regions: inside loop, outside loop, TM helix, re-entrant helix and signal peptide. • For evaluating signal peptide preference, residues with positive signal peptide scores up to position 30 in a target sequence were added to the outside loop score and subtracted from the inside loops score, in order to direct prediction towards a non-cytoplasmic amino terminal. • The value was also scaled by a factor of 10 and subtracted from the TM helix SVM score to prevent TM helix prediction. • For the same reason, positive re-entrant helix scores were scaled by a factor of 10 and subtracted from the TM helix SVM score
Overall Prediction Accuracy • Benchmark results for the SVM-based method ('TMSVM') against a selection of leading topology predictors. 'Correct signal peptide' and 'correct re-entrant helix' refer to correct topology prediction for proteins containing these features. TMSVM was able to detect signal peptides with 92% accuracy, and re-entrant helices with 39% accuracy. No false positives of either class were predicted. • OCTOPUS results were not cross-validated therefore are likely to be overestimated as there is considerable overlap between test and training sets. • Tested vs the Möller (low resolution) data set – scores 77%, same as MEMSAT3.
Discriminating between TM and Globular Proteins • For SVM training, we used 416 randomly chosen proteins from the MEMSAT3 [11] set which consists of 2685 non-redundant chains from globular proteins of known structure, combined with our novel set of 131 TM proteins. • The remaining 2269 sequences were used used as test cases. PSI-BLAST profiles were generated for all sequences and 10-fold cross validation was used to assess performance, again removing sequences from the training fold with greater than 25% sequences identity to any sequence in the test fold. • Window size = 33, Kernel = RBF, MCC = 0.78
Conclusions • Novel SVM-based approach predicts correct topology with 88% accuracy, 9% higher than next best method OCTOPUS. • Incorporates signal peptide and re-entrant helix prediction. • Signal peptide containing proteins correctly predicted with 92% accuracy. • Re-entrant helix containing proteins correctly predicted with 55% accuracy – room for improvement. • Good TM/globular protein discrimination – combined with SP prediction, highly suited to whole genome analysis. • Further work • SVM to predict amphipathic/pore-forming helices.