Support Vector Machine-based Transmembrane Protein Topology Prediction Tim Nugent

Support Vector Machine-based Transmembrane Protein Topology PredictionTim Nugent

Alpha-helical Transmembrane Proteins • Transmembrane proteins fulfil many critical cellular functions. • Comprise about 30% of the human proteome. • Composed of hydrophobic, membrane-spanning alpha-helices, connected with loop regions. • Poorly represented in structural databases. • Predicting their structure and topology is therefore an important challenge for bioinformatics.

Machine Learning-based Approaches

Using Support Vector Machines for TM Topology Prediction • Recently, more advanced methods using machine learning algorithms such as hidden Markov models (e.g. TMHMM, PHOBIUS) and neural networks (MEMSAT3) have been developed, • They have achieved significant improvements in prediction accuracy (~80%). • However, none of the top scoring methods use SVMs. • While hidden Markov models and neural networks may have multiple outputs, SVMs are binary classifiers. • In order to deal with TM topology prediction, multiple SVM will have to be combined, e.g. • TM helix / Loop • Inside Loop / Outside Loop • Signal Peptide / ¬Signal Peptide • Re-entrant Loop / ¬Re-entrant Loop

Assembling a Novel Data Set of Transmembrane Proteins • In order to study and predict features of transmembrane (TM) proteins, the use of a high quality data set containing sequences with experimentally confirmed TM regions is essential for both training and validation purposes. • Based on Möller set and MPTOPO database. Novel TM sequences parsed from SWISS-PROT and blasted vs PDB. • Remove fragments, chain breaks, colicins, venoms etc. • Homology reduce at 40% sequence identity. • Topologies determined by OPM or PDB_TM. • Since PDB structures of TM proteins contain no lipid, theoretical approaches are used to predict the position of the membrane relative to the structure, and thus the TM helix boundaries. • OPM uses water-lipid transfer energy minimisation • PDB_TM uses hydrophobicity/structural feature analysis

Data Set Composition • Based on Möller set and MPTOPO database. Novel TM sequences parsed from SWISS-PROT and blasted vs PDB. • Remove fragments, chain breaks, colicins, venoms etc. • Homology reduce at 40% sequence identity. • Topologies determined by OPM or PDB_TM.

Novel Data Set • Theoretical membrane placement on to the Mechanosensitive channel protein MscS crystal structure (PDB code 2oau) by OPM (left) and PDB_TM (right). The membrane region is between the red and blue bars.

Re-entrant Helices • Re-entrant helices in Aquaporin Z (left) from Escherichia coli (PDB code 1rc2) and Potassium channel (right) from Bacillus cereus (PDB code 2ahy) marked with black arrows.

Support Vector Machine Training • Data set of 131 non-redundant protein sequences. • Jack knife cross-validation - sequences with >25% sequence identity removed from training sets. • Signal peptide SVM – 10-fold cross validation + additional data from Phobius set and SWISS-PROT (2654 sequences). • PSI-BLAST profiles vs Uniref 90. E-value threshold for inclusion = 0.001 • Normalise by Z-score. • 27-35 (update - 41) residue sliding window. • Transduction. • Optimise window size, kernel choice and parameters using Mathew's Correlation Coefficient:

Window Size H/L SVM Split 1: 37 39 35 33 31 MCC 1: 0.79 0.79 0.79 0.79 0.79 Split 2: 37 35 39 33 31 MCC 2: 0.82 0.82 0.82 0.82 0.81 * * * I/O SVM Split 1: 43 45 41 39 37 35 33 MCC 1: 0.66 0.66 0.66 0.66 0.65 0.64 0.63 Split 2: 45 43 41 39 37 35 33 MCC 2: 0.55 0.55 0.55 0.55 0.54 0.54 0.52 * * * * * Max TM helix length = 33 residues Average TM helix length = 21 residues Average topogenic loop (< 60 residues) length = 19 residues

Per Residue SVM Prediction Accuracy

Dynamic Programming • Modified version of original MEMSAT algorithm, treating TM helices as discrete units, rather than separating them into inside, outside and middle components. • Re-entrant helix and signal peptide states were added. • Residues were therefore predicted to lie in one of five different topological regions: inside loop, outside loop, TM helix, re-entrant helix and signal peptide. • For evaluating signal peptide preference, residues with positive signal peptide scores up to position 30 in a target sequence were added to the outside loop score and subtracted from the inside loops score, in order to direct prediction towards a non-cytoplasmic amino terminal. • The value was also scaled by a factor of 10 and subtracted from the TM helix SVM score to prevent TM helix prediction. • For the same reason, positive re-entrant helix scores were scaled by a factor of 10 and subtracted from the TM helix SVM score

Overall Prediction Accuracy • Benchmark results for the SVM-based method ('TMSVM') against a selection of leading topology predictors. 'Correct signal peptide' and 'correct re-entrant helix' refer to correct topology prediction for proteins containing these features. TMSVM was able to detect signal peptides with 92% accuracy, and re-entrant helices with 39% accuracy. No false positives of either class were predicted. • OCTOPUS results were not cross-validated therefore are likely to be overestimated as there is considerable overlap between test and training sets. • Tested vs the Möller (low resolution) data set – scores 77%, same as MEMSAT3.

Formate Dehydrogenase

Ubiquinol Oxidase

Glycerol uptake facilitator

ABC transporter BtuCD

Photosystem I

Discriminating between TM and Globular Proteins • For SVM training, we used 416 randomly chosen proteins from the MEMSAT3 [11] set which consists of 2685 non-redundant chains from globular proteins of known structure, combined with our novel set of 131 TM proteins. • The remaining 2269 sequences were used used as test cases. PSI-BLAST profiles were generated for all sequences and 10-fold cross validation was used to assess performance, again removing sequences from the training fold with greater than 25% sequences identity to any sequence in the test fold. • Window size = 33, Kernel = RBF, MCC = 0.78

Whole Genome Analysis

Conclusions • Novel SVM-based approach predicts correct topology with 88% accuracy, 9% higher than next best method OCTOPUS. • Incorporates signal peptide and re-entrant helix prediction. • Signal peptide containing proteins correctly predicted with 92% accuracy. • Re-entrant helix containing proteins correctly predicted with 55% accuracy – room for improvement. • Good TM/globular protein discrimination – combined with SP prediction, highly suited to whole genome analysis. • Further work • SVM to predict amphipathic/pore-forming helices.

Support Vector Machine-based Transmembrane Protein Topology Prediction Tim Nugent

Support Vector Machine-based Transmembrane Protein Topology Prediction Tim Nugent

Presentation Transcript

Support Vector Machine

Transmembrane Protein Topology Prediction Using Support Vector Machines

Support vector machine

Support vector machine

Support vector machine

Using Support Vector Machines for transmembrane protein topology prediction Tim Nugent

Support Vector Machine

Support Vector Machine

Prediction of protein localization and membrane protein topology

Alpha-helical transmembrane protein structure prediction Timothy Nugent

Support vector machine approach for protein subcelluar localization prediction (SubLoc)

Support Vector Machine

Support Vector Machine Based Orthographic Dis ambiguation

Support Vector Machine

Transmembrane Protein Prediction

Support Vector Machine

Progress in Transmembrane Protein Research 12 Month Report Tim Nugent

Support Vector Machine

Support Vector Machine

Support Vector Machine

Research on prediction of transmembrane protein topology based on fuzzy theory

Support Vector Machine