150 likes | 358 Views
Using Support Vector Machines for transmembrane protein topology prediction Tim Nugent . Alpha-helical Transmembrane Proteins. Transmembrane proteins fulfil many critical cellular functions. Comprise about 30% of the human proteome.
E N D
Using Support Vector Machines for transmembrane protein topology predictionTim Nugent
Alpha-helical Transmembrane Proteins • Transmembrane proteins fulfil many critical cellular functions. • Comprise about 30% of the human proteome. • Composed of hydrophobic, membrane-spanning alpha-helices, connected with loop regions. • Poorly represented in structural databases. • Predicting their structure and topology is therefore an important challenge for bioinformatics.
Transmembrane Protein Topology • Topology of a transmembrane protein describes which portions of the amino-acid sequence lie within the plane of the surrounding lipid bilayer and which portions protrude into the watery environment on either side. • Regions of the polypeptide chain span the membrane. • Position of the N-terminal.
Identification of Transmembrane Regions Aquaporin KGVWTQAFWKAVTAEFLAMLIFVLLSVGSTINWGGSEN To generate data for a plot, the protein sequence is scanned with a moving window of size 19-21 residues. At each position, the mean hydrophobic index of the amino acids within the window is calculated and that value plotted as the midpoint of the window.
Discriminating between Inside and Outside Loops Hydrophobic: Val, Phe, Ile, Leu, Met. Positive: Lys, Arg, His. Cytoplasmic loops are enriched in positively charged residues: the 'positive-inside rule' of von Heijne
Using Evolutionary Information -190 -486 -409 -225 -483 223 -414 -327 -229 -389 -83 738 -236 -56 -424 -478 -100 -370 -32768 -40 -506 218 -282 -521 159 -410 410 155 -513 -225 -311 -354 -163 106 137 50 -100 -325 -32768 -403 • PSI-BLAST takes a single protein sequence as an input and compares it to a protein database. • The program constructs a multiple alignment, and then a profile, from any significant local alignments found. • The profile is compared to the protein database, again seeking local alignments. • PSI-BLAST estimates the statistical significance of the local alignments found. • Finally, PSI-BLAST iterates, by returning to step (2), an arbitrary number of times or until convergence.
Using Support Vector Machines for Topology prediction • Earlier approaches have relied on physiochemical properties such as hydrophobicity to identify transmembrane helices (e.g Kyte-Doolittle). • Recently, more advanced methods using machine learning algorithms such as hidden Markov models (e.g. TMHMM, PHOBIUS) and neural networks (MEMSAT3) have been developed, • They have achieved significant improvements in prediction accuracy (~80%). • However, none of the top scoring methods use SVMs. • While hidden Markov models and neural networks may have multiple outputs, SVMs are binary classifiers. • In order to deal with TM topology prediction, multiple SVM will have to be combined, e.g. • TM helix / Loop • Inside Loop / Outside Loop • Signal Peptide / TM helix • Re-entrant Loop / TM helix
Helix / Loop SVM Prediction Accuracy • TM helix / Loop SVM: • Database of 135 non-redundant protein sequences • Jack knife cross-validation • PSI-BLAST profiles • Normalised by Z-score • 33 residue sliding window • Radial Basis Function Kernel: Gamma = 0.09, C = 0.8 • SVM Mathews Correlation Coefficient = 0.82 • TP=9129 • FP=1351 • TN=22140 • FN=1320 • Kyte-Doolittle MCC: 0.66 • MEMSAT3 MMC: 0.76
Inside Loop/Outside Loop SVM Prediction Accuracy • Inside Loop/Outside Loop SVM • 33 residue sliding window • Mathews Correlation Coefficient = 0.64 • Precision = 0.86 • Recall = 0.59 • Signal Peptide/TM Helix and Re-entrant Loop/TM Helix SVMs in training...
Further work • Expand training set. • Additional sequences where the TMH are known but the topology is not can be used to train the Helix/Loop classifier. • Parameter optimisation. • Window size • Kernel type • Transduction. • Signal peptide SVM • Re-entrant loop SVM. • Combine SVM raw scores/probabilities into a topology.