Prediction of Bacterial Effectors using SVM and Naïve Bayes classifier

Prediction of Bacterial Effectors using SVM and Naïve Bayes classifier Sneha Joshi MU Informatics Institute November 30, 2009

Effector Prediction • What are effectors: • Why predicting effectors • Prime candidates involved in Host pathogen interaction • Modulate host cell functions • What are our goals: • Develop a classifier to classify pathogenic proteins in to effectors or non-effectors • Identify important features of signal • Provide potential drug targets

Available Methods • Experimental: • Translocation assays using fusion proteins of putative effector with reporter gene • Detection of effectors in supernatant • Prior knowledge required to screen effectors using experiment • Computational: • Homology to known effectors • Can not predict novel effectors • Transcriptional co-regulation • Few methods exists – limited to one of the secretion system

SVM prediction Features from N terminal 25 amino acids Features from full length of protein Features from C terminal 25 amino acids SVM 2 SVM 1 SVM 3 Naïve BayesClassifier Effectors Non-Effectors

Features from Protein sequence Dipeptide Composition Secondary structure Dielectric constant MLKYEERKLNNLTLSSFSKVGVSNDARL Charge Amino Acid Composition Relative solvent accessibility Polar, non-polar, charged, acidic, basic amino acids

Features from Nucleotide sequence Distance from known effector

Results: Data

Results: SVM1: Full Length amino acids Precision = TP/(TP+FP) Recall = TP/(TP+FN)

Results: SVM2: N terminal 25 amino acids Precision = TP/(TP+FP) Recall = TP/(TP+FN)

Results: SVM3: C terminal 25 amino acids Precision = TP/(TP+FP) Recall = TP/(TP+FN)

Results • Effect of predicted secondary structure solvent accessibility on prediction accuracy

Results • Effect of serine on prediction accuracy

Feature Selection • Feature space reduction • Correlation based feature selection1 • Hypothesis: Good feature subsets contain features highly correlated with the class yet uncorrelated with each other. • Features space reduced to 36 dimensions for full length, 19 for N terminal, and 25 dimensions for C-terminal. 1 Mark Hall Correlation-based Feature Selection for Machine Learning

Results after feature selection

Case study Xanthomonas oryzae Causes leaf blight of rice Has T2SS and T3SS System detects 2 effectors substrates of type II secretion system along with other 6 effectors of type III secretion system.

Future Work • Naïve Bayes Classifier: • Application to biological system: Mycobacterium tuberculosis • Evolutionary study of effector proteins • Extending beyond bacterial secretion systems • Nematode effector proteins

Acknowledgement • This work was supported by NSF Award #0845196 • Dmitry Korkin • Gavin Conant.

Thank You.

Prediction of Bacterial Effectors using SVM and Naïve Bayes classifier