380 likes | 403 Views
Sixth International Conference on Bioinformatics InCob2007, HongKong. T-cell EPITOPES PREDICTION OF HEMAGGLUTININ, NEURAMINIDASE AND MATRIX PROTEIN OF INFLUENZA A VIRUS USING SUPPORT VECTOR MACHINE AND HIDDEN MARKOV MODEL.
E N D
Sixth International Conference on Bioinformatics InCob2007, HongKong T-cell EPITOPES PREDICTION OF HEMAGGLUTININ, NEURAMINIDASE AND MATRIX PROTEIN OF INFLUENZA A VIRUS USING SUPPORT VECTOR MACHINE AND HIDDEN MARKOV MODEL Vo Cam Quy, Nguyen Thanh Khoi, Nguyen Thi Truc Minh, Tran Linh Thuoc Department of Biotechnology University of Natural Sciences Vietnam National University – HoChiMinh city, VietNam
OUTLINE • Introduction • Epitope prediction methods • Influenza A virus • Materials And Methods • Results And Discussion • Conclusion and future work
Epitope in silico Analysis Peptide Multiepitope vaccines VACCINOME Candidate Epitope DB Epitope prediction Disease related protein DB Gene/Protein Sequence Database
Epitope • An epitope is the part of a macromolecule that is recognized by the immune system, specifically by antibodies, B cells, or T cells. • Most referred as three-dimensional surface features of an antigen molecule • linear epitopes are determined by the amino acid sequence
EPITOPE PREDICTION STRATEGIES Epitope prediction B cell epitope prediction T cell epitope prediction Sequence Structure chemical features structure Statitical method Machine learning method Binding motifs, matrices Support Vector Machine, Artifical Neural Network… High accuracy Quantitative Matrices Hidden Markov Model Flexible model
Tcell epitope prediction approach T cell epitope prediction Direct approach Indirect approach Postive: MHC binding peptides (binder) Negative: MHC-I non-binding peptides (non-binder) Negative: non-epitope Postive: Putative epitope Compare Epitope Candidates
Influenza A virus • Influenza A viruses continue to emerge from the aquatic avian reservoir and cause pandemics • Many variances and mutations in the population difficult for vaccine producing • Genome: Consists of s/s (-) sense RNA in 8 segments • Hemagglutinin, neuraminidase, matrix protein are 3 of proteins concerned much. Red: M2 protein Green: hemagglutinin Blue: euraminidase Inside: viral RNA http://www.roche.com/pages/ facets/10/viruse.htm
OBJECTIVE • Building HMM and SVM models for T cell epitope prediction (MHC class I and II) • Direct approach (epitope prediction) • Indirect approach (MHC binder prediction) • combining the results to get epitope candidates • Epitope prediction of Influenza A virus’s proteins for the design of vaccine in silico
METHODS 3 PARAMETERS OPTIMIZATION AntiJen MHCBN IEDB Training models 1 DATA COLLECTION AND PROCESSING Data collection Evaluating Raw data Optimal model Processing Training set Predict 4 APPLYING Protein 2 BUILDING MODEL EPITOPES epitopes predicted by both methods / both approachs were considered as epitopes SVM method HMM method
RESULTS OF DATA COLLECTION AND PROCESSING Peptide type Allele Alen 24 data sets
METHODS 3 PARAMETERS OPTIMIZATION AntiJen MHCBN IEDB Training models 1 DATA COLLECTION AND PROCESSING Data collection Evaluating Raw data Optimal model Processing Training set Predict 4 APPLYING 2 BUILDING MODEL Protein EPITOPES epitopes predicted by both methods were considered as epitopes SVM method HMM method
Step 2: BUIDLING MODEL – HMM method Positive training set ClustalW Perl script modelfromalign Initial model • Result: 11 matrices x 6 allele x 2 approaches = 132initial models
Sequence is cut into overlaps 8mer/9mer non-binder/non-epitope data processing Step 2: BUIDLING MODEL – SVM method Motif 9mer (binding core) Positive data Motif information from SYFPEITHI database (script perl) Choosing peptide conforming reported motif MHC class I binder/epitope data processing MHC class II binder/epitope data processing Negative data
METHODS 3 PARAMETERS OPTIMIZATION AntiJen MHCBN IEDB Training models 1 DATA COLLECTION AND PROCESSING Data collection Evaluating Raw data Optimal model Processing Training set Predict 4 APPLYING 2 BUILDING MODEL Protein EPITOPES epitopes predicted by both methods were considered as epitopes SVM method HMM method
STEP 3: PARAMETERS OPTIMIZATION HMM METHOD
TRAINING PRINCIPLE COUPLE OF MODELS Positive model 12 Positive data set buildmodel (Baum-Welch or Viterbi) 132 Initial models 12 Negative data set buildmodel (Baum-Welch or Viterbi) Negative model
Test set Training set Initial model (positive) - + ROC analysis Training Training Couple 1 Positive and negative data sets 10-FOLD CROSS VALIDATION 6 7 8 9 10 1 2 3 4 5 Average accuracy Acc. 3 Acc. 9 Acc. 1 Acc. 2 Acc. 4 Acc. 5 Acc. 6 Acc. 7 Acc. 8 Acc. 10
NLL CALCULATING PRINCIPLE hmmscore (Viterbi) NLL 1 Positive model NLL 1 – NLL 2 Compare threshold NLL final NLL PPVPVSKVVSTDEYVAR ? Queried sequence NLL 2 Epitope Non-epitope Negative model hmmscore (Viterbi) final NLL threshold NLL final NLL threshold NLL
ROC (Receiver Operating Curve) Analysis AROC > 90%: excellent prediction AROC > 80%:good prediction AROC < 80%: not acceptable prediction
RESULTS OF VALIDATION The validation result of 22 couples of models trained by Baum-Welch and Viterbi algorithm in indirect approach for H-2-Db allele
STEP 3: PARAMETERS OPTIMIZATION SVM METHOD
LOOCV (LEAVE-ONE-OUT-CROSS-VALIDATION) Removing one peptide from the training data Testing was done on the removed peptide Training set The model was built by remaining data
THE ACCURACY (MHC class I MODELS) Accuracy comparing the accuracies of predictive models between direct and indirect method after carrying out LOOCV procedure (mhc class I) Direct method Indirect method MHC allele
THE ACCURACY (MHC class II MODELS) Accuracy Direct method Indirect method MHC allele
OPTIMAL PARAMETERS (MHC CLASS I) Kernel functions: - Linear function - Polynimial function - RBF function - Sigmoid function
OPTIMAL PARAMETERS (MHC CLASS II) Kernel functions: - Linear function - Polynimial function - RBF function - Sigmoid function
METHODS 3 PARAMETERS OPTIMIZATION AntiJen MHCBN IEDB Training models 1 DATA COLLECTION AND PROCESSING Data collection Evaluating Raw data Optimal model Processing Training set Predict 4 APPLYING 2 BUILDING MODEL Protein EPITOPES epitopes predicted by both methods were considered as epitopes SVM method HMM method
Total amount of epitopes in Influenza A virus Table 7: The number of epitopes in both HMM - SVM method protein Allele
WEB PREDICTION TOOL FOR HMM METHOD (cont) Positive results Number of positive sequences Negative results Number of negative sequences
CONCLUSIONS • SVM method: the model accuracy • Indirect method is better • MHC class I: H-2-Db (86.58%), H-2-Kb (80.25% ) and H-2-Kd (83.45%) • MHC class II: H-2-IEd (93.26%), H-2-IEk (95.19%), H-2-IAd (89.42%) • HMM method: the model accuracy • dicrect method is better • MHC class I: H-2-Db (86%), H-2-Kb (84.54% ) and H-2-Kd (84.72%) • MHC class II: H-2-IEd (93.90%), H-2-IEk (95.11%), H-2-IAd (77.84%)
CONCLUSIONS • Built HMM and SVM models for T cell epitope prediction (MHC class I and II) • Direct approach (epitope prediction) • Indirect approach (MHC binder prediction) with a high accuracy • Applying successfully these model for epitope prediction of Influenza A virus’s proteins for the design of vaccine in silico
FUTURE WORKS • Applying this tool to other proteins • Will run any programs by web. • B cell epitope prediction • Test result by biological experiment • …