170 likes | 290 Views
A discriminative method for protein remote homology detection based on N-Gram. Reporter : Xie sifa Mentor : Zou quan. Outline. Introduction. Method. Improve P&R. Conclusion. Introduction. Introduction. Protein homology detection. detect 10%~30% protein structure.
E N D
A discriminative method for protein remote homology detection based on N-Gram Reporter : Xie sifa Mentor : Zou quan
Outline Introduction Method Improve P&R Conclusion
Introduction Protein homology detection detect 10%~30% protein structure Remote homology detection ...ATTATCCGACGGCCGCCT... ...TCATCTGCACGGCCTCAC... Similarity<25% --《生物信息学基础》 孙啸,陆祖宏,谢建明
Process Data Set Feature Extraction Classify
Date Set Benchmark (Liao and Noble,2003) Same superfamily Similatiry<10-25 4352proteins TrainSet Different family 54 Families Familyi Same family Test Set Different family
Ngram 2Gram: 400 3Gram: 8000 1Gram: 20 "A Closer Look at Skip-gram Modelling" --David Guthrie,Ben Allison et al Skip-Ngram: "I hit the tennis ball" "hit the ball" !!! "the tennis ball" "I hit the" "hit the tennis"
Random Forest Ensemble !!!
Result the area under the ROC curve up to first 50 false positives
Improving Recall and Precision Unbalance data set Trade-off
Improving Recall and Precision One family one threshold
Improving Recall and Precision Train set 0.98+ 0.95+ 0.93+ 0.92+ 0.90- 0.87- 0.85+ 0.84- 0.81+ 0.79+ 0.77- 0.75- 0.73- 0.69+ 0.65- 0.62- 0.58- 0.55- 0.53- F value 0.88 0.85 0.82 0.79 0.78 0.76 0.75 0.72 0.70 0.68 0.67 0.63 0.60 0.57 0.56 0.54 0.51 0.49 0.48 0.79 New test New train F value F value no value but position! F value
Conclusion 1. Ngram model is successfully used to detect protein remote homology. The result on the benchmark is satisfied. 2. A novel method is proposed to improve the recall and precision of positive samples. This method yields values of 0.86752 and 0.56470 for mean recall and mean precision, respectively.