1 / 22

Assessing the Performance of Macromolecular Sequence Classifiers

This research assesses the performance of macromolecular sequence classifiers using machine learning methods. It compares different approaches and evaluates their effectiveness in predictive modeling, focusing on data selection and evaluation procedures. The study includes experiments on macromolecular sequence classification and discusses window-based and sequence-based cross-validation methods. Supported by a grant from the National Institutes of Health (GM066387).

amaliag
Download Presentation

Assessing the Performance of Macromolecular Sequence Classifiers

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Assessing the Performance of Macromolecular Sequence Classifiers CorneliaCaragea(cornelia@cs.iastate.edu) Iowa State University Joint work with Jivko Sinapov, Drena Dobbs, and Vasant Honavar October 15, 2007 Research supported in part by a grant from the National Institutes of Health (GM066387).

  2. Background and Motivation • Machine Learning methods offer some of the most cost-effective approaches to building predictive models • One problem – multiple approaches • Needed: comparing the effectiveness of different predictive classifiers • Difficulty: different data selection and evaluation procedures Research supported in part by a grant from the National Institutes of Health (GM066387).

  3. Outline • Macromolecular Sequence Classification • Performance Evaluation • Window-Based Cross-Validation • Sequence-Based Cross-Validation • Experiments • Conclusions Research supported in part by a grant from the National Institutes of Health (GM066387).

  4. H3N+ M L I L K T I F L R P S C S L L L T S Q Q COO- E I D S E Glycosylated? Phosphorylated? Macromolecular Sequence Classification • Predict a label for each element in a given sequence • Example: • Identify post-translational modification residues Research supported in part by a grant from the National Institutes of Health (GM066387).

  5. Macromolecular Sequence Classification • Example: • Identify RNA-binding residues 1T0K_B SINQKLALVIKSGKYTLGYKSTVKSLRQGKSKLIIIAANTPVLRKSELEYYAMLSKTKVYYFQGGNNELGTAVGKLFRVGVVSILEAGDSDILTTLA 0000000000000000111110010000000000000001100100000000000000000000010000000001111100000000000000000 Research supported in part by a grant from the National Institutes of Health (GM066387).

  6. Training Data Learning System Resulting Classifier Performance on test set Validation Test Data All Data Macromolecular Sequence Classification Research supported in part by a grant from the National Institutes of Health (GM066387).

  7. . . . VKKFGGEVVKAGNIL,0 KKFGGEVVKAGNILV,0 KFGGEVVKAGNILVR,1 FGGEVVKAGNILVRQ,1 . . . Target residue Sequence: DSNPKYLGVKKFGGEVVKAGNILVRQRGTKFKAGQGVGMGRDHTLFALSDGK Class: 1111110011111110011111001011111100000001111101000000 Class label Macromolecular Sequence Classification • Sliding Window Approach: Research supported in part by a grant from the National Institutes of Health (GM066387).

  8. Outline • Macromolecular Sequence Classification • Performance Evaluation • Window-Based Cross-Validation • Sequence-Based Cross-Validation • Experiments • Conclusions Research supported in part by a grant from the National Institutes of Health (GM066387).

  9. S1 S2 Sk-1 Sk Learn classifier C Evaluate classifier C repeat k times Performance Evaluation K-Fold Cross-Validation: Research supported in part by a grant from the National Institutes of Health (GM066387).

  10. S1 S2 Sk-1 Sk windows Learn classifier C Evaluate classifier C repeat k times Window-Based Cross-Validation Procedure: • Extract windows from all sequences in the dataset • Partition the set of windows into k disjoint subsets • Perform standard cross-validation Research supported in part by a grant from the National Institutes of Health (GM066387).

  11. S1 S2 Sk-1 Sk sequences Learn classifier C Evaluate classifier C repeat k times Sequence-Based Cross-Validation Procedure: • Partition the set of sequences into k disjoint subsets • Extract windows from sequences in each subset • Perform standard cross-validation Research supported in part by a grant from the National Institutes of Health (GM066387).

  12. Window-Based vs. Sequence-Based Cross-Validation • Window-Based Cross-Validation: • Train and test sets are likely to contain some windows that originate from the same sequence. • This violates the independence assumption between train and test sets. • Sequence-Based Cross-Validation: • Windows belonging to the same sequence end up in the same set. Research supported in part by a grant from the National Institutes of Health (GM066387).

  13. Machine Learning Classifiers • Support Vector Machine: • 0/1 String Kernel • Example: x = VKKFGGEVVKAGNIL y = KKFGGEVVKAGNILV I[xi=yi]= 010010010000000 • Naïve Bayes: • Identity Window: VKKFGGEVVKAGNIL x = V,K,K,F,G,G,E,V,V,K,A,G,N,I,L Research supported in part by a grant from the National Institutes of Health (GM066387).

  14. Datasets • O-GlycBase dataset: • contains experimentally verified glycosylation sites • http://www.cbs.dtu.dk/databases/OGLYCBASE/ • RNA-Protein Interface dataset, RB147 : • consists of RNA-binding protein sequences extracted from structures of known RNA-protein complexes solved by X-ray crystallography in the Protein Data Bank. • http://bindr.gdcb.iastate.edu/RNABindR/ • Protein-Protein Interface dataset: • consists of protein-binding protein sequences Research supported in part by a grant from the National Institutes of Health (GM066387).

  15. Datasets Number ofpositive and negative instances used in our experiments Research supported in part by a grant from the National Institutes of Health (GM066387).

  16. Outline • Macromolecular Sequence Classification • Performance Evaluation • Window-Based Cross-Validation • Sequence-Based Cross-Validation • Experiments • Conclusions Research supported in part by a grant from the National Institutes of Health (GM066387).

  17. Experimental Design Questions: • How does Sequence-Based Cross-Validation compare with Window-Based Cross-Validation? • How do the results vary when we vary the size of the dataset? Research supported in part by a grant from the National Institutes of Health (GM066387).

  18. Results Receiver Operating Characteristic (ROC) Curves for Window-Based and Sequence-Based 10-Fold Cross-Validation using SVM O-glycBase Research supported in part by a grant from the National Institutes of Health (GM066387).

  19. Results AUC CC c) Protein-Protein Interface b) RNA-Protein Interface a) O-glycBase Research supported in part by a grant from the National Institutes of Health (GM066387).

  20. Outline • Macromolecular Sequence Classification • Performance Evaluation • Window-Based Cross-Validation • Sequence-Based Cross-Validation • Experiments • Conclusions Research supported in part by a grant from the National Institutes of Health (GM066387).

  21. Conclusions • Compared two variants of k-fold cross-validation: window-based and sequence-based k-fold cross-validation. • The comparison shows that Window-Based CV overestimates the performance of the classifiers relative to Sequence-Based CV. • Sequence-Based CV provides more realistic estimates of performance, because predictors trained on labeled sequence data have to predict the labels for residues in a novel sequence. Research supported in part by a grant from the National Institutes of Health (GM066387).

  22. Vasant Honavar Jivko Sinapov Drena Dobbs Research supported in part by a grant from the National Institutes of Health (GM066387).

More Related