1 / 58

Proteome Analyst

Proteome Analyst. Transparent High-throughput Protein Annotation: Function, Localization and Custom Predictors. Proteome Analyst. Duane Szafron, Paul Lu, Russell Greiner, David Wishart, Zhiyong Lu, Brett Poulin, Roman Eisner, John Anvik,Cam Macdonell. Proteome Analyst. Proteome

bowie
Download Presentation

Proteome Analyst

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Proteome Analyst Transparent High-throughput Protein Annotation: Function, Localization and Custom Predictors

  2. Proteome Analyst Duane Szafron, Paul Lu, Russell Greiner, David Wishart, Zhiyong Lu, Brett Poulin, Roman Eisner, John Anvik,Cam Macdonell

  3. Proteome Analyst • Proteome • one of many ‘-omes’ • set of all proteins in an organism • Analysis • prediction of protein function or localization from sequence data

  4. Analyze a Protein • We have examples of annotated proteins in various protein classes. • We have more examples of unannotated proteins.

  5. Analyze a Protein • We have examples of annotated proteins in various protein classes. • We have more examples of unannotated proteins. • What do we do?

  6. Analyze a Protein • We have examples of annotated proteins in various protein classes. • We have more examples of unannotated proteins. • What do we do? • Find homologues to each protein and assume similar function.

  7. Analyze a Protein • We have examples of annotated proteins in various protein classes. • We have more examples of unannotated proteins. • What do we do? • Find homologues to each protein and assume similar function. • Find characteristics of each protein that affect function.

  8. Analyzing Proteins • One Protein?

  9. Analyzing Proteins • One Protein? • Just do it.

  10. Analyzing Proteins • One Protein? • Just do it. • 5 Proteins?

  11. Analyzing Proteins • One Protein? • Just do it. • 5 Proteins? • Post-doc familiar with protein classes.

  12. Analyzing Proteins • One Protein? • Just do it. • 5 Proteins? • Post-doc familiar with protein classes. • 50 Proteins?

  13. Analyzing Proteins • One Protein? • Just do it. • 5 Proteins? • Post-doc familiar with protein classes. • 50 Proteins? • grad student

  14. Analyzing Proteins • One Protein? • Just do it. • 5 Proteins? • Post-doc familiar with protein classes. • 50 Proteins? • grad student • 5000 proteins?

  15. Analyzing Proteins • One Protein? • Just do it. • 5 Proteins? • Post-doc familiar with protein classes. • 50 Proteins? • grad student • 5000 proteins? • summer students

  16. Proteome Analyst

  17. Proteome Analyst • High-throughput • Transparent • Prediction of • Protein Function • Protein Localization • Custom Classification

  18. Machine Learning Task • Training • INPUT: sequences, classes • OUTPUT: Classifier • Analysis • INPUT: sequences, Classifier • OUTPUT: classes

  19. Machine Learning Task • Training • INPUT: sequences, classes • OUTPUT: Classifier • Analysis • INPUT: sequences, Classifier • OUTPUT: classes, explanation

  20. Training • INPUT • sequences, classes • PA Tools • sequences  features • ML Algorithm • features, classes  Classifier • OUTPUT • Classifier

  21. Training: INPUT >class A<Training Seq 1 MVGSGLLWLALVSCILTQASAVQRGYGN PIEASSYGL... >class B<Training Seq 2 LLDEPFRSTENSAGSQGCDKNMSGWYRF VGEGGVRMS... >class B<Training Seq 3 EVIAYLRDPNCSSILQTEERNWVSVTSP VQASACRNI... . . .

  22. Training: INPUT classes >class A<Training Seq 1 MVGSGLLWLALVSCILTQASAVQRGYGN PIEASSYGL... >class B<Training Seq 2 LLDEPFRSTENSAGSQGCDKNMSGWYRF VGEGGVRMS... >class B<Training Seq 3 EVIAYLRDPNCSSILQTEERNWVSVTSP VQASACRNI... . . . protein sequences

  23. Training: PA Tools • sequences  features

  24. Training: PA Tools • sequences  features • Homology Tools (BLAST) • sequence  homologues • homologues  annotations • annotations  features

  25. Homology Tool • sequence  features sequence seq DB BLAST homologues retrieve parse annotations features

  26. Homology Tool • sequence  features sequence DBSOURCE swissprot: locus MPPB_NEUCR, ... xrefs (non-sequence databases): ... InterProIPR001431,... KEYWORDS Hydrolase; Metalloprotease; Zinc; Mitochondrion; Transit peptide; Oxidoreductase; Electron transport; Respiratory chain. seq DB BLAST homologues retrieve parse annotations features

  27. Homology Tool • sequence  features sequence seq DB BLAST homologues retrieve parse annotations features

  28. Training: PA Tools • sequences  features • Homology Tools (BLAST) • sequence  homologues • homologues  annotations • annotations  features • Pattern Tools (PFAM, ProSite, …) • sequences  motifs • motifs  features

  29. Pattern Tool • sequence  features sequence pattern DB find patterns parse features

  30. Pattern Tool • sequence  features sequence pattern DB find Pfam; PF00234; tryp_alpha_amyl; 1. PROSITE; PS00940; GAMMA_THIONIN; 1. PROSITE; PS00305; 11S_SEED_STORAGE; 1. patterns parse features

  31. Pattern Tool • sequence  features • not included in current results sequence pattern DB find patterns parse features

  32. Training: ML Algorithm • features, classes  Classifier

  33. Training: ML Algorithm • features, classes  Classifier • any ML Algorithm may be used • default = naïve Bayes • consistently near-best accuracy (SVM, ANN slightly better) • efficient (for high-throughput) • easy to interpret

  34. Training: OUTPUT • Classifier

  35. Analysis (Classification) • INPUT • sequences • PA Tools • sequences  features • Classifier • features  classes, explanation • OUTPUT • classes

  36. Analysis: INPUT >Seq 1 DTILNINFQCAYPLDMKVSLQAALQPIV SSLNVSVDG... >Seq 2 AVELSVESVLYVGAILEQGDTSRFNLVL RNCYATPTE... >Seq 3 HVEENGQSSESRFSVQMFMFAGHYDLVF LHCEIHLCD... . . .

  37. Analysis: INPUT >Seq 1 DTILNINFQCAYPLDMKVSLQAALQPIV SSLNVSVDG... >Seq 2 AVELSVESVLYVGAILEQGDTSRFNLVL RNCYATPTE... >Seq 3 HVEENGQSSESRFSVQMFMFAGHYDLVF LHCEIHLCD... . . . protein sequences

  38. Analysis: PA Tools • sequences  features

  39. Analysis: PA Tools • sequences  features • Homology Tools (BLAST) • sequence  homologues • homologues  annotations • annotations  features • Pattern Tools (PFAM, ProSite, …) • sequences  motifs • motifs  features

  40. Analysis: Classification • features  classes

  41. Analysis: Classification • features  classes • naïve Bayes • returns probabilities of each class for each sequence • efficient (for high-throughput) • easy to interpret

  42. Analysis: Classification • features  classes, explanation

  43. Analysis: Classification • features  classes, explanation

  44. Analysis: Classification • features  classes, explanation

  45. Analysis: Classification • features  classes, explanation

  46. Analysis: Classification • features  classes, explanation

  47. Results: General Function • GeneQuiz classification • 5-fold x-val accuracy on 14 classes

  48. Results: General Function • GeneQuiz classification • 5-fold x-val accuracy on 14 classes

  49. Results: Specific Function • K+ Ion Channel Proteins • 5-fold x-val accuracy on 78 sequences, 4 classes

  50. Results: Specific Function • K+ Ion Channel Proteins • 5-fold x-val accuracy on 78 sequences, 4 classes

More Related