580 likes | 915 Views
Proteome Analyst. Transparent High-throughput Protein Annotation: Function, Localization and Custom Predictors. Proteome Analyst. Duane Szafron, Paul Lu, Russell Greiner, David Wishart, Zhiyong Lu, Brett Poulin, Roman Eisner, John Anvik,Cam Macdonell. Proteome Analyst. Proteome
E N D
Proteome Analyst Transparent High-throughput Protein Annotation: Function, Localization and Custom Predictors
Proteome Analyst Duane Szafron, Paul Lu, Russell Greiner, David Wishart, Zhiyong Lu, Brett Poulin, Roman Eisner, John Anvik,Cam Macdonell
Proteome Analyst • Proteome • one of many ‘-omes’ • set of all proteins in an organism • Analysis • prediction of protein function or localization from sequence data
Analyze a Protein • We have examples of annotated proteins in various protein classes. • We have more examples of unannotated proteins.
Analyze a Protein • We have examples of annotated proteins in various protein classes. • We have more examples of unannotated proteins. • What do we do?
Analyze a Protein • We have examples of annotated proteins in various protein classes. • We have more examples of unannotated proteins. • What do we do? • Find homologues to each protein and assume similar function.
Analyze a Protein • We have examples of annotated proteins in various protein classes. • We have more examples of unannotated proteins. • What do we do? • Find homologues to each protein and assume similar function. • Find characteristics of each protein that affect function.
Analyzing Proteins • One Protein?
Analyzing Proteins • One Protein? • Just do it.
Analyzing Proteins • One Protein? • Just do it. • 5 Proteins?
Analyzing Proteins • One Protein? • Just do it. • 5 Proteins? • Post-doc familiar with protein classes.
Analyzing Proteins • One Protein? • Just do it. • 5 Proteins? • Post-doc familiar with protein classes. • 50 Proteins?
Analyzing Proteins • One Protein? • Just do it. • 5 Proteins? • Post-doc familiar with protein classes. • 50 Proteins? • grad student
Analyzing Proteins • One Protein? • Just do it. • 5 Proteins? • Post-doc familiar with protein classes. • 50 Proteins? • grad student • 5000 proteins?
Analyzing Proteins • One Protein? • Just do it. • 5 Proteins? • Post-doc familiar with protein classes. • 50 Proteins? • grad student • 5000 proteins? • summer students
Proteome Analyst • High-throughput • Transparent • Prediction of • Protein Function • Protein Localization • Custom Classification
Machine Learning Task • Training • INPUT: sequences, classes • OUTPUT: Classifier • Analysis • INPUT: sequences, Classifier • OUTPUT: classes
Machine Learning Task • Training • INPUT: sequences, classes • OUTPUT: Classifier • Analysis • INPUT: sequences, Classifier • OUTPUT: classes, explanation
Training • INPUT • sequences, classes • PA Tools • sequences features • ML Algorithm • features, classes Classifier • OUTPUT • Classifier
Training: INPUT >class A<Training Seq 1 MVGSGLLWLALVSCILTQASAVQRGYGN PIEASSYGL... >class B<Training Seq 2 LLDEPFRSTENSAGSQGCDKNMSGWYRF VGEGGVRMS... >class B<Training Seq 3 EVIAYLRDPNCSSILQTEERNWVSVTSP VQASACRNI... . . .
Training: INPUT classes >class A<Training Seq 1 MVGSGLLWLALVSCILTQASAVQRGYGN PIEASSYGL... >class B<Training Seq 2 LLDEPFRSTENSAGSQGCDKNMSGWYRF VGEGGVRMS... >class B<Training Seq 3 EVIAYLRDPNCSSILQTEERNWVSVTSP VQASACRNI... . . . protein sequences
Training: PA Tools • sequences features
Training: PA Tools • sequences features • Homology Tools (BLAST) • sequence homologues • homologues annotations • annotations features
Homology Tool • sequence features sequence seq DB BLAST homologues retrieve parse annotations features
Homology Tool • sequence features sequence DBSOURCE swissprot: locus MPPB_NEUCR, ... xrefs (non-sequence databases): ... InterProIPR001431,... KEYWORDS Hydrolase; Metalloprotease; Zinc; Mitochondrion; Transit peptide; Oxidoreductase; Electron transport; Respiratory chain. seq DB BLAST homologues retrieve parse annotations features
Homology Tool • sequence features sequence seq DB BLAST homologues retrieve parse annotations features
Training: PA Tools • sequences features • Homology Tools (BLAST) • sequence homologues • homologues annotations • annotations features • Pattern Tools (PFAM, ProSite, …) • sequences motifs • motifs features
Pattern Tool • sequence features sequence pattern DB find patterns parse features
Pattern Tool • sequence features sequence pattern DB find Pfam; PF00234; tryp_alpha_amyl; 1. PROSITE; PS00940; GAMMA_THIONIN; 1. PROSITE; PS00305; 11S_SEED_STORAGE; 1. patterns parse features
Pattern Tool • sequence features • not included in current results sequence pattern DB find patterns parse features
Training: ML Algorithm • features, classes Classifier
Training: ML Algorithm • features, classes Classifier • any ML Algorithm may be used • default = naïve Bayes • consistently near-best accuracy (SVM, ANN slightly better) • efficient (for high-throughput) • easy to interpret
Training: OUTPUT • Classifier
Analysis (Classification) • INPUT • sequences • PA Tools • sequences features • Classifier • features classes, explanation • OUTPUT • classes
Analysis: INPUT >Seq 1 DTILNINFQCAYPLDMKVSLQAALQPIV SSLNVSVDG... >Seq 2 AVELSVESVLYVGAILEQGDTSRFNLVL RNCYATPTE... >Seq 3 HVEENGQSSESRFSVQMFMFAGHYDLVF LHCEIHLCD... . . .
Analysis: INPUT >Seq 1 DTILNINFQCAYPLDMKVSLQAALQPIV SSLNVSVDG... >Seq 2 AVELSVESVLYVGAILEQGDTSRFNLVL RNCYATPTE... >Seq 3 HVEENGQSSESRFSVQMFMFAGHYDLVF LHCEIHLCD... . . . protein sequences
Analysis: PA Tools • sequences features
Analysis: PA Tools • sequences features • Homology Tools (BLAST) • sequence homologues • homologues annotations • annotations features • Pattern Tools (PFAM, ProSite, …) • sequences motifs • motifs features
Analysis: Classification • features classes
Analysis: Classification • features classes • naïve Bayes • returns probabilities of each class for each sequence • efficient (for high-throughput) • easy to interpret
Analysis: Classification • features classes, explanation
Analysis: Classification • features classes, explanation
Analysis: Classification • features classes, explanation
Analysis: Classification • features classes, explanation
Analysis: Classification • features classes, explanation
Results: General Function • GeneQuiz classification • 5-fold x-val accuracy on 14 classes
Results: General Function • GeneQuiz classification • 5-fold x-val accuracy on 14 classes
Results: Specific Function • K+ Ion Channel Proteins • 5-fold x-val accuracy on 78 sequences, 4 classes
Results: Specific Function • K+ Ion Channel Proteins • 5-fold x-val accuracy on 78 sequences, 4 classes