600 likes | 615 Views
This research paper focuses on predicting protein function using machine-learned hierarchical classifiers. The study explores different predictors and evaluates their performance in a hierarchy. Experimental results and conclusions are presented.
E N D
Predicting Protein Function Using Machine-Learned Hierarchical Classifiers Roman Eisner Supervisors: Duane Szafron and Paul Lu eisner@cs.ualberta.ca
Outline • Introduction • Predictors • Evaluation in a Hierarchy • Local Predictor Design • Experimental Results • Conclusion eisner@cs.ualberta.ca
Proteins • Functional Units in the cell • Perform a Variety of Functions • e.g. Catalysis of reactions, Structural and mechanical roles, transport of other molecules • Can take years to study a single protein • Any good leads would be helpful! eisner@cs.ualberta.ca
Protein Function Prediction and Protein Function Determination • Prediction: • An estimate of what function a protein performs • Determination: • Work in a laboratory to observe and discover what function a protein performs • Prediction complements determination eisner@cs.ualberta.ca
Proteins • Chain of amino acids • 20 Amino Acids • FastA Format: >P18077 – R35A_HUMAN MSGRLWSKAIFAGYKRGLRNQREHTALLKIEGVYARDETEFYLGKR CAYVYKAKNNTVTPGGKPNKTRVIWGKVTRAHGNSGMVRAKFRSNL PAKAIGHRIRVMLYPSRI eisner@cs.ualberta.ca
Ontologies • Standardized Vocabularies (Common Language) • In biological literature, different terms can be used to describe the same function • e.g. “peroxiredoxin activity” and “thioredoxin peroxidase activity” • Can be structured in a hierarchy to show relationships eisner@cs.ualberta.ca
Gene Ontology • Directed Acyclic Graph (DAG) • Always changing • Describes 3 aspects of protein annotations: • Molecular Function • Biological Process • Cellular Component eisner@cs.ualberta.ca
Gene Ontology • Directed Acyclic Graph (DAG) • Always changing • Describes 3 aspects of protein annotations: • Molecular Function • Biological Process • Cellular Component eisner@cs.ualberta.ca
Hierarchical Ontologies • Can help to represent a large number of classes • Represent General and Specific data • Some data is incomplete – could become more specific in the future eisner@cs.ualberta.ca
Incomplete Annotations eisner@cs.ualberta.ca
Goal • To predict the function of proteins given their sequence eisner@cs.ualberta.ca
Data Set • Protein Sequences • UniProt database • Ontology • Gene Ontology Molecular Function aspect • Experimental Annotations • Gene Ontology Annotation project @ EBI • Pruned Ontology: 406 nodes (out of 7,399) with ≥ 20 proteins • Final Data Set: 14,362 proteins eisner@cs.ualberta.ca
Outline • Introduction • Predictors • Evaluation in a Hierarchy • Local Predictor Design • Experimental Results • Conclusion eisner@cs.ualberta.ca
Predictors • Global: • BLAST NN • Local: • PA-SVM • PFAM-SVM • Probabilistic Suffix Trees eisner@cs.ualberta.ca
Predictors • Global: • BLAST NN • Local: • PA-SVM • PFAM-SVM • Probabilistic Suffix Trees Linear eisner@cs.ualberta.ca
Why Linear SVMs? • Accurate • Explainability • Each term in the dot product in meaningful eisner@cs.ualberta.ca
PA-SVM Proteome Analyst eisner@cs.ualberta.ca
PFAM-SVM Hidden Markov Models eisner@cs.ualberta.ca
PST • Probabilistic Suffix Trees • Efficient Markov chains • Model the protein sequences directly: • Prediction: eisner@cs.ualberta.ca
BLAST • Protein Sequence Alignment for a query protein against any set of protein sequences eisner@cs.ualberta.ca
BLAST eisner@cs.ualberta.ca
Outline • Introduction • Predictors • Evaluation in a Hierarchy • Local Predictor Design • Experimental Results • Conclusion eisner@cs.ualberta.ca
Evaluating Predictions in a Hierarchy • Not all errors are equivalent • Error to sibling different than error to unrelated part of hierarchy • Proteins can perform more than one function • Need to combine predictions of multiple functions into a single measure eisner@cs.ualberta.ca
Evaluating Predictions in a Hierarchy • Semantics of the hierarchy – True Path Rule • Protein labeled with: {T} -> {T, A1, A2} • Predicted functions: {S} -> {S, A1, A2} • Precision = 2/3 = 67% • Recall = 2/3 = 67% eisner@cs.ualberta.ca
Evaluating Predictions in a Hierarchy • Protein labelled with {T} -> {T, A1, A2} • Predicted: {C1} -> {C1, T, A1, A2} • Precision = 3/4 = 75% • Recall = 3/3 = 100% eisner@cs.ualberta.ca
Supervised Learning eisner@cs.ualberta.ca
Cross-Validation • Used to estimate performance of classification system on future data • 5 Fold Cross-Validation: eisner@cs.ualberta.ca
Outline • Introduction • Predictors • Evaluation in a Hierarchy • Local Predictor Design • Experimental Results • Conclusion eisner@cs.ualberta.ca
Inclusive vs Exclusive Local Predictors • In a system of local predictors, how should each local predictor behave? • Two extremes: • A local predictor predicts positive only for those proteins that belong exactly at that node • A local predictor predicts positive for those proteins that belong at or below them in the hierarchy • No a priori reason to choose either eisner@cs.ualberta.ca
Exclusive Local Predictors eisner@cs.ualberta.ca
Inclusive Local Predictors eisner@cs.ualberta.ca
Training Set Design • Proteins in the current fold’s training set can be used in any way • Need to select for each local predictor: • Positive training examples • Negative training examples eisner@cs.ualberta.ca
Training Set Design eisner@cs.ualberta.ca
Training Set Design eisner@cs.ualberta.ca
Training Set Design eisner@cs.ualberta.ca
Training Set Design eisner@cs.ualberta.ca
Training Set Design eisner@cs.ualberta.ca
Comparing Training Set Design Schemes • Using PA-SVM eisner@cs.ualberta.ca
Exclusive have more exceptions eisner@cs.ualberta.ca
Lowering the Cost of Local Predictors • Top-Down • Compute local predictors top to bottom until a negative prediction is reached eisner@cs.ualberta.ca
Lowering the Cost of Local Predictors • Top-Down • Compute local predictors top to bottom until a negative prediction is reached eisner@cs.ualberta.ca
Lowering the Cost of Local Predictors • Top-Down • Compute local predictors top to bottom until a negative prediction is reached eisner@cs.ualberta.ca
Top-Down Search eisner@cs.ualberta.ca
Outline • Introduction • Predictors • Evaluation in a Hierarchy • Local Predictor Design • Experimental Results • Conclusion eisner@cs.ualberta.ca
Predictor Results eisner@cs.ualberta.ca
Similar and Dissimilar Proteins • 89% of proteins – at least one good BLAST hit • Proteins which are similar (often homologous) to the set of well studied proteins • 11% of proteins – no good BLAST hit • Proteins which are not similar to the set of well studied proteins eisner@cs.ualberta.ca
Coverage • Coverage: Percentage of proteins for which a prediction is made eisner@cs.ualberta.ca
Similar Proteins – Exploiting BLAST • BLAST is fast and accurate when a good hit is found • Can exploit this to lower the cost of local predictors • Generate candidate nodes • Only compute local predictors for candidate nodes • Candidate node set should have: • High Recall • Minimal Size eisner@cs.ualberta.ca
Similar Proteins – Exploiting BLAST • candidate nodes generating methods: • Searching outward from BLAST hit • Performing the union of more than one BLAST hit’s annotations eisner@cs.ualberta.ca