Predicting Protein Function Using Machine-Learned Hierarchical Classifiers

Predicting Protein Function Using Machine-Learned Hierarchical Classifiers Roman Eisner Supervisors: Duane Szafron and Paul Lu eisner@cs.ualberta.ca

Outline • Introduction • Predictors • Evaluation in a Hierarchy • Local Predictor Design • Experimental Results • Conclusion eisner@cs.ualberta.ca

eisner@cs.ualberta.ca

Proteins • Functional Units in the cell • Perform a Variety of Functions • e.g. Catalysis of reactions, Structural and mechanical roles, transport of other molecules • Can take years to study a single protein • Any good leads would be helpful! eisner@cs.ualberta.ca

Protein Function Prediction and Protein Function Determination • Prediction: • An estimate of what function a protein performs • Determination: • Work in a laboratory to observe and discover what function a protein performs • Prediction complements determination eisner@cs.ualberta.ca

Proteins • Chain of amino acids • 20 Amino Acids • FastA Format: >P18077 – R35A_HUMAN MSGRLWSKAIFAGYKRGLRNQREHTALLKIEGVYARDETEFYLGKR CAYVYKAKNNTVTPGGKPNKTRVIWGKVTRAHGNSGMVRAKFRSNL PAKAIGHRIRVMLYPSRI eisner@cs.ualberta.ca

Ontologies • Standardized Vocabularies (Common Language) • In biological literature, different terms can be used to describe the same function • e.g. “peroxiredoxin activity” and “thioredoxin peroxidase activity” • Can be structured in a hierarchy to show relationships eisner@cs.ualberta.ca

Gene Ontology • Directed Acyclic Graph (DAG) • Always changing • Describes 3 aspects of protein annotations: • Molecular Function • Biological Process • Cellular Component eisner@cs.ualberta.ca

Hierarchical Ontologies • Can help to represent a large number of classes • Represent General and Specific data • Some data is incomplete – could become more specific in the future eisner@cs.ualberta.ca

Incomplete Annotations eisner@cs.ualberta.ca

Goal • To predict the function of proteins given their sequence eisner@cs.ualberta.ca

Data Set • Protein Sequences • UniProt database • Ontology • Gene Ontology Molecular Function aspect • Experimental Annotations • Gene Ontology Annotation project @ EBI • Pruned Ontology: 406 nodes (out of 7,399) with ≥ 20 proteins • Final Data Set: 14,362 proteins eisner@cs.ualberta.ca

Predictors • Global: • BLAST NN • Local: • PA-SVM • PFAM-SVM • Probabilistic Suffix Trees eisner@cs.ualberta.ca

Predictors • Global: • BLAST NN • Local: • PA-SVM • PFAM-SVM • Probabilistic Suffix Trees Linear eisner@cs.ualberta.ca

Why Linear SVMs? • Accurate • Explainability • Each term in the dot product in meaningful eisner@cs.ualberta.ca

PA-SVM Proteome Analyst eisner@cs.ualberta.ca

PFAM-SVM Hidden Markov Models eisner@cs.ualberta.ca

PST • Probabilistic Suffix Trees • Efficient Markov chains • Model the protein sequences directly: • Prediction: eisner@cs.ualberta.ca

BLAST • Protein Sequence Alignment for a query protein against any set of protein sequences eisner@cs.ualberta.ca

BLAST eisner@cs.ualberta.ca

Evaluating Predictions in a Hierarchy • Not all errors are equivalent • Error to sibling different than error to unrelated part of hierarchy • Proteins can perform more than one function • Need to combine predictions of multiple functions into a single measure eisner@cs.ualberta.ca

Evaluating Predictions in a Hierarchy • Semantics of the hierarchy – True Path Rule • Protein labeled with: {T} -> {T, A1, A2} • Predicted functions: {S} -> {S, A1, A2} • Precision = 2/3 = 67% • Recall = 2/3 = 67% eisner@cs.ualberta.ca

Evaluating Predictions in a Hierarchy • Protein labelled with {T} -> {T, A1, A2} • Predicted: {C1} -> {C1, T, A1, A2} • Precision = 3/4 = 75% • Recall = 3/3 = 100% eisner@cs.ualberta.ca

Supervised Learning eisner@cs.ualberta.ca

Cross-Validation • Used to estimate performance of classification system on future data • 5 Fold Cross-Validation: eisner@cs.ualberta.ca

Inclusive vs Exclusive Local Predictors • In a system of local predictors, how should each local predictor behave? • Two extremes: • A local predictor predicts positive only for those proteins that belong exactly at that node • A local predictor predicts positive for those proteins that belong at or below them in the hierarchy • No a priori reason to choose either eisner@cs.ualberta.ca

Exclusive Local Predictors eisner@cs.ualberta.ca

Inclusive Local Predictors eisner@cs.ualberta.ca

Training Set Design • Proteins in the current fold’s training set can be used in any way • Need to select for each local predictor: • Positive training examples • Negative training examples eisner@cs.ualberta.ca

Training Set Design eisner@cs.ualberta.ca

Comparing Training Set Design Schemes • Using PA-SVM eisner@cs.ualberta.ca

Exclusive have more exceptions eisner@cs.ualberta.ca

Lowering the Cost of Local Predictors • Top-Down • Compute local predictors top to bottom until a negative prediction is reached eisner@cs.ualberta.ca

Top-Down Search eisner@cs.ualberta.ca

Predictor Results eisner@cs.ualberta.ca

Similar and Dissimilar Proteins • 89% of proteins – at least one good BLAST hit • Proteins which are similar (often homologous) to the set of well studied proteins • 11% of proteins – no good BLAST hit • Proteins which are not similar to the set of well studied proteins eisner@cs.ualberta.ca

Coverage • Coverage: Percentage of proteins for which a prediction is made eisner@cs.ualberta.ca

Similar Proteins – Exploiting BLAST • BLAST is fast and accurate when a good hit is found • Can exploit this to lower the cost of local predictors • Generate candidate nodes • Only compute local predictors for candidate nodes • Candidate node set should have: • High Recall • Minimal Size eisner@cs.ualberta.ca

Similar Proteins – Exploiting BLAST • candidate nodes generating methods: • Searching outward from BLAST hit • Performing the union of more than one BLAST hit’s annotations eisner@cs.ualberta.ca

Predicting Protein Function Using Machine-Learned Hierarchical Classifiers

Predicting Protein Function Using Machine-Learned Hierarchical Classifiers

Presentation Transcript

Protein Function

Protein Function

Predicting Phospholipidosis Using Machine Learning

Predicting Income from Census Data using Multiple Classifiers

Predicting protein function from heterogeneous data

Predicting protein function from heterogeneous data

Predicting 3D Protein Structure using Homology Modeling

Protein Function

Protein Function Analysis using Computational Mutagenesis

Lecture 4 Protein Function prediction using network concepts Hierarchical Clustering

Protein function

Theoretical methods for predicting gene function III. Predicting protein function

Building Hierarchical Classifiers Using Class Proximity

Theoretical methods for predicting gene function II. predicting protein domains

Predicting protein structure and function

Protein Function

Protein Function

Protein Function

Protein Function

Support Vector Machine Classifiers

Protein Function