Using a Mixture of Probabilistic Decision Trees for Direct Prediction of Protein Functions

Using a Mixture of Probabilistic Decision Trees for Direct Prediction of Protein Functions Paper by Umar Syed and Golan Yona department of CS, Cornell University Presentaion by Andrejus Parfionovas department of Math & Stat, USU

Classical methods to predict a structure of a new protein: • Sequence comparison to the known proteins in search of similarities • Sequences often diverge and become unrecognizable • Structure comparison to the known structures in the PDB database • Structural data is sparse and not available for newly sequenced genes

What other features can be used to improve prediction? • Domain content • Subcellular location • Tissue specificity • Species type • Pairwise interaction • Enzyme cofactors • Catalytic activity • Expression profiles, etc.

Having so many features it is important: • To extract relevant information • Directly from the sequence • Predicted secondary structure • Features extracted from database • To combine data in a feasible model • Mixture model of Probabilistic Decision Trees (PDT) was used

Features extracted directly from the sequence (percentage): • 20 individual amino acids • 16 amino acid groups percentage (16 amino acid groups: + or – charged, polar, aromatic, hydrophobic, acidic, etc) • 20 most informative dipeptides

Features predicted from the sequence: • Secondary structure predicted by the PSIPRED: • Coil • Helix • Strand

Features extracted from SWISSPROT database: • Binary features (presence/absence) • Alternative products • Enzyme cofactors • Catalytic activity • Nominal features • Tissue Specificity (2 different definitions) • Subcellular location • Organism and Species classification • Continuous • Number of patterns exhibited by each protein (“complexity” of a protein)

Mixture model of PDT (Probabilistic Decision Trees) • Can handle nominal data • Robust to the errors • Missing data is allowed

How to select an attribute for a decision node? • Use entropy to measure the impurity • Impurity must reduce after the split • Alternative measure – Mantras distance metric (has lower bias towards low split info).

Enhancements of the algorithm: • Dynamic attribute filtering • Discretizing numerical features • Multiple values for attributes • Missing attributes • Binary splitting • Leaf weighting • Post-prunning • 10-fold cross-validation

The probabilistic fremaework • Attribute is selected with probability that depends on its information gain • Weight the trees by the performance

Evaluation of decision trees • Accuracy = (tp + tn)/total • Sensitivity = tp/(tp + fn) • Selectivity = tp/(tp + fp) • Jensen-Shannon divergence score

Handling skewed distributions (unequal class sizes) • Re-weight cases by 1/(# of counts) • Increases the impurity # of false positives • Mixed entropy • Uses average of weighted & unweighted information gain to split and prune trees • Interlaced entropy • Start with weighted samples and later use the unweighted entropy

Model selection (simplification) • Occam’s razor: out of 2 models with the same result choose more simple • Bayesianapproach:themostprobable model has max.posterior probability

Learning strategy optimization

Pfam classification test (comparison to BLAST) • PDT performance – 81% • BLAST performance – 86% • Main reasons: • Nodes become impure because weighted entropy stops learning too early • Important branches were eliminated by post-pruning when validation set is small

EC classification test (comparison to BLAST) • PDT performance on average – 71% • BLASTperformance was often smaller

Conclusions • Many protein families cannot be defined by sequence similarities only • New method makes use of other features (structure, dipeptides, etc.) • Besides classification, PDT allow feature selection for further use • Results comparable to BLAST

Modifications and Improvements • Use global optimization for pruning • Use probabilities for attribute values • Use boosting techniques (combine weighted trees) • Use Gini-index to measure node-impurity

Using a Mixture of Probabilistic Decision Trees for Direct Prediction of Protein Functions

Using a Mixture of Probabilistic Decision Trees for Direct Prediction of Protein Functions

Presentation Transcript

Prediction of protein structure

Probabilistic Prediction

Protein functions prediction

Basics of Decision Trees

Probabilistic Prediction

Prediction of protein structure

Classification using Decision Trees

Helmut Schmid: Probabilistic Part-of-Speech-Tagging Using Decision Trees

Prediction of protein disorder

Probabilistic Models of Nonprojective Dependency Trees

Prediction of protein function

Searching for Single Top Using Decision Trees

Data Mining using Decision Trees

Consistent probabilistic outputs for protein function prediction

Prediction of protein disorder

Learning a Small Mixture of Trees

Domain-Based Protein-Protein Interaction Prediction Using Random Decision Forest Framework

Probabilistic Prediction

Induction of Decision Trees