190 likes | 317 Views
Using a Mixture of Probabilistic Decision Trees for Direct Prediction of Protein Functions. Paper by Umar Syed and Golan Yona department of CS, Cornell University Presentaion by Andrejus Parfionovas department of Math & Stat, USU. Classical methods to predict a structure of a new protein:.
E N D
Using a Mixture of Probabilistic Decision Trees for Direct Prediction of Protein Functions Paper by Umar Syed and Golan Yona department of CS, Cornell University Presentaion by Andrejus Parfionovas department of Math & Stat, USU
Classical methods to predict a structure of a new protein: • Sequence comparison to the known proteins in search of similarities • Sequences often diverge and become unrecognizable • Structure comparison to the known structures in the PDB database • Structural data is sparse and not available for newly sequenced genes
What other features can be used to improve prediction? • Domain content • Subcellular location • Tissue specificity • Species type • Pairwise interaction • Enzyme cofactors • Catalytic activity • Expression profiles, etc.
Having so many features it is important: • To extract relevant information • Directly from the sequence • Predicted secondary structure • Features extracted from database • To combine data in a feasible model • Mixture model of Probabilistic Decision Trees (PDT) was used
Features extracted directly from the sequence (percentage): • 20 individual amino acids • 16 amino acid groups percentage (16 amino acid groups: + or – charged, polar, aromatic, hydrophobic, acidic, etc) • 20 most informative dipeptides
Features predicted from the sequence: • Secondary structure predicted by the PSIPRED: • Coil • Helix • Strand
Features extracted from SWISSPROT database: • Binary features (presence/absence) • Alternative products • Enzyme cofactors • Catalytic activity • Nominal features • Tissue Specificity (2 different definitions) • Subcellular location • Organism and Species classification • Continuous • Number of patterns exhibited by each protein (“complexity” of a protein)
Mixture model of PDT (Probabilistic Decision Trees) • Can handle nominal data • Robust to the errors • Missing data is allowed
How to select an attribute for a decision node? • Use entropy to measure the impurity • Impurity must reduce after the split • Alternative measure – Mantras distance metric (has lower bias towards low split info).
Enhancements of the algorithm: • Dynamic attribute filtering • Discretizing numerical features • Multiple values for attributes • Missing attributes • Binary splitting • Leaf weighting • Post-prunning • 10-fold cross-validation
The probabilistic fremaework • Attribute is selected with probability that depends on its information gain • Weight the trees by the performance
Evaluation of decision trees • Accuracy = (tp + tn)/total • Sensitivity = tp/(tp + fn) • Selectivity = tp/(tp + fp) • Jensen-Shannon divergence score
Handling skewed distributions (unequal class sizes) • Re-weight cases by 1/(# of counts) • Increases the impurity # of false positives • Mixed entropy • Uses average of weighted & unweighted information gain to split and prune trees • Interlaced entropy • Start with weighted samples and later use the unweighted entropy
Model selection (simplification) • Occam’s razor: out of 2 models with the same result choose more simple • Bayesianapproach:themostprobable model has max.posterior probability
Pfam classification test (comparison to BLAST) • PDT performance – 81% • BLAST performance – 86% • Main reasons: • Nodes become impure because weighted entropy stops learning too early • Important branches were eliminated by post-pruning when validation set is small
EC classification test (comparison to BLAST) • PDT performance on average – 71% • BLASTperformance was often smaller
Conclusions • Many protein families cannot be defined by sequence similarities only • New method makes use of other features (structure, dipeptides, etc.) • Besides classification, PDT allow feature selection for further use • Results comparable to BLAST
Modifications and Improvements • Use global optimization for pruning • Use probabilities for attribute values • Use boosting techniques (combine weighted trees) • Use Gini-index to measure node-impurity