100 likes | 398 Views
Application and Efficacy of Random Forest Method for QSAR Analysis presented by Pavel Polishchuk. Random Forest – consensus modelling. Random Forest model is an ensemble of single decision trees. Rules for model construction
E N D
Application and Efficacy of Random Forest Method for QSAR Analysis presented by Pavel Polishchuk
Random Forest – consensus modelling Random Forest model is an ensemble of single decision trees. Rules for model construction 1. Each tree growing on separate bootstrap sample of initial training set compounds. 2. In each node only small randomly chosen fixed number of descriptors are considered. 3. Each tree grows for its maximum depth (no pruning).
Initial dataset Bootstrap sample Bootstrap sample Bootstrap sample … Tree1 Tree2 Tree3 Combined prediction Random Forest algorithm
Random Forest advantages: • RF models are robust to over-fitting. • There is no need in pre-selection of variables. • RF has its own reliable procedure for estimation of predictive ability of model. • RF models are robust to “noise” in training dataset. • RF allows to estimate variable importance for target property (interpretability of RF model). • RF allows to analyze compounds with different mechanisms of action. • RF method is very fast and effective in working with huge datasets.
Toxicity of chemical compounds for T. pyriformis# was expressed as inverse logarithm of 50% inhibition of Tetrahymena pyriformis growth concentration (pIGC50) Diverse datasets: training set = 644 compounds test set 1 (ts1) = 339 compounds test set 2 (ts2) = 110 compounds Total number of 2D simplex descriptors = 6021 # Zhu, H., et al., J. Chem. Inf. Model., 2008. 48: p. 766-784.
mean absolute error of prediction Comparison of RF model with other consensus ones RF model (trees=500, vars=2000)# # Polischuk, P.G., et al J. Chem. Inf. Model., 2009. 49: p.2481-2488 ## Zhu, H., et al., J. Chem. Inf. Model., 2008. 48: p. 766-784.
Estimation of mutagenic potential of chemical compounds (Ames test) training set = 4361 compounds test set = 2181 compounds # Results of collaboration of 13 scientific groups (not published yet)
test set R2 = 0.82 out-of-bag set R2 = 0.88 training set R2 = 0.99 Solubility in water QSPR task solution# training set = 2537 compounds test set = 301 compounds #Kovdienko, N.A., et al. Molecular Informatics, 2010.29: p.394-406
Leo Breiman – author of Random Forest «Random Forest is an example of a tool that is useful in doing analyses of scientific data. But the cleverest algorithms are no substitute for human intelligence and knowledge of the data in the problem. Take the output of random forests not as absolute truth, but as smart computer generated guesses that may be helpful in leading to a deeper understanding of the problem.» (27.01.1928 – 07.07.2005)