1 / 10

Application and Efficacy of Random Forest Method for QSAR Analysis presented by Pavel Polishchuk

Application and Efficacy of Random Forest Method for QSAR Analysis presented by Pavel Polishchuk. Random Forest – consensus modelling. Random Forest model is an ensemble of single decision trees. Rules for model construction

beau
Download Presentation

Application and Efficacy of Random Forest Method for QSAR Analysis presented by Pavel Polishchuk

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Application and Efficacy of Random Forest Method for QSAR Analysis presented by Pavel Polishchuk

  2. Random Forest – consensus modelling Random Forest model is an ensemble of single decision trees. Rules for model construction 1. Each tree growing on separate bootstrap sample of initial training set compounds. 2. In each node only small randomly chosen fixed number of descriptors are considered. 3. Each tree grows for its maximum depth (no pruning).

  3. Initial dataset Bootstrap sample Bootstrap sample Bootstrap sample … Tree1 Tree2 Tree3 Combined prediction Random Forest algorithm

  4. Random Forest advantages: • RF models are robust to over-fitting. • There is no need in pre-selection of variables. • RF has its own reliable procedure for estimation of predictive ability of model. • RF models are robust to “noise” in training dataset. • RF allows to estimate variable importance for target property (interpretability of RF model). • RF allows to analyze compounds with different mechanisms of action. • RF method is very fast and effective in working with huge datasets.

  5. Several examples of real QSAR tasks solutions

  6. Toxicity of chemical compounds for T. pyriformis# was expressed as inverse logarithm of 50% inhibition of Tetrahymena pyriformis growth concentration (pIGC50) Diverse datasets: training set = 644 compounds test set 1 (ts1) = 339 compounds test set 2 (ts2) = 110 compounds Total number of 2D simplex descriptors = 6021 # Zhu, H., et al., J. Chem. Inf. Model., 2008. 48: p. 766-784.

  7. mean absolute error of prediction Comparison of RF model with other consensus ones RF model (trees=500, vars=2000)# # Polischuk, P.G., et al J. Chem. Inf. Model., 2009. 49: p.2481-2488 ## Zhu, H., et al., J. Chem. Inf. Model., 2008. 48: p. 766-784.

  8. Estimation of mutagenic potential of chemical compounds (Ames test) training set = 4361 compounds test set = 2181 compounds # Results of collaboration of 13 scientific groups (not published yet)

  9. test set R2 = 0.82 out-of-bag set R2 = 0.88 training set R2 = 0.99 Solubility in water QSPR task solution# training set = 2537 compounds test set = 301 compounds #Kovdienko, N.A., et al. Molecular Informatics, 2010.29: p.394-406

  10. Leo Breiman – author of Random Forest «Random Forest is an example of a tool that is useful in doing analyses of scientific data. But the cleverest algorithms are no substitute for human intelligence and knowledge of the data in the problem. Take the output of random forests not as absolute truth, but as smart computer generated guesses that may be helpful in leading to a deeper understanding of the problem.» (27.01.1928 – 07.07.2005)

More Related