1.19k likes | 2.67k Views
Random Forest. Dr John Mitchell (Chemistry, St Andrews, 2019). Random Forest. A Machine Learning Method. Random Forest. A Machine Learning Method. This is a decision tree. Random Forest. A decision tree is like a flow chart. Random Forest. A Machine Learning Method.
E N D
Random Forest Dr John Mitchell (Chemistry, St Andrews, 2019)
Random Forest A Machine Learning Method
Random Forest A Machine Learning Method This is a decision tree.
Random Forest A decision tree is like a flow chart
Random Forest A Machine Learning Method Let’s visualise the decision tree ...
Random Forest A Machine Learning Method ... as a flow chart.
Random Forest A Machine Learning Method In detail, it looks like this. ... as a flow chart.
Random Forest A Machine Learning Method I came across Random Forest in the context of its application to chemical problems, that is chemoinformatics (or cheminformatics, the variant spellings are equivalent).
Encoding structure as features Mapping features to property
I refer to the entities about which predictions are to be made as items. In the context of chemistry, they are usually molecules. Each row of this matrix represents an item.
Each item is actually encoded by its descriptors. The terms feature and descriptor are synonymous. Each column of the matrix contains the values of one descriptor for each of the different items.
And each row of the matrix contains all the descriptors for one item.
The thing being predicted for each item is the property (output property). In this picture, it’s aqueous solubility. Mapping features to property
Classification is when the possible outputs of the prediction are discrete classes, so we are trying to put items into the correct pigeon hole, like [TRUE or FALSE] , or like [RED,GREEN,or BLUE]
Regressionis when the possible outputs of the prediction are continuous numerical values, so we are trying to predict as accurately as possible. We normally measure this with the root mean squared error, and also look at the correlation coefficient.
Random Forest A Machine Learning Method A single decision tree can indeed make a decent classifier, but there’s an easy way to improve upon this …
Wisdom of Crowds Francis Galton (1907) described a competition at a country fair, where participants were asked to estimate the mass of a cow. Individual entries were not particularly reliable, but Galton realised that by combining these guesses a much more reliable estimate could be obtained.
Wisdom of Crowds Guess the mass of the cow: Median of individual guesses is a good estimator: Francis Galton, Voxpopuli, Nature, 75, 450-451 (1907).
Wisdom of Crowds This is an ensemble predictor which works by aggregating individual independent estimates, and generates a result that is more reliable than the individual guesses and more accurate than the large majority of them.
Random Forest A Machine Learning Method Rather than having just one decision tree, we use lots of them to make a forest.
Random Forest • Multiple trees are only useful if not identical! • So make them randomly different.
Random Forest • So we randomly choose the data items for each tree; • Unlike a cup draw, it’s done with replacement; • We choose N items out of N for each tree, but an item can be repeated; • The resulting set of N non-unique items is known as a bootstrap sample.
Random Forest • This kind of with replacement selection gives what’s known as a bootstrap sample. • On average, e-1 or ~37% of items are not sampled by a given single tree. • These form the “out of bag” set for that tree. • The “out of bag” data are useful for validation.
Random Forest • We also randomly choose questions to ask of the data. • At each node, this is based on a fresh random sample of mtryof the descriptors. • The descriptor used at each split is selected so as to optimise splitting or minimise training error, given that the split values (e.g. MW > 315) have already been optimised for each of the mtryavailable descriptors.
LogP <=3.05 >3.05 SMR VSA <=7.4 >7.4 <=255 >255 Node 1 MW Node 2 nHbond <=315 >315 <=3 >3 Node 3 Node 4 Node 5 Node 6 Random Forest Here’s an example of such a tree.
LogP <=3.05 >3.05 SMR VSA <=7.4 >7.4 <=255 >255 Node 1 MW Node 2 nHbond <=315 >315 <=3 >3 Node 3 Node 4 Node 5 Node 6 Random Forest Here’s an example of such a tree. Question at Node 1: Is the molecular weight > 315? If true go to Node 4; if false go to Node 3.
RandomForest • The building of the decision trees is the training phase of the Random Forest algorithm. • Once the trees are built, the query items are passed through each decision tree. • Which node they end up at depends on their descriptor values. • This node determines the tree’s individual prediction.
A B D E C John Mitchell, Machine learning methods in chemoinformatics, WIREs Comput. Mol. Sci., 4, 468-481 (2014)
Random Forest: Consensus • For a classification problem, the trees vote for the class to assign the object to.
Random Forest: Consensus • For a regression problem, the trees each predict a numerical value, and these are averaged.
Random Forest So let’s summarise what we’ve said about Random Forest.
Random Forest • Introduced by Leo Breimanand Adele Cutler (early 2000s) • Development of Decision Trees (Recursive Partitioning): Random Forest can be used for either classification or regression, in the latter case the trees are regression trees. Leo Breiman, Random Forests, Machine Learning, 45, 5-32 (2001) Vladimir Svetnik, Andy Liaw, et al., Random Forest: A Classification and Regression Tool for Compound Classification and QSAR Modeling, J ChemInfComputSci, 43, 1947-1958 (2003)
Random Forest • Introduced by Breimanand Cutler (2001) • Development of Decision Trees (Recursive Partitioning): • Dataset is partitioned into consecutively smaller subsets (of similar property value) • Each partition is based upon the value of one descriptor • The descriptor used at each split is selected so as to minimise the error • Tree is not pruned. Leo Breiman, Random Forests, Machine Learning, 45, 5-32 (2001) Vladimir Svetnik, Andy Liaw, et al., Random Forest: A Classification and Regression Tool for Compound Classification and QSAR Modeling, J ChemInfComputSci, 43, 1947-1958 (2003)
... ntree=500 Random Forest (Classification) • Coupled Ensemble of Decision Trees • Each tree is trained: • from a bootstrap sample of the data • in situ out-of-bag cross-validation • without pruning back; • for classification typically nodesize=1 • from subset of descriptors at each split; • Advantages: • improved accuracy • method for descriptor selection • no overfitting • easy to train • human interpretable • not a black box for classification typically mtry= SQRT(no. of descriptors)
LogP <=3.05 >3.05 SMR VSA <=7.4 >7.4 <=255 >255 ... Node 1 MW Node 2 nHbond ntree=500 <=315 >315 <=3 >3 Node 3 Node 4 Node 5 Node 6 Random Forest (Regression) • Coupled Ensemble of Regression Trees • Each tree is trained: • from a bootstrap sample of the data • in situ out-of-bag cross-validation • without pruning back; • for regression typically nodesize=5 • from subset of descriptors at each split; • mtry=(no. of descriptors)/3 • Advantages: • improved accuracy • method for descriptor selection • no overfitting • easy to train • human interpretable • not a black box for regression typically
Random Forest (Summary) • Random Forest is a collection of Decision Trees grown with the CART algorithm. • Standard Parameters: • Needs a moderately large number of trees. I’d suggest at least 100; generally 500 trees is plenty. • No pruning back: Minimum node size > 5 (for regression) • mtrydescriptors tried at each split • Can quantify descriptor importance: • Incorporates descriptor selection • Incorporates “Out-of-bag” validation
Random Forest (variants) Bagging If we allow each split to use any of the available descriptors, rather than a randomly chosen subset, then Random Forest is equivalent to Bagging.
Random Forest (variants) ExtraTrees The ExtraTrees variant (“extremely randomized trees”) uses all N items for each tree. It also chooses possible splits using a given descriptor at a node randomly (RF in contrast makes them as good as possible); ExtraTreedoes however then choose the best descriptor to carry out the split with.
Which would you Prefer ... or ? Solubility in water (and other biological fluids) is highly desirable for pharmaceuticals!
Solubility is an important issue in drug discovery and a major cause of failure of drug development projects Expensive for the pharma industry Patients suffer lack of available treatments A good computational model for predicting the solubility of druglike molecules would be very valuable.
Solubility You might think that “How much solid compound dissolves in 1 litre of water” is a simple question to answer. However, experiments are prone to large errors. Solution takes time to reach equilibrium, and results depend on pH, temperature, ionic strength, solid form, impurities etc.
Humankind vs The Machines Sam Boobier, Anne Osborn & John Mitchell, Can human experts predict solubility better than computers? J Cheminformatics, 9:63 (2017) Image: scmp.com
Humankind vs The Machines Challenge is to predict solubilities of 25 molecules given 75 as training data.
Humankind vs The Machines Sent 229 emailed invitations to subject experts and students. Obtained 22 anonymous responses, of those 17 made full sets of predictions.
Humankind vs The Machines 10 machine learning algorithms were given the same training & test sets as the human panel.