160 likes | 170 Views
Learn about Classification and Regression Trees (CART) and Random Forests, two powerful techniques for predictive modeling. Understand how CART works with recursive partitioning and decision trees, while Random Forests reduce overfitting by using an ensemble of trees. Explore tuning parameters and the benefits of each method. Gain insights into predicting with nested data and how to avoid biases in variable importance.
E N D
CART (classificationandregression trees) & Random forests Partlybased on Statistical Learning course by Trevor Hastieand Rob Tibshirani
CART • Classification and regression trees: recursive partitioning / decision trees (Leo Breiman & Jerome Friedman)
CART • Regression: splits are found by picking the predictor and accompanying split that minimizes RSS (residual sum of squares) (top-down!) • Classification: • Gini –index (variance measure across classes) • Cross-entropy:
CART • Predicting: predict test observation by passing it down the tree, following the splits, and use the mean / majority vote of the training observations to make the prediction • To avoid overfitting (= low bias but high variance), the tree needs to be pruned, using cost complexity pruning: A penalty is placed on the total number of final nodes, cross validation to find optimal value for penalty parameter alpha (preferred to growing smaller trees because a good split might follow a split that does not look very informative)
CART • Intuitive to interpret (applied researchers) • Not rely on common assumptions like multivariate normality or homogeneity of variance • Automatically detect nonlinear relationships & interactions • Overfitting • Do not predict as well as other common (machine learning) methods
Random forests • Instead of growing 1 tree, grow an ensemble of trees • To reduce overfitting and improve prediction • Cost: interpretability
Random forests • 2 tricks: • Every tree uses a bootstrap sample of the data (usually 2/3) (also referred to as bagging) • At every node only a subset m of the parameters is considered for partitioning
Random forests • Additional benefits: • Because of point 1, able to get ‘test errors’ for free: OOB – out of bag, error estimates • Because of point 2: obtain an indication of parameter importance
Random forest • Tuningparemters: number of trees tobegrownandm, thenumber of paramterstobeconsidered at each node (√p and p / 3) • Use cross- validationtodeterminem
R packages • rpart • tree • randomForest • caret
Nested data and trees • Prediction = OK • Trees in RF with nested structure are highly correlated, biased OOB estimates and variable importance • CART likely to prefer variables with more values, biasing towards level 1 variables