280 likes | 448 Views
Lecture 10. Trees and Boosting. Instructed by Jinzhu Jia. Outline. Tree based methods CART MARS Boosting. Tree based methods. Tree-based methods partition the feature space into a set of rectangles, and then fit a simple model (like a constant) in each one.
E N D
Lecture 10. Treesand Boosting Instructed by Jinzhu Jia
Outline • Tree based methods • CART • MARS • Boosting
Tree based methods • Tree-based methods partition the feature space into a set of rectangles, and then fit a simple model (like a constant) in each one.
Regression Trees • P inputs and a response (x_i,y_i),i=1,2,...,N • Goal: automatically decides on the splitting variables and split points, and also the topology (shape) of the tree • Alg: suppose we know the partitation • R_1,R_2,...,R_M
Regression Trees • Greedy procedures
Tree size • Large tree might over-fit the data • Small tree might not capture the important structure • Tree size is a tuning parameter • Stop the splitting process only when some minimum node size (say 5) is reached • Then this large tree is pruned using cost complexitypruning
Pruning • Define a subtree to be any tree that can be obtained by pruning that is collapsing any number of its internal nodes. • Index terminal nodes by m; |T| denotes the number of terminal nodes in T.
Classification Trees • The target variable is a classification outcome taking values 1,2,...,K • The only changes needed is the criteria for splitting nodes and pruning the tree. • Measures of node impurity:
MARS • Multivariate Adaptive Regression Splines • Well suited for high-dim problems • A generalization of stepwise linear regression • It uses expansions in piesewise linear basis functions of the form:
MARS A reflected pair
MARS • where is a function in , or a product of two or more such functions. • Adaptive way to add basis functions:
MARS • The size of terms matters • M is the effective number of parameters in the model: this accounts both for the number of terms and the number of parameters when selecting the positions of the knots.
Boosting • Originally designed for classification problems • Can be extended to regression problems • Motivation: combines the output of many weak classifiers to produce a powerful “committee”
Adaboost • Consider a two-class problem: • G(X) is the classifier • Error rate: • Weak classifier: error rate is only slightly better than random guessing.
Adaboost • Boosting sequentially apply the week classification algorithm to repeatedly modified versions of the data, thereby producing a sequence of weak classifiers:
Adaboost Freund and Schapire 1997
Examples • 2000 trains and 10,000 tests • Weak classifier is just a stump: a two terminal node classification tree. • Error rate: 48.5%
Boosting Fits and Additive Model • Boosting is a way of fitting an additive expansion in a set of elementary “basis” functions. • For Adaboost, basis functions are weak classifiers • More generally, Additive Model:
Exponential Loss and AdaBoost • Adaboost is equivalent to forward additive modeling using the following loss function: • Forward step:
Exp Loss and AdaBoost 2(a) 2(c) 2(b) 2(d)
Why Exp Loss? • Computational reason • Leads the simple reweighting scheme • Question: what does adaboostestimate? • Modeling:
Loss functions and Robustness For regression Huber Loss
Boosting for Regression • Iteratively fit the residuals
Exercise (Not Homework) • 1. Reproduce Figure 10.2