270 likes | 525 Views
Boosting and Additive Trees (2) . Yi Zhang , Kevyn Collins-Thompson Advanced Statistical Seminar 11-745 Oct 29, 2002. Recap: Boosting (1). Background: Ensemble Learning Boosting Definitions, Example AdaBoost Boosting as an Additive Model Boosting Practical Issues
E N D
Boosting and Additive Trees (2) Yi Zhang , Kevyn Collins-Thompson Advanced Statistical Seminar 11-745 Oct 29, 2002
Recap: Boosting (1) • Background: Ensemble Learning • Boosting Definitions, Example • AdaBoost • Boosting as an Additive Model • Boosting Practical Issues • Exponential Loss • Other Loss Functions • Boosting Trees • Boosting as Entropy Projection • Data Mining Methods
Outline for This Class • Find the solution based on numerical optimization • Control the model complexity and avoid over fitting • Right sized trees for boosting • Number of iterations • Regularization • Understand the final model (Interpretation) • Single variable • Correlation of variables
Numerical Optimization • Goal: Find f that minimize the loss function over training data • Gradient Descent Search in the unconstrained function space to minimize the loss on training data • Loss on training data converges to zero
Gradient Search on Constrained Function Space: Gradient Tree Boosting • Introduce a tree at the mth iteration whose predictions tmare as close as possible to the negative gradient • Advantage compared with unconstrained gradient search: Robust, less likely for over fitting
View Boosting as Linear Model • Basis expansion: • use basis function Tm (m=1..M, each Tm is a weak learner) to transform inputs vector X into T space, then use linear models in this new space • Special for Boosting: Choosing of basis function Tm depends on T1,… Tm-1
Recap: Linear Models in Chapter 3 Bias Variance trade off Subset selection (feature selection, discrete) Coefficient shrinkage (smoothing: ridge, lasso) Using derived input direction (PCA, PLA) Multiple outcome shrinkage and selection Exploit correlations in different outcomes This Chapter: Improve Boosting Size of the constituent trees J Number of boosting iterations M (subset selection) Regularization (Shrinkage) Improve Boosting as Linear Model
Right Size Tree for Boosting (?) • The Best for one step is not the best in long run • Using very large tree (such as C4.5) as weak learner to fit the residue assumes each tree is the last one in the expansion. Usually degrade performance and increase computation • Simple approach: restrict all trees to be the same size J • J limits the input features interaction level of tree-based approximation • In practice low-order interaction effects tend to dominate, and empirically 4J 8 works well (?)
Number of Boosting Iterations(subset selection) • Boosting will over fit as M -> • Use validation set • Other methods … (later)
Shrinkage • Scale the contribution of each tree by a factor 0<<1 to control the learning rate • Both and M control prediction risk on the training data, and operate dependently • M
Penalized Regression • Ridge regression or Lasso regression
If is monotone in , we have k|k| = M, and the solution for algorithm 4 is identical to result of lasso regression as described in page 64. ( , M ) lasso regression S/t/
More about algorithm 4 • Algorithm 4 Algorithm 3 + Shrinkage • L1 norm vs. L2 norm: more details later • Chapter 12 after learning SVM
Interpretation: Understanding the final model • Single decision trees are easy to interpret • Linear combination of trees is difficult to understand • Which features are important? • What’s the interaction between features?
Relative Importance of Individual Variables • For a single tree, define the importance of xl as • For additive tree, define the importance of xl as • For K-class classification, just treat as K 2-class classification task
Partial Dependence Plots • Visualize dependence of approximation f(x) on the joint values of important features • Usually the size of the subsets is small (1-3) • Define average or partial dependence • Can be estimated empirically using the training data:
10.50 vs. 10.52 • Same if predictor variables are independent • Why use 10.50 instead of 10.52 to Measure Partial Dependency? • Example 1: f(X)=h1(xs)+ h2(xc) • Example 2: f(X)=h1(xs)* h2(xc)
Conclusion • Find the solution based on numerical optimization • Control the model complexity and avoid over fitting • Right sized trees for boosting • Number of iterations • Regularization • Understand the final model (Interpretation) • Single variable • Correlation of variables