320 likes | 413 Views
CN700: HST 10.6-10.13. Neil Weisenfeld (notes were recycled and modified from Prof. Cohen and an unnamed student) April 12, 2005. Robust Loss Functions for Classification. Loss functions that lead to simple boosting solutions (squared error and exponential) are not always the most robust.
E N D
CN700: HST 10.6-10.13 Neil Weisenfeld (notes were recycled and modified from Prof. Cohen and an unnamed student) April 12, 2005
Robust Loss Functions for Classification • Loss functions that lead to simple boosting solutions (squared error and exponential) are not always the most robust. • In classification (with –1/1 response), the “margin” plays the role that residuals do in regression: y.f(x) • Incorrect classification is negative (-1*1, 1*-1) • Loss criteria should penalize negative margins more heavily (positive margins are correctly classified)
Loss functions for 2-class classification • Exponential and binomial deviance -> monotone continuous approximations to misclassification loss • Exponential more heavily penalizes strong negatives, while deviance is more balanced. • Binomial deviance more robust in noisy situations where Bayes error rate is not close to zero. • Squared error is poor if classification is the goal (it grows above zero)
Loss functions for k-classes • Bayes classifier: • If not just interested in assignment, then the class probabilities are of interest. Logistic model generalizes to K classes: • Binomial deviance extends to K-class multinomial deviance loss function:
Robust loss functions for regression • Squared error loss too-heavily penalizes large absolute residuals |y-f(x)| and is threfore not robust. • Absolute error better choice • Huber loss (below) deals well with outliers and nearly as efficient as least squares for Gaussian errors.
Boosting Trees • Trees partition the space of joint predictor variable values into disjoint regions with constant predictor values assigned to each region: • Tree can be formally expressed as: • Parameters found by minimizing empirical risk:
Boosting Trees • Formidable combinatorial optimization problem • Divide into two parts: • Find : typically trivial • Find : use a greedy, top-down recursive partitioning algorithm (e.g. Gini index for misclassification loss in growing of tree) • Boosted tree model is the sum of such trees (from forward, stagewise algorithm):
Boosting Trees • The Boosted Tree Model is “induced in a forward, stagewise manner” • At each step in the procedure, one must solve: • Given the regions at each step, the optimal constants are found:
Boosting Trees • For squared error loss, this is no harder than for a single tree: at each stage you create the tree that best predicts the current residuals • For two-class classification and exponential loss, we have AdaBoost • Absolute error or Huber loss for regression and deviance for classification would make for robust trees, but there are no simple boosting algorithms.
Boosting Trees: Numerical Optimization A variety of numerical optimization techniques exist for finding the solution to the above problem. They all work iteratively, in the sense that the function is approximated by taking an initial guess at its value, and computing successive adding functions to it, each of which is computed on the basis of the function at the previous iteration.
Boosting Trees: Steepest Descent • Move down the gradient of L(f) • Very Greedy => Can get stuck in local minima • Unconstrained => Can be applied to any system (as long as gradient can be calculated)
Not a Tree Algorithm so…(notes from the Professor) • Calculate a gradient and then fit regression trees by least squares. • Advantage – No need to do linear fit • Gradient taken only w.r.t function values at points so its as if it were one diminsional
Boosting Trees: Gradient Boosting • But gradient descent operates solely on the training data. One idea: create boosted trees to approximate steps down the gradient. • Boosting is like gradient descent, but each added tree moves down the loss gradient created at fm-1, and hence approximates the true gradient • Each tree is constrained by the previous one unlike the true gradient
MART (Multiple Additive Regression Trees): Generic Gradient Tree Boosting Algorithm • 1. Initialize: • 2. For m=1…M: • A) For i=1,2,…,N (pseudo-residuals) • B) Fit a regression tree to the targets r_im giving terminal regions R_jm, j=1,…,J_m • C) For j=1,2,…,J_m: • D) update: 3. Output:
Right-Sized Trees forBoosting • At issue: for single-tree methods we create deep trees and then prune them. How to handle for these complex, multi-tree methods? • Maybe set a constant depth in terms of the number of terminal nodes J. • Number of terminal nodes relates to degree of coordinate variable interactions that are considered. • Consideration of the ANOVA (Analysis of Variance) expansion of the “target” function: • ANOVA expansion: • Yields an approach to Boosting Trees in which the number of terminal nodes in each of the individual trees is set to J, where J-1 is the largest degree of interaction we wish to capture about the data.
Effect of Interaction Order • Just to show how degree of interaction relates to test error in the simple example of 10.2. • Ideal J=2, so boosting models with J>2 incurs more variance. • Note J is not the “number of terms”
Regularization • Aside from J, the other meta-parameter of MART is M, the number of iterations. • Continued iteration usually reduces training risk, but can lead to overfitting. • One strategy is to estimate M*, the ideal number of iterations, by testing prediction risk, as a function of M, on a validation sample. • Other regularizations strategies…
Shrinkage • The idea of shrinkage is to weight the contribution of each tree by a factor between 0 and 1. Thus, the MART update rule can be replaced by: • There is a clear tradeoff between the shrinkage factor and M, the number of iterations. • Lower values of n require more iterations and longer computation, but favor better test error. Best strategy seems to be to suck it up and set n low (<0.1).
Shrinkage and Test Error • Again, the example of 10.2 • Effect especially pronounced when using deviant binomial deviance loss measure, but shrinkage always looks nicer. • HS&T have led us down the primrose path.
Penalized Regression • Taking the set of all possible J terminal node trees realizable on the data set as basis functions, the linear model is: • Penalized regression includes a penalty J(a) for large numbers of parameters:
Penalized Regression • Penalties can be, for example, like ridge or lasso: • However, direct implementation of the above procedure is computationally infeasible (due to the requirement that all possible J-terminal node trees have been found). Forward stagewise linear regression provides a close approximation to the lasso and is similar to boosting and algorithm 10.2
Forward Stagewise Linear Regression • Initialize: • For m = 1 to M: • Output: Increasing M is like decreasing l. Many coefficients will remain at zero. Others will tend to have absolute values less than their least squares defaults.
Lasso vs. Forward-Stagewise (but not on trees) • Just as a demonstration, try this out with the original variables, instead of trees, and compare to the Lasso solutions.
Importance of Predictor Variables (IPV) • Find out which variable reduces the error most • Normalize other variables influence w.r.t. this variable
IPV: Hints • To overcome greedy splits, average over many boosted trees • To prevent masking—where important variables are highly correlated with other important ones—use shrinking
IPV Classification • For K-class classification, fit a function for each class, and see which variable is important within each class. • If a few variables are important across all classes: 1.) Laugh your way to the bank, and 2.) Give me some $ for teaching you this.
IPV: Classification Arrange each variable’s importance in a matrix with p rows and K classes • Columns separate within a class • Rows separate between classes
Partial Dependence Plots (PDP’s) • Problem: After we’ve determined our important variables, how can we visualize their effects? • Solution1: Give up and become another (clueless but rich) manager • Solution2: Just pick a few and keep at it (who likes the Bahamas anyway?)
What they are in Limit • PDP
PDP’s: Conditioning • To visualize d (> 3) dimensions, condition on a few input variables • Like looking at slices of the d dimensional surface. • Set ranges if necessary • Especially useful when • interactions are limited • and those variables have additive or multiplicative effects
PDP’s: Finding Interactions • To find interactions, compare partial dependence plots with their relative importance • If the importance is high yet the plot appears flat, multiply it with another important variable