420 likes | 430 Views
Boosting constructs a sequence of weak classifiers to create a strong classifier by weighted voting. Explore boosting properties, Additive Model, and Gradient Boosting.
E N D
Boosting (2) Understanding boosting as an additive model Boosted trees
Boosting Construct a sequence of weak classifiers, and combine them into a strong classifier by a weighted majority vote. “weak”: better than random coin-tossing Some properties: Flexible. Able to do feature selection. Good generalization. Could fit noise.
Boosting 10 predictors The weak classifier is a Stump: a two-level tree.
Boosting Boosting can be seen as fitting an additive model, with the general form: Expansion coefficients Examples of γ: Sigmoidal function in neural networks; A split in a tree model; Basis functions: Simple functions of feature x, with parameters γ
Boosting In general, such functions are fit by minimizing a loss function This could be computationally intensive. An alternative is to go stepwise, fitting a sub-problem of a single basis function
Boosting Forward stagewise additive modeling --- add new basis functions without adjusting previously added ones. Example: * Squared loss function is not good for classification.
Boosting The version of Adaboost we discussed uses this loss function: The basis functions are individual weak classifiers.
Boosting Margin: y*f(x) >0, correct <0, incorrect The goal of classification – to produce positive margin as much as possible. Negative margin should be penalized more. Exponential penalize negative margin more heavily.
Boosting classifier To be solved: All fixed. Independent from β and G
Boosting Observations are either correctly or incorrectly classified. Then the target function to be minimized is: For any β> 0, Gm has to satisfy: G is the classifier that minimizes the weighted error rate.
Boosting Solving for the Gm will give us a weighted error rate. Plug it back to get β: Update the overall classifier by plugging these in:
Boosting The weight for next iteration becomes: Using Independent of i. Ignored.
Boosting trees Trees partition the space into disjoint regions Rj, j = 1,2,...,J, as represented by the terminal nodes of the tree. A tree is expressed as A boosted tree model is a sum of trees In each step of the boosting procedure, need to find
Boosting trees Finding gamma is easy. Given the Regions, Finding R is difficult. Approximate solutions are found. The Adaboost solution using exponential loss: Find tree that minimizes weighted error rate Gradient boosting is a generalization of Adaboost.
Boosting trees Gradient boosting. Consider the boosting procedure as a stepwise optimization. If the loss function were differentiable, we may find the gradient for optimization. The loss function is f(x) is constrained to be a sum of trees . The gradient is:
Induce a tree T(x;Θm) whose predictions tm are as close as possible to the negative gradient
Boosting trees In regression tree with square loss, -gim=yi-fm-1(xi) Going along the gradient is fitting the residuals with a tree. In classification tree with deviance loss, the logistic model can be used as the link, Trees are induced to predict the corresponding current residuals on the probability scale.
Boosting trees Failure in bagging a single-level tree.
Boosted trees and Random Forest Example comparing RF to boosted trees.
Boosted trees and Random Forest Example comparing RF to boosted trees.
probability that a relevant variable will be selected Boosted trees and Random Forest However, when the number of relevant variables increases, the performance of random forests is robust to an increase in the number of noise variables.
Boosted trees and Random Forest Thesameideawasappliedtogradientboosting: Stochasticgradientboosting: -subsamplerowsbeforecreatingeachtree -subsamplecolumnsbeforecreatingeachtree -subsamplecolumnsbeforeconsideringeverysplit