360 likes | 660 Views
Additive Models, Trees, and Related Methods. 2006. 02. 17. Partly based on Prof. Prem Goel ’ s Slides. 9.1 Generalized Additive Models. Mean function: f j : unspecified smooth (nonparametric) functions Relate conditional mean of Y to an additive function of X ’ s via a link function g.
E N D
Additive Models, Trees, and Related Methods 2006. 02. 17. Partly based on Prof. Prem Goel’s Slides
9.1 Generalized Additive Models • Mean function: • fj: unspecified smooth (nonparametric) functions • Relate conditional mean of Y to an additive function of X’s via a link function g.
Fitting Additive Models • Fit each fj using scatterplot smoother and estimate all p functions simultaneously • For example, the cubic smoothing spline as smoother • Criterion: penalized sum of squares (9.7) • An additve cubic spline model minimizes this • Each fj is cubic spline in the component Xj • Knots at each of the unique values xij
The backfitting algorithm • Can accommodate other fitting methods in same way, by specifying appropritate smoothing operator Sj. • For a large class of linear smoothers, backfitting is equivalent to a Gauss-Seidel algorithm
Additive Logistic Regression • For the logistic regression model and other generalized additive models, the appropriate criterion is a penalized log-likelihood. • To maximize it, the backfitting procedure is used in conjunction with a likelihood maximizer.
Local Scoring Algorithm for the Additive Logistic Regression
Partition the feature space into a set of rectangles and fit a simple model in each one. CART and C4.5 9.2 Tree-Based Methods
Regression Tree • Assume recursive binary partition • In each partition, Y is modeled with a different constant. • For each split, choose the variable and split-point which minimizes sum of squares. • Repeat with each subset, until reach a minimum node size
Regression Tree • How large should we grow the tree? • Cost-complexity pruning • Find tree which minimizes • Choosing adaptively by weakest link pruning • Collapse the smallest per-node increase in RSS until we get the single-node tree. • Among these sequence of trees, there exists a tree that minimizes cost-complexity • Cross-validation
Classification Trees • Only change in the criteria to split nodes and pruning the tree.
Node Impurity Measures • Cross-entropy and Gini index are more sensitive to changes in the node probabilities than the misclassification rate. • Either cross-entropy and Gini index should be used when growing the tree. • When pruning, any of the three can be used.
Other Issues • Instability • Hierarchical process: error on the upper split is propagated down. • Bagging • Lack of smoothness in prediction surface. • Can degrade performance in regression. • MARS • ROC curves • By varying relative sizes of the losses L01 and L10 in loss matrix, increase/decrease the sensitivity/specificity
9.3 PRIM-Bump Hunting • Patient Rule Induction Method • Seeks boxes in which the response average is high. • Not binary split • Hard to interpret the collection of rules. • Individual rule is simpler. • Patient • Do not fragment the data quickly as binary partition. • Can help the top-down greedy algorithm find a better solution.
Basic element – pair of piecewise linear basis function Form each reflected pairs for each input Xj with knots at each observed value of that input. Total 2Np basis functions 9.4 MARS: Multivariate Adaptive Regression Splines
Other Issues • MARS for classification • Two classes: 0/1 code and regression • More than two classes: optimal scoring (Section 12.5) • MARS vs. CART • Piecewise linear basis vs. step functions • Multiplication vs. splitting • Not necessarily binary splitting.
9.5 Hierarchical Mixtures of Experts • Soft gating network with expert at terminal node.
Hierarchical Mixtures of Experts • Estimation of parameters • EM algorithm • E-step: compute expectations of gating probabilities • M-step: estimate the parameters in the expert networks by multiple logistic regression. • HME vs. CART • Similar to CART with linear combination splits. • Soft split: better to model gradual response transition • No method to find a good tree topology for HME
9.6 Missing Data • Whether the missing data mechanism distorted the observed data. • Missing at random(MAR) – missing data mechanism is independent of the observed data. • Missing completely at random(MCAR) – missing data mechanism is independent of data.
Missing Data • Assuming MCAR • Discard observations with any missing values. • Rely on the learning algorithm to deal with missing values in its training phase. • Impute all missing values before training.
9.7 Computational Considerations • Additive Model fitting: O(mpN+pNlogN), m is # of iterations. • Trees: O(pNlogN) for initial sorting and split computation) • MARS: O(NM2+pM2N), M is # of terms • HME: O(Np2) for the regression, Np2K2(EM algorith takes long to converge)