400 likes | 565 Views
Classification Part 4: Tree-Based Methods. BMTRY 726 4/18/14. So Far. Linear Classifiers - Logistic regression : most commonly used - LDA : Uses more information (prior probabilities) and therefore more efficient but less interpretable output -Weaknesses:
E N D
Classification Part 4: Tree-Based Methods BMTRY 726 4/18/14
So Far Linear Classifiers -Logistic regression: most commonly used -LDA: Uses more information (prior probabilities) and therefore more efficient but less interpretable output -Weaknesses: -modeling issues for larger p relative to n -inflexibility in decision boundary
So Far Neural Networks -Able to handle large p by reducing the number of inputs, X, to a smaller set of derived features, Z, used to model Y -Applying (non-linear) functions to linear combinations of the X’s and Z’s results in very flexible decision boundaries -Weaknesses: -Limited means of identifying important features -Tend to over-fit the data
Tree-Based Methods Non-parametric and semi-parametric methods Models can be represented graphically making them very easy to interpret We will consider two decision tree methods: -Classification and Regression Trees (CART) -Logic Regression
Tree-Based Methods Classification and Regression Trees (CART) -Used for both classification and regression -Features can be categorical or continuous -Binary splits selected for all features -Features may occur more than once in a model Logic Regression -Also used for classification and regression -Features must be categorical for these models -Features may occur more than once in a model
CART Main Idea: divide feature space into disjoint set of rectangles and fit a simple model to each one -For regression, the predicted output is just the mean of the training data in that region. -Fit a linear function in each region -For classification the predicted class is just the most frequent class in the training data in that region. -Estimate class probabilities by the frequencies in the training data in that region.
Feature Space as Rectangular Regions X1<t1 X2<t2 X1<t3 X2<t4 R1 R2 R3 R4 R5
CART Trees CART trees: graphical representations of a CART model Features in a CART tree: -Nodes/partitions: binary splits in the data based on one feature (e.g. X1<t1) -Branches: paths leading down from a split -Go left if Xfollows the rule defines at the split -Go right if X doesn’t meet criteria -Terminal nodes: rectangular region where splitting ends X1<t1 X2<t2 X1<t3 X2<t4 R1 R2 R3 R4 R5
CART Tree Predictions Predictions from a tree are made from the top down For new observation with feature vector X, follow path through the tree that is true for the observed vector until reaching a terminal node Prediction matches prediction made at terminal node - In the case of classification, it is the class with the largest probability at that node X1<t1 X2<t2 X1<t3 X2<t4 R1 R2 R3 R4 R5
CART Tree Predictions For example, say our feature vector is X = (6, -1.5) Start at the top- which direction do we go? X1< 4 X2< -1 X1< 5.5 X2< -0.2 R1 R2 R3 R4 R5
Building a CART Tree Given a set of features X and associated outcomes Y how do we fit a tree? CART uses a Recursive Partitioning(aka Greedy Search) Algorithm to select the “best” feature at each split. So let’s look at the recursive partitioning algorithm.
Building a CART Tree Recursive Partitioning(aka Greedy Search) Algorithm (1) Consider all possible splits for each features/predictors -Think of split as a cutpoint (when X is binary this is always ½) (2) Select predictor and split on that predictor that provides the best separation of the data (3) Divide data into 2 subsets based on the selected split (4) For each subset, consider all possible splits for each feature (5) Select the predictor/split that provides best separation for each subset of data (6) Repeat this process until some desired level of data purity is reach this is a terminal node!
Impurity of a Node Need a measure of impurity of a node to help decide on how to split a node, or which node to split The measure should be at a maximum when a node is equally divided amongst all classes The impurity should be zero if the node is all one class Consider the proportion of observations of class k in node m:
Measures of Impurity Possible measures of impurity: (1) Misclassification Rate -Situations can occur where no split improves the misclassification rate -The misclassification rate can be equal when one option is clearly better for the next step (2) Deviance/Cross-entropy: (3) Gini Index:
Tree Impurity Define impurity of a treeto be the sum over all terminal nodes of the impurity of a node multiplied by the proportion of cases that reach that node of the tree Example: Impurity of a tree with one single node, with both A and B having 400 cases, using the Gini Index: -Proportions of the two cases= 0.5 -Therefore Gini Index= (0.5)(1-0.5)+ (0.5)(1-0.5) = 0.5
X2< -1.0 Blue
X2< -1.0 Blue X1< 3.9 Red
X2< -1.0 Blue X1< 3.9 Red X1< 5.45 Blue
X2< -1.0 Blue X1< 3.9 Red X1< 5.45 Blue X2< -0.25 Red Blue
Pruning the Tree Over fitting CART trees grown to maximum purity often over fit the data Error rate will always drop (or at least not increase) with every split Does not mean error rate on test data will improve Best method of arriving at a suitable size for the tree is to grow an overly complex one then to prune it back Pruning (in classification trees) based on misclassification
Pruning the Tree Pruning parameters include (to name a few): (1) minimum number of observations in a node to allow a split (2) minimum number of observations in a terminal node (3) max depth of any node in the tree Use cross-validation to determine the appropriate tuning parameters Possible rules: -minimize cross validation relative error (xerror) -use the “1-SE rule” which uses the largest value of Cp with the “xerror” within one standard deviation of the minimum
Logic Regression Logic regression is an alternative tree-based method It requires exclusively binary features although the response can be binary or continuous Logic regression produces models that represent Boolean (i.e. logical) combinations of binary predictors -“AND”= -“OR” = -“NOT” = !Xj This allows logic regression to model complex interactions among binary predictors
Logic Regression Like CART, logic regression is a non-parametric/semi-parametric data driven method Like CART also able to model continuous and categorical responses Logic regression can also be used for survival outcomes (though I’ve never found an instance where someone actually did!) Unlike CART, logic regression requires that all features are binary when added to model -Features in CART model treated as binary BUT CART identifies the “best” split for a given feature to improve purity in the data Logic regression also uses a different fitting algorithm to identify the “best” model -Simulated Annealing vs. Recursive Partitioning
Comparing Tree Structures CART nodes: define splits on features in the tree branches: define relationship between features in the tree terminal nodes: Defined by a subgroup of the data (probability of each class in that node defines final prediction) predictions: made by starting at the top of the tree and going down the appropriate splits until a terminal node is reached Logic Regression knot (nodes): “AND”/ “OR” operators in the tree branches: define relationship between features in the tree leaves (terminal nodes): Define the features (i.e. X’s) use in the tree predictions: the tree can be broken down into a set of logical relationships among features, if an observation matches one of the feature combinations defined in the tree, the class is predicted to be a 1
Logic Regression Branches in a CART tree can be thought of as “AND” combinations of the features at each node in the branch The only “OR” operator in a CART tree occurs at the first split Inclusion of “OR”, “AND”, and “NOT” operators in logic regression trees provides greater flexibility than is seen in CART models For data that include predominantly binary predictors (i.e. single nucleotide polymorphisms, yes/no survey data, etc.) logic regression is a better option than CART Consider an example…
CART vs. Logic Regression Trees Assume features are a set of binary predictors and the following combinations predict that an individual has disease (versus being healthy): (X1 AND X4) OR (X1 AND !X5) CART Logic Regression X1< 0.5 OR AND AND X5< 0.5 Predict C = 0 X1 X4 X1 !X5 X4< 0.5 Predict C = 1 Predict C = 0 Predict C = 1
Logic Regression Trees Not Unique Logic regression trees are not necessarily unique. We can represent our same model with two seemingly different trees… (X1 AND X4) OR (X1 AND !X5) Logic Regression Tree 1 Logic Regression Tree 2 OR AND AND AND OR X1 X1 X4 X1 !X5 X4 !X5
OR AND AND !p53 OR AND AND Rb !p53 !p21 H-ras FGFR3 AND 9q- !p21 Cancer Example of Logic Regression Logic Model for Stage P1 Cancer: Equivalent Logic Tree: Mitra, A. P. et al. , 2006
OR AND AND !p53 OR AND AND Rb !p53 !p21 H-ras FGFR3 AND 9q- !p21 Cancer Example of Logic Regression Logic Model for Stage P1 Cancer: Equivalent Logic Tree: Mitra, A. P. et al. , 2006
OR AND AND !p53 OR AND AND Rb !p53 !p21 H-ras FGFR3 AND 9q- !p21 Cancer Example of Logic Regression Logic Model for Stage P1 Cancer: Equivalent Logic Tree: Mitra, A. P. et al. , 2006
OR AND AND !p53 OR AND AND Rb !p53 !p21 H-ras FGFR3 AND 9q- !p21 Cancer Example of Logic Regression Logic Model for Stage P1 Cancer: Equivalent Logic Tree: Mitra, A. P. et al. J Clin Oncol; 24:5552-5564 2006
Fitting a Logic Regression Model CART -considers one feature at a time Logic regression -considers all logical combinations of predictors (up to a fixed size) As a result… The search space can be very large Recursive partitioning could be used for logic regression, however, since it only considers one predictor at a time, it may fail to identify optimal combinations of predictors Simulated annealing is a stochastic learning algorithm that is a viable alternative to recursive partitioning
Simulated Annealing Original idea for the simulated annealing (SA) algorithm came from metallurgy -With metal mixtures, control strength, malleability, etc. by controlling temperature at which the metals anneal SA borrows from this idea to effectively search large feature space without getting “stuck” at local max/min SA a combination of a random walk and a Markov chain to conduct its search Rate of annealing controlled by: -number of random steps the algorithm takes -cooling schedule (includes staring and ending annealing “temperature”
Simulated Annealing (0) Select measure of model fit (scoring function) • Select a random starting point in the search space (i.e. one possible combination of features) • Randomly select of six possible moves to update the current model -Alternate a leaf; alternate an operator; grow a branch; prune a branch; split a leaf; delete a leaf • Compare the new model to the old model using some measure of fit • If the new model is better, accept unconditionally • If the new model is worse, with some probability choose the new model -This probability related to the current annealing “temperature” • Repeat for some set number of iterations (number of steps)
Six Allowable Moves Ruczinski I, Kooperberg C, LeBlanc M (2003). Logic Regression. JCGS, 3(12), 475-511.
Simulated Annealing Measures of model fit: • Misclassification (classification model) -Simulated annealing does not have the same issues for misclassification that recursive partitioning does • Deviance/cross-entropy (logistic classification model) • Sums-of-square error (regression model) • Hazard function (survival) • User specified model fit/loss function -You could write your own
Other Fitting Considerations • Choosing annealing parameters (start/end temperatures, number of iterations) -Determine an appropriate set of annealing parameters prior to running model -Starting temperature: select to accept 90-95% of “worse” models -Ending temperature: select so >5% of worse models are accepted -Iterations is a matter of choice more iterations = longer run time • Choosing maximum number of leaves/trees -Good to determine appropriate number of leaves/trees before running final model -Can be done by cross-validation (select model size with smallest CV error)
Some Final Notes The Good: Tree-based methods are very flexible for classification and regression The models can be presented graphically for easier interpretation -clinicians like them, they tend to think this way anyway CART is very useful in settings where features are continuous and binary Logic regression is useful for modeling data with all binary features (e.g. SNP data)
Some Final Notes The Bad (and sometimes ugly): They do have a tendency to over-fit the data -not uncommon among machine learning methods They are referred to as weak learners -small changes in the data results in very different models There has been extensive research into improving the performance of these methods which is the topic of our next class.