Decision Trees

Decision Trees Learning what questions to ask

Decision tree Job is to build a tree that represents a series of questions that the classifier will ask of a data instance that is to be classified Each node is a question about the value that the instance to be classified has in a particular dimension How would the decision tree classify this data instance Fan-out of each node determined by how many different values that dimension can take-on Discrete Data Play Tennis? Decision Trees

Training Training data is used to build the tree How decide what question to ask first? Remember the curse of dimensionality There might be just a few dimensions that are important and the rest could be random Training builds the tree Classifying means using the tree Decision Trees

What Question to Ask What question can I ask about the data that will give me the most information gain Closer to being able to classify… Identifying the most important dimension (most important question) How humid is it? How windy is it? What to ask next… What is the outlook? Decision Trees

Approach comes out of Information Theory From Wikipedia: developed by Claude E. Shannon to find fundamental limits on signal processing operations such as compressing data Basically, how much information can I cram into a given signal (how many bits can I encode) Information Theory Another statistical approach Decision Trees

Entropy Starts with entropy… Entropy is a measure of the homogeneity of the data Purely random (nothing but noise) is maximum entropy Linearly separable data is minimum entropy What does that mean with discrete data? Given all instances with a sunny outlook, what if all of them were classified “yes, play tennis” that were “low humidity” and all of them were classified “no, do not play tennis” that were “high humidity” High entropy or low? Given all instances with a sunny outlook, what if half were “yes, play tennis” and half “no, don’t play” no matter what the humidity High entropy or low? Decision Trees

Entropy If going to measure… Want a statistical approach that yields… Example: 50% positives Example: 0% positives Example: 100% positives S is a collection of training samples is the proportion of positives is the proportion of negatives We define as 0 Decision Trees

Example What if a sample was 20% 80% Log2(.2) = log(.2)/log(2) Log2(.2) = -2.321928 Log(.8) = -0.3219281 -(.2)*(-2.321928) – (.8)*(-0.3219281) 0.7219281 What if 80% 20% Same What if 50% 50% Highest entropy, 1 Decision Trees

If Not Binary Can extend to more classes Not just positive and negative • If set base to number of classes back to summing to 1 at max • Sum to number of classes if stick with base 2 • From book: Entropy is a measure of the expected encoding length measured in bits Decision Trees

Information Gain Simply, expected reduction in entropy caused by partitioning the examples according to this attribute Scales the contribution of each answer according to membership If entropy of S is 1 and each of the entropies for the answers is 0 then … 1 – 0 so one Information gain is 1 Humidity question or Windy question? If entropy of S is 1 and each of the entropies for the answers is 1 then … 1 – 1 so zero Information gain is zero Decision Trees

Example What is the information gain , 9 yesses to tennis, 5 no’s Decision Trees

The algorithm Recursive algorithm: ID3 Iterative Dichotomizer 3 • ID3(S, attributes yet to be processed) • Create a Root node for the tree • Base cases • If S are all same class, return the single node tree root with that label • If attributes is empty return r node with label equal to most common class • Otherwise • Find attribute with greatest information gain • Set decision attribute for root • For each value of the chosen attribute • Add a new branch below root • Determine Sv for that value • If Sv is empty • Add a leaf with label of most common class • Else • Add subtree to this branch: ID3(Sv, attributes – this attribute) Decision Trees

Another example Which attribute next? Decision Trees

Another Example Next attribute? Decision Trees

An issue Is there a branch for every answer? What if no training samples had overcast as their outlook? Could you classify a new unknown or test instance if it had overcast in that dimension? Decision Trees

An issue Tree often perfectly classifies training data Not guaranteed but usually: if exhaust every dimension as drill-down last decision node might have answers that are still “impure” but is labeled with most abundant class For instance: on the cancer data my tree had no leaves deeper than 4 levels It basically memorizes the training data Is this the best policy? What if had a node that “should” be pure but had a single exception? Overfitting Decision Trees

Visualizing Overfitting Decision boundary Sometimes it is better to live with a little error than to try to get perfection Decision Trees

Overfitting Wikipedia In statistics, overfitting occurs when a statistical model describes random error or noise instead of the underlying relationship. Decision Trees

How Fix Bayesian finds boundary that minimizes error If we trim the decision tree’s leaves—similar effect i.e. don’t try to memorize every single training sample Decision Trees

Don’t know until you know Withhold some data Use to test Definition Given a hypothesis space , a hypothesis is said to overfit the training data if there exists some alternative hypothesis , such that has smaller error than over the training examples, but has a smaller error than over the entire distribution of instances. Decision Trees

How prevent? Stop growing tree early Set some threshold for allowable entropy Post Pruning Build tree then remove as long as it improves Decision Trees

Remove each decision node in turn and check performance Removing a decision node means removing all sub-trees below it and assigning the most common class Remove (permanently) the decision node that caused the greatest increase in accuracy Rinse and repeat Reduced Error Pruning Try it and see Decision Trees

Build the complete (over trained) tree Convert the learned tree into a set of rules One rule per path from root to leaf Each rule is a set of conjunctions Remove any clause from each rule chain that increases accuracy Remember each rule chain provides a full classification Sort rules by accuracy and classify in that order Rule Post Pruning Decision Trees

Not really a tree any more A series of rules A node could both be present and not be present Imagine a bifurcation and one track has only the first and last “node” Decision Trees

Bagging Bootstrap aggregating (bagging) Helps to avoid overfitting Usually applied to decision tree models (though not exclusively) Neural Networks

Bagging Machine learning ensemble meta-algorithm Create a bunch of models Do so by bootstrap sampling the training data Let all the models vote Pick me! Pick me! Pick me! Pick me! Pick me! Pick me! Pick me! Pick me! Pick me! Pick me! Pick me! Pick me! Pick me! Pick me! Pick me! Pick me! Pick me! Pick me! Neural Networks

Random Forest Forest is a bunch of trees Each tree has access to a random subset of attributes/dimensions Decision Trees

The nature of Decision Trees Greedy algorithm Tries to race to an answer Finds the next question that best splits the data into classes by answer Result: Short trees are preferred Decision Trees

Occam’s razor The simplest answer is often the best But does this lead to the best classifier Book has a philosophical discussion about this without resolving the issue Decision Trees

Coolness factor Many classifiers simply give an answer No reason Decision trees one of the few that provides such insights Decision Trees

Decision Trees

Decision Trees

Decision Trees

Presentation Transcript

Decision Trees

Decision Trees

Decision Trees

Decision Trees

Decision Trees

Decision Trees

Decision Trees

Decision Trees

Decision Trees

Decision Trees

Decision Trees

Decision Trees

Decision Trees

Decision Trees

Decision Trees

DECISION TREES

Decision Trees

Decision Trees

Decision trees

Decision Trees

Decision Trees