Decision Tree Learning

Decision Tree Learning Debapriyo Majumdar Data Mining – Fall 2014 Indian Statistical Institute Kolkata August 25, 2014

Example: Age, Income and Owning a flat • Training set • Owns a house • Does not own a house Monthly income (thousand rupees) L1 L2 Age • If the training data was as above • Could we define some simple rules by observation? • Any point above the line L1  Owns a house • Any point to the right of L2  Owns a house • Any other point  Does not own a house

Example: Age, Income and Owning a flat • Training set • Owns a house • Does not own a house Monthly income (thousand rupees) L1 L2 In general, the data won’t be such as above Age

Example: Age, Income and Owning a flat • Training set • Owns a house • Does not own a house Monthly income (thousand rupees) Age • Approach: recursively split the data into partitions so that each partition becomes purer till … How to decide the split? How to measure purity? When to stop?

Approach for splitting • What are the possible lines for splitting? • For each variable, midpoints between pairs of consecutive values for the variable • How many? • If N = number of points in training set and m = number of variables • About O(N × m) • How to choose which line to use for splitting? • The line which reduceimpurity (~ heterogeneity of composition) the most • How to measure impurity?

Gini Index for Measuring Impurity • Suppose there are Cclasses • Let p(i|t)= fraction of observations belonging to class iin rectangle (node) t • Gini index: • If all observations in tbelong to one single class Gini(t) = 0 • When is Gini(t) maximum?

Entropy • Average amount of information contained • From another point of view – average amount of information expected – hence amount of uncertainty • We will study this in more detail later • Entropy: Where 0 log20 is defined to be 0

Classification Error • What if we stop the tree building at a node • That is, do not create any further branches for that node • Make that node a leaf • Classify the node with the most frequent class present in the node • Classification error as measure of impurity This rectangle (node) is still impure • Intuitively – the impurity of the most frequent class in the rectangle (node)

The Full Blown Tree Root 1000 Number of points • Recursive splitting • Suppose we don’t stop until all nodes are pure • A large decision tree with leaf nodes having very few data points • Does not represent classes well • Overfitting • Solution: • Stop earlier, or • Prune back the tree 400 600 200 200 160 240 2 1 5 Statistically not significant

Prune back • Pruning step: collapse leaf nodes and make the immediate parent a leaf node • Effect of pruning • Lose purity of nodes • But were they really pure or was that a noise? • Too many nodes ≈ noise • Trade-off between loss of purity and gain in complexity Decision node (Freq = 7) Leaf node (label = Y) Freq = 5 Leaf node (label = B) Freq = 2 Prune Leaf node (label = Y) Freq = 7

Prune back: cost complexity • Cost complexity of a (sub)tree: • Classification error (based on training data) and a penalty for size of the tree Decision node (Freq = 7) Leaf node (label = Y) Freq = 5 Leaf node (label = B) Freq = 2 • Err(T) is the classification error • L(T) = number of leaves in T • Penalty factor α is between 0 and 1 • If α=0, no penalty for bigger tree Prune Leaf node (label = Y) Freq = 7

Different Decision Tree Algorithms • Chi-square Automatic Interaction Detector (CHAID) • Gordon Kass (1980) • Stop subtree creation if not statistically significant by chi-square test • Classification and Regression Trees (CART) • Breiman et al. • Decision tree building by Gini’s index • Iterative Dichotomizer 3 (ID3) • Ross Quinlan (1986) • Splitting by information gain (difference in entropy) • C4.5 • Quinlan’s next algorithm, improved over ID3 • Bottom up pruning, both categorical and continuous variables • Handling of incomplete data points • C5.0 • Ross Quinlan’s commercial version

Properties of Decision Trees • Non parametric approach • Does not require any prior assumptions regarding the probability distribution of the class and attributes • Finding an optimal decision tree is an NP-complete problem • Heuristics used: greedy, recursive partitioning, top-down, bottom-up pruning • Fast to generate, fast to classify • Easy to interpret or visualize • Error propagation • An error at the top of the tree propagates all the way down

References • Introduction to Data Mining, by Tan, Steinbach, Kumar • Chapter 4is available online: http://www-users.cs.umn.edu/~kumar/dmbook/ch4.pdf

Decision Tree Learning