140 likes | 156 Views
This paper discusses decision tree learning in data mining, including splitting data, measuring impurity, stopping criteria, pruning, and different algorithms. It also explores the properties and references of decision trees.
E N D
Decision Tree Learning Debapriyo Majumdar Data Mining – Fall 2014 Indian Statistical Institute Kolkata August 25, 2014
Example: Age, Income and Owning a flat • Training set • Owns a house • Does not own a house Monthly income (thousand rupees) L1 L2 Age • If the training data was as above • Could we define some simple rules by observation? • Any point above the line L1 Owns a house • Any point to the right of L2 Owns a house • Any other point Does not own a house
Example: Age, Income and Owning a flat • Training set • Owns a house • Does not own a house Monthly income (thousand rupees) L1 L2 In general, the data won’t be such as above Age
Example: Age, Income and Owning a flat • Training set • Owns a house • Does not own a house Monthly income (thousand rupees) Age • Approach: recursively split the data into partitions so that each partition becomes purer till … How to decide the split? How to measure purity? When to stop?
Approach for splitting • What are the possible lines for splitting? • For each variable, midpoints between pairs of consecutive values for the variable • How many? • If N = number of points in training set and m = number of variables • About O(N × m) • How to choose which line to use for splitting? • The line which reduceimpurity (~ heterogeneity of composition) the most • How to measure impurity?
Gini Index for Measuring Impurity • Suppose there are Cclasses • Let p(i|t)= fraction of observations belonging to class iin rectangle (node) t • Gini index: • If all observations in tbelong to one single class Gini(t) = 0 • When is Gini(t) maximum?
Entropy • Average amount of information contained • From another point of view – average amount of information expected – hence amount of uncertainty • We will study this in more detail later • Entropy: Where 0 log20 is defined to be 0
Classification Error • What if we stop the tree building at a node • That is, do not create any further branches for that node • Make that node a leaf • Classify the node with the most frequent class present in the node • Classification error as measure of impurity This rectangle (node) is still impure • Intuitively – the impurity of the most frequent class in the rectangle (node)
The Full Blown Tree Root 1000 Number of points • Recursive splitting • Suppose we don’t stop until all nodes are pure • A large decision tree with leaf nodes having very few data points • Does not represent classes well • Overfitting • Solution: • Stop earlier, or • Prune back the tree 400 600 200 200 160 240 2 1 5 Statistically not significant
Prune back • Pruning step: collapse leaf nodes and make the immediate parent a leaf node • Effect of pruning • Lose purity of nodes • But were they really pure or was that a noise? • Too many nodes ≈ noise • Trade-off between loss of purity and gain in complexity Decision node (Freq = 7) Leaf node (label = Y) Freq = 5 Leaf node (label = B) Freq = 2 Prune Leaf node (label = Y) Freq = 7
Prune back: cost complexity • Cost complexity of a (sub)tree: • Classification error (based on training data) and a penalty for size of the tree Decision node (Freq = 7) Leaf node (label = Y) Freq = 5 Leaf node (label = B) Freq = 2 • Err(T) is the classification error • L(T) = number of leaves in T • Penalty factor α is between 0 and 1 • If α=0, no penalty for bigger tree Prune Leaf node (label = Y) Freq = 7
Different Decision Tree Algorithms • Chi-square Automatic Interaction Detector (CHAID) • Gordon Kass (1980) • Stop subtree creation if not statistically significant by chi-square test • Classification and Regression Trees (CART) • Breiman et al. • Decision tree building by Gini’s index • Iterative Dichotomizer 3 (ID3) • Ross Quinlan (1986) • Splitting by information gain (difference in entropy) • C4.5 • Quinlan’s next algorithm, improved over ID3 • Bottom up pruning, both categorical and continuous variables • Handling of incomplete data points • C5.0 • Ross Quinlan’s commercial version
Properties of Decision Trees • Non parametric approach • Does not require any prior assumptions regarding the probability distribution of the class and attributes • Finding an optimal decision tree is an NP-complete problem • Heuristics used: greedy, recursive partitioning, top-down, bottom-up pruning • Fast to generate, fast to classify • Easy to interpret or visualize • Error propagation • An error at the top of the tree propagates all the way down
References • Introduction to Data Mining, by Tan, Steinbach, Kumar • Chapter 4is available online: http://www-users.cs.umn.edu/~kumar/dmbook/ch4.pdf