260 likes | 292 Views
Classification and Regression Trees Chapter 9. Example: Predicting Car Purchase -- Decision Tree. Age?. 30. 30-40. >40. Student?. YES. Credit Rating?. yes. yes. no. no. NO. YES. NO. YES. Another Example: will the customer default on her/his loan?.
E N D
Classification and Regression Trees Chapter 9
Example: Predicting Car Purchase -- Decision Tree Age? 30 30-40 >40 Student? YES Credit Rating? yes yes no no NO YES NO YES
Another Example: will the customer default on her/his loan? • Can we classify default based on balance and age?
The Decision Tree • A series of nested tests. • Each node represents a test on one attribute (for decisions) • Nominal attribute • # of branches = # of possible values • Numeric attribute • discretized • Each leafis a class assignment • Default or Not default Employed? Yes No Balance? NOT DEFAULT >=50K <50K DEFAULT Age? >45 <=45 NOT DEFAULT DEFAULT
Using the decision tree for prediction • Predict for Mark: • Age: 40 • Employment: None • Balance: 88k • Start at the root and traverse down until a leaf is reached Employed? Yes No Balance? NOT DEFAULT >=50K <50K DEFAULT Age? >45 <=45 NOT DEFAULT DEFAULT
Non-Responder IFIncome=LowANDDebts=LowTHENNon-Responder Low Debts Responder IFIncome=LowANDDebts=HighTHENResponder High Low Income IFIncome=HighANDGender=MaleANDChildren=ManyTHENResponder Many Responder Children Male High Gender Few Non-Responder Female Non-Responder IFIncome=HighANDGender=FemaleTHENNon-Responder Trees Easily Converted To Rules (for coding) • Reading Rules from a Decision Tree IFIncome=HighANDGender=Male ANDChildren=FewTHENNon-Responder
Decision Tree Construction • Basic algorithms are greedy • Tree constructed in a top-down recursive partitioningmanner • All training examples are at the root initially • Attributes are assumed to be categorical (discretized if necessary) • Examples partitioned recursively based on selected attributes • Which attribute to select? – Using Information gain • Some Popular Methods • ID3, C4.5, CART • Similar ideas • Differ in • How tree is grown • Splitting criteria • Pruning methods • Termination criteria
Decision Tree Construction • Basic idea: • Partition training examples into purer and purer sub groups • Group A is “purer” than group B if more members in A are similar than members in B. • Tree constructed by recursively partitioning instances. Age 45 Balance 50K Age < 45 Entire Population Age 45 Balance < 50K Bad risk (Default) Good risk (Not default) Age < 45
Attribute Selection • Which attribute should be used for a split? • Choose the attribute that best partitions the relevant population into purer groups at each decision node. • Many measures of impurity exist • (Optional) Gini index • Entropy and Information Gain - most common • Use the Information Gain measure to decide which attribute to use • How informative is the attribute in distinguishing among instances from different classes? • Developed by Shannon (1952) • Ideally, also try to minimize number of splits (nodes) in the tree. • More compact • Often more accurate
Measurement of impurity: Entropy q = proportion of cases (out of m classes) in set A that belong to class k • Entropy ranges between 0 (most pure) and log2(m) (equal representation of classes) • Maximum value of Entropy is 1 in the binary case
Entropy (Cont’d) • Set S with p elements of class P and n elements of class N. • Entropy of set S is Here k = 2 corresponding to P and N If p = credit worthy = 10, n =non-credit worthy = 20, E(S) = -(10/(10+20))log2(10/(10+20))-(20/(10+20))log2(20/(10+20)) =0.918296
Exercise on Entropy • Calculate the entropy. • What is the value when p=n=15? Entire Population 16 Plus (p), 14 Green (n)? For practice: Entire Population 12 Plus (p), 18 Green (n) ?
Information Gain • Information gain: expected reduction in entropy • Suppose node N gets partitioned into M child nodes {c1, c2, …, cm}, given attribute A • Information Gain = Entropy Reduction • = Entropy of N – Sum of entropies of c1…cm • Select attribute with the highest information gain • What does “highest information gain” mean intuitively?
Root Node: Try (Hair Length 5”) no yes Hair Length <= 5” ?
Exercise: Try (Age 10) no no yes yes Age <= 10? Age <= 36? Find: Entropies for all the splits Information Gain Try later: Age<= 36
Try (Weight 160 lb) no yes Weight <= 160 lb?
Root Node: Try (Weight 160 lb) no yes Weight <= 160 lb?
Building The Tree • Splitting of Weight reduces entropy the most • People with weight 160 not perfectly classified, so recurse • Split on Hair Length 2” is the best option • TREE no yes Weight <= 160 lb? no yes Hair Length <= 2”?
Trees Easily Converted To Rules IF(weight > 160 lb) male ELSE IF(Hair Length 2”) male ELSEfemale no yes Weight <= 160 lb? no yes Hair Length <= 2”?
Stopping • Stopping Criteria for splitting • When additional splits obtain no information gain • When maximum purity is obtained • if all attributes have been used
The Overfitting Problem no yes Wears blue? • Many possible splitting rules that (near-)perfectly classify the data • May not generalize to future datasets • Particularly a problem with small datasets
Overfitting • Too many branches (think about having only one data point at each leaf …) • End up fitting noise • Effect • Great fit for training data, poor accuracy for unseen samples
Avoid Overfitting • Prepruning • Halt tree construction early • Do not split node if this would result in purity measure falling below threshold • Difficult to choose threshold • Postpruning • Remove branches from a “fully grown” tree • Get sequence of progressively pruned trees • Use different dataset to select best pruned tree (i.e. cross-validation)
Strengths & Weaknesses of Decision Trees • Strengths • Easy to understand and interpret – tree structure specifies entire decision structure • Easy to implement • Running time is low even with large data sets • Very popular method • Weaknesses • Volatile: small changes in underlying data result in very different models • Cannot capture interactions between variables • Can result in large error • How can we reduce volatility ? • “Bagging”