340 likes | 666 Views
Decision Tree Learning Algorithms. Sagar Kasukurthy. DECISION TREE ALGORITHMS. One of the simplest forms of machine learning. Supervised learning – Output for the training data is known. Takes as input a vector of attribute values. Returns a single output value decision.
E N D
Decision Tree Learning Algorithms Sagar Kasukurthy
DECISION TREE ALGORITHMS • One of the simplest forms of machine learning. • Supervised learning – Output for the training data is known. • Takes as input a vector of attribute values. Returns a single output value decision. • We build a decision tree first based on the training data and then apply the decision tree on the test sample. Decision - Whether new customer will default on credit card or not? Goal – To come up with a value, yes or no. Attributes – The Random variables in the problem are attributes.
New Test Sample • Consider a person who is not a home owner, is Single and has an annual income of 94k. Would he default on the payment?
Observations on the tree • Do not need to check all attributes to make a decision. • Very intuitive. • Why is home owner the root node and not marital status?
Observations on the tree • Node Types • Root Node : No incoming edges, only outgoing edges • Example: Home Owner. • Internal Node : Exactly one incoming edge and >= 2 outgoing edges. • Example: Marital Status • Leaf Node: Exactly one incoming edge • Example : Class Label. • Edges : Represent possible values of the attributes.
Attribute Types • Binary Attribute – Two possible values • Example: Home Owner : Yes or No • Nominal Attributes – Many possible values • Can be split in (2k-1 -1) ways • Example: Marital Status: • (Single, Divorced, Married) • (Single, Divorced/Married) • (Single/Married, Divorced.
Attribute Types • Ordinal Attributes – Similar to Nominal expect grouping must not violate the order property of the attribute values. • Example: Shirt Size with values small, medium, large and extra large. Group only in the order small, medium, large, extra large. • Continuous Attributes • Binary Outcomes. example : Annual Income > 80k (Yes,No) • Range Query. example : Annual Income with branches: • <10k • 10k – 25k • 25k – 50k • 50k – 80k • >80k
Learning Algorithm • Aim: Find a small tree consistent with the training examples • Idea: (recursively) choose "most significant" attribute as root of (sub)tree functionDTL(examples, attributes, parents_examples) returns a decision tree { if examples is empty thenreturn MAJORITY_VALUE(parent_examples) else if all examples have all same classification then return the classification else if attributes is empty then return MAJORITY_VALUE(examples) else best CHOOSE_BEST_ATTRIBUTE(attributes, examples) Tree a new decision tree with root test best For each value vi of bestdo examplesi { elements of examples with best = vi } subtree DTL( examplesi, attributes – best, MAJORITY_VALUE(examples)) add a branch to the tree with label viand subtreesubtree returnTree }
CHOOSE_BEST_ATTRIBUTE • Information gain at the attribute. • Choose the attribute with the higher information gain. • Equation: • I = Impurity Measure • N = number of samples • N(Vj) = number of samples when attribute V takes value Vj.
CHOOSE_BEST_ATTRIBUTE • Impurity Measure: Measure of the goodness of a split at a node. • When is a split pure? • A split is pure if after the split, for all branches, all the instances choosing a branch belong to the same class • The measures for selecting the best split are based on the degree of impurity of child nodes.
IMPURITY MEASURES • ENTROPY • GINI INDEX • MISCLASSIFICATION ERROR • C = number of classes • P(i/t) = fraction of records belonging to class i at node t
ENTROPY • Measure of uncertainty of a random variable. • More the uncertainty, higher the entropy. • Example: Coin toss which always comes up as heads. • No uncertainty, thus entropy = zero. • We gain no information by observing the value since the value is always heads. • Entropy: • Entropy: H(V) = • V = Random variable • P(Vk) – Probability of variable taking value Vk • For a fair coin H(Fair) = - (0.5log2 0.5 + 0.5log2 0.5 ) = 1
DECISION TREE USING ENTROPY AS IMPURITY MEASURE Let 0 be Class 0 and 1 be Class 1.
DECISION TREE USING ENTROPY AS IMPURITY MEASURE • I(Parent) • Total 8 samples, 3 class 0 and 5 class 1 • I(Parent) = -3/8log23/8 – 5/8log25/8 = 0.95
DECISION TREE USING ENTROPY AS IMPURITY MEASURE For attribute A A takes value 0 , 3 class 0 and 1 class 1 N(Vj) = 4 N = 8 A takes value 1,0 class 0 and 4 class 1 N(Vj) = 4 N = 8 Information gain for attribute A = 0.95 – (-4/8(-3/4log23/4 + 1/4log21/4) – 4/8(-4/4log24/4 + 0/4log20/4)) = 0.54
DECISION TREE USING ENTROPY AS IMPURITY MEASURE For attribute B B takes value 0 , 2 class 0 and 2 class 1 N(Vj) = 4 N = 8 B takes value 1,1 class 0 and 3 class 1 N(Vj) = 4 N = 8 Information gain for attribute B = 0.95 – (-4/8(-2/4log22/4 + 2/4log22/4) – 4/8(-1/4log21/4 + 3/4log23/4)) = 0.04
DECISION TREE USING ENTROPY AS IMPURITY MEASURE For attribute C C takes value 0 , 2 class 0 and 2 class 1 N(Vj) = 4 N = 8 C takes value 1,1 class 0 and 3 class 1 N(Vj) = 4 N = 8 Information gain for attribute B = 0.95 – (-4/8(-2/4log22/4 + 2/4log22/4) – 4/8(-1/4log21/4 + 3/4log23/4)) = 0.04
DECISION TREE USING ENTROPY AS IMPURITY MEASURE • Information gain for Node A is the highest, so we use node A as the root node. • Now, we see that when A = 1, all samples belong to class label 1. • Now, remaining samples = 4 when A = 0. Need to calculate information gain for attributes B and C. • Class 0 = 3 and Class 1 = 1. I(Parent) = -3/4log23/4 – 1/4log21/84 = 0.81
DECISION TREE USING ENTROPY AS IMPURITY MEASURE For attribute B B takes value 0 , 2 class 0 and 0 class 1 N(Vj) = 2 N = 4 B takes value 1,1 class 0 and 1 class 1 N(Vj) = 2 N = 4 Information gain for attribute B = 0.81 – (-2/4(-2/2log22/2 + 0/2log20/2) – 2/4(-1/2log21/2 + 1/2log21/2)) = 0.311
DECISION TREE USING ENTROPY AS IMPURITY MEASURE For attribute C C takes value 0 , 2 class 0 and 0 class 1 N(Vj) = 2 N = 4 C takes value 1,1 class 0 and 1 class 1 N(Vj) = 2 N = 4 Information gain for attribute C = 0.81 – (-2/4(-2/2log22/2 + 0/2log20/2) – 2/4(-1/2log21/2 + 1/2log21/2)) = 0.311
DECISION TREE USING ENTROPY AS IMPURITY MEASURE • Both B and C have same information gain.
ENTROPY, GINI • ENTROPY = Used by ID3, C4.5 and C5.0 algorithms • GINI Coefficient = Used by the CART algorithm
OVERFITTING • Algorithm generates a large tree when there is no pattern. • Problem of predicting whether a roll of dice outputs 6 or not. • Carry experiments with various dice and decide to use attributes like color of the die, weight etc. • If in the experiments we saw that when we roll a 7 gram blue die, we got 6, the decision tree will build a pattern on that training sample.
REASONS FOR OVERFITTING • Choosing attributes with little meaning and try to satisfy noisy data. • Huge number of attributes • Small Training Data Set.
HOW TO COMBAT OVERFITTING - PRUNING • Eliminate irrelevant nodes – Nodes that have zero information gain. • A node that after split gives 50 Yes and 50 No on 100 examples
Problems associated with Decision Trees • Missing Data • Multi valued attributes • Attribute with many possible values, information gain may be high. But choosing this attribute first, might not yield the best tree. • Continuous attributes
Continuous attributes • Steps • Sort the records based on the integer value of the attribute. • Scan the values and each time update the count matrix of Yes/No and compute Impurity • Choose the split position that has the least impurity. • Splitting is the most expensive part of real-world decision tree learning applications.
References • Artificial Intelligence – A modern approach –Third edition by Russel and Norvig. • Video Lecture – Prof. P. Dasgupta, Dept. of Computer Science, IIT Kharagpur. • Neural networks course Classroom lecture on decision tree – Dr. Eun Youn, Texas Tech Univ, Lubbock.