260 likes | 414 Views
Data Mining using Decision Trees. Professor J. F. Baldwin. Decision Trees from Data Base. Ex Att Att Att Concept Num Size Colour Shape Satisfied 1 med blue brick yes 2 small red wedge no 3 small red sphere yes 4 large red wedge no 5 large green pillar yes
E N D
Data Mining using Decision Trees Professor J. F. Baldwin
Decision Trees from Data Base Ex Att Att Att Concept Num Size Colour Shape Satisfied 1 med blue brick yes 2 small red wedge no 3 small red sphere yes 4 large red wedge no 5 large green pillar yes 6 large red pillar no 7 large green sphere yes Choose target : Concept satisfied Use all attributes except Ex Num
CLS - Concept LearningSystem - Hunt et al. Tree Structure Node with mixture of +ve and -ve examples Parent node Attribute V v1 v2 v3 Children nodes
CLS ALGORITHM 1. Initialise the tree T by setting it to consist of onenode containing all the examples, both +ve and -ve,in the training set 2. If all the examples in T are +ve, create a YES node and HALT 3. If all the examples in T are -ve, create a NO node and HALT 4. Otherwise, select an attribute F with values v1, ..., vn Partition T into subsets T1, ..., Tn according to the values on F. Create branches with F as parent and T1, ..., Tn as child nodes. 5. Apply the procedure recursively to each child node
Data Base Example Using attribute SIZE {1, 2, 3, 4, 5, 6, 7} SIZE med large small {1} {2, 3} {4, 5, 6, 7} Expand Expand YES
Expanding {1, 2, 3, 4, 5, 6, 7} SIZE large med small {1} {2, 3} COLOUR {4, 5, 6, 7} SHAPE wedge sphere YES {2, 3} SHAPE pillar wedge {7} {4} {5, 6} COLOUR sphere {3} {2} red green Yes No {6} No {5} Yes no yes
Rules from Tree IF (SIZE = large AND ((SHAPE = wedge) OR (SHAPE = pillar AND COLOUR = red) ))) OR (SIZE = small AND SHAPE = wedge) THEN NO IF (SIZE = large AND ((SHAPE = pillar) AND COLOUR = green) OR SHAPE = sphere) ) OR (SIZE = small AND SHAPE = sphere) OR (SIZE = medium) THEN YES
Disjunctive Normal Form - DNF IF (SIZE = medium) OR (SIZE = small AND SHAPE = sphere) OR (SIZE = large AND SHAPE = sphere) OR (SIZE = large AND SHAPE = pillar AND COLOUR = green THEN CONCEPT = satisfied ELSE CIONCEPT = not satisfied
ID3 - Quinlan Attributes are chosen in any order for the CLS algorithm. This can result in large decision trees if the ordering is not optimal. Optimal ordering would result in smallest decision Tree. No method is known to determine optimal ordering. We use a heuristic to provide efficient ordering which will result in near optimal ordering ID3 = CLS + efficient ordering of attributes Entropy is used to order the attributes.
Entropy For random variable V which can take values {v1, v2, …, vn} with Pr(vi) = pi, all i, the entropy of V is given by Entropy for a fair dice = = 1.7917 Entropy for fair dice with even score = = 1.0986 Differences between entropies Information gain = 1.7917 - 1.0986 = 0.6931
Attribute Expansion Expand attribute Ai - other attributes Ai T Pr Equally likely unless specified Pr(A1, …Ai, …An, T) Attributes Except Ai aim ai1 T Pr T Pr(A1, …Ai-1, Ai+1, …An, T | Ai = ai1) Pass probabilities corresponding to ai1 from above and re-normalise -equally likely again if previous equally likely
Expected Entropy for an Attribute Attribute Ai and target T - Ai T Pr Pass probabilities corresponding to tk from above for ai1and re-normalise aim ai1 Pr T T Pr Pr(T | Ai=aim) S(ai2) S(aim) S(ai1) Expected Entropy for Ai =
How to choose attribute and Information gain Determine expected entropy for each attribute i.e. S(Ai), all i Choose s such that Expand attribute As By choosing attribute As the information gain is S - S(As) where where Minimising expected entropy is equivalent to maximising Information gain
Previous Example Ex Att Att Att Concept Num Size Colour Shape Satisfied 1 med blue brick yes 1/7 2 small red wedge no 1/7 3 small red sphere yes 1/7 4 large red wedge no 1/7 5 large green pillar yes 1/7 6 large red pillar no 1/7 7 large green sphere yes 1/7 Pr Concept satisfied Pr S = (4/7)Log(4/7) + (3/7)Log(3/7) = 0.99 yes no 4/7 3/7
Entropy for attribute Size Att Concept Size Satisfied med yes 1/7 small no 1/7 small yes 1/7 large no 2/7 large yes 2/7 Pr S(Size) = (2/7)1 + (1/7)0 + (4/7)1 = 6/7 = 0.86 Information Gain for Size = 0.99 - 0.86 = 0.13 Pr(large) = 4/7 Pr(small) = 2/7 large small Concept Satisfied no 1/2 yes 1/2 Pr Concept Satisfied no 1/2 yes 1/2 Pr Pr(med) = 1/7 med Concept Satisfied yes 1 S(large) = 1 Pr S(small) = 1 S(med) = 0
First Expansion Attribute Information Gain SIZE 0.13 COLOUR 0.52 SHAPE 0.7 max choose {1, 2, 3, 4, 5, 6, 7} SHAPE sphere wedge pillar brick {5, 6} {2, 4} {1} {3, 7} Expand YES NO YES
Complete Decision Tree {1, 2, 3, 4, 5, 6, 7} Rule: IF Shape is wedge OR Shape is brick OR Shape is pillar AND Colour is red OR Shape is sphere THEN NO ELSE YES SHAPE sphere wedge pillar brick {5, 6} {2, 4} {1} {3, 7} COLOUR YES NO YES green red {5} {6) YES NO
A new case Att Att Att Concept Size Colour Shape Satisfied med red pillar ? SHAPE pillar COLOUR red ? = NO
Post Pruning Any Node S N examples in node Let C be class with most examples i.e majority E(S) n cases of C C is one of {YES, NO} Suppose we terminate this node and make it a leaf with classification C. What will be the expected error, E(S), if we use the tree for new cases and we reach this node. E(S) = Pr(class of new case is a class ≠ C)
Bayes Updating for Post Pruning Let p denote probability of class C for new case arriving at S We do not know p. Let f(p) be a prior probability distribution for p on [0, 1]. We can update this prior using Bayes’ updating with the information at node S. The information at node S is n C in S Pr(n C in S | p) f(p) f(p | n in S) = 1 Pr(n C in S | p) f(p)dp 0
Mathematics of Post Pruning Assume f(p) to be uniform over [0, 1] The evaluation of the integral n N – n p (1-p) 1 f(p | n C in S) = a b dx = x (1-x) 1 n N – n 0 p (1-p) dp n! (N – n + 1)! 0 (N + 2)! using Beta Functions E(S) = E (1 – p) f(p | n C in S) n N – n + 1 N – n + 1 dp p (1-p) using Beta Functions. E(S) = = N + 2 1 n N – n p (1-p) dp 0
Post Pruning for Binary Case For leaf nodes Si Error(Si) = E(Si) Error(S) = MIN S { } E(S) BackUpError(S) Num of examples in Si Pm Pi = Num of examples in S P2 P1 E(S) BackUpError(S) S1 S2 Sm Error(Sm) Error(S2) Error(S1) For any node S which is not a leaf node we can calculate BackUpError(S) = Pi Error(Si) Decision: Prune at S if BackUpError(S) ≥ Error(S) i
Example of Post Pruning [x, y] means x YES cases and y NO cases Before Pruning a 0.417 0.378 [6, 4] We underline Error(Sk) c 0.5 0.383 b 0.375 0.413 PRUNE [2, 2] [4, 2] [1, 0] 0.333 [3, 2] 0.429 [1, 0] 0.333 d 0.4 0.444 PRUNE [1, 2] PRUNE means cut the sub- tree below this point [1, 1] 0.5 [0, 1] 0.333
Result of Pruning After Pruning a [6, 4] c [4, 2] [2, 2] [1, 0] [1, 2]
Generalisation For the case in which we have k classes the generalisation for E(S) is N – n + k – 1 E(S) = N + k Otherwise, pruning method is the same.
Testing DataBase Learn rules using Training Set and Prune Test rules on this set and record % correct Test rules on Test Set record % correct Training Set Test Set % accuracy on test set should be close to that of training set. This indicates good generalisation Over-fitting can occur if noisy data is used or too specific attributes are used. Pruning will overcome noise to some extent but not completely. Too specific attributes must be dropped.