1 / 24

Data Mining CSCI 307, Spring 2019 Lecture 15

This lecture discusses the properties required from a purity measure in constructing decision trees, including zero measure for pure nodes and maximal measure for maximal impurity.

acano
Download Presentation

Data Mining CSCI 307, Spring 2019 Lecture 15

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Data MiningCSCI 307, Spring 2019Lecture 15 Constructing Trees

  2. Wishlist for a Purity Measure Properties we require from a purity measure: • When node is pure, measure should be zero • When impurity is maximal (i.e. all classes equally likely), measure should be maximal • Measure should obey multistage property (i.e. decisions can be made in several stages): measure([2,3,4]) = measure([2,7]) + (7/9) x measure([3,4]) Then decide on the second case This decision is made in two stages Make the first decision Entropy is the only function that satisfies all three properties

  3. Outlook . Yes No . Sunny 23 Overcast 40 Rainy 32 Example: attribute Outlook Outlook = Sunny : Outlook = Overcast : Outlook = Rainy :

  4. Example: attribute Outlook info([2,3]) = 0.971 bits Outlook = Sunny : Outlook = Overcast : Outlook = Rainy : Expected information for the attribute: info([4,0]) = 0 bits info([3,2]) = 0.971 bits info([2,3],[4,0],[3,2]) =

  5. Computing Information Gain Information gain....... information before splitting – information after splitting We've calculated the information BEFORE splitting and we've calculated the information AFTER the split for the Outlook attribute. So we can calculate the information gain for Outlook. gain(Outlook ) =

  6. Temperature Yes No Hot 22 Mild 42 Cool 31 attribute: Temperature info([2,2]) = entropy(2/4, 2/4) = −2/4 log2(2/4) − 2/4 log2(2/4)= 1 bit Temperature = Hot : Temperature = Mild : Temperature = Cool : info([4,2]) = entropy(4/6, 2/6) = −2/3 log(2/3) − 1/3 log(1/3)= 0.918bits info([3,1]) = entropy(3/4, 1/4) = −3/4 log(3/4) − 1/4 log(1/4) = 0.811bits

  7. attribute: Temperature info([2,2]) = 1 bit info([4,2]) = 0.918bits Temperature = Hot : Temperature = Mild : Temperature = Cool : Expected information for the attribute: info([3,1]) = 0.811bits Average information value. (Use the number of instances that go down each branch.) info([2,2],[4,2],[3,1]) = gain(Temperature ) = info([9,5]) – info([2,2],[4,2],[3,1]) =

  8. Humidity Yes No High 34 Normal 61 attribute: Humidity Humidity = High : Humidity = Normal : info([3,4]) = entropy(3/7, 4/7) = −3/7 log2(3/7) − 4/7 log2(4/7) = 0.985bits info([6,1]) = entropy(6/7, 1/7) = −6/7 log(6/7) − 1/7 log(1/7) = 0.592bits

  9. attribute: Humidity info([3,4]) = 0.985bits info([6,1]) = 0.592bits Humidity = High : Humidity = Normal : Expected information for the attribute: Average information value. (Use the number of instances that go down each branch.) info([3,4],[6,1]) = 7/14 x0.985+ 7/14 x 0.592 =0.788bits gain(Humidity ) = info([9,5]) – info([3,4],[6,1]) = 0.940 – 0.788 = 0.152 bits

  10. Windy Yes No False 62 True 33 attribute: Windy Windy = False : Windy = True : info([6,2]) = entropy(6/8, 2/8) = −6/8 log2(6/8) − 2/8 log2(2/8) = 0.811bits info([3,3]) = entropy(3/6, 3/6) = −3/6 log(3/6) − 3/6 log(3/6) = 1 bit

  11. attribute: Windy info([6,2]) = 0.8112777 bits info([3,3]) = 1 bit Windy = False : Windy = True : Expected information for the attribute: Average information value. (Use the number of instances that go down each branch.) info([6,2],[3,3]) = 8/14 x0.811+ 6/14 x 1 =0.892bits gain(Windy ) = info([9,5]) – info([6,2],[3,3]) = 0.940 – 0.892 = 0.048 bits

  12. Which Attribute to Select as Root? For all the attributes from the weather data: gain(Outlook ) = 0.247 bits gain(Temperature ) = 0.029 bits gain(Humidity )= 0.152 bits gain(Windy ) = 0.048 bits Outlook is the way to go ... it's the root.

  13. Continuing to Split Now, determine the gain for EACH of Outlook's branches, sunny, overcast, and rainy. For the sunny branch we know at this point the entropy is 0.971; it is our "before" split information as we calculate our gain from here. yes The rainy branch entropy is also 0.971; use it is our "before" split information as we calculate our gain from here on down. Splitting stops when we can't split any further; that is the case with the value overcast. We don't need to consider Outlook further.

  14. Continuing the Split at Sunny Now, must determine the gain for EACH of Outlook's branches. For the sunny branch we know the "before" split entropy is 0.971

  15. Find Subroot for Sunny humidity = high: info([0,3]) = entropy(0,1) = humidity = normal: info([2,0]) = entropy(1,0) = info([0,3],[2,0]) = gain(Humidity ) = info([2,3]) – info([0,3],[2,0]) =

  16. Find Subroot for Sunny (continued) windy = false: info([1,2]) = entropy(1/3,2/3) = −⅓ log(⅓) − ⅔log(⅔) = 0.183 bits windy = true: info([1,1]) = entropy(1/2,1/2) = −½ log(½) − ½ log(½) = 1 bit info([0,3],[2,0]) = 3/5 x 0.183 + 2/5 x 1 = 0.951 bits gain(Windy ) = info([2,3]) – info([1,2],[1,1]) = .971 - .951= 0.020

  17. Find Subroot for Sunny (continued) temperature = hot: info([0,2]) = entropy(0,1) =−0 log(0) − 1 log(1) = 0bits temperature = mild: info([1,1]) = entropy(1/2,1/2) =−½ log(½) − ½ log(½) = 1 bit temperature = cool: info([1,0]) = entropy(1,0) =−1 log(1) − 0 log(0) = 0 bits info([0,2],[1,1],[1,0]) = 2/5 x 0 + 2/5 x 1 + 1/5 x 0 = 0 + 0.4 + 0 = 0.4bits gain(Temperature ) = info([2,3]) – info([0,2],[1,1],[1,0]) = 0.971 – 0.4 = 0.571 bits

  18. Finish the Split at Sunny gain(Humidity )= 0.971 bits gain(Temperature ) = 0.571 bits gain(Windy ) = 0.020 bits

  19. Possible Splits at Rainy No need to actually do the calculations because windy is pure

  20. Final Decision Tree Note: not all leaves need to be pure; sometimes identical instances have different classes Splitting stops when data can't be split any further.

  21. Highly-branching Attributes • Problematic: attributes with a large number of values (extreme case: ID code) • Subsets are more likely to be pure if there is a large number of values • Information gain is biased towards choosing attributes with a large number of values • This may result in overfitting (selection of an attribute that is non-optimal for prediction) • Another problem: fragmentation

  22. Tree Stump for ID code Attribute This seems like a bad idea for a split. Entropy of split: info(ID code) = info([0,1]) + (info([0,1]) + (info([1,0]) + ... + (info([1,0]) + info([0,1]) = 0 bits Information gain is maximal for ID code (namely 0.940 bits, i.e. the before split information)

  23. Gain Ratio • Gain ratio: a modification of the information gain that reduces its bias • Gain ratio takes number and size of branches into account when choosing an attribute • It corrects the information gain by taking the intrinsic information of a split into account • Intrinsic information: entropy of distribution of instances into branches (i.e. how much information do we need to tell which branch an instance belongs to)

  24. Computing the Gain Ratio Example: intrinsic information for ID code info([1,1,...,1]) = 14 x (-1/14 x log(1/14)) = 3.807bits Value of attribute decreases as intrinsic information gets larger Definition of gain ratio: Example: 0.940 bits 3.807 bits gain(attribute) intrinsic_info(attribute) gain_ratio(ID code) = = 0.246 gain_ratio(attribute) =

More Related