1 / 8

Alternative measures for selecting attributes

Alternative measures for selecting attributes. Recall intuition behind information gain measure: We want to choose attribute that does the most work in classifying the training examples by itself.

lowell
Download Presentation

Alternative measures for selecting attributes

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Alternative measures for selecting attributes • Recall intuition behind information gain measure: • We want to choose attribute that does the most work in classifying the training examples by itself. • So measure how much information is gained (or how much entropy decreased) if that attribute is known.

  2. However, information gain measure favors attributes with many values. • Extreme example: Suppose that we add attribute “Date” to each training example. Each training example has a different date.

  3. Day DateOutlook Temp Humidity Wind PlayTennis D1 3/1 Sunny Hot High Weak No D2 3/2 Sunny Hot High Strong No D3 3/3 Overcast Hot High Weak Yes D4 3/4 Rain Mild High Weak Yes D5 3/5 Rain Cool Normal Weak Yes D6 3/6 Rain Cool Normal Strong No D7 3/7 Overcast Cool Normal Strong Yes D8 3/8 Sunny Mild High Weak No D9 3/9 Sunny Cool Normal Weak Yes D10 3/10 Rain Mild Normal Weak Yes D11 3/11 Sunny Mild Normal Strong Yes D12 3/12 Overcast Mild High Strong Yes D13 3/13 Overcast Hot Normal Weak Yes D14 3/14 Rain Mild High Strong No Gain (S, Outlook) = .94 - .694 = .246 What is Gain (S, Date)?

  4. Date will be chosen as root of the tree. • But of course the resulting tree will not generalize

  5. Gain Ratio • Quinlan proposed another method of selecting attributes, called “gain ratio”: Suppose attribute A splits the training data S into m subsets. Call the subsets S1, S2, ..., Sm. We can define a set: The Penalty Term is the entropy of this set. For example: What is the Penalty Term for the “Date” attribute? How about for “Outlook”?

  6. Day DateOutlook Temp Humidity Wind PlayTennis D1 3/1 Sunny Hot High Weak No D2 3/2 Sunny Hot High Strong No D3 3/3 Overcast Hot High Weak Yes D4 3/4 Rain Mild High Weak Yes D5 3/5 Rain Cool Normal Weak Yes D6 3/6 Rain Cool Normal Strong No D7 3/7 Overcast Cool Normal Strong Yes D8 3/8 Sunny Mild High Weak No D9 3/9 Sunny Cool Normal Weak Yes D10 3/10 Rain Mild Normal Weak Yes D11 3/11 Sunny Mild Normal Strong Yes D12 3/12 Overcast Mild High Strong Yes D13 3/13 Overcast Hot Normal Weak Yes D14 3/14 Rain Mild High Strong No

  7. UCI ML Repository http://archive.ics.uci.edu/ml/ http://archive.ics.uci.edu/ml/datasets/Optical+Recognition+of+Handwritten+Digits optdigits-pictures optdigits.info optdigits.names

  8. Homework 1 • How to download homework and data • Demo of C4.5 • Accounts on Linuxlab? • How to get to Linux Lab • Need help on Linux? • Newer version C5.0: http://www.rulequest.com/see5-info.html

More Related