140 likes | 579 Views
Alternative measures for selecting attributes. Recall intuition behind information gain measure: We want to choose attribute that does the most work in classifying the training examples by itself.
E N D
Alternative measures for selecting attributes • Recall intuition behind information gain measure: • We want to choose attribute that does the most work in classifying the training examples by itself. • So measure how much information is gained (or how much entropy decreased) if that attribute is known.
However, information gain measure favors attributes with many values. • Extreme example: Suppose that we add attribute “Date” to each training example. Each training example has a different date.
Day DateOutlook Temp Humidity Wind PlayTennis D1 3/1 Sunny Hot High Weak No D2 3/2 Sunny Hot High Strong No D3 3/3 Overcast Hot High Weak Yes D4 3/4 Rain Mild High Weak Yes D5 3/5 Rain Cool Normal Weak Yes D6 3/6 Rain Cool Normal Strong No D7 3/7 Overcast Cool Normal Strong Yes D8 3/8 Sunny Mild High Weak No D9 3/9 Sunny Cool Normal Weak Yes D10 3/10 Rain Mild Normal Weak Yes D11 3/11 Sunny Mild Normal Strong Yes D12 3/12 Overcast Mild High Strong Yes D13 3/13 Overcast Hot Normal Weak Yes D14 3/14 Rain Mild High Strong No Gain (S, Outlook) = .94 - .694 = .246 What is Gain (S, Date)?
Date will be chosen as root of the tree. • But of course the resulting tree will not generalize
Gain Ratio • Quinlan proposed another method of selecting attributes, called “gain ratio”: Suppose attribute A splits the training data S into m subsets. Call the subsets S1, S2, ..., Sm. We can define a set: The Penalty Term is the entropy of this set. For example: What is the Penalty Term for the “Date” attribute? How about for “Outlook”?
Day DateOutlook Temp Humidity Wind PlayTennis D1 3/1 Sunny Hot High Weak No D2 3/2 Sunny Hot High Strong No D3 3/3 Overcast Hot High Weak Yes D4 3/4 Rain Mild High Weak Yes D5 3/5 Rain Cool Normal Weak Yes D6 3/6 Rain Cool Normal Strong No D7 3/7 Overcast Cool Normal Strong Yes D8 3/8 Sunny Mild High Weak No D9 3/9 Sunny Cool Normal Weak Yes D10 3/10 Rain Mild Normal Weak Yes D11 3/11 Sunny Mild Normal Strong Yes D12 3/12 Overcast Mild High Strong Yes D13 3/13 Overcast Hot Normal Weak Yes D14 3/14 Rain Mild High Strong No
UCI ML Repository http://archive.ics.uci.edu/ml/ http://archive.ics.uci.edu/ml/datasets/Optical+Recognition+of+Handwritten+Digits optdigits-pictures optdigits.info optdigits.names
Homework 1 • How to download homework and data • Demo of C4.5 • Accounts on Linuxlab? • How to get to Linux Lab • Need help on Linux? • Newer version C5.0: http://www.rulequest.com/see5-info.html