Iterative Dichotomiser ( ID3) Algorithm

By: Phuong H. Nguyen Professor: Lee, Sin-Min Course: CS 157B Section: 2 Date: 05/08/07 Spring 2007 Iterative Dichotomiser (ID3) Algorithm

Overview • Introduction • Entropy • Information Gain • Detailed Example Walkthrough • Conclusion • References

Introduction • ID3 algorithm is a greedy algorithm for decision tree construction developed by Ross Quinlan in 1987. • ID3 algorithm uses information gain to select best attribute as root node or decision nodes: • Max-Gain approach (highest information gain) for splitting

Entropy • Measure the impurity or randomness of an example collection. • A quantitative measurement of the homogeneity of a set of examples. • Basically, it tells us how random the given examples are according to the target classification class.

Entropy (cont.) • Entropy (S) = -Ppositive log2Ppositive– Pnegative log2Pnegative Where: - Ppositive = proportion of positive examples • Pnegative = proportion of negative examples Example: If S is a collection of 14 examples with 9 YES and 5 NO, then: Entropy(S) = - (9/14) log2 (9/14) - (5/14) log2 (5/14) = 0.940

Entropy (cont.) • More than two classification classes: Entropy(S) = ∑ -p(i) log2 p(i) • Result for any entropy calculation will be between 0 and 1. • Two special cases: If Entropy(S) = 1(max value) members are split equally between the two classes (min uniformity, max randomness) If Entropy(S) = 0 all members in S belong to strictly one class (max uniformity, min randomness)

Information Gain • A statistical property measures how well a given attribute separates example collection into target classes. • ID3 algorithm uses Max-Gain approach (highest information gain) to select best attribute for root node and decision nodes.

Information Gain (cont.) • Gain(S, A) = Entropy(S) – ∑((|Sv| / |S|) *Entropy(Sv)) Where: • A is an attribute of collection S • Sv = subset of S for which attribute A has value v • |Sv| = number of elements in Sv • |S| = number of elements in S

Information Gain (cont.) Example: Collection S = 14 examples (9 YES - 5 NO) Wind speed is one attribute of S = {Weak, Strong} • Weak = 8 occurrences (6 YES - 2 NO) • Strong = 6 occurrences (3 YES - 3 NO) Calculation: Entropy(S) = - (9/14) log2 (9/14) - (5/14) log2 (5/14) = 0.940 Entropy(Sweak) = - (6/8)*log2(6/8) - (2/8)*log2(2/8) = 0.811 Entropy(Sstrong) = - (3/6)*log2(3/6) - (3/6)*log2(3/6) = 1.00 Gain(S,Wind) = Entropy(S) - (8/14)*Entropy(Sweak) - (6/14)*Entropy(Sstrong) = 0.940 - (8/14)*0.811 - (6/14)*1.00 = 0.048 • Then for each attribute in S, the information gain is calculated in the same way. • The highest gain attribute is used in the root node or decision node.

Example Walkthrough • An example of a company sending out some promotions to various houses and recording a few facts about each house and also whether people responded or not:

Example Walkthrough (cont.) The target classification is “Outcome” which can be “Responded” or “Nothing”. The attributes in collection are “District, House Type, Income, Previous Customer, and Outcome”. They have the following values: - District = {Suburban, Rural, Urban} - House Type = {Detached, Semi-detached, Terrace} - Income = {High, Low} - Previous Customer = {No, Responded} - Outcome = {Nothing, Responded}

Example Walkthrough (cont.) Detailed Calculation for Gain(S, District): Entropy (S = [9/14 responses, 5/14 no responses]) = -9/14 log2 9/14 - 5/14 log2 5/14 = 0.40978 + 0.5305 = 0.9403 Entropy(SDistrict = Suburban= [2/5 responses, 3/5 no responses]) = -2/5 log2 2/5 – 3/5 log2 3/5 = 0.5288 + 0.4422 = 0.9709 Entropy(SDistrict = Rural= [4/4 responses, 0/4 no responses]) = -4/4 log2 4/4 = 0 Entropy(SDistrict = Urban= [3/5 responses, 2/5 no responses]) = -3/5 log2 3/5 – 2/5 log2 2/5 = 0.4422 + 0.5288 = 0.9709 Gain(S, District) = Entropy(S) – ((5/14) * Entropy(SDistrict = Suburban) + (5/14) * Entropy(SDistrict = Urban) + (4/14) * Entropy(SDistrict = Rural)) = 0.9403 – ((5/14)*0.9709 + (5/14)*0 + (4/14)*0.9709) = 0.9403 – 0.3468 – 0 – 0.34678 = 0.2468

Example Walkthrough (cont.) • So we now have: Gain(S, District) = 0.2468 • Apply the same process to the remaining 3 attributes of S, we get: • - Gain(S,House Type) = 0.049 • - Gain(S,Income) = 0.151 • - Gain(S,Previous Customer) = 0.048 • Comparing the information gain of the four attributes, we see that “District” has the highest value. • “District” will be the root node of the decision tree. • So far the decision tree will look like following: District Suburban Urban Rural ??? ??? ???

Example Walkthrough (cont.) • Apply the same process to the left side of the root node (Suburban), we get: • - Entropy(Ssuburban) = 0.970 • - Gain(Ssuburban,House Type) = 0.570 • - Gain(Ssuburban,Income) = 0.970 • - Gain(Ssuburban,Previous Customer) = 0.019 • The information gain of “Income” is highest: • “Income” will be the decision node. • Then decision tree will look like following: District Suburban Urban Rural Income ??? ???

Example Walkthrough (cont.) For the center of the root node (Rural), it is a special case because: - Entropy(SRural) = 0  all members in SRuralbelong to strictly one target classification class, which is “Responded” Thus, we skip all the calculation and add the corresponding target classification value to the tree. Then decision will look like following: District Suburban Urban Rural Income Responded ???

Example Walkthrough (cont.) • Apply the same process to the right side of the root node (Urban), we get: • Entropy(Surban) = 0.970 • Gain(Surban,House Type) = 0.019 • Gain(Surban,Income) = 0.019 • Gain(Surban,Previous Customer) = 0.970 • The information gain of “Previous Customer” • is highest: • “Previous Customer” will be the decision node. • Then decision tree will look like following: District Suburban Urban Rural Income Responded Previous Customer

For “Income” side, we have: High  Nothing (3/3)  Entropy = 0 and Low  Responded (2/2)  Entropy = 0 For “Previous Customer” side, we have: No  Responded (3/3)  Entropy = 0 and Yes  Nothing (2/2)  Entropy = 0  No longer need to split the tree; therefore, the final decision tree will look like following: District Suburban Urban Rural Income Responded Previous Customer High Low No Yes Nothing Responded Responded Nothing

District Suburban Urban Rural Income Responded Previous Customer High Low No Yes Nothing Responded Responded Nothing • From the above decision tree, some rules can be extracted: • Examples: • (District = Suburban) AND (Income = Low)  (Outcome = Responded) • (District = Rural)  (Outcome = Responded) • (District = Urban) AND (Previous Customer = Yes)  (Outcome = Nothing) • and so on…

Conclusion • ID3 algorithm is easy to implement if we know how it works. • ID3 algorithm is one of the most important techniques in data mining. • Industry has shown that ID3 algorithm has been effective for data mining.

References • Dr. Lee’s Slides, San Jose State University, Spring 2007, http://www.cs.sjsu.edu/%7Elee/cs157b/cs157b.html • "Building Decision Trees with the ID3 Algorithm", by: Andrew Colin, Dr. Dobbs Journal, June 1996 • "Incremental Induction of Decision Trees", by Paul E. Utgoff, Kluwer Academic Publishers, 1989 • http://www.cise.ufl.edu/~ddd/cap6635/Fall-97/Short-papers/2.htm • http://decisiontrees.net/node/27

Iterative Dichotomiser ( ID3) Algorithm

Iterative Dichotomiser ( ID3) Algorithm

Presentation Transcript

Iterative Deepening Search

Iterative Project Management

Iterative Project Management

Iterative Project Management

Iterative Development

Iterative Computations Concurrency

Iterative methods

Iterative Patterns

Iterative Closest Point

Iterative Solution Methods

Iterative Equalization

Iterative Statements

Iterative Dichotomiser 3

Iterative Timing Recovery

Iterative Statements

Iterative Combinatorial Auctions

Iterative Deepening

Iterative Methods

Iterative Project Management