SEEM4630 2013-2014 Tutorial 2 Classification : Decision tree, Naïve Bayes & k-NN

SEEM4630 2013-2014 Tutorial 2 Classification:Decision tree, Naïve Bayes & k-NN Wentao TIAN, wttian@se.cuhk.edu.hk

Classification: Definition • Given a collection of records (training set ), each record contains a set of attributes, one of the attributes is the class. • Find a model for class attribute as a function of the values of other attributes. • Decision tree • Naïve bayes • k-NN • Goal: previously unseenrecords should be assigned a class as accurately as possible.

Decision Tree • Goal • Construct a tree so that instances belonging to different classes should be separated • Basic algorithm (a greedy algorithm) • Tree is constructed in a top-down recursive manner • At start, all the training examples are at the root • Test attributes are selected on the basis of a heuristics or statistical measure (e.g., information gain) • Examples are partitioned recursively based on selected attributes

Let pibe the probability that a tuple belongs to class Ci, estimated by |Ci,D|/|D| Expected information (entropy) needed to classify a tuple in D: Information needed (after using A to split D into v partitions) to classify D: Informationgained by branching on attribute A Attribute Selection Measure 1: Information Gain

Information gain measure is biased towards attributes with a large number of values C4.5 (a successor of ID3) uses gain ratio to overcome the problem (normalization to information gain) GainRatio(A) = Gain(A)/SplitInfo(A) Attribute Selection Measure 2: Gain Ratio

If a data set D contains examples from n classes, gini index, gini(D) is defined as where pj is the relative frequency of class j in D If a data set D is split on A into two subsets D1 and D2, the gini index gini(D) is defined as Reduction in Impurity: Attribute Selection Measure 3: Gini index

Example

Tree induction example • Entropy of data S • Split data by attribute Outlook Info(S) = -9/14(log2(9/14))-5/14(log2(5/14)) = 0.94 Sunny [2+,3-] S[9+, 5-] Outlook Overcast [4+,0-] Rain [3+,2-] Gain(Outlook) = 0.94 – 5/14[-2/5(log2(2/5))-3/5(log2(3/5))] – 4/14[-4/4(log2(4/4))-0/4(log2(0/4))] – 5/14[-3/5(log2(3/5))-2/5(log2(2/5))] = 0.94 – 0.69 = 0.25

Tree induction example • Split data by attribute Temperature <15 [3+,1-] S[9+, 5-] Temperature 15-25 [5+,1-] >25 [2+,2-] Gain(Temperature) = 0.94 – 4/14[-3/4(log2(3/4))-1/4(log2(1/4))] – 6/14[-5/6(log2(5/6))-1/6(log2(1/6))] – 4/14[-2/4(log2(2/4))-2/4(log2(2/4))] = 0.94 – 0.80 = 0.14

High [3+,4-] S[9+, 5-] Humidity Normal [6+, 1-] Weak [6+, 2-] S[9+, 5-] Wind Strong [3+, 3-] Tree induction example • Split data by attribute Humidity • Split data by attribute Wind Gain(Humidity) = 0.94 – 7/14[-3/7(log2(3/7))-4/7(log2(4/7))] – 7/14[-6/7(log2(6/7))-1/7(log2(1/7))] = 0.94 – 0.79 = 0.15 Gain(Wind) = 0.94 – 8/14[-6/8(log2(6/8))-2/8(log2(2/8))] – 6/14[-3/6(log2(3/6))-3/6(log2(3/6))] = 0.94 – 0.89 = 0.05

Outlook Sunny Overcast Rain ?? Yes ?? • Tree induction example Outlook Temperature Humidity Wind Play Tennis Sunny >25 High Weak No Gain(Outlook) = 0.25 Gain(Temperature)=0.14 Gain(Humidity) = 0.15 Gain(Wind) = 0.05 Sunny >25 High Strong No Overcast >25 High Weak Yes Rain 15-25 High Weak Yes Rain <15 Normal Weak Yes Rain <15 Normal Strong No Overcast <15 Normal Strong Yes Sunny 15-25 High Weak No Sunny <15 Normal Weak Yes Rain 15-25 Normal Weak Yes Sunny 15-25 Normal Strong Yes Overcast 15-25 High Strong Yes Overcast >25 Normal Weak Yes Rain 15-25 High Strong No

High [0+,3-] Sunny[2+,3-] Humidity Normal [2+, 0-] Weak [1+, 2-] Sunny[2+, 3-] Wind Strong [1+, 1-] • Entropy of branch Sunny • Split Sunny branch by attribute Temperature • Split Sunny branch by attribute Humidity • Split Sunny branch by attribute Wind Info(Sunny) = -2/5(log2(2/5))-3/5(log2(3/5)) = 0.97 Gain(Temperature) = 0.97 – 1/5[-1/1(log2(1/1))-0/1(log2(0/1))] – 2/5[-1/2(log2(1/2))-1/2(log2(1/2))] – 2/5[-0/2(log2(0/2))-2/2(log2(2/2))] = 0.97 – 0.4 = 0.57 <15 [1+,0-] Sunny[2+,3-] Temperature 15-25 [1+,1-] >25 [0+,2-] Gain(Humidity) = 0.97 – 3/5[-0/3(log2(0/3))-3/3(log2(3/3))] – 2/5[-2/2(log2(2/2))-0/2(log2(0/2))] = 0.97 – 0 = 0.97 Gain(Wind) = 0.97 – 3/5[-1/3(log2(1/3))-2/3(log2(2/3))] – 2/5[-1/2(log2(1/2))-1/2(log2(1/2))] = 0.97 – 0.95= 0.02

Tree induction example Outlook Sunny Overcast Rain Humidity Yes ?? High Normal No Yes

High [1+,1-] Rain[3+,2-] Humidity Normal [2+, 1-] Weak [3+, 0-] Rain[3+,2-] Wind Strong [0+, 2-] • Entropy of branch Rain • Split Rain branch by attribute Temperature • Split Rain branch by attribute Humidity • Split Rain branch by attribute Wind Info(Rain) = -3/5(log2(3/5))-2/5(log2(2/5)) = 0.97 Gain(Outlook) = 0.97 – 2/5[-1/2(log2(1/2))-1/2(log2(1/2))] – 3/5[-2/3(log2(2/3))-1/3(log2(1/3))] – 0/5[-0/0(log2(0/0))-0/0(log2(0/0))] = 0.97 – 0.95 = 0.02 <15 [1+,1-] Rain[3+,2-] Temperature 15-25 [2+,1-] >25 [0+,0-] Gain(Humidity) = 0.97 – 2/5[-1/2(log2(1/2))-1/2(log2(1/2))] – 3/5[-2/3(log2(2/3))-1/3(log2(1/3))] = 0.97 – 0.95 = 0.02 Gain(Wind) = 0.97 – 3/5[-3/3(log2(3/3))-0/3(log2(0/3))] – 2/5[-0/2(log2(0/2))-2/2(log2(2/2))] = 0.97 – 0 = 0.97

Outlook Sunny Overcast Rain Humidity Yes Wind High Normal Weak Strong No No Yes Yes

Model: compute from data Bayesian Classification • A statistical classifier: performs probabilistic prediction, i.e., predicts class membership probabilities • where xi is the value of attribute Ai • Choose the class label that has the highest probability • Foundation: Based on Bayes’ Theorem. • posteriori probability • prior probability • likelihood

Problem: joint probabilities are difficult to estimate Naïve Bayes Classifier Assumption: attributes are conditionally independent Naïve Bayes Classifier

Example: Naïve Bayes Classifier P(C=t) = 1/2 P(C=f) = 1/2 P(A=m|C=t) = 2/5 P(A=m|C=f) = 1/5 P(B=q|C=t) = 2/5 P(B=q|C=f) = 2/5 Test Record: A=m, B=q, C=?

Example: Naïve Bayes Classifier Higher! • For C = t • P(A=m|C=t) * P(B=q|C=t) * P(C=t) = 2/5 * 2/5 * 1/2 = 2/25 • P(C=t|A=m, B=q) = (2/25) / P(A=m, B=q) • For C = f • P(A=m|C=f) * P(B=q|C=f) * P(C=f) = 1/5 * 2/5 * 1/2 = 1/25 • P(C=t|A=m, B=q) = (1/25) / P(A=m, B=q) • Conclusion: A=m, B=q, C=t

Nearest Neighbor Classification • Input • A set of stored records • k: # of nearest neighbors • Output • Compute distance: • Identify k nearest neighbors • Determine the class label of unknown record based on class labels of nearest neighbors (i.e. by taking majority vote)

Nearest Neighbor Classification A Discrete Example • Calculate the distances: • d(P1, Pn) = • d(P2, Pn) = 3.80 • d(P3, Pn) = 2.12 • d(P4, Pn) = 1.12 • d(P5, Pn) = 1.58 • d(P6, Pn) = 2 • d(P7, Pn) = 1 • d(P8, Pn) = 2.12 • Input Given 8 training instances • P1 (4, 2)  Orange • P2 (0.5, 2.5)  Orange • P3 (2.5, 2.5)  Orange • P4 (3, 3.5)  Orange • P5 (5.5, 3.5)  Orange • P6 (2, 4)  Black • P7 (4, 5)  Black • P8 (2.5, 5.5)  Black k = 1 & k = 3 • New Instance: • Pn (4, 4) ?

P8 P8 P7 P7 Pn Pn P6 P6 P4 P4 P5 P5 P3 P2 P3 P2 P1 P1 Nearest Neighbor Classification k = 3 k = 1

Nearest Neighbor Classification… • Scaling issues • Attributes may have to be scaled to prevent distance measures from being dominated by one of the attributes • Each attribute must follow in the same range • Min-Max normalization • Example: • Two data records:a = (1, 1000), b = (0.5, 1) • dis(a, b) = ?

P8 P8 P7 P7 Pn Pn P6 P6 P4 P4 P5 P5 P3 P3 P2 P2 P1 P1 Classification: Lazy & Eager Learning • Two Types of Learning Methodologies • Lazy Learning • Instance-based learning. (k-NN) • Eager Learning • Decision-tree and Bayesian classification. • ANN & SVM

Differences Between Lazy &Eager Learning • Lazy Learning • Do not require model building • Less time training but more time predicting • Lazy method effectively uses a richer hypothesis space since it uses many local linear functions to form its implicit global approximation to the target function • Eager Learning • Require model building • More time training but less time predicting • Must commit to a single hypothesis that covers the entire instance space

Thank you & Questions?

SEEM4630 2013-2014 Tutorial 2 Classification : Decision tree, Naïve Bayes & k-NN