Data Mining and Classification Methods Overview

Data Mining amd Knowledge Acquisition — Chapter 5 — BIS 541 2012/2013 Spring

Chapter 7. Classification and Prediction • What is classification? What is prediction? • Issues regarding classification and prediction • Classification by decision tree induction • Bayesian Classification • Classification by Neural Networks • Classification by Support Vector Machines (SVM) • Classification based on concepts from association rule mining • Other Classification Methods • Prediction • Classification accuracy • Summary

Supervised vs. Unsupervised Learning • Supervised learning (classification) • Supervision: The training data (observations, measurements, etc.) are accompanied by labels indicating the class of the observations • New data is classified based on the training set • Unsupervised learning(clustering) • The class labels of training data is unknown • Given a set of measurements, observations, etc. with the aim of establishing the existence of classes or clusters in the data

Classification predicts categorical class labels (discrete or nominal) classifies data (constructs a model) based on the training set and the values (class labels) in a classifying attribute and uses it in classifying new data Numeric Prediction models continuous-valued functions, i.e., predicts unknown or missing values Typical applications Credit/loan approval: Medical diagnosis: if a tumor is cancerous or benign Fraud detection: if a transaction is fraudulent Web page categorization: which category it is Prediction Problems: Classification vs. Numeric Prediction 4

Classification—A Two-Step Process • Model construction: describing a set of predetermined classes • Each tuple/sample is assumed to belong to a predefined class, as determined by the class label attribute • The set of tuples used for model construction is training set • The model is represented as classification rules, decision trees, or mathematical formulae • Model usage: for classifying future or unknown objects • Estimate accuracy of the model • The known label of test sample is compared with the classified result from the model • Accuracy rate is the percentage of test set samples that are correctly classified by the model • Test set is independent of training set, otherwise over-fitting will occur • If the accuracy is acceptable, use the model to classify data tuples whose class labels are not known

Training Data Classifier (Model) Classification Process (1): Model Construction Classification Algorithms IF rank = ‘professor’ OR years > 6 THEN tenured = ‘yes’

Classifier Testing Data Unseen Data Classification Process (2): Use the Model in Prediction (Jeff, Professor, 4) Tenured?

Issues Regarding Classification and Prediction (1): Data Preparation • Data cleaning • Preprocess data in order to reduce noise and handle missing values • Relevance analysis (feature selection) • Remove the irrelevant or redundant attributes • Data transformation • Generalize and/or normalize data

Issues regarding classification and prediction (2): Evaluating Classification Methods • Predictive accuracy • Speed and scalability • time to construct the model • time to use the model • Robustness • handling noise and missing values • Scalability • efficiency in disk-resident databases • Interpretability: • understanding and insight provided by the model • Goodness of rules • decision tree size • compactness of classification rules

Classification by Decision Tree Induction • Decision tree • A flow-chart-like tree structure • Internal node denotes a test on an attribute • Branch represents an outcome of the test • Leaf nodes represent class labels or class distribution • Decision tree generation consists of two phases • Tree construction • At start, all the training examples are at the root • Partition examples recursively based on selected attributes • Tree pruning • Identify and remove branches that reflect noise or outliers • Once the tree is build • Use of decision tree: Classifying an unknown sample

Training Dataset This follows an example from Quinlan’s ID3 Han Table 7.1 Original data from DMPML Ch 4 Sec 4.3 pp 8-9,89-94

Output: A Decision Tree for “buys_computer” 14 cases age? 4 cases <=30 5 cases overcast 5 cases >40 31..40 student? credit rating? yes no yes fair excellent no yes no yes

In Practice passed Current • For the i th data, • at time I, input information is known • At time O, output is asigned (yes/no) • For all data object in the training data set (i=1..14) both input and output are known time Ii Oi

In Practice passed Current future • For a new customers n, • at the current time I, input information is known • But O, output is not known • Yet to be classified as (yes or no) before its actual buying behavioris realized • Value of a data mining study to predict buying behavior beforehand time In On

ID3 Algorithm for Decision Tree Induction • ID3 algorithm Quinlan (1986) • Tree is constructed in a top-down recursive divide-and-conquer manner • At start, all the training examples are at the root • Attributes are categorical (if continuous-valued, they are discretized in advance) • Examples are partitioned recursively based on selected attributes • Test attributes are selected on the basis of a heuristic or statistical measure (e.g., information gain) • Conditions for stopping partitioning • All samples for a given node belong to the same class • There are no remaining attributes for further partitioning – majority voting is employed for classifying the leaf • There are no samples left

Entropy of a Simple Event • average amount of information to predict event s • Expected information of an event s is • I(s)= -log2ps=log2(1/p) • Ps is probability of event s • entropy is high for lower probable events and low otherwise • as ps goes to zero • Event s becomes rare and rare and hence • İts entropy approachs to infinity • as ps goes to one • Event s becomes more and more certain and hence • İts entropy approachs to zero • Hence entropy is a measure of randomness. disorder rareness of an event

Computational Formulas Log21=0. log22=1 Log20.5=log21/2=log22-1=-log22=-1 Log20.25=-2 Computational formula Logax=logbx/logba a and b any basis Choosing b as e It is more convenient to compute logarithms of any basis Logax=logex/logea or in particular Log2x=logex/loge2 or Dropping e Log2x= lnx/ln2 entropy Entropy=-log2p 2 1 0 0.25 0.5 1 probability

Entropy of an simple event • Entropy can be interpreted as the number of bits to encode that event • if p=1/2, • -log21/2= 1 bit is required to encode that event • 0 and 1 with equal probability • if p=1/4 • -log21/4= 2 bits are required to encode that event • 00 01 10 11 each are equally likely only one represent the specific event • if p=1/8 • -log21/8= 3 bits are required to encode that event • 000 001 010 011 100 101 110 111 each are equally likely only one represent the specific event • if p=1/3 • -log21/3=-(loge1/3)/loge2=1.58 bits are required to encode that event

Composite events • Consider the four events A, B, C, D • A:tossing a fair coin where • PH = PT = ½ • B:tossing an unfair coin where • PH = ¼ PT = 3/4 • C:tossing an unfair coin where • PH = 0.001 PT = 0.999 • D:tossing an unfair coin where • PH = 0.0 PT = 1.0 • Which of these events is more certain • How much information is needed to guess the result of each toes in A to D • What is the expected information to predict the outcome of events A B C D respectively?

Entropy of composite events • Average or expected information is highest when each event is equally likely • Event A • Expected information required to guess falls as the probability of head becomes either 0 or 1 • as PH goes to 0 or PT goes to 1: moving to case D • The composite event of toesing a coin is more and more certain • So for case D no information is needed as the answer is already known as tail • What is the expected information to predict the outcome of that event when probability of head is p in general • ent(S)= p[-log2p]+(1-p)[-log2(1-p)] • ent(S)= -plog2p-(1-p)log2(1-p) • The lower the entropy the higher the information content of that event • Weighted average of simple events weighted by their probailities

Examples • When the event is certain: • pH = 1,pT= 0 or pH = 0, pT = 1 • ent(S)= -1log2(1)-0log20= -1*0-0*log20=0 • Note that: limx0+ xlog2x=0 • For a fair coin pH = 0.5 ,pT= 0.5 • ent(S)= -(1/2)log2(1/2)-(1/2)log21/2 • = -1/2(-1) -1/2(-1)=1 • Ent(S) is 1: p = 0.5 1-p=0.5 • if head or tail probabilities are unequal • entropy is between 0 and 1

P head versus entropy for the event of toesing a coin Entropy= -plog2p-(1-p)log2(1-p) 1 0 1 0.5 Probability of head

In general • Entropy is a measure of (im)purity of an sample variable S is defined as • Ent(S) = sSps(-log2ps) • = -sSpslog2ps • s is a value of S an element of sample space • ps is its estimated or subjective probability of any sS • Note that sps = 1

Information needed to classify an object • class entropy is computed in a similar manner • Entropy is 0 if all members of S belong to the same class • no information to classify an object • entropy of a partition is the weighted average of all entropies of all classes • Total number of objects: S • There are two classes C1 and C2 • with cardinalities S1 and S2 • I(S1,S2)=-(S1/S)*log2(S1/S)-(S2/S)*log2(S1/S2) • Or in general with m classes C1,C2,…,Cm • I(S1,S2…Sm)=-(S1/S)*log2(S1/S)-(S2/S)*log2(S1/S2)-.. • -(Sm/S)*log2(Sm/S) • Probability of an objects belonging to class i:Pi=Si/S

Example cont.: At the root of the tree • There are 14 cases S = 14 • 9 yes denoted by Y, S1 = 9, py = 9/14 • 5 no denoted by N, S2 = 5, pn = 5/14 • Y s and N s are almost equally likely • How much information on the average is needed to classify a person as buyer or not • Without knowing any characteristics such as • age income … • I(S1,S2)=(S1/S)(-log2S1/S) +(S2/S)(-log2S2/S) • I(9,5)=(9/14)(-log29/14) + (5/14)(-log25/14) • =(9/14)(0.637) + (5/14)(1.485) = 0.940 bits of information • close to 1

Expected information to classify with an attribute (1) • An attribute A with n distinct values as a1,a2,..,an partition the dataset into n distinct parts • Si,j is number of objects in partition i (i=1..n) • with a class Cj (j=1..m) • Expected information to classify an object knowing the value of attribute A is the • weighted average of entropies of all partitions • Weighted by the frequency of that partition i

Expected information to classify with an attribute (2) • Ent(A) =I(A) = mj=1(ai/S) *I(Si,1..Si,m) • =(a1/S)*I(S1,1..S1,m)+(a2/S)*I(S2,1..S2,m)+… • +(an/S)*I(Sn,1..Sn,m) • I(Si,1..Si,m) entropy of any partition i • I(Si,1..Si,m)=mj=1(Sij/ai)(-log2Sij/ai) • =-(Si1/ai)*log2Si1/ai-(Si2/ai)*log2Si2/ai- • ...-(Sim/ai)*log2Sim/ai • ai = mj=1Sij = nuber of objects in each partition • Here sij/ai is the probability of class j in partition i

Information Gain • Gain in information using distinct values of attribute A is the reduction in entropy or information need to classify an object • Gain(A) = I(S1,..,Sm) – I(A) • average information without knowing A – • Average information with knowing A • Eg. Knowing such characteristics as: • Age interval, income interval • How much help to classify a new object? • Can information gain be negative? • Is it always greater then or equal to zero?

Another Example • Pick up a student at random in BU • What is the chance that she is staying in dorm? • Initially we have no specific information about her • If I ask initials • Does it help us in predicting the probability of her staying in dorm. • No • If I ask her adress and record the city • Does it help us in predcting the chance of her staying in dorm • Yes

Attribute selection at the root • There are four attributes • age, income, student, credit rating • Which of these provıdes the highest informatıon in classıfying a new customer or equivalently • Which of these results in hıghest information gaın

Testing age at the root node 14 cases 9 Y 5 N <=30 >40 31..40 5 cases 2 Y 3 N I(2,3)=0.971 Dec N 4 cases 4 Y 0 N I(4,0) =0 Dec:Y 5 cases 3 Y 2 N I(3,2)=0.971 Dec Y Accuricy: 10/14, Entropy(age)=5/14*I(2,3)+4/14*I(4,0)+5/14*I(3,2) Entropy(age) = 5/14(-3/5log23/5-2/5log22/5) + 4/14((-4/4log24/4-0/4log20/4) + 5/14(-3/5log23/5-2/5log22/5) =0.694 Gain(age) = 0.940 – 0.694 = 0.246

Expected information for age <=30 • If age is <=30 • Information need to classify a new customer : • I(2,3)=0.971 bits as the training data tells that • Knowing that age <=30 • with 0.4 probability a customer buys • but with 0.6 probability she dose not • I(2,3)=-(3/5)log23/5-(2/5)log22/5=0.971 • =0.6*0.734+0.4*1.322=0.971 • But what is the weight of age range <=30 • 5 out of 14 samples are in that range • (5/14)*I(2,3) is the weighted information need to classify a customer as buyer or not

Information gain by age • gain(age)= I(Y,N)-Entropy(age) • = 0.940 – 0.694 • =0.246 is the information gain • Or reduction of entropy to classify a new object • Knowing the age of the customer increases our ability to classify her as buyer or not or • Help us to predict her buying behavior

Class Y: buys_computer = “yes” Class N: buys_computer = “no” I(Y, N) = I(9, 5) =0.940 Compute the entropy for age: means “age <=30” has 5 out of 14 samples, with 2 yes’es and 3 no’s. Hence Similarly, Attribute Selection by Information Gain Computation

Testing income at the root node 14 cases 9 Y 5 N low high medium 4 cases 3 Y 1 N I(3,1)=? Dec Y 6 cases 4 Y 2 N I(4,2) =? Dec: Y 4 cases 2 Y 2 N I(2,2)=1 Dec: ? Accuricy:9/14 Entropy(income) = 4/14(-3/4log23/4-1/4log21/4) + 6/14((-2/6log23/4-4/6log22/6) + 4/14(-2/4log22/4-2/4log22/4) =0.914 Gain(income) = 0.940 – 0914.=0.026

Testing student at the root node 14 cases 9 Y 5 N no yes 7 cases 3 Y 4 N I(3,4)=? Dec N 7 cases 6 Y 1 N I(6,1)=? Dec: Y Accuricy:10/14 Entropy(student) = 7/14(-3/7log23/7-4/7log24/7) + 7/14((-6/7log26/7-1/7log21/7) =0.789 Gain(student) = 0.940 –0.789 0.151

Testing credit rating at the root node 14 cases 9 Y 5 N faır excellent 8 cases 6 Y 2 N I(6,2)=? Dec Y 6 cases 3 Y 3 N I(3,3)=1 Dec: ? Accuricy:9/14 Entropy(student) = 8/14(-2/6log22/6-4/6log24/6)+ + 6/14((-3/6log23/6-3/6log23/6) =0.892 Gain(credit rating) = 0.940 – 0892=0.048

Comparıng gaıns • İnformation gains for attrıbutes at the root node: • Gain(age) = 0.246 • Gain(age) = 0.026 • Gain(age) = 0.151 • Gain(age) = 0.048 • Age provıdes the highest gain in information • Age ıs choosen as the attribute at the root node • Branch acording to the distinct values of age

After Selecting age first level of the tree 14 cases age? 4 cases <=30 5 cases overcast 5 cases >40 31..40 Continue continue yes

Attrıbute selectıon at age <=30 node 5 cases 2 Y 3 N 5 cases 2 Y 3 N yes no student 2 Y 0 N hıgh low income 0 Y 3 N 0 Y 2 N medıum 1 Y 0 N 1Y 1 N 5 cases 2 Y 3 N excellent faır credit 1 Y 1 N 1 Y 2 N Informatıon gaın for student ıs the hıghest as knowıng the Customers beıng a student or not provıdes perfect ınforma- tıon to classıfy her buyıng behavıor

Attrıbute selectıon at age >40 node 5 cases 3 Y 2 N 5 cases 3 Y 2 N yes no student 2 Y 1 N hıgh low income 1 Y 1 N 0 Y 0 N medıum 1 Y 1 N 2Y 1 N 5 cases 3 Y 2 N excellent faır credit 0 Y 2 N 3 Y 0 N Informatıon gaın for credıt ratıng ıs the hıghest as knowıng the Customers beıng a student or not provıdes perfect ınforma- tıon to classıfy her buyıng behavıor

Exercise • Calculate all information gains in the second level of the tree that is after branching by the distinct values of age

Output: A Decision Tree for “buys_computer” 14 cases age? 4 cases <=30 5 cases overcast 5 cases >40 31..40 student? credit rating? yes 2 N 3 N 2 Y 3 N no yes fair excellent no yes no yes Accuricy 14/14 on training set

Advantage and Disadvantages of Decision Trees • Advantages: • Easy to understand and map nicely to a production rules • Suitable for categorical as well as numerical inputs • No statistical assumptions about distribution of attributes • Generation and application to classify unknown outputs is very fast • Disadvantages: • Output attributes must be categorical • Unstable: slight variations in the training data may result in different attribute selections and hence different trees • Numerical input attributes leads to complex threes as attribute splits are usually binary • Not suitable for non rectangler regions such as regions separated by linear or nonlnear combination of attributes • By lines ( in 2 dimensions) planes( in 3 dimensions) or in general by hyperplanes (n dimensions)

A classification problem in that decision trees are not suitable to classify income A The two classes X and O are separated by line AA Decision trees are not suitabe For this problem x x x o x o x x x o x o x o o o A age

Other Attribute Selection Measures • Gain Ratio • Gini index (CART, IBM IntelligentMiner) • All attributes are assumed continuous-valued • Assume there exist several possible split values for each attribute • May need other tools, such as clustering, to get the possible split values • Can be modified for categorical attributes

Gain Ratio • Add another attribute transaction TID • for each observation TID is different • E(TID)= (1/14)*I(1,0)+(1/14)*I(1,0)+ (1/14)*I(1,0)...+ (1/14)*I(1,0)=0 • gain(TID)= 0.940-0=0.940 • the highest gain so TID is the test attribute • which makes no sense • use gain ratio rather then gain • Split information: measure of the information value of split: • without considering class information • only number and size of child nodes • A kind of normalization for information gain

Split information = (-Si/S)log2(Si/S) • information needed to assign an instance to one of these branches • Gain ratio = gain(S)/split information(S) • in the previous example • Split info and gain ratio for TID: • split info(TID) =[(1/14)log2(1/14)]*14=3.807 • gain ratio(TID) =(0.940-0.0)/3.807=0.246 • Split info for age:I(5,4,5)= • (5/14)log25/14+ (4/14)log24/14 +(5/14)log25/14=1.577 • gain ratio(age) = gain(age)/split info(age) • =0.247/1.577=0.156

Data Mining and Classification Methods Overview

Data Mining and Classification Methods Overview

Presentation Transcript