710 likes | 723 Views
Explore the concepts of classification and prediction in data mining, including decision tree induction, Bayesian classification, neural networks, support vector machines, and more. Understand the importance of supervised learning in building models for accurate classification and prediction tasks.
E N D
Data Mining amd Knowledge Acquisition — Chapter 5 — BIS 541 2012/2013 Spring
Chapter 7. Classification and Prediction • What is classification? What is prediction? • Issues regarding classification and prediction • Classification by decision tree induction • Bayesian Classification • Classification by Neural Networks • Classification by Support Vector Machines (SVM) • Classification based on concepts from association rule mining • Other Classification Methods • Prediction • Classification accuracy • Summary
Supervised vs. Unsupervised Learning • Supervised learning (classification) • Supervision: The training data (observations, measurements, etc.) are accompanied by labels indicating the class of the observations • New data is classified based on the training set • Unsupervised learning(clustering) • The class labels of training data is unknown • Given a set of measurements, observations, etc. with the aim of establishing the existence of classes or clusters in the data
Classification predicts categorical class labels (discrete or nominal) classifies data (constructs a model) based on the training set and the values (class labels) in a classifying attribute and uses it in classifying new data Numeric Prediction models continuous-valued functions, i.e., predicts unknown or missing values Typical applications Credit/loan approval: Medical diagnosis: if a tumor is cancerous or benign Fraud detection: if a transaction is fraudulent Web page categorization: which category it is Prediction Problems: Classification vs. Numeric Prediction 4
Classification—A Two-Step Process • Model construction: describing a set of predetermined classes • Each tuple/sample is assumed to belong to a predefined class, as determined by the class label attribute • The set of tuples used for model construction is training set • The model is represented as classification rules, decision trees, or mathematical formulae • Model usage: for classifying future or unknown objects • Estimate accuracy of the model • The known label of test sample is compared with the classified result from the model • Accuracy rate is the percentage of test set samples that are correctly classified by the model • Test set is independent of training set, otherwise over-fitting will occur • If the accuracy is acceptable, use the model to classify data tuples whose class labels are not known
Training Data Classifier (Model) Classification Process (1): Model Construction Classification Algorithms IF rank = ‘professor’ OR years > 6 THEN tenured = ‘yes’
Classifier Testing Data Unseen Data Classification Process (2): Use the Model in Prediction (Jeff, Professor, 4) Tenured?
Chapter 7. Classification and Prediction • What is classification? What is prediction? • Issues regarding classification and prediction • Classification by decision tree induction • Bayesian Classification • Classification by Neural Networks • Classification by Support Vector Machines (SVM) • Classification based on concepts from association rule mining • Other Classification Methods • Prediction • Classification accuracy • Summary
Issues Regarding Classification and Prediction (1): Data Preparation • Data cleaning • Preprocess data in order to reduce noise and handle missing values • Relevance analysis (feature selection) • Remove the irrelevant or redundant attributes • Data transformation • Generalize and/or normalize data
Issues regarding classification and prediction (2): Evaluating Classification Methods • Predictive accuracy • Speed and scalability • time to construct the model • time to use the model • Robustness • handling noise and missing values • Scalability • efficiency in disk-resident databases • Interpretability: • understanding and insight provided by the model • Goodness of rules • decision tree size • compactness of classification rules
Chapter 7. Classification and Prediction • What is classification? What is prediction? • Issues regarding classification and prediction • Classification by decision tree induction • Bayesian Classification • Classification by Neural Networks • Classification by Support Vector Machines (SVM) • Classification based on concepts from association rule mining • Other Classification Methods • Prediction • Classification accuracy • Summary
Classification by Decision Tree Induction • Decision tree • A flow-chart-like tree structure • Internal node denotes a test on an attribute • Branch represents an outcome of the test • Leaf nodes represent class labels or class distribution • Decision tree generation consists of two phases • Tree construction • At start, all the training examples are at the root • Partition examples recursively based on selected attributes • Tree pruning • Identify and remove branches that reflect noise or outliers • Once the tree is build • Use of decision tree: Classifying an unknown sample
Training Dataset This follows an example from Quinlan’s ID3 Han Table 7.1 Original data from DMPML Ch 4 Sec 4.3 pp 8-9,89-94
Output: A Decision Tree for “buys_computer” 14 cases age? 4 cases <=30 5 cases overcast 5 cases >40 31..40 student? credit rating? yes no yes fair excellent no yes no yes
In Practice passed Current • For the i th data, • at time I, input information is known • At time O, output is asigned (yes/no) • For all data object in the training data set (i=1..14) both input and output are known time Ii Oi
In Practice passed Current future • For a new customers n, • at the current time I, input information is known • But O, output is not known • Yet to be classified as (yes or no) before its actual buying behavioris realized • Value of a data mining study to predict buying behavior beforehand time In On
ID3 Algorithm for Decision Tree Induction • ID3 algorithm Quinlan (1986) • Tree is constructed in a top-down recursive divide-and-conquer manner • At start, all the training examples are at the root • Attributes are categorical (if continuous-valued, they are discretized in advance) • Examples are partitioned recursively based on selected attributes • Test attributes are selected on the basis of a heuristic or statistical measure (e.g., information gain) • Conditions for stopping partitioning • All samples for a given node belong to the same class • There are no remaining attributes for further partitioning – majority voting is employed for classifying the leaf • There are no samples left
Entropy of a Simple Event • average amount of information to predict event s • Expected information of an event s is • I(s)= -log2ps=log2(1/p) • Ps is probability of event s • entropy is high for lower probable events and low otherwise • as ps goes to zero • Event s becomes rare and rare and hence • İts entropy approachs to infinity • as ps goes to one • Event s becomes more and more certain and hence • İts entropy approachs to zero • Hence entropy is a measure of randomness. disorder rareness of an event
Computational Formulas Log21=0. log22=1 Log20.5=log21/2=log22-1=-log22=-1 Log20.25=-2 Computational formula Logax=logbx/logba a and b any basis Choosing b as e It is more convenient to compute logarithms of any basis Logax=logex/logea or in particular Log2x=logex/loge2 or Dropping e Log2x= lnx/ln2 entropy Entropy=-log2p 2 1 0 0.25 0.5 1 probability
Entropy of an simple event • Entropy can be interpreted as the number of bits to encode that event • if p=1/2, • -log21/2= 1 bit is required to encode that event • 0 and 1 with equal probability • if p=1/4 • -log21/4= 2 bits are required to encode that event • 00 01 10 11 each are equally likely only one represent the specific event • if p=1/8 • -log21/8= 3 bits are required to encode that event • 000 001 010 011 100 101 110 111 each are equally likely only one represent the specific event • if p=1/3 • -log21/3=-(loge1/3)/loge2=1.58 bits are required to encode that event
Composite events • Consider the four events A, B, C, D • A:tossing a fair coin where • PH = PT = ½ • B:tossing an unfair coin where • PH = ¼ PT = 3/4 • C:tossing an unfair coin where • PH = 0.001 PT = 0.999 • D:tossing an unfair coin where • PH = 0.0 PT = 1.0 • Which of these events is more certain • How much information is needed to guess the result of each toes in A to D • What is the expected information to predict the outcome of events A B C D respectively?
Entropy of composite events • Average or expected information is highest when each event is equally likely • Event A • Expected information required to guess falls as the probability of head becomes either 0 or 1 • as PH goes to 0 or PT goes to 1: moving to case D • The composite event of toesing a coin is more and more certain • So for case D no information is needed as the answer is already known as tail • What is the expected information to predict the outcome of that event when probability of head is p in general • ent(S)= p[-log2p]+(1-p)[-log2(1-p)] • ent(S)= -plog2p-(1-p)log2(1-p) • The lower the entropy the higher the information content of that event • Weighted average of simple events weighted by their probailities
Examples • When the event is certain: • pH = 1,pT= 0 or pH = 0, pT = 1 • ent(S)= -1log2(1)-0log20= -1*0-0*log20=0 • Note that: limx0+ xlog2x=0 • For a fair coin pH = 0.5 ,pT= 0.5 • ent(S)= -(1/2)log2(1/2)-(1/2)log21/2 • = -1/2(-1) -1/2(-1)=1 • Ent(S) is 1: p = 0.5 1-p=0.5 • if head or tail probabilities are unequal • entropy is between 0 and 1
P head versus entropy for the event of toesing a coin Entropy= -plog2p-(1-p)log2(1-p) 1 0 1 0.5 Probability of head
In general • Entropy is a measure of (im)purity of an sample variable S is defined as • Ent(S) = sSps(-log2ps) • = -sSpslog2ps • s is a value of S an element of sample space • ps is its estimated or subjective probability of any sS • Note that sps = 1
Information needed to classify an object • class entropy is computed in a similar manner • Entropy is 0 if all members of S belong to the same class • no information to classify an object • entropy of a partition is the weighted average of all entropies of all classes • Total number of objects: S • There are two classes C1 and C2 • with cardinalities S1 and S2 • I(S1,S2)=-(S1/S)*log2(S1/S)-(S2/S)*log2(S1/S2) • Or in general with m classes C1,C2,…,Cm • I(S1,S2…Sm)=-(S1/S)*log2(S1/S)-(S2/S)*log2(S1/S2)-.. • -(Sm/S)*log2(Sm/S) • Probability of an objects belonging to class i:Pi=Si/S
Example cont.: At the root of the tree • There are 14 cases S = 14 • 9 yes denoted by Y, S1 = 9, py = 9/14 • 5 no denoted by N, S2 = 5, pn = 5/14 • Y s and N s are almost equally likely • How much information on the average is needed to classify a person as buyer or not • Without knowing any characteristics such as • age income … • I(S1,S2)=(S1/S)(-log2S1/S) +(S2/S)(-log2S2/S) • I(9,5)=(9/14)(-log29/14) + (5/14)(-log25/14) • =(9/14)(0.637) + (5/14)(1.485) = 0.940 bits of information • close to 1
Expected information to classify with an attribute (1) • An attribute A with n distinct values as a1,a2,..,an partition the dataset into n distinct parts • Si,j is number of objects in partition i (i=1..n) • with a class Cj (j=1..m) • Expected information to classify an object knowing the value of attribute A is the • weighted average of entropies of all partitions • Weighted by the frequency of that partition i
Expected information to classify with an attribute (2) • Ent(A) =I(A) = mj=1(ai/S) *I(Si,1..Si,m) • =(a1/S)*I(S1,1..S1,m)+(a2/S)*I(S2,1..S2,m)+… • +(an/S)*I(Sn,1..Sn,m) • I(Si,1..Si,m) entropy of any partition i • I(Si,1..Si,m)=mj=1(Sij/ai)(-log2Sij/ai) • =-(Si1/ai)*log2Si1/ai-(Si2/ai)*log2Si2/ai- • ...-(Sim/ai)*log2Sim/ai • ai = mj=1Sij = nuber of objects in each partition • Here sij/ai is the probability of class j in partition i
Information Gain • Gain in information using distinct values of attribute A is the reduction in entropy or information need to classify an object • Gain(A) = I(S1,..,Sm) – I(A) • average information without knowing A – • Average information with knowing A • Eg. Knowing such characteristics as: • Age interval, income interval • How much help to classify a new object? • Can information gain be negative? • Is it always greater then or equal to zero?
Another Example • Pick up a student at random in BU • What is the chance that she is staying in dorm? • Initially we have no specific information about her • If I ask initials • Does it help us in predicting the probability of her staying in dorm. • No • If I ask her adress and record the city • Does it help us in predcting the chance of her staying in dorm • Yes
Attribute selection at the root • There are four attributes • age, income, student, credit rating • Which of these provıdes the highest informatıon in classıfying a new customer or equivalently • Which of these results in hıghest information gaın
Testing age at the root node 14 cases 9 Y 5 N <=30 >40 31..40 5 cases 2 Y 3 N I(2,3)=0.971 Dec N 4 cases 4 Y 0 N I(4,0) =0 Dec:Y 5 cases 3 Y 2 N I(3,2)=0.971 Dec Y Accuricy: 10/14, Entropy(age)=5/14*I(2,3)+4/14*I(4,0)+5/14*I(3,2) Entropy(age) = 5/14(-3/5log23/5-2/5log22/5) + 4/14((-4/4log24/4-0/4log20/4) + 5/14(-3/5log23/5-2/5log22/5) =0.694 Gain(age) = 0.940 – 0.694 = 0.246
Expected information for age <=30 • If age is <=30 • Information need to classify a new customer : • I(2,3)=0.971 bits as the training data tells that • Knowing that age <=30 • with 0.4 probability a customer buys • but with 0.6 probability she dose not • I(2,3)=-(3/5)log23/5-(2/5)log22/5=0.971 • =0.6*0.734+0.4*1.322=0.971 • But what is the weight of age range <=30 • 5 out of 14 samples are in that range • (5/14)*I(2,3) is the weighted information need to classify a customer as buyer or not
Information gain by age • gain(age)= I(Y,N)-Entropy(age) • = 0.940 – 0.694 • =0.246 is the information gain • Or reduction of entropy to classify a new object • Knowing the age of the customer increases our ability to classify her as buyer or not or • Help us to predict her buying behavior
Class Y: buys_computer = “yes” Class N: buys_computer = “no” I(Y, N) = I(9, 5) =0.940 Compute the entropy for age: means “age <=30” has 5 out of 14 samples, with 2 yes’es and 3 no’s. Hence Similarly, Attribute Selection by Information Gain Computation
Testing income at the root node 14 cases 9 Y 5 N low high medium 4 cases 3 Y 1 N I(3,1)=? Dec Y 6 cases 4 Y 2 N I(4,2) =? Dec: Y 4 cases 2 Y 2 N I(2,2)=1 Dec: ? Accuricy:9/14 Entropy(income) = 4/14(-3/4log23/4-1/4log21/4) + 6/14((-2/6log23/4-4/6log22/6) + 4/14(-2/4log22/4-2/4log22/4) =0.914 Gain(income) = 0.940 – 0914.=0.026
Testing student at the root node 14 cases 9 Y 5 N no yes 7 cases 3 Y 4 N I(3,4)=? Dec N 7 cases 6 Y 1 N I(6,1)=? Dec: Y Accuricy:10/14 Entropy(student) = 7/14(-3/7log23/7-4/7log24/7) + 7/14((-6/7log26/7-1/7log21/7) =0.789 Gain(student) = 0.940 –0.789 0.151
Testing credit rating at the root node 14 cases 9 Y 5 N faır excellent 8 cases 6 Y 2 N I(6,2)=? Dec Y 6 cases 3 Y 3 N I(3,3)=1 Dec: ? Accuricy:9/14 Entropy(student) = 8/14(-2/6log22/6-4/6log24/6)+ + 6/14((-3/6log23/6-3/6log23/6) =0.892 Gain(credit rating) = 0.940 – 0892=0.048
Comparıng gaıns • İnformation gains for attrıbutes at the root node: • Gain(age) = 0.246 • Gain(age) = 0.026 • Gain(age) = 0.151 • Gain(age) = 0.048 • Age provıdes the highest gain in information • Age ıs choosen as the attribute at the root node • Branch acording to the distinct values of age
After Selecting age first level of the tree 14 cases age? 4 cases <=30 5 cases overcast 5 cases >40 31..40 Continue continue yes
Attrıbute selectıon at age <=30 node 5 cases 2 Y 3 N 5 cases 2 Y 3 N yes no student 2 Y 0 N hıgh low income 0 Y 3 N 0 Y 2 N medıum 1 Y 0 N 1Y 1 N 5 cases 2 Y 3 N excellent faır credit 1 Y 1 N 1 Y 2 N Informatıon gaın for student ıs the hıghest as knowıng the Customers beıng a student or not provıdes perfect ınforma- tıon to classıfy her buyıng behavıor
Attrıbute selectıon at age >40 node 5 cases 3 Y 2 N 5 cases 3 Y 2 N yes no student 2 Y 1 N hıgh low income 1 Y 1 N 0 Y 0 N medıum 1 Y 1 N 2Y 1 N 5 cases 3 Y 2 N excellent faır credit 0 Y 2 N 3 Y 0 N Informatıon gaın for credıt ratıng ıs the hıghest as knowıng the Customers beıng a student or not provıdes perfect ınforma- tıon to classıfy her buyıng behavıor
Exercise • Calculate all information gains in the second level of the tree that is after branching by the distinct values of age
Output: A Decision Tree for “buys_computer” 14 cases age? 4 cases <=30 5 cases overcast 5 cases >40 31..40 student? credit rating? yes 2 N 3 N 2 Y 3 N no yes fair excellent no yes no yes Accuricy 14/14 on training set
Advantage and Disadvantages of Decision Trees • Advantages: • Easy to understand and map nicely to a production rules • Suitable for categorical as well as numerical inputs • No statistical assumptions about distribution of attributes • Generation and application to classify unknown outputs is very fast • Disadvantages: • Output attributes must be categorical • Unstable: slight variations in the training data may result in different attribute selections and hence different trees • Numerical input attributes leads to complex threes as attribute splits are usually binary • Not suitable for non rectangler regions such as regions separated by linear or nonlnear combination of attributes • By lines ( in 2 dimensions) planes( in 3 dimensions) or in general by hyperplanes (n dimensions)
A classification problem in that decision trees are not suitable to classify income A The two classes X and O are separated by line AA Decision trees are not suitabe For this problem x x x o x o x x x o x o x o o o A age
Other Attribute Selection Measures • Gain Ratio • Gini index (CART, IBM IntelligentMiner) • All attributes are assumed continuous-valued • Assume there exist several possible split values for each attribute • May need other tools, such as clustering, to get the possible split values • Can be modified for categorical attributes
Gain Ratio • Add another attribute transaction TID • for each observation TID is different • E(TID)= (1/14)*I(1,0)+(1/14)*I(1,0)+ (1/14)*I(1,0)...+ (1/14)*I(1,0)=0 • gain(TID)= 0.940-0=0.940 • the highest gain so TID is the test attribute • which makes no sense • use gain ratio rather then gain • Split information: measure of the information value of split: • without considering class information • only number and size of child nodes • A kind of normalization for information gain
Split information = (-Si/S)log2(Si/S) • information needed to assign an instance to one of these branches • Gain ratio = gain(S)/split information(S) • in the previous example • Split info and gain ratio for TID: • split info(TID) =[(1/14)log2(1/14)]*14=3.807 • gain ratio(TID) =(0.940-0.0)/3.807=0.246 • Split info for age:I(5,4,5)= • (5/14)log25/14+ (4/14)log24/14 +(5/14)log25/14=1.577 • gain ratio(age) = gain(age)/split info(age) • =0.247/1.577=0.156