1 / 145

Data Warehousing and Data Mining — Chapter 7 —

Data Warehousing and Data Mining — Chapter 7 —. MIS 542 2013-2014 Fall. Chapter 7. Classification and Prediction. What is classification? What is prediction? Issues regarding classification and prediction Classification by decision tree induction Bayesian Classification

lucass
Download Presentation

Data Warehousing and Data Mining — Chapter 7 —

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Data Warehousing andData Mining— Chapter 7 — MIS 542 2013-2014 Fall

  2. Chapter 7. Classification and Prediction • What is classification? What is prediction? • Issues regarding classification and prediction • Classification by decision tree induction • Bayesian Classification • Classification by Neural Networks • Classification by Support Vector Machines (SVM) • Classification based on concepts from association rule mining • Other Classification Methods • Prediction • Classification accuracy • Summary

  3. Classification vs. Prediction • Classification: • predicts categorical class labels (discrete or nominal) • classifies data (constructs a model) based on the training set and the values (class labels) in a classifying attribute and uses it in classifying new data • Prediction: • models continuous-valued functions, i.e., predicts unknown or missing values • Typical Applications • credit approval • target marketing • medical diagnosis • treatment effectiveness analysis

  4. Classification—A Two-Step Process • Model construction: describing a set of predetermined classes • Each tuple/sample is assumed to belong to a predefined class, as determined by the class label attribute • The set of tuples used for model construction is training set • The model is represented as classification rules, decision trees, or mathematical formulae • Model usage: for classifying future or unknown objects • Estimate accuracy of the model • The known label of test sample is compared with the classified result from the model • Accuracy rate is the percentage of test set samples that are correctly classified by the model • Test set is independent of training set, otherwise over-fitting will occur • If the accuracy is acceptable, use the model to classify data tuples whose class labels are not known

  5. Training Data Classifier (Model) Classification Process (1): Model Construction Classification Algorithms IF rank = ‘professor’ OR years > 6 THEN tenured = ‘yes’

  6. Classifier Testing Data Unseen Data Classification Process (2): Use the Model in Prediction (Jeff, Professor, 4) Tenured?

  7. Supervised vs. Unsupervised Learning • Supervised learning (classification) • Supervision: The training data (observations, measurements, etc.) are accompanied by labels indicating the class of the observations • New data is classified based on the training set • Unsupervised learning(clustering) • The class labels of training data is unknown • Given a set of measurements, observations, etc. with the aim of establishing the existence of classes or clusters in the data

  8. Chapter 7. Classification and Prediction • What is classification? What is prediction? • Issues regarding classification and prediction • Classification by decision tree induction • Bayesian Classification • Classification by Neural Networks • Classification by Support Vector Machines (SVM) • Classification based on concepts from association rule mining • Other Classification Methods • Prediction • Classification accuracy • Summary

  9. Issues Regarding Classification and Prediction (1): Data Preparation • Data cleaning • Preprocess data in order to reduce noise and handle missing values • Relevance analysis (feature selection) • Remove the irrelevant or redundant attributes • Data transformation • Generalize and/or normalize data

  10. Issues regarding classification and prediction (2): Evaluating Classification Methods • Predictive accuracy • Speed and scalability • time to construct the model • time to use the model • Robustness • handling noise and missing values • Scalability • efficiency in disk-resident databases • Interpretability: • understanding and insight provided by the model • Goodness of rules • decision tree size • compactness of classification rules

  11. Chapter 7. Classification and Prediction • What is classification? What is prediction? • Issues regarding classification and prediction • Classification by decision tree induction • Bayesian Classification • Classification by Neural Networks • Classification by Support Vector Machines (SVM) • Classification based on concepts from association rule mining • Other Classification Methods • Prediction • Classification accuracy • Summary

  12. Classification by Decision Tree Induction • Decision tree • A flow-chart-like tree structure • Internal node denotes a test on an attribute • Branch represents an outcome of the test • Leaf nodes represent class labels or class distribution • Decision tree generation consists of two phases • Tree construction • At start, all the training examples are at the root • Partition examples recursively based on selected attributes • Tree pruning • Identify and remove branches that reflect noise or outliers • Once the tree is build • Use of decision tree: Classifying an unknown sample

  13. Training Dataset This follows an example from Quinlan’s ID3 Han Table 7.1 Original data from DMPML Ch 4 Sec 4.3 pp 8-9,89-94

  14. Output: A Decision Tree for “buys_computer” 14 cases age? 4 cases <=30 5 cases overcast 5 cases >40 31..40 student? credit rating? yes no yes fair excellent no yes no yes

  15. In Practice passed Current • For the i th data, • at time I, input information is known • At time O, output is asigned (yes/no) • For all data object in the training data set (i=1..14) both input and output are known time Ii Oi

  16. In Practice passed Current future • For a new customers n, • at the current time I, input information is known • But O, output is not known • Yet to be classified as (yes or no) before its actual buying behavioris realized • Value of a data mining study to predict buying behavior beforehand time In On

  17. ID3 Algorithm for Decision Tree Induction • ID3 algorithm Quinlan (1986) • Tree is constructed in a top-down recursive divide-and-conquer manner • At start, all the training examples are at the root • Attributes are categorical (if continuous-valued, they are discretized in advance) • Examples are partitioned recursively based on selected attributes • Test attributes are selected on the basis of a heuristic or statistical measure (e.g., information gain) • Conditions for stopping partitioning • All samples for a given node belong to the same class • There are no remaining attributes for further partitioning – majority voting is employed for classifying the leaf • There are no samples left

  18. Entropy of a Simple Event • average amount of information to predict event s • Expected information of an event s is • I(s)= -log2ps=log2(1/p) • Ps is probability of event s • entropy is high for lower probable events and low otherwise • as ps goes to zero • Event s becomes rare and rare and hence • İts entropy approachs to infinity • as ps goes to one • Event s becomes more and more certain and hence • İts entropy approachs to zero • Hence entropy is a measure of randomness. disorder rareness of an event

  19. Computational Formulas Log21=0. log22=1 Log20.5=log21/2=log22-1=-log22=-1 Log20.25=-2 Computational formula Logax=logbx/logba a and b any basis Choosing b as e It is more convenient to compute logarithms of any basis Logax=logex/logea or in particular Log2x=logex/loge2 or Dropping e Log2x= lnx/ln2 entropy Entropy=-log2p 2 1 0 0.25 0.5 1 probability

  20. Entropy of an simple event • Entropy can be interpreted as the number of bits to encode that event • if p=1/2, • -log21/2= 1 bit is required to encode that event • 0 and 1 with equal probability • if p=1/4 • -log21/4= 2 bits are required to encode that event • 00 01 10 11 each are equally likely only one represent the specific event • if p=1/8 • -log21/8= 3 bits are required to encode that event • 000 001 010 011 100 101 110 111 each are equally likely only one represent the specific event • if p=1/3 • -log21/3=-(loge1/3)/loge2=1.58 bits are required to encode that event

  21. Composite events • Consider the four events A, B, C, D • A:tossing a fair coin where • PH = PT = ½ • B:tossing an unfair coin where • PH = ¼ PT = 3/4 • C:tossing an unfair coin where • PH = 0.001 PT = 0.999 • D:tossing an unfair coin where • PH = 0.0 PT = 1.0 • Which of these events is more certain • How much information is needed to guess the result of each toes in A to D • What is the expected information to predict the outcome of events A B C D respectively?

  22. Entropy of composite events • Average or expected information is highest when each event is equally likely • Event A • Expected information required to guess falls as the probability of head becomes either 0 or 1 • as PH goes to 0 or PT goes to 1: moving to case D • The composite event of toesing a coin is more and more certain • So for case D no information is needed as the answer is already known as tail • What is the expected information to predict the outcome of that event when probability of head is p in general • ent(S)= p[-log2p]+(1-p)[-log2(1-p)] • ent(S)= -plog2p-(1-p)log2(1-p) • The lower the entropy the higher the information content of that event • Weighted average of simple events weighted by their probailities

  23. Examples • When the event is certain: • pH = 1,pT= 0 or pH = 0, pT = 1 • ent(S)= -1log2(1)-0log20= -1*0-0*log20=0 • Note that: limx0+ xlog2x=0 • For a fair coin pH = 0.5 ,pT= 0.5 • ent(S)= -(1/2)log2(1/2)-(1/2)log21/2 • = -1/2(-1) -1/2(-1)=1 • Ent(S) is 1: p = 0.5 1-p=0.5 • if head or tail probabilities are unequal • entropy is between 0 and 1

  24. P head versus entropy for the event of toesing a coin Entropy= -plog2p-(1-p)log2(1-p) 1 0 1 0.5 Probability of head

  25. In general • Entropy is a measure of (im)purity of an sample variable S is defined as • Ent(S) = sSps(-log2ps) • = -sSpslog2ps • s is a value of S an element of sample space • ps is its estimated or subjective probability of any sS • Note that sps = 1

  26. Information needed to classify an object • class entropy is computed in a similar manner • Entropy is 0 if all members of S belong to the same class • no information to classify an object • entropy of a partition is the weighted average of all entropies of all classes • Total number of objects: S • There are two classes C1 and C2 • with cardinalities S1 and S2 • I(S1,S2)=-(S1/S)*log2(S1/S)-(S2/S)*log2(S1/S2) • Or in general with m classes C1,C2,…,Cm • I(S1,S2…Sm)=-(S1/S)*log2(S1/S)-(S2/S)*log2(S1/S2)-.. • -(Sm/S)*log2(Sm/S) • Probability of an objects belonging to class i:Pi=Si/S

  27. Example cont.: At the root of the tree • There are 14 cases S = 14 • 9 yes denoted by Y, S1 = 9, py = 9/14 • 5 no denoted by N, S2 = 5, pn = 5/14 • Y s and N s are almost equally likely • How much information on the average is needed to classify a person as buyer or not • Without knowing any characteristics such as • age income … • I(S1,S2)=(S1/S)(-log2S1/S) +(S2/S)(-log2S2/S) • I(9,5)=(9/14)(-log29/14) + (5/14)(-log25/14) • =(9/14)(0.637) + (5/14)(1.485) = 0.940 bits of information • close to 1

  28. Expected information to classify with an attribute (1) • An attribute A with n distinct values as a1,a2,..,an partition the dataset into n distinct parts • Si,j is number of objects in partition i (i=1..n) • with a class Cj (j=1..m) • Expected information to classify an object knowing the value of attribute A is the • weighted average of entropies of all partitions • Weighted by the frequency of that partition i

  29. Expected information to classify with an attribute (2) • Ent(A) =I(A) = mj=1(ai/S) *I(Si,1..Si,m) • =(a1/S)*I(S1,1..S1,m)+(a2/S)*I(S2,1..S2,m)+… • +(an/S)*I(Sn,1..Sn,m) • I(Si,1..Si,m) entropy of any partition i • I(Si,1..Si,m)=mj=1(Sij/ai)(-log2Sij/ai) • =-(Si1/ai)*log2Si1/ai-(Si2/ai)*log2Si2/ai- • ...-(Sim/ai)*log2Sim/ai • ai = mj=1Sij = nuber of objects in each partition • Here sij/ai is the probability of class j in partition i

  30. Information Gain • Gain in information using distinct values of attribute A is the reduction in entropy or information need to classify an object • Gain(A) = I(S1,..,Sm) – I(A) • average information without knowing A – • Average information with knowing A • Eg. Knowing such characteristics as: • Age interval, income interval • How much help to classify a new object? • Can information gain be negative? • Is it always greater then or equal to zero?

  31. Another Example • Pick up a student at random in BU • What is the chance that she is staying in dorm? • Initially we have no specific information about her • If I ask initials • Does it help us in predicting the probability of her staying in dorm. • No • If I ask her adress and record the city • Does it help us in predcting the chance of her staying in dorm • Yes

  32. Attribute selection at the root • There are four attributes • age, income, student, credit rating • Which of these provıdes the highest informatıon in classıfying a new customer or equivalently • Which of these results in hıghest information gaın

  33. Testing age at the root node 14 cases 9 Y 5 N <=30 >40 31..40 5 cases 2 Y 3 N I(2,3)=0.971 Dec N 4 cases 4 Y 0 N I(4,0) =0 Dec:Y 5 cases 3 Y 2 N I(3,2)=0.971 Dec Y Accuricy: 10/14, Entropy(age)=5/14*I(2,3)+4/14*I(4,0)+5/14*I(3,2) Entropy(age) = 5/14(-3/5log23/5-2/5log22/5) + 4/14((-4/4log24/4-0/4log20/4) + 5/14(-3/5log23/5-2/5log22/5) =0.694 Gain(age) = 0.940 – 0.694 = 0.246

  34. Expected information for age <=30 • If age is <=30 • Information need to classify a new customer : • I(2,3)=0.971 bits as the training data tells that • Knowing that age <=30 • with 0.4 probability a customer buys • but with 0.6 probability she dose not • I(2,3)=-(3/5)log23/5-(2/5)log22/5=0.971 • =0.6*0.734+0.4*1.322=0.971 • But what is the weight of age range <=30 • 5 out of 14 samples are in that range • (5/14)*I(2,3) is the weighted information need to classify a customer as buyer or not

  35. Information gain by age • gain(age)= I(Y,N)-Entropy(age) • = 0.940 – 0.694 • =0.246 is the information gain • Or reduction of entropy to classify a new object • Knowing the age of the customer increases our ability to classify her as buyer or not or • Help us to predict her buying behavior

  36. Class Y: buys_computer = “yes” Class N: buys_computer = “no” I(Y, N) = I(9, 5) =0.940 Compute the entropy for age: means “age <=30” has 5 out of 14 samples, with 2 yes’es and 3 no’s. Hence Similarly, Attribute Selection by Information Gain Computation

  37. Testing income at the root node 14 cases 9 Y 5 N low high medium 4 cases 3 Y 1 N I(3,1)=? Dec Y 6 cases 4 Y 2 N I(4,2) =? Dec: Y 4 cases 2 Y 2 N I(2,2)=1 Dec: ? Accuricy:9/14 Entropy(income) = 4/14(-3/4log23/4-1/4log21/4) + 6/14((-2/6log23/4-4/6log22/6) + 4/14(-2/4log22/4-2/4log22/4) =0.914 Gain(income) = 0.940 – 0914.=0.026

  38. Testing student at the root node 14 cases 9 Y 5 N no yes 7 cases 3 Y 4 N I(3,4)=? Dec N 7 cases 6 Y 1 N I(6,1)=? Dec: Y Accuricy:10/14 Entropy(student) = 7/14(-3/7log23/7-4/7log24/7) + 7/14((-6/7log26/7-1/7log21/7) =0.789 Gain(student) = 0.940 –0.789 0.151

  39. Testing credit rating at the root node 14 cases 9 Y 5 N faır excellent 8 cases 6 Y 2 N I(6,2)=? Dec Y 6 cases 3 Y 3 N I(3,3)=1 Dec: ? Accuricy:9/14 Entropy(student) = 8/14(-2/6log22/6-4/6log24/6)+ + 6/14((-3/6log23/6-3/6log23/6) =0.892 Gain(credit rating) = 0.940 – 0892=0.048

  40. Comparıng gaıns • İnformation gains for attrıbutes at the root node: • Gain(age) = 0.246 • Gain(age) = 0.026 • Gain(age) = 0.151 • Gain(age) = 0.048 • Age provıdes the highest gain in information • Age ıs choosen as the attribute at the root node • Branch acording to the distinct values of age

  41. After Selecting age first level of the tree 14 cases age? 4 cases <=30 5 cases overcast 5 cases >40 31..40 Continue continue yes

  42. Attrıbute selectıon at age <=30 node 5 cases 2 Y 3 N 5 cases 2 Y 3 N yes no student 2 Y 0 N hıgh low income 0 Y 3 N 0 Y 2 N medıum 1 Y 0 N 1Y 1 N 5 cases 2 Y 3 N excellent faır credit 1 Y 1 N 1 Y 2 N Informatıon gaın for student ıs the hıghest as knowıng the Customers beıng a student or not provıdes perfect ınforma- tıon to classıfy her buyıng behavıor

  43. Attrıbute selectıon at age >40 node 5 cases 3 Y 2 N 5 cases 3 Y 2 N yes no student 2 Y 1 N hıgh low income 1 Y 1 N 0 Y 0 N medıum 1 Y 1 N 2Y 1 N 5 cases 3 Y 2 N excellent faır credit 0 Y 2 N 3 Y 0 N Informatıon gaın for credıt ratıng ıs the hıghest as knowıng the Customers beıng a student or not provıdes perfect ınforma- tıon to classıfy her buyıng behavıor

  44. Exercise • Calculate all information gains in the second level of the tree that is after branching by the distinct values of age

  45. Output: A Decision Tree for “buys_computer” 14 cases age? 4 cases <=30 5 cases overcast 5 cases >40 31..40 student? credit rating? yes 2 N 3 N 2 Y 3 N no yes fair excellent no yes no yes Accuricy 14/14 on training set

  46. Advantage and Disadvantages of Decision Trees • Advantages: • Easy to understand and map nicely to a production rules • Suitable for categorical as well as numerical inputs • No statistical assumptions about distribution of attributes • Generation and application to classify unknown outputs is very fast • Disadvantages: • Output attributes must be categorical • Unstable: slight variations in the training data may result in different attribute selections and hence different trees • Numerical input attributes leads to complex threes as attribute splits are usually binary • Not suitable for non rectangler regions such as regions separated by linear or nonlnear combination of attributes • By lines ( in 2 dimensions) planes( in 3 dimensions) or in general by hyperplanes (n dimensions)

  47. A classification problem in that decision trees are not suitable to classify income A The two classes X and O are separated by line AA Decision trees are not suitabe For this problem x x x o x o x x x o x o x o o o A age

  48. Other Attribute Selection Measures • Gain Ratio • Gini index (CART, IBM IntelligentMiner) • All attributes are assumed continuous-valued • Assume there exist several possible split values for each attribute • May need other tools, such as clustering, to get the possible split values • Can be modified for categorical attributes

  49. Gain Ratio • Add another attribute transaction TID • for each observation TID is different • E(TID)= (1/14)*I(1,0)+(1/14)*I(1,0)+ (1/14)*I(1,0)...+ (1/14)*I(1,0)=0 • gain(TID)= 0.940-0=0.940 • the highest gain so TID is the test attribute • which makes no sense • use gain ratio rather then gain • Split information: measure of the information value of split: • without considering class information • only number and size of child nodes • A kind of normalization for information gain

  50. Split information = (-Si/S)log2(Si/S) • information needed to assign an instance to one of these branches • Gain ratio = gain(S)/split information(S) • in the previous example • Split info and gain ratio for TID: • split info(TID) =[(1/14)log2(1/14)]*14=3.807 • gain ratio(TID) =(0.940-0.0)/3.807=0.246 • Split info for age:I(5,4,5)= • (5/14)log25/14+ (4/14)log24/14 +(5/14)log25/14=1.577 • gain ratio(age) = gain(age)/split info(age) • =0.247/1.577=0.156

More Related