930 likes | 943 Views
Explore the concepts of classification and prediction in data warehousing and data mining, including methods such as decision tree induction, Bayesian classification, neural networks, Support Vector Machines (SVM), and association rule mining. Learn about classification accuracy, training data, classifier models, and the two-step process of model construction and usage. Understand supervised vs. unsupervised learning and issues related to data preparation, evaluation, and interpretability of classification methods.
E N D
Data Warehousing andData Mining— Chapter 7 — MIS 542 2013-2014 Fall
Chapter 7. Classification and Prediction • What is classification? What is prediction? • Issues regarding classification and prediction • Classification by decision tree induction • Bayesian Classification • Classification by Neural Networks • Classification by Support Vector Machines (SVM) • Classification based on concepts from association rule mining • Other Classification Methods • Prediction • Classification accuracy • Summary
Classification vs. Prediction • Classification: • predicts categorical class labels (discrete or nominal) • classifies data (constructs a model) based on the training set and the values (class labels) in a classifying attribute and uses it in classifying new data • Prediction: • models continuous-valued functions, i.e., predicts unknown or missing values • Typical Applications • credit approval • target marketing • medical diagnosis • treatment effectiveness analysis
Classification—A Two-Step Process • Model construction: describing a set of predetermined classes • Each tuple/sample is assumed to belong to a predefined class, as determined by the class label attribute • The set of tuples used for model construction is training set • The model is represented as classification rules, decision trees, or mathematical formulae • Model usage: for classifying future or unknown objects • Estimate accuracy of the model • The known label of test sample is compared with the classified result from the model • Accuracy rate is the percentage of test set samples that are correctly classified by the model • Test set is independent of training set, otherwise over-fitting will occur • If the accuracy is acceptable, use the model to classify data tuples whose class labels are not known
Training Data Classifier (Model) Classification Process (1): Model Construction Classification Algorithms IF rank = ‘professor’ OR years > 6 THEN tenured = ‘yes’
Classifier Testing Data Unseen Data Classification Process (2): Use the Model in Prediction (Jeff, Professor, 4) Tenured?
Supervised vs. Unsupervised Learning • Supervised learning (classification) • Supervision: The training data (observations, measurements, etc.) are accompanied by labels indicating the class of the observations • New data is classified based on the training set • Unsupervised learning(clustering) • The class labels of training data is unknown • Given a set of measurements, observations, etc. with the aim of establishing the existence of classes or clusters in the data
Chapter 7. Classification and Prediction • What is classification? What is prediction? • Issues regarding classification and prediction • Classification by decision tree induction • Bayesian Classification • Classification by Neural Networks • Classification by Support Vector Machines (SVM) • Classification based on concepts from association rule mining • Other Classification Methods • Prediction • Classification accuracy • Summary
Issues Regarding Classification and Prediction (1): Data Preparation • Data cleaning • Preprocess data in order to reduce noise and handle missing values • Relevance analysis (feature selection) • Remove the irrelevant or redundant attributes • Data transformation • Generalize and/or normalize data
Issues regarding classification and prediction (2): Evaluating Classification Methods • Predictive accuracy • Speed and scalability • time to construct the model • time to use the model • Robustness • handling noise and missing values • Scalability • efficiency in disk-resident databases • Interpretability: • understanding and insight provided by the model • Goodness of rules • decision tree size • compactness of classification rules
Chapter 7. Classification and Prediction • What is classification? What is prediction? • Issues regarding classification and prediction • Classification by decision tree induction • Bayesian Classification • Classification by Neural Networks • Classification by Support Vector Machines (SVM) • Classification based on concepts from association rule mining • Other Classification Methods • Prediction • Classification accuracy • Summary
Classification by Decision Tree Induction • Decision tree • A flow-chart-like tree structure • Internal node denotes a test on an attribute • Branch represents an outcome of the test • Leaf nodes represent class labels or class distribution • Decision tree generation consists of two phases • Tree construction • At start, all the training examples are at the root • Partition examples recursively based on selected attributes • Tree pruning • Identify and remove branches that reflect noise or outliers • Use of decision tree: Classifying an unknown sample • Test the attribute values of the sample against the decision tree
Training Dataset This follows an example from Quinlan’s ID3
Output: A Decision Tree for “buys_computer” age? <=30 overcast >40 30..40 student? credit rating? yes no yes fair excellent no yes no yes
Algorithm for Decision Tree Induction • Basic algorithm (a greedy algorithm) • Tree is constructed in a top-down recursive divide-and-conquer manner • At start, all the training examples are at the root • Attributes are categorical (if continuous-valued, they are discretized in advance) • Examples are partitioned recursively based on selected attributes • Test attributes are selected on the basis of a heuristic or statistical measure (e.g., information gain) • Conditions for stopping partitioning • All samples for a given node belong to the same class • There are no remaining attributes for further partitioning – majority voting is employed for classifying the leaf • There are no samples left
Attribute Selection Measure: Information Gain (ID3/C4.5) • Select the attribute with the highest information gain • S contains si tuples of class Ci for i = {1, …, m} • information measures info required to classify any arbitrary tuple • entropy of attribute A with values {a1,a2,…,av} • information gained by branching on attribute A
Class P: buys_computer = “yes” Class N: buys_computer = “no” I(p, n) = I(9, 5) =0.940 Compute the entropy for age: means “age <=30” has 5 out of 14 samples, with 2 yes’es and 3 no’s. Hence Similarly, Attribute Selection by Information Gain Computation
Gain Ratio • Add another attribute transaction TID • for each observation TID is different • E(TID)= (1/14)*I(1,0)+(1/14)*I(1,0)+ (1/14)*I(1,0)...+ (1/14)*I(1,0)=0 • gain(TID)= 0.940-0=0.940 • the highest gain so TID is the test attribute • which makes no sense • use gain ratio rather then gain • Split information: measure of the information value of split: • without considering class information • only number and size of child nodes • A kind of normalization for information gain
Split information = (-Si/S)log2(Si/S) • information needed to assign an instance to one of these branches • Gain ratio = gain(S)/split information(S) • in the previous example • Split info and gain ratio for TID: • split info(TID) =[(1/14)log2(1/14)]*14=3.807 • gain ratio(TID) =(0.940-0.0)/3.807=0.246 • Split info for age:I(5,4,5)= • (5/14)log25/14+ (4/14)log24/14 +(5/14)log25/14=1.577 • gain ratio(age) = gain(age)/split info(age) • =0.247/1.577=0.156
Exercise • Repeat the same exercise of constructing the tree by the gain ratio criteria as the attribute selection measure • notice that TID has the highest gain ratio • do not split by TID
Other Attribute Selection Measures • Gini index (CART, IBM IntelligentMiner) • All attributes are assumed continuous-valued • Assume there exist several possible split values for each attribute • May need other tools, such as clustering, to get the possible split values • Can be modified for categorical attributes
Gini Index (IBM IntelligentMiner) • If a data set T contains examples from n classes, gini index, gini(T) is defined as where pj is the relative frequency of class j in T. • If a data set T is split into two subsets T1 and T2 with sizes N1 and N2 respectively, the gini index of the split data contains examples from n classes, the gini index gini(T) is defined as • The attribute provides the smallest ginisplit(T) is chosen to split the node (need to enumerate all possible splitting points for each attribute).
Gini index (CART) Example • Ex. D has 9 tuples in buys_computer = “yes” and 5 in “no” • Suppose the attribute income partitions D into 10 in D1: {low, medium} and 4 in D2 but gini{medium,high} is 0.30 and thus the best since it is the lowest • D1:{medium,high}, D2:{low} gini index: 0.300 • D1:{low,high}, D2:{medium} gini index: 0.315 • Highest gini is for D1:{low,medium}, D2:{high}
Extracting Classification Rules from Trees • Represent the knowledge in the form of IF-THEN rules • One rule is created for each path from the root to a leaf • Each attribute-value pair along a path forms a conjunction • The leaf node holds the class prediction • Rules are easier for humans to understand • Example IF age = “<=30” AND student = “no” THEN buys_computer = “no” IF age = “<=30” AND student = “yes” THEN buys_computer = “yes” IF age = “31…40” THEN buys_computer = “yes” IF age = “>40” AND credit_rating = “excellent” THEN buys_computer = “yes” IF age = “<=30” AND credit_rating = “fair” THEN buys_computer = “no”
Approaches to Determine the Final Tree Size • Separate training (2/3) and testing (1/3) sets • Use cross validation, e.g., 10-fold cross validation • Use all the data for training • but apply a statistical test (e.g., chi-square) to estimate whether expanding or pruning a node may improve the entire distribution • Use minimum description length (MDL) principle • halting growth of the tree when the encoding is minimized
Enhancements to basic decision tree induction • Allow for continuous-valued attributes • Dynamically define new discrete-valued attributes that partition the continuous attribute value into a discrete set of intervals • Handle missing attribute values • Assign the most common value of the attribute • Assign probability to each of the possible values • Attribute construction • Create new attributes based on existing ones that are sparsely represented • This reduces fragmentation, repetition, and replication
Missing Values • T cases for attribute A • in F cases value of A is unknown • gain = (1-F/T)*(info(T)-entropy(T,A)+ • (F/T)*0 • split info add another branch for cases whose values are unknown
Missing Values • When a case has a known attribute value it s assigned to Ti with probability 1 • if the attribute value is missing it is assigned to Ti with a probability • give a weight w that the case belongs to subset Ti • do that for each subset Ti • Than number of cases in each Ti has fractional values
Example • One of the training cases has a missing age • T:(age = ?, inc=mid,stu=no,credit=ex,class buy= yes) • gain(age) = inf(8,5)-ent(age)= • inf(8,5)=-(8/13)*log(8/13)-(5/13)*log(5/13) • =0.961 • ent(age)=(5/13)*[-(2/5)*log(2/5)-(3/5)*log(3/5)] • +(3/13)*[-(3/3)*log(3/3)+0] • +(5/13)*[-(3/5)*log(3/5)-(2/5)*log(2/5)] • =.747 • gain(age)= (13/14)*(0.961-0.747)=0.199
split info(age) = (5/14)log5/14 <=30 • + (3/14)log3/14 31..40 • + (5/14)log5/14 >40 • + (1/14)log1/14 missing • =1.809 • gain ratio(age) = 0.199/1.809=0.156
after splitting by age • age <=30 branch • age student class weight • age<30 y B 1 • age<30 n N 1 • age<30 n N 1 • age<30 n N 1 • age<30 y B 1 • age<30 n B 5/13 age >40 31..40 <=30 credit student 3,3/13 B ext 2N 5/13B fair 3B no 3N 5/13 B yes 2B
What happens if a new case has to be classified • T: age<30,income=mid,stu=?,credit=fair, • class=has to be found • based on age goes to first subtree • but student is unknown • with (2.0/5.4) probability it is a student • with (3.4/5.4) probabilirty it is not a student • (5/13 is approximately 0.4) • P(buy)= P(buy|stu)*P(stu)+P(buy|nostu)*P(notst) • = 1 *(2/5.4)+(5/44) *(3.4/5.4) • =0.44 • P(nbuy)=P(nbuy|stu)*P(stu) +P(nbuy|nostu)*P(notst) • = 0 *(2/5.4)+(39/44)*(3.4/5.4) • =0.56
Continuous variables • Income is continuous • make a binary split • try all possible splitting points • compute entropy and gain similarly • but you can use income as a test variable in any subtree • there is still information in income not used perfectly
Avoid Overfitting in Classification • Overfitting: An induced tree may overfit the training data • Too many branches, some may reflect anomalies due to noise or outliers • Poor accuracy for unseen samples • Two approaches to avoid overfitting • Prepruning: Halt tree construction early—do not split a node if this would result in the goodness measure falling below a threshold • Difficult to choose an appropriate threshold • Postpruning: Remove branches from a “fully grown” tree—get a sequence of progressively pruned trees • Use a set of data different from the training data to decide which is the “best pruned tree”
Motivation for pruning • A trivial tree two classes with probability p: no • 1-p:yes • conditioned on a given set of attribute values • (1) assign each case to majority: no • error(1): 1-p • (2) assign to no with p and yes with 1-p • error(2): p*(1-p) + (1-p)*p=2p(1-p)>1-p • for p>0.5 • so simple tree has a lower error of classification
If the error rate of a subtree is higher then the error obtained by replacing the tree with its most frequent leaf or branch • prune the subtree • How to estimate the prediction error • do not use training samples • pruning always increase error of the training sample • estimate error based on test set • cost complexity or reduced error pruning
Pessimistic error estimates • Based on training set only • subtree covers N cases E cases missclasified • error based on training set f=E/N • but this is not error • resemble it as a sample from population • estimate a upper bond on the population error based on the confidence limit • Make a normal approximation to the binomial distribution
Given a confidence level c default value is %25 in C4.5 • find confidence limit z such that • P((f-e)/sqrt(e(1-e)/N)>z)=c • N: number of samples • f =E/N: observed error rate • e: true error rate • the upper confidence limit is used as a pesimistic estimate of the true but unknown error rate • first approximation to confidence interval for error • f +/-zc/2=f +/-zc/2sqrt(f(1-f)/N))
Solving the above inequality • e= (f+z2/2N + z*sqrt(f/N-f2/2N+z2/4N2)) • (1+z2/N) • z is number of standard deviations corresponding to a confidence level c • for c=0.25 z=0.69 • refer to Figure 6.2 on page 166 of WF
Example • Labour negotiation data • dependent variable or class to be predicted: • acceptability of contract: good or bad • independent variables: • duration, • wage increase 1th year: <%2.5, >%2.5 • working hours per week: <36, >36 • health plan contribution: none, half, full • Figure 6.2 WF shows a branch of a d.tree
wage increase 1th year >2.5 <2.5 working hours per week >36 <=36 health plan contribution 1 Bad 1 good full none half 4 bad 2 good 4 bad 2 good 1 bad 1good a c b
for node a • E=2,N=6 so f=2/6=0.33 • plugging into the formula upper confidence limit : e= 0.47 • use 0.47 a pessimistic estimate of the error • rather then the training error of 0.33 • for node b • E=1,N=2 so f=1/2=0.50 • plugging into the formula upper confidence limit : e= 0.72 • for node c f=2/6=0.33 but e=0.47 • average error=(6*0.47+2*0.72+6*0.47)/14=0.51 • The error estimate for the parent healt plan is • f=5/14 and f=0.46<0.51 • so prune the node • now working hour per week node has two branchs
working hour per week 1 bad 1 good bed e=0.46 e=0.72 average pessimistic error= (2*0.72+14*0.46)/16= the pessimistic error of the pruned tree: f= 6/16 ep= Exercise calculate pessimistic error decide to prune or not based on ep
Extracting Classification Rules from Trees • Represent the knowledge in the form of IF-THEN rules • One rule is created for each path from the root to a leaf • Each attribute-value pair along a path forms a conjunction • The leaf node holds the class prediction • Rules are easier for humans to understand • Example IF age = “<=30” AND student = “no” THEN buys_computer = “no” IF age = “<=30” AND student = “yes” THEN buys_computer = “yes” IF age = “31…40” THEN buys_computer = “yes” IF age = “>40” AND credit_rating = “excellent” THEN buys_computer = “yes” IF age = “>40” AND credit_rating = “fair” THEN buys_computer = “no”
Rule R: A then class C • Rule R-:A- then class C • delete condition X from A to obtain A- • make a table • class C not class C • X Y1 E1 • not X Y2 E2 • Y1+E1 satisfies R Y1 correct E1 misclassified • Y2+E2 cases satisfied by R- but not R
The total cases by R- is Y1+Y2+E1+E2 • some satisfies X some not • use a pessimistic estimate of the true error for each rule using the upper limit UppC%y(E,N) • for rule R estimate UppC%y(E1,Y1+E1) • and for R- estimate UppC%y(E1+E2,Y1+E1+Y2+E2) • if pessimistic error rate of R- < that of R • delete condition x
Suppose n conditions for a rule • delete conditions one by one • repeat • compare with the pessimistic error of the original rule • if min(R-)<R delete condition x • unit no improvement is in pessimistic error • Study example on page 49-50 of Quinlan 93
AnswerTree • Variables • measurement levels • case weights • frequency variables • Growing methods • Stopping rules • Tree parameters • costs, prior probabilities, scores and profits • Gain summary • Accuracy of tree • Cost-complexity pruning
Variables • Categorical Variables • nominal or ordinal • Continuous Variables • All grouping method accept all types of variables • QUEST requires that tatget variable be nominal • Target and predictor variables • target variable (dependent variable) • predictor (independent variables) • Case weight and frequency variables
Case weight and frequency variables • CASE WEIGHT VARIABLES • unequal treatment to the cases • Ex: direct marketing • 10,000 households respond • and 1,000,000 do not respond • all responders but %1 nonresponders(10,000) • case weight 1 for responders and • case weight 100 for nonresponders • FREQUENCY VARIABLES • count of a record representing more than one individual