Data Warehousing and Data Mining — Chapter 7 —

Data Warehousing andData Mining— Chapter 7 — MIS 542 2013-2014 Fall

Chapter 7. Classification and Prediction • What is classification? What is prediction? • Issues regarding classification and prediction • Classification by decision tree induction • Bayesian Classification • Classification by Neural Networks • Classification by Support Vector Machines (SVM) • Classification based on concepts from association rule mining • Other Classification Methods • Prediction • Classification accuracy • Summary

Classification vs. Prediction • Classification: • predicts categorical class labels (discrete or nominal) • classifies data (constructs a model) based on the training set and the values (class labels) in a classifying attribute and uses it in classifying new data • Prediction: • models continuous-valued functions, i.e., predicts unknown or missing values • Typical Applications • credit approval • target marketing • medical diagnosis • treatment effectiveness analysis

Classification—A Two-Step Process • Model construction: describing a set of predetermined classes • Each tuple/sample is assumed to belong to a predefined class, as determined by the class label attribute • The set of tuples used for model construction is training set • The model is represented as classification rules, decision trees, or mathematical formulae • Model usage: for classifying future or unknown objects • Estimate accuracy of the model • The known label of test sample is compared with the classified result from the model • Accuracy rate is the percentage of test set samples that are correctly classified by the model • Test set is independent of training set, otherwise over-fitting will occur • If the accuracy is acceptable, use the model to classify data tuples whose class labels are not known

Training Data Classifier (Model) Classification Process (1): Model Construction Classification Algorithms IF rank = ‘professor’ OR years > 6 THEN tenured = ‘yes’

Classifier Testing Data Unseen Data Classification Process (2): Use the Model in Prediction (Jeff, Professor, 4) Tenured?

Supervised vs. Unsupervised Learning • Supervised learning (classification) • Supervision: The training data (observations, measurements, etc.) are accompanied by labels indicating the class of the observations • New data is classified based on the training set • Unsupervised learning(clustering) • The class labels of training data is unknown • Given a set of measurements, observations, etc. with the aim of establishing the existence of classes or clusters in the data

Issues Regarding Classification and Prediction (1): Data Preparation • Data cleaning • Preprocess data in order to reduce noise and handle missing values • Relevance analysis (feature selection) • Remove the irrelevant or redundant attributes • Data transformation • Generalize and/or normalize data

Issues regarding classification and prediction (2): Evaluating Classification Methods • Predictive accuracy • Speed and scalability • time to construct the model • time to use the model • Robustness • handling noise and missing values • Scalability • efficiency in disk-resident databases • Interpretability: • understanding and insight provided by the model • Goodness of rules • decision tree size • compactness of classification rules

Classification by Decision Tree Induction • Decision tree • A flow-chart-like tree structure • Internal node denotes a test on an attribute • Branch represents an outcome of the test • Leaf nodes represent class labels or class distribution • Decision tree generation consists of two phases • Tree construction • At start, all the training examples are at the root • Partition examples recursively based on selected attributes • Tree pruning • Identify and remove branches that reflect noise or outliers • Use of decision tree: Classifying an unknown sample • Test the attribute values of the sample against the decision tree

Training Dataset This follows an example from Quinlan’s ID3

Output: A Decision Tree for “buys_computer” age? <=30 overcast >40 30..40 student? credit rating? yes no yes fair excellent no yes no yes

Algorithm for Decision Tree Induction • Basic algorithm (a greedy algorithm) • Tree is constructed in a top-down recursive divide-and-conquer manner • At start, all the training examples are at the root • Attributes are categorical (if continuous-valued, they are discretized in advance) • Examples are partitioned recursively based on selected attributes • Test attributes are selected on the basis of a heuristic or statistical measure (e.g., information gain) • Conditions for stopping partitioning • All samples for a given node belong to the same class • There are no remaining attributes for further partitioning – majority voting is employed for classifying the leaf • There are no samples left

Attribute Selection Measure: Information Gain (ID3/C4.5) • Select the attribute with the highest information gain • S contains si tuples of class Ci for i = {1, …, m} • information measures info required to classify any arbitrary tuple • entropy of attribute A with values {a1,a2,…,av} • information gained by branching on attribute A

Class P: buys_computer = “yes” Class N: buys_computer = “no” I(p, n) = I(9, 5) =0.940 Compute the entropy for age: means “age <=30” has 5 out of 14 samples, with 2 yes’es and 3 no’s. Hence Similarly, Attribute Selection by Information Gain Computation

Gain Ratio • Add another attribute transaction TID • for each observation TID is different • E(TID)= (1/14)*I(1,0)+(1/14)*I(1,0)+ (1/14)*I(1,0)...+ (1/14)*I(1,0)=0 • gain(TID)= 0.940-0=0.940 • the highest gain so TID is the test attribute • which makes no sense • use gain ratio rather then gain • Split information: measure of the information value of split: • without considering class information • only number and size of child nodes • A kind of normalization for information gain

Split information = (-Si/S)log2(Si/S) • information needed to assign an instance to one of these branches • Gain ratio = gain(S)/split information(S) • in the previous example • Split info and gain ratio for TID: • split info(TID) =[(1/14)log2(1/14)]*14=3.807 • gain ratio(TID) =(0.940-0.0)/3.807=0.246 • Split info for age:I(5,4,5)= • (5/14)log25/14+ (4/14)log24/14 +(5/14)log25/14=1.577 • gain ratio(age) = gain(age)/split info(age) • =0.247/1.577=0.156

Exercise • Repeat the same exercise of constructing the tree by the gain ratio criteria as the attribute selection measure • notice that TID has the highest gain ratio • do not split by TID

Other Attribute Selection Measures • Gini index (CART, IBM IntelligentMiner) • All attributes are assumed continuous-valued • Assume there exist several possible split values for each attribute • May need other tools, such as clustering, to get the possible split values • Can be modified for categorical attributes

Gini Index (IBM IntelligentMiner) • If a data set T contains examples from n classes, gini index, gini(T) is defined as where pj is the relative frequency of class j in T. • If a data set T is split into two subsets T1 and T2 with sizes N1 and N2 respectively, the gini index of the split data contains examples from n classes, the gini index gini(T) is defined as • The attribute provides the smallest ginisplit(T) is chosen to split the node (need to enumerate all possible splitting points for each attribute).

Gini index (CART) Example • Ex. D has 9 tuples in buys_computer = “yes” and 5 in “no” • Suppose the attribute income partitions D into 10 in D1: {low, medium} and 4 in D2 but gini{medium,high} is 0.30 and thus the best since it is the lowest • D1:{medium,high}, D2:{low} gini index: 0.300 • D1:{low,high}, D2:{medium} gini index: 0.315 • Highest gini is for D1:{low,medium}, D2:{high}

Extracting Classification Rules from Trees • Represent the knowledge in the form of IF-THEN rules • One rule is created for each path from the root to a leaf • Each attribute-value pair along a path forms a conjunction • The leaf node holds the class prediction • Rules are easier for humans to understand • Example IF age = “<=30” AND student = “no” THEN buys_computer = “no” IF age = “<=30” AND student = “yes” THEN buys_computer = “yes” IF age = “31…40” THEN buys_computer = “yes” IF age = “>40” AND credit_rating = “excellent” THEN buys_computer = “yes” IF age = “<=30” AND credit_rating = “fair” THEN buys_computer = “no”

Approaches to Determine the Final Tree Size • Separate training (2/3) and testing (1/3) sets • Use cross validation, e.g., 10-fold cross validation • Use all the data for training • but apply a statistical test (e.g., chi-square) to estimate whether expanding or pruning a node may improve the entire distribution • Use minimum description length (MDL) principle • halting growth of the tree when the encoding is minimized

Enhancements to basic decision tree induction • Allow for continuous-valued attributes • Dynamically define new discrete-valued attributes that partition the continuous attribute value into a discrete set of intervals • Handle missing attribute values • Assign the most common value of the attribute • Assign probability to each of the possible values • Attribute construction • Create new attributes based on existing ones that are sparsely represented • This reduces fragmentation, repetition, and replication

Missing Values • T cases for attribute A • in F cases value of A is unknown • gain = (1-F/T)*(info(T)-entropy(T,A)+ • (F/T)*0 • split info add another branch for cases whose values are unknown

Missing Values • When a case has a known attribute value it s assigned to Ti with probability 1 • if the attribute value is missing it is assigned to Ti with a probability • give a weight w that the case belongs to subset Ti • do that for each subset Ti • Than number of cases in each Ti has fractional values

Example • One of the training cases has a missing age • T:(age = ?, inc=mid,stu=no,credit=ex,class buy= yes) • gain(age) = inf(8,5)-ent(age)= • inf(8,5)=-(8/13)*log(8/13)-(5/13)*log(5/13) • =0.961 • ent(age)=(5/13)*[-(2/5)*log(2/5)-(3/5)*log(3/5)] • +(3/13)*[-(3/3)*log(3/3)+0] • +(5/13)*[-(3/5)*log(3/5)-(2/5)*log(2/5)] • =.747 • gain(age)= (13/14)*(0.961-0.747)=0.199

split info(age) = (5/14)log5/14 <=30 • + (3/14)log3/14 31..40 • + (5/14)log5/14 >40 • + (1/14)log1/14 missing • =1.809 • gain ratio(age) = 0.199/1.809=0.156

after splitting by age • age <=30 branch • age student class weight • age<30 y B 1 • age<30 n N 1 • age<30 n N 1 • age<30 n N 1 • age<30 y B 1 • age<30 n B 5/13 age >40 31..40 <=30 credit student 3,3/13 B ext 2N 5/13B fair 3B no 3N 5/13 B yes 2B

What happens if a new case has to be classified • T: age<30,income=mid,stu=?,credit=fair, • class=has to be found • based on age goes to first subtree • but student is unknown • with (2.0/5.4) probability it is a student • with (3.4/5.4) probabilirty it is not a student • (5/13 is approximately 0.4) • P(buy)= P(buy|stu)*P(stu)+P(buy|nostu)*P(notst) • = 1 *(2/5.4)+(5/44) *(3.4/5.4) • =0.44 • P(nbuy)=P(nbuy|stu)*P(stu) +P(nbuy|nostu)*P(notst) • = 0 *(2/5.4)+(39/44)*(3.4/5.4) • =0.56

Continuous variables • Income is continuous • make a binary split • try all possible splitting points • compute entropy and gain similarly • but you can use income as a test variable in any subtree • there is still information in income not used perfectly

Avoid Overfitting in Classification • Overfitting: An induced tree may overfit the training data • Too many branches, some may reflect anomalies due to noise or outliers • Poor accuracy for unseen samples • Two approaches to avoid overfitting • Prepruning: Halt tree construction early—do not split a node if this would result in the goodness measure falling below a threshold • Difficult to choose an appropriate threshold • Postpruning: Remove branches from a “fully grown” tree—get a sequence of progressively pruned trees • Use a set of data different from the training data to decide which is the “best pruned tree”

Motivation for pruning • A trivial tree two classes with probability p: no • 1-p:yes • conditioned on a given set of attribute values • (1) assign each case to majority: no • error(1): 1-p • (2) assign to no with p and yes with 1-p • error(2): p*(1-p) + (1-p)*p=2p(1-p)>1-p • for p>0.5 • so simple tree has a lower error of classification

If the error rate of a subtree is higher then the error obtained by replacing the tree with its most frequent leaf or branch • prune the subtree • How to estimate the prediction error • do not use training samples • pruning always increase error of the training sample • estimate error based on test set • cost complexity or reduced error pruning

Pessimistic error estimates • Based on training set only • subtree covers N cases E cases missclasified • error based on training set f=E/N • but this is not error • resemble it as a sample from population • estimate a upper bond on the population error based on the confidence limit • Make a normal approximation to the binomial distribution

Given a confidence level c default value is %25 in C4.5 • find confidence limit z such that • P((f-e)/sqrt(e(1-e)/N)>z)=c • N: number of samples • f =E/N: observed error rate • e: true error rate • the upper confidence limit is used as a pesimistic estimate of the true but unknown error rate • first approximation to confidence interval for error • f +/-zc/2=f +/-zc/2sqrt(f(1-f)/N))

Solving the above inequality • e= (f+z2/2N + z*sqrt(f/N-f2/2N+z2/4N2)) • (1+z2/N) • z is number of standard deviations corresponding to a confidence level c • for c=0.25 z=0.69 • refer to Figure 6.2 on page 166 of WF

Example • Labour negotiation data • dependent variable or class to be predicted: • acceptability of contract: good or bad • independent variables: • duration, • wage increase 1th year: <%2.5, >%2.5 • working hours per week: <36, >36 • health plan contribution: none, half, full • Figure 6.2 WF shows a branch of a d.tree

wage increase 1th year >2.5 <2.5 working hours per week >36 <=36 health plan contribution 1 Bad 1 good full none half 4 bad 2 good 4 bad 2 good 1 bad 1good a c b

for node a • E=2,N=6 so f=2/6=0.33 • plugging into the formula upper confidence limit : e= 0.47 • use 0.47 a pessimistic estimate of the error • rather then the training error of 0.33 • for node b • E=1,N=2 so f=1/2=0.50 • plugging into the formula upper confidence limit : e= 0.72 • for node c f=2/6=0.33 but e=0.47 • average error=(6*0.47+2*0.72+6*0.47)/14=0.51 • The error estimate for the parent healt plan is • f=5/14 and f=0.46<0.51 • so prune the node • now working hour per week node has two branchs

working hour per week 1 bad 1 good bed e=0.46 e=0.72 average pessimistic error= (2*0.72+14*0.46)/16= the pessimistic error of the pruned tree: f= 6/16 ep= Exercise calculate pessimistic error decide to prune or not based on ep

Extracting Classification Rules from Trees • Represent the knowledge in the form of IF-THEN rules • One rule is created for each path from the root to a leaf • Each attribute-value pair along a path forms a conjunction • The leaf node holds the class prediction • Rules are easier for humans to understand • Example IF age = “<=30” AND student = “no” THEN buys_computer = “no” IF age = “<=30” AND student = “yes” THEN buys_computer = “yes” IF age = “31…40” THEN buys_computer = “yes” IF age = “>40” AND credit_rating = “excellent” THEN buys_computer = “yes” IF age = “>40” AND credit_rating = “fair” THEN buys_computer = “no”

Rule R: A then class C • Rule R-:A- then class C • delete condition X from A to obtain A- • make a table • class C not class C • X Y1 E1 • not X Y2 E2 • Y1+E1 satisfies R Y1 correct E1 misclassified • Y2+E2 cases satisfied by R- but not R

The total cases by R- is Y1+Y2+E1+E2 • some satisfies X some not • use a pessimistic estimate of the true error for each rule using the upper limit UppC%y(E,N) • for rule R estimate UppC%y(E1,Y1+E1) • and for R- estimate UppC%y(E1+E2,Y1+E1+Y2+E2) • if pessimistic error rate of R- < that of R • delete condition x

Suppose n conditions for a rule • delete conditions one by one • repeat • compare with the pessimistic error of the original rule • if min(R-)<R delete condition x • unit no improvement is in pessimistic error • Study example on page 49-50 of Quinlan 93

AnswerTree • Variables • measurement levels • case weights • frequency variables • Growing methods • Stopping rules • Tree parameters • costs, prior probabilities, scores and profits • Gain summary • Accuracy of tree • Cost-complexity pruning

Variables • Categorical Variables • nominal or ordinal • Continuous Variables • All grouping method accept all types of variables • QUEST requires that tatget variable be nominal • Target and predictor variables • target variable (dependent variable) • predictor (independent variables) • Case weight and frequency variables

Case weight and frequency variables • CASE WEIGHT VARIABLES • unequal treatment to the cases • Ex: direct marketing • 10,000 households respond • and 1,000,000 do not respond • all responders but %1 nonresponders(10,000) • case weight 1 for responders and • case weight 100 for nonresponders • FREQUENCY VARIABLES • count of a record representing more than one individual

Data Warehousing and Data Mining — Chapter 7 —