AMCS/CS 340: Data Mining

Classification I: Decision Tree AMCS/CS 340: Data Mining Xiangliang Zhang King Abdullah University of Science and Technology

Given a collection of records (training set ) Each record contains a set of attributes, one of the attributes is the class. Find a model for class attribute as a function of the values of other attributes. Goal: previously unseen records should be assigned a class as accurately as possible. A test set is used to determine the accuracy of the model. Usually, the given data set is divided into training and test sets, with training set used to build the model and test set used to validate it. Classification: Definition Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Test Set Model Classification Example categorical categorical continuous class Learn Classifier Training Set predicting borrowers who cheat on loan payments.

Issues: Evaluating Classification Methods • Accuracy • classifier accuracy: how well the class labels of test data are predicted • Speed • time to construct the model (training time) • time to use the model (classification/prediction time) • Robustness: handling noise and missing values • Scalability: efficiency in large-scale data • Interpretability • understanding and insight provided by the model • Other measures, e.g., goodness of rules, such as decision tree size or compactness of classification rules 4

Decision Tree based Methods Rule-based Methods Learning from Neighbors Bayesian Classification Neural Networks Ensemble Methods Support Vector Machines Classification Techniques Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

categorical categorical continuous class Example of a Decision Tree Splitting Attributes Refund Yes No NO MarSt Married Single, Divorced TaxInc NO < 80K > 80K YES NO Model: Decision Tree Training Data Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

NO Another Example of Decision Tree categorical categorical continuous Single, Divorced MarSt class Married NO Refund No Yes TaxInc < 80K > 80K YES NO There could be more than one tree that fits the same data! Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Decision Tree Classification Task Decision Tree Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Refund Yes No NO MarSt Married Single, Divorced TaxInc NO < 80K > 80K YES NO Apply Model to Test Data Test Data Start from the root of tree.

Refund Yes No NO MarSt Married Single, Divorced TaxInc NO < 80K > 80K YES NO Apply Model to Test Data Test Data Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Apply Model to Test Data Test Data Refund Yes No NO MarSt Married Single, Divorced TaxInc NO < 80K > 80K YES NO Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Apply Model to Test Data Test Data Refund Yes No NO MarSt Assign Cheat to “No” Married Single, Divorced TaxInc NO < 80K > 80K YES NO Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Decision Tree Classification Task Decision Tree Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Basic algorithm (a greedy algorithm) Tree is constructed in a top-down recursive divide-and-conquer manner At start, all the training examples are at the root Examples are partitioned recursively based on selected attributes Attributes are categorical (if continuous-valued, they are discretized in advance) Test attributes are selected on the basis of a heuristic or statistical measure (e.g., information gain) Conditions for stopping partitioning All samples for a given node belong to the same class There are no remaining attributes for further partitioning – majority voting is employed for classifying the leaf There are no samples left Algorithm for Decision Tree Induction Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Many Algorithms: Hunt’s Algorithm (one of the earliest) CART ID3, C4.5 SLIQ,SPRINT Decision Tree Induction Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Let Dt be the set of training records that reach a node t General Procedure: IfDt contains records that belong the same class yt, then t is a leaf node labeled as yt If Dt is an empty set, then t is a leaf node labeled by the default class, yd If Dtcontains records that belong to more than one class, use an attribute test to split the data into smaller subsets. Recursively apply the procedure to each subset. General Structure of Hunt’s Algorithm Dt ? Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Refund Yes No Don’t Cheat Don’t Cheat Refund Refund Yes No Yes No Don’t Cheat Marital Status Don’t Cheat Marital Status Single, Divorced Single, Divorced Married Married Don’t Cheat Taxable Income Cheat Don’t Cheat < 80K >= 80K Don’t Cheat Cheat Hunt’s Algorithm Don’t Cheat

Issues: Determine how to split the records How to specify the attribute test condition? How many branches Partition threshold for splitting How to determine the best split? Choose which attribute ? Issues of Hunt's Algorithm Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Depends on attribute types Nominal Ordinal Continuous Depends on number of ways to split 2-way split Multi-way split How to Specify Test Condition? Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Multi-way split: Use as many partitions as distinct values. Binary split: Divides values into two subsets. Need to find optimal partitioning. CarType Family Luxury Sports CarType CarType {Sports, Luxury} {Family, Luxury} {Family} {Sports} Splitting Based on Nominal Attributes OR Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Multi-way split: Use as many partitions as distinct values. Binary split: Divides values into two subsets. Need to find optimal partitioning. What about this split? Size Small Large Medium Size Size {Small, Medium} {Medium, Large} {Large} {Small} Size {Small, Large} {Medium} Splitting Based on Ordinal Attributes OR Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Different ways of handling Binary Decision: (A < v) or ( A >= v ) consider all possible splits and finds the best cut Discretization to form an ordinal categorical attribute ranges can be found by equal interval bucketing, equal frequency bucketing (percentiles), or clustering. Splitting Based on Continuous Attributes Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Splitting Based on Continuous Attributes Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Issues: Determine how to split the records How to specify the attribute test condition? How many branches Partition threshold for splitting How to determine the best split? Choose which attribute ? Issues of Hunt's Algorithm Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

How to determine the Best Split Before Splitting: 10 records of class 0, 10 records of class 1 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Greedy approach: Nodes with homogeneous class distribution are preferred Need a measure of node impurity: How to determine the Best Split Non-homogeneous, High degree of impurity Homogeneous, Low degree of impurity Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

How to determine the Best Split Before Splitting: 10 records of class 0, 10 records of class 1 Non-homogeneous, High degree of impurity Homogeneous, Low degree of impurity Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Gini Index Gini impurity is a measure of how frequently a randomly chosen element from a set is incorrectly labeled if it were labeled randomly according to the distribution of labels in the subset. Entropy a measure of the uncertainty associated with a random variable. Misclassification error the proportion of misclassified samples Measures of Node Impurity Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Measures of Node Impurity • Gini Index • p( j | t) is the relative frequency of class jat node t • Entropy • Misclassification error Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Comparison among Measures of Node Impurity For a 2-class problem: Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Quality of Split p: n When a node p is split into k partitions, the quality of split is computed as, Or information gain child 1: n1 GINI(1) child i: ni GINI(i) child k: nk GINI(k) where, ni = number of records at child i, n = number of records at node p. • Measures reduction in GINI/Entropy achieved because of the split. • Choose the split that achieves most reduction (maximizes GAIN) Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Quality of Split: binary attributes • Splits into two partitions (binary attributes) P Yes No Node N1 Node N2 Gini(N1) = 1 – (4/7)2 – (3/7)2= 0.4898 Gini(N2) = 1 – (2/5)2 – (3/5)2= 0.480 Ginisplit(Children) = 7/12 * 0.4898 + 5/12 * 0.480= 0.486 Gainsplit = Gini (parent) – Ginisplit(children) = 0.014

CART • CART: Classification and Regression Trees • constructs trees with only binary splits (simplifies the splitting criterion) • use Gini Index as a splitting criterion • split the attribute who provides the smallest Ginisplit(p) or the largest GAINsplit(p) • need to enumerate all the possible splitting points for each attribute Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Sorted Values Split Positions Continuous Attributes: Computing Gini Index • For efficient computation: for each attribute, • Sort the attribute on values • Set candidate split positions as the midpoints between two adjacent sorted values. • Linearly scan these values, each time updating the count matrix and computing Gini index • Choose the split position that has the least Gini index

CarType CarType CarType {Family, Luxury} {Sports, Luxury} Family Luxury {Sports} {Family} Sports How to Find the Best Split Two-way split (find best partition of values) Multi-way split largest Gain = 0.337 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Which Attribute to Split ? Gain = 0.02 Gain = 0.337 Gain = 0.5 Best ? • Disadvantage: Tends to prefer splits that result in large number of partitions, each being small but pure. • Unique value for each record not predictable • Small number of recodes in each node not reliable prediction Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Splitting Based on Gain Ratio • Gain Ratio: • Parentnode p is split into k partitions • where ni is the number of records in partition i • designed to overcome the disadvantage of Information Gain • adjusts Information Gain by the entropy of the partitioning (SplitINFO). • higher entropy partitioning (large number of small partitions) is penalized! Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Comparing Attribute Selection Measures • The three measures, in general, return good results but • Ginigain: • biased to multivalued attributes • has difficulty when # of classes is large • tends to favor tests that result in equal-sized partitions and purity in both partitions • Information gain: • biased towards multivalued attributes • Gain ratio: • tends to prefer unbalanced splits in which one partition is much smaller than the others Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining September 14, 2010 Data Mining: Concepts and Techniques 43

ID3 and C4.5 ID3 (Ross Quinlan1986) is the precursor to the C4.5 algorithm (Ross Quinlan 1993) C4.5 is an extension of earlier ID3 algorithm For each unused attribute Ai, count the information GAIN (ID3) or GAINRatio (C4.5) from splitting on Ai Find the best splitting attribute Abest with the highest GAIN or GAINRatio Create a decision node that splits on Abest Recur on the sublists obtained by splitting on Abest, and add those nodes as children of node Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Improvements of C4.5 from ID3 algorithm Handling both continuous and discrete attributes In order to handle continuous attributes, C4.5 creates a threshold and then splits the list into those whose attribute value is above the threshold and those that are less than or equal to it. Handling training data with missing attribute values C4.5 allows attribute values to be marked as ? for missing. Missing attribute values are simply not used in gain and entropy calculations. Pruning trees after creation C4.5 goes back through the tree once it's been created and attempts to remove branches that do not help by replacing them with leaf nodes. Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

C4.5 Issues: Needs entire data to fit in memory. Unsuitable for Large Datasets. Needs a lot of computation at every stage of construction of decision tree. You can download the software from:http://www2.cs.uregina.ca/~dbd/cs831/notes/ml/dtrees/c4.5/c4.5r8.tar.gz More information http://www2.cs.uregina.ca/~dbd/cs831/notes/ml/dtrees/c4.5/tutorial.html Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

SLIQ – a decision tree classifier SLIQ, Supervised Learning In Quest(EDBT’96 — Mehta et al.) Uses a pre-sorting technique in the tree growing phase (eliminates the need to sort data at each node) creates a separate list for each attribute of the training data A separate list, called class list, is created for the class labels attached to the examples. SLIQ requires that the class list and (only) one attribute listcould be kept in the memory at any time Suitable for classification of large disk-resident datasets Applies to both numerical and categorical attributes Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

SLIQ Methodology Create decision tree by partitioning records Generate attribute list for each attribute Sort attribute lists for NUMERIC Attributes Start End Example Only NUMERIC attributes sorted Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Numeric attributes splitting index States of class histograms Partition position Numeric attributes sorted Ginisplit =0.44 Position 0 Ginisplit =0.22 Position 3 Position 6 Ginisplit =0.44 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

AMCS/CS 340: Data Mining