Decision Trees

Decision Trees Jyh-Shing Roger Jang (張智星) CSIE Dept, National Taiwan University

Classification • Stages in classification • Model construction: Given a collection of records (training set), where each record has a set of attributes, including the class, we want to find a model (classifier) for predicting the class as a function of other attributes. • Model evaluation: Use previously unseen records (test set) to test the model, and hopefully the model should be able to assign a class as accurately as possible. • Model application: Apply the model directly.

Stages in Classification • Stages in Classification

Example

Examples Classification/Regression Tasks • Classification • Predict the trend (up or down) of stock markets • Predict tumors as benign or malignant • Classify credit card transactions as legitimate or fraudulent • Categorize news articles as finance, weather, entertainment, sports, etc. • Regression • Predict the temperature in 3 hours from now • Predict tomorrow’s gold/oil price • Estimate the paths of typhoon

Methods for Classification • Numerous methods for classification • Decision trees • Minimum-distance classifiers • Artificial neural networks • Naïve Bayes classifiers • Quadratic classifiers • Gaussian-mixture-model classifiers • Support vector machines • Rule-based methods • …

Decision Tree Induction • Again, many algorithms • Hunt’s algorithm (one of the earliest) • CART (classification and regression trees) • ID3, C4.5 • SLIQ, SPRINT • …

General Steps in Tree Induction • Idea • We want to send all the training data along the tree until it reach the leaves where the data should be as “pure” as possible. • Let D be the data set that reach a node • General procedure • If D contains records belonging to the same class y, then mark the node as a leaf with class y. • Otherwise use a test to split the data set based on an attribute to create subtree recursively.

Tree Induction • Issues in tree induction • How to split the dataset at a node: Split the dataset based on a greedy search to optimize a certain criterion/test • When to stop splitting: When the “impurity measure” is less than a threshold

How to Specify Test? • Depends on attribute types • Nominal • Car types: Family, sports, luxury, etc • Ordinal • T-shirt size: Small, median big, etc • Continuous • Temperature: 10.3, 25.6, 38, etc • Depends on number of ways to split • Binary (2-way) split • Multi-way split Aka “factor”

CarType Family Luxury Sports CarType CarType {Sports, Luxury} {Family, Luxury} {Family} {Sports} Splitting Based on Nominal/Ordinal Attributes • Multi-way split • Use as many partitions as distinct values • Binary split • Divides values into two subsets via optimal partitioning OR

Splitting Based on Continuous Attributes • Multi-way split • Discretization to form an ordinal categorical attribute • Binary split (A<v or A v) • Consider all possible splits to find the best one

To Determine the Best Split • Goal • Nodes with homogeneous (pure) class distribution are preferred • Need a measure of node impurity (which should be keep as low as possible during split selection) Non-homogeneous, High degree of impurity Homogeneous, Low degree of impurity

Measures of Node Impurity • Numerous measures of node impurity • Gini index • Entropy • Classification error For 2-class problem

Impurity Measure: Gini Index P(j|t) is the relative frequency of class j at node t • Gini index for a given node t: • Extreme values • Minimum = 0 • Maximum = 1/(# of classes) • Examples “confusion” in HW4 All records in the same class Records equally distributed among all classes

Splitting Based on Gini Index • The quality of splitting a node t into k childrens • ti = node of child i • ni = number of records at ti • n = number of records at note t “total confusion” in HW4

Gini Index for General Binary Split • Example for computing Gini index for binary split B? Yes No Node N1 Node N2 Gini(N1) = 1 – (5/6)2 – (2/6)2= 0.194 Gini(N2) = 1 – (1/6)2 – (4/6)2= 0.528 Ginisplit(B) = 7/12 * 0.194 + 5/12 * 0.528= 0.333

Gini Index for Nominal Attributes • For each child, obtain counts for each class • Compute the Gini index for each child • Compute the Gini index for the split Multi-way split Two-way split (find best partition of values)

Sorted Values Split Positions Gini Index for Binary Split on Continuous Attributes • For each attribute • Sort the attribute values • Linearly scan these value, and update the count matrix and compute Gini index for a new value each time • Choose the split that has the smallest Gini index

Decision Trees

Decision Trees

Presentation Transcript

Decision Trees

Decision Trees

Decision Trees

Decision Trees

Decision Trees

Decision Trees

Decision Trees

Decision Trees

Decision Trees

Decision Trees

Decision Trees

Decision Trees

DECISION TREES

Decision Trees

Decision Trees

Decision trees

Decision Trees

Decision Trees

Decision Trees