200 likes | 217 Views
Decision Trees. Jyh-Shing Roger Jang ( 張智星 ) CSIE Dept, National Taiwan University. Classification. Stages in classification
E N D
Decision Trees Jyh-Shing Roger Jang (張智星) CSIE Dept, National Taiwan University
Classification • Stages in classification • Model construction: Given a collection of records (training set), where each record has a set of attributes, including the class, we want to find a model (classifier) for predicting the class as a function of other attributes. • Model evaluation: Use previously unseen records (test set) to test the model, and hopefully the model should be able to assign a class as accurately as possible. • Model application: Apply the model directly.
Stages in Classification • Stages in Classification
Examples Classification/Regression Tasks • Classification • Predict the trend (up or down) of stock markets • Predict tumors as benign or malignant • Classify credit card transactions as legitimate or fraudulent • Categorize news articles as finance, weather, entertainment, sports, etc. • Regression • Predict the temperature in 3 hours from now • Predict tomorrow’s gold/oil price • Estimate the paths of typhoon
Methods for Classification • Numerous methods for classification • Decision trees • Minimum-distance classifiers • Artificial neural networks • Naïve Bayes classifiers • Quadratic classifiers • Gaussian-mixture-model classifiers • Support vector machines • Rule-based methods • …
Decision Tree Induction • Again, many algorithms • Hunt’s algorithm (one of the earliest) • CART (classification and regression trees) • ID3, C4.5 • SLIQ, SPRINT • …
Decision Tree Induction • Again, many algorithms • Hunt’s algorithm (one of the earliest) • CART (classification and regression trees) • ID3, C4.5 • SLIQ, SPRINT • …
General Steps in Tree Induction • Idea • We want to send all the training data along the tree until it reach the leaves where the data should be as “pure” as possible. • Let D be the data set that reach a node • General procedure • If D contains records belonging to the same class y, then mark the node as a leaf with class y. • Otherwise use a test to split the data set based on an attribute to create subtree recursively.
Tree Induction • Issues in tree induction • How to split the dataset at a node: Split the dataset based on a greedy search to optimize a certain criterion/test • When to stop splitting: When the “impurity measure” is less than a threshold
How to Specify Test? • Depends on attribute types • Nominal • Car types: Family, sports, luxury, etc • Ordinal • T-shirt size: Small, median big, etc • Continuous • Temperature: 10.3, 25.6, 38, etc • Depends on number of ways to split • Binary (2-way) split • Multi-way split Aka “factor”
CarType Family Luxury Sports CarType CarType {Sports, Luxury} {Family, Luxury} {Family} {Sports} Splitting Based on Nominal/Ordinal Attributes • Multi-way split • Use as many partitions as distinct values • Binary split • Divides values into two subsets via optimal partitioning OR
Splitting Based on Continuous Attributes • Multi-way split • Discretization to form an ordinal categorical attribute • Binary split (A<v or A v) • Consider all possible splits to find the best one
To Determine the Best Split • Goal • Nodes with homogeneous (pure) class distribution are preferred • Need a measure of node impurity (which should be keep as low as possible during split selection) Non-homogeneous, High degree of impurity Homogeneous, Low degree of impurity
Measures of Node Impurity • Numerous measures of node impurity • Gini index • Entropy • Classification error For 2-class problem
Impurity Measure: Gini Index P(j|t) is the relative frequency of class j at node t • Gini index for a given node t: • Extreme values • Minimum = 0 • Maximum = 1/(# of classes) • Examples “confusion” in HW4 All records in the same class Records equally distributed among all classes
Splitting Based on Gini Index • The quality of splitting a node t into k childrens • ti = node of child i • ni = number of records at ti • n = number of records at note t “total confusion” in HW4
Gini Index for General Binary Split • Example for computing Gini index for binary split B? Yes No Node N1 Node N2 Gini(N1) = 1 – (5/6)2 – (2/6)2= 0.194 Gini(N2) = 1 – (1/6)2 – (4/6)2= 0.528 Ginisplit(B) = 7/12 * 0.194 + 5/12 * 0.528= 0.333
Gini Index for Nominal Attributes • For each child, obtain counts for each class • Compute the Gini index for each child • Compute the Gini index for the split Multi-way split Two-way split (find best partition of values)
Sorted Values Split Positions Gini Index for Binary Split on Continuous Attributes • For each attribute • Sort the attribute values • Linearly scan these value, and update the count matrix and compute Gini index for a new value each time • Choose the split that has the smallest Gini index