Understanding Decision Trees: A Machine Learning Overview

Three kinds of learning • Supervised learning • Learning some mapping from inputs to outputs • Unsupervised learning • Given “data”, what kinds of patterns can you find? • Reinforcement learning • Learn from positive negative reinforcement

Categorical data example Example from Ross Quinlan, Decision Tree Induction; graphics from Tom Mitchell, Machine Learning

Decision Tree Classification

Which feature to split on? Try to classify as many as possible with each split (This is a good split)

Which feature to split on? These are bad splits – no classifications obtained

Improving a good split

Decision Tree Algorithm Framework • Use splitting criterion to decide on best attribute to split • Each child is new decision tree – recurse with parent feature removed • If all data points in child node are same class, classify node as that class • If no attributes left, classify by majority rule • If no data points left, no such example seen: classify as majority class from entire dataset

How do we know which splits are good? • Want nodes as “pure” as possible • How do we quantify “randomness” of a node? Want • All elements +: “randomness” = 0 • All elements –: “randomness” = 0 • Half +, half -: “randomness” = 1 • Draw plot • What should “randomness” function look like?

Typical solution: Entropy • pp = proportion of + examples • pn = proportion of – examples • A collection with low entropy is good.

ID3 Criterion • Split on feature with most information gain. • Gain = entropy in original node – weighted sum of entropy in child nodes

How good is this split?

The big picture • Start with root • Find attribute to split on with most gain • Recurse

Assessment • How do I know how well my decision tree works? • Training set: data that you use to build decision tree • Test set: data that you did not use for training that you use to assess the quality of decision tree

Issues on training and test sets • Do you know the correct classification for the test set? • If you do, why not include it in the training set to get a better classifier? • If you don’t, how can you measure the performance of your classifier?

Cross Validation • Tenfold cross-validation • Ten iterations • Pull a different tenth of the dataset out each time to act as a test set • Train on the remaining training set • Measure performance on the test set • Leave one out cross-validation • Similar, but leave only one point out each time, then count correct vs. incorrect

Noise and Overfitting • Can we always obtain a decision tree that is consistent with the data? • Do we always want a decision tree that is consistent with the data? • Example: Predict Carleton students who become CEOs • Features: state/country of origin, GPA letter, major, age, high school GPA, junior high GPA, ... • What happens with only a few features? • What happens with many features?

Overfitting • Fitting a classifier “too closely” to the data • finding patterns that aren’t really there • Prevented in decision trees by pruning • When building trees, stop recursion on irrelevant attributes • Do statistical tests at node to determine if should continue or not

Examples of decision treesusing Weka

Preventing overfitting by cross validation • Another technique to prevent overfitting (is this valid)? • Keep on recursing on decision tree as long as you continue to get improved accuracy on the test set

Ensemble Methods • Many “weak” learners, when combined together, can perform more strongly than any one by itself • Bagging & Boosting: many different learners, voting on which classification • Multiple algorithms, or different features, or both

Bagging / Boosting • Bagging: vote to determine answer • Run one algorithm on random subsets of data to obtain multiple classifiers • Boosting: weighted vote to determine answer • Each iteration, weight more heavily data that learner got wrong • What does it mean to “weight more heavily” for k-nn? For decision trees? • AdaBoost is recent (1997) and has become popular, fast

Computational Learning Theory

Chapter 20 up next • Moving on to Chapter 20: statistical learning methods • Skipping to: will revisit earlier topics (perhaps) near end of course • 20.5: Neural Networks • 20.6: Support vector machines

Understanding Decision Trees: A Machine Learning Overview

Understanding Decision Trees: A Machine Learning Overview

Presentation Transcript

Three kinds of agenda setting:

Three kinds of dependence

Three Kinds of Learning

THREE KINDS OF MEMORY

Three Kinds of Rebellion

There are Three Kinds of Rocks

Recommend You Three Kinds Of Flash Accessories

The Three Kinds of Treasure Background

Three Kinds of Rebellion

Some Kinds of Learning Approach

Three kinds of causal claim

Three Kinds of Church Members

Three Kinds Of Belief

Three kinds of learning

Three Kinds of Google Mashups – And More

Three Different Kinds of Influenza

Some Kinds of Learning Approach

Three kinds of bottom-up LR parser

Three Types of Learning

What kinds of assessment promote learning?

Three Commonly Used Kinds Of Telescopic Handlers

Three Kinds Of Homes Offer For Sale