CSC 4510 – Machine Learning

Lecture 4: Decision Trees, continued CSC 4510 – Machine Learning Dr. Mary-Angela Papalaskari Department of Computing Sciences Villanova University Course website: www.csc.villanova.edu/~map/4510/ CSC 4510 - M.A. Papalaskari - Villanova University

color green red blue shape pos neg circle triangle square neg neg pos Tiny Example of Decision Tree Learning • Instance attributes: <size, color, shape> • C = {positive, negative} • D: CSC 4510 - M.A. Papalaskari - Villanova University

color color green green red red blue blue shape shape pos C B neg circle circle triangle triangle square square neg B C neg pos A Decision Trees • Tree-based classifiers for instances represented as feature-vectors. Nodes test features, there is one branch for each value of the feature, and leaves specify the category. • Can represent arbitrary conjunction and disjunction. Can represent any classification function over discrete feature vectors. • Can be rewritten as a set of rules, i.e. disjunctive normal form (DNF). • red  circle → pos • red  circle → A blue → B; red  square → B green → C; red  triangle → C CSC 4510 - M.A. Papalaskari - Villanova University

color green red blue Top-Down Decision Tree Creation <big, red, circle>: + <small, red, circle>: + <small, red, square>:  <big, blue, circle>:  <big, red, circle>: + <small, red, circle>: + <small, red, square>:  CSC 4510 - M.A. Papalaskari - Villanova University

color green red blue shape circle triangle square Top-Down Decision Tree Creation <big, red, circle>: + <small, red, circle>: + <small, red, square>:  <big, blue, circle>:  <big, red, circle>: + <small, red, circle>: + <small, red, square>:  neg neg <big, blue, circle>:  pos neg pos <big, red, circle>: + <small, red, circle>: + <small, red, square>:  CSC 4510 - M.A. Papalaskari - Villanova University

color green red blue First step in detail Start Value Examples Count Probability positive: <big, red, circle> 2 .5 <small, red, circle> negative: <small, red, square> 2 .5 <big, blue, circle> Color=red Value/Examples Count Probability positive: 2 .667 <big, red, circle> <small, red, circle> negative: 1 .333 <small, red, square> Color=green Value /Examples Count Probability positive: none 0 0 negative: none 0 0 <small, red, square> Color=blue Value /Examples Count Probability positive: none 0 0 negative: 1 1 <small, red, square> CSC 4510 - M.A. Papalaskari - Villanova University

color green red blue First step in detail Entropy=1 Start Value Examples Count Probability positive: <big, red, circle> 2 .5 <small, red, circle> negative: <small, red, square> 2 .5 <big, blue, circle> Entropy=0.92 Color=red Value/Examples Count Probability positive: 2 .667 <big, red, circle> <small, red, circle> negative: 1 .333 <small, red, square> Color=green Value /Examples Count Probability positive: none 0 0 negative: none 0 0 Entropy=0 Color=blue Value /Examples Count Probability positive: none 0 0 negative: 1 1 <small, red, square> CSC 4510 - M.A. Papalaskari - Villanova University

color green red blue First step in detail Start Value Examples Count Probability positive: <big, red, circle> 2 .5 <small, red, circle> negative: <small, red, square> 2 .5 <big, blue, circle> Entropy=1 Gain(Color) = Entropy(start) – (Entropy(red)*(3/4) – Entropy(blue)*(1/4) = 1 – 0.69 – 0 = 0.31 Entropy=0.92 Entropy=0 Color=red Value/Examples Count Probability positive: 2 .667 <big, red, circle> <small, red, circle> negative: 1 .333 <small, red, square> Color=green Value /Examples Count Probability positive: none 0 0 negative: none 0 0 Entropy=0 Color=blue Value /Examples Count Probability positive: none 0 0 negative: 1 1 <small, red, square> CSC 4510 - M.A. Papalaskari - Villanova University

size small BIG First step in detail- Split on Size? Start Value Examples Count Probability positive: <big, red, circle> 2 .5 <small, red, circle> negative: <small, red, square> 2 .5 <big, blue, circle> Size=BIG Value/Examples Count Probability positive: 1 .5 <big, red, circle> negative: 1 .5 <big, blue, circle> Size = small Value /Examples Count Probability positive: 1 .5 <small, red, circle> negative: <small, red, square>1 .5 CSC 4510 - M.A. Papalaskari - Villanova University

size small BIG First step in detail- Split on Size Start Value Examples Count Probability positive: <big, red, circle> 2 .5 <small, red, circle> negative: <small, red, square> 2 .5 <big, blue, circle> Entropy=1 Entropy=1 Entropy=1 Size=BIG Value/Examples Count Probability positive: 1 .5 <big, red, circle> negative: 1 .5 <big, blue, circle> Size = small Value /Examples Count Probability positive: 1 .5 <small, red, circle> negative: <small, red, square>1 .5 CSC 4510 - M.A. Papalaskari - Villanova University

size small BIG First step in detail- Split on Size Start Value Examples Count Probability positive: <big, red, circle> 2 .5 <small, red, circle> negative: <small, red, square> 2 .5 <big, blue, circle> Entropy=1 Gain(Size) = Entropy(start) – (Entropy(BIG)*(2/4) – Entropy(small)*(2/4) = 1 – 1/2 – 1/2 = 0 Entropy=1 Entropy=1 Size=BIG Value/Examples Count Probability positive: 1 .5 <big, red, circle> negative: 1 .5 <big, blue, circle> Size = small Value /Examples Count Probability positive: 1 .5 <small, red, circle> negative: <small, red, square>1 .5 CSC 4510 - M.A. Papalaskari - Villanova University

color green red blue shape circle triangle square Top-Down Decision Tree Induction • Build a tree top-down by divide and conquer. <big, red, circle>: + <small, red, circle>: + <small, red, square>:  <big, blue, circle>:  <big, red, circle>: + <small, red, circle>: + <small, red, square>:  neg neg <big, blue, circle>:  pos neg pos <big, red, circle>: + <small, red, circle>: + <small, red, square>:  CSC 4510 - M.A. Papalaskari - Villanova University

Another view (Aispace) CSC 4510 - M.A. Papalaskari - Villanova University

Picking a Good Split Feature • Goal is to have the resulting tree be as small as possible, per Occam’s razor. • Finding a minimal decision tree (nodes, leaves, or depth) is an NP-hard optimization problem. • Top-down divide-and-conquer method does a greedy search for a simple tree but does not guarantee to find the smallest. • General lesson in ML: “Greed is good.” • Want to pick a feature that creates subsets of examples that are relatively “pure” in a single class so they are “closer” to being leaf nodes. • There are a variety of heuristics for picking a good test, a popular one is based on information gain that originated with the ID3 system of Quinlan (1979). CSC 4510 - M.A. Papalaskari - Villanova University

Entropy • Entropy (disorder, impurity) of a set of examples, S, relative to a binary classification is: where p1 is the fraction of positive examples in S and p0 is the fraction of negatives. • If all examples are in one category, entropy is zero (we define 0log(0)=0) • If examples are equally mixed (p1=p0=0.5), entropy is a maximum of 1. • Entropy can be viewed as the number of bits required on average to encode the class of an example in S where data compression (e.g. Huffman coding) is used to give shorter codes to more likely cases. • For multi-class problems with c categories, entropy generalizes to: CSC 4510 - M.A. Papalaskari - Villanova University

Entropy Plot for Binary Classification CSC 4510 - M.A. Papalaskari - Villanova University

2+, 2 : E=1 shape 2+, 2 : E=1 size 2+, 2  : E=1 color big small 1+,1 1+,1 E=1 E=1 red blue 2+,1 0+,1 E=0.918 E=0 circle square 2+,1 0+,1 E=0.918 E=0 Gain=1(0.51 + 0.51) = 0 Gain=1(0.750.918 + 0.250) = 0.311 Gain=1(0.750.918 + 0.250) = 0.311 Information Gain • The information gain of a feature F is the expected reduction in entropy resulting from splitting on this feature. where Sv is the subset of S having value v for feature F. • Entropy of each resulting subset weighted by its relative size. • Example summary • <big, red, circle>: + <small, red, circle>: + • <small, red, square>:  <big, blue, circle>:  CSC 4510 - M.A. Papalaskari - Villanova University

Bias in Decision-Tree Induction • Information-gain gives a bias for trees with minimal depth. Recall: • Bias • Any criteria other than consistency with the training data that is used to select a hypothesis. CSC 4510 - M.A. Papalaskari - Villanova University

History of Decision-Tree Research • Hunt and colleagues use exhaustive search decision-tree methods (CLS) to model human concept learning in the 1960’s. • In the late 70’s, Quinlan developed ID3 with the information gain heuristic to learn expert systems from examples. • Simulataneously, Breiman and Friedman and colleagues develop CART (Classification and Regression Trees), similar to ID3. • In the 1980’s a variety of improvements are introduced to handle noise, continuous features, missing features, and improved splitting criteria. Various expert-system development tools results. • Quinlan’s updated decision-tree package (C4.5) released in 1993. • Weka includes Java version of C4.5 called J48. CSC 4510 - M.A. Papalaskari - Villanova University

Additional Decision Tree Issues • Better splitting criteria • Information gain prefers features with many values. • Continuous features • Predicting a real-valued function (regression trees) • Missing feature values • Features with costs • Misclassification costs • Incremental learning • ID4 • ID5 • Mining large databases that do not fit in main memory CSC 4510 - M.A. Papalaskari - Villanova University

Class Exercise • Practice using decision tree learning on some of the sample datasets available in AISpace • Some of the slides in this presentation are adapted from: • Prof. Frank Klassner’s ML class at Villanova • the University of Manchester ML course http://www.cs.manchester.ac.uk/ugt/COMP24111/ • The Stanford online ML course http://www.ml-class.org/ CSC 4510 - M.A. Papalaskari - Villanova University

Resources: Datasets • UCI Repository: http://www.ics.uci.edu/~mlearn/MLRepository.html • UCI KDD Archive: http://kdd.ics.uci.edu/summary.data.application.html • Statlib: http://lib.stat.cmu.edu/ • Delve: http://www.cs.utoronto.ca/~delve/ CSC 4510 - M.A. Papalaskari - Villanova University

CSC 4510 – Machine Learning

CSC 4510 – Machine Learning

Presentation Transcript

An Introduction to Machine Learning with Perl

Introduction to Machine Learning

Heart/Lung Machine

Text Classification

MACHINE TRANSLATION AT MICROSOFT Chris Wendt Chris.Wendt@microsoft.com

Some Useful Machine Learning Tools

Deep Learning from Speech Analysis/Recognition to Language/Multimodal Processing

Subpart O - Machine Guarding

TCS for Machine Learning Scientists

Submodularity in Machine Learning

Machine Learning for Analyzing Brain Activity

Support Vector Machine 支持向量機

A few methods for learning binary classifiers

Relation Extraction and Machine Learning for IE

CS 59000 Statistical Machine learning Lecture 3

Machine Learning Models on Random Graphs

Lecture 7: Turning Machines

Using Statistical Machine Learning in Cloud Computing

Learning Embeddings for Similarity-Based Retrieval

M240B Machine Gun Operators Course

Seminar: Statistical NLP