1 / 50

Decision Tree Learning

Decision Tree Learning. 主講人:虞台文 大同大學資工所 智慧型多媒體研究室. Content. Introduction CLS ID3 C4.5 References. Decision Tree Learning. Introduction. Examples. Generalize. Instantiated for another case. Inductive Learning. Prediction. Model.

Download Presentation

Decision Tree Learning

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Decision Tree Learning 主講人:虞台文 大同大學資工所 智慧型多媒體研究室

  2. Content • Introduction • CLS • ID3 • C4.5 • References

  3. Decision Tree Learning Introduction

  4. Examples Generalize Instantiated for another case Inductive Learning Prediction Model The general conclusion should apply to unseen examples.

  5. Decision Tree • A decision tree is a tree in which • each branch node represents a choice between a number of alternatives • each leaf node represents a classification or decision

  6. Example I

  7. Example II

  8. Decision Rules

  9. Decision Tree Learning • We wish to be able to induce a decision tree from a set of data about instances together with the decisions or classifications for those instances. • Learning Algorithms: • CLS (Concept Learning System) • ID3  C4  C4.5 C5

  10. Appropriate Problems forDecision Tree Learning • Instances are represented by attribute-value pairs. • The target function has discrete output values. • Disjunctive descriptions may be required. • The training data may contain errors. • The training data may contain missing attribute values.

  11. Decision Tree Learning CLS

  12. CLS Algorithm • T the whole training set. Create a T node. • If all examples in T are positive, create a ‘P’ node with T as its parent and stop. • If all examples in T are negative, create a ‘N’ node with T as its parent and stop. • Select an attribute X with values v1, v2, …, vN and partition T into subsets T1, T2, …, TN according their values on X. Create N nodes Ti(i = 1,..., N) with T as their parent and X = vi as the label of the branch from T to Ti. • For each Tido: TTi and goto step 2.

  13. (tall, blond, blue) w (short, black, brown) e (short, silver, blue) w (tall, silver, black) e (short, black, blue) w (short, black, brown) e (tall, blond, brown) w (tall, black, brown) e (tall, silver, blue) w (tall, black, black) e (short, blond, blue) w (short, blond, black) e Example

  14. (tall, blond, blue) w (short, black, brown) e (short, silver, blue) w (tall, silver, black) e (short, black, blue) w (short, black, brown) e tall short (tall, blond, brown) w (tall, black, brown) e (tall, silver, blue) w (tall, black, black) e (short, blond, blue) w (short, blond, black) e (tall, blond, blue) w (tall, silver, black) e (short, silver, blue) w (short, black, brown) e (tall, blond, brown) w (tall, black, brown) e (short, black, blue) w (short, black, brown) e blond black black silver (tall, silver, blue) w (tall, black, black) e (short, blond, blue) w (short, blond, black) e (short, black, blue) w (tall, blond, blue) w (tall, black, brown) e (short, silver, blue) w (short, black, brown) e (tall, blond, brown) w blond (tall, black, black) e silver (short, black, brown) e brown (short, blond, blue) w (short, blond, black) e (short, black, brown) e (tall, silver, black) e (short, black, brown) e blue (tall, silver, blue) w black blue blue black (short, blond, blue) w (short, black, blue) w (tall, silver, black) e (tall, silver, blue) w (short, blond, black) e Example

  15. (tall, blond, blue) w (short, black, brown) e (short, silver, blue) w (tall, silver, black) e (short, black, blue) w (short, black, brown) e (tall, blond, brown) w (tall, black, brown) e (tall, silver, blue) w (tall, black, black) e (short, blond, blue) w (short, blond, black) e blue brown black (tall, blond, blue) w (short, black, brown) e (tall, silver, black) e (short, black, brown) e (short, silver, blue) w (tall, black, black) e (tall, black, brown) e (short, black, blue) w (short, blond, black) e (tall, blond, brown) w (tall, silver, blue) w (short, blond, blue) w blond black (tall, blond, brown) w (short, black, brown) e (short, black, brown) e (tall, black, brown) e Example

  16. Decision Tree Learning ID3

  17. (tall, blond, blue) w (short, black, brown) e (short, silver, blue) w (tall, silver, black) e (short, black, blue) w (short, black, brown) e (tall, blond, brown) w (tall, black, brown) e (tall, silver, blue) w (tall, black, black) e (short, blond, blue) w (short, blond, black) e Entropy (Binary Classification) S= C1 : Class 1 C2: Class 2

  18. Entropy (Binary Classification)

  19. Entropy (Binary Classification)

  20. + +  + +   + + + + +   Entropy (Binary Classification) # + = 9 # – = 6

  21. + + + + + +  + + + + + + + + Entropy (Binary Classification) # + = 14 # – = 1

  22. S v1 v2 vn S1 S2 Sn Information Gain

  23. (tall, blond, blue) w (short, black, brown) e (short, silver, blue) w (tall, silver, black) e (short, black, blue) w (short, black, brown) e tall short (tall, blond, brown) w (tall, black, brown) e (tall, silver, blue) w (tall, black, black) e (short, blond, blue) w (short, blond, black) e (tall, blond, blue) w (tall, silver, black) e (short, silver, blue) w (short, black, brown) e (tall, blond, brown) w (tall, black, brown) e (short, black, blue) w (short, black, brown) e (tall, silver, blue) w (tall, black, black) e (short, blond, blue) w (short, blond, black) e Information Gain

  24. (tall, blond, blue) w (short, black, brown) e (short, silver, blue) w (tall, silver, black) e (short, black, blue) w (short, black, brown) e (tall, blond, brown) w (tall, black, brown) e (tall, silver, blue) w (tall, black, black) e (short, blond, blue) w (short, blond, black) e blond silver black (tall, blond, blue) w (short, silver, blue) w (short, black, blue) w (tall, silver, blue) w (tall, blond, brown) w (short, black, brown) e (tall, silver, black) e (short, blond, blue) w (short, black, brown) e (tall, black, brown) e (short, blond, black) e (tall, black, black) e Information Gain

  25. (tall, blond, blue) w (short, black, brown) e (short, silver, blue) w (tall, silver, black) e (short, black, blue) w (short, black, brown) e blue brown black (tall, blond, brown) w (tall, black, brown) e (tall, silver, blue) w (tall, black, black) e (short, blond, blue) w (short, blond, black) e (tall, blond, blue) w (short, black, brown) e (tall, silver, black) e (short, black, brown) e (short, silver, blue) w (tall, black, black) e (tall, black, brown) e (short, black, blue) w (short, blond, black) e (tall, blond, brown) w (tall, silver, blue) w (short, blond, blue) w Information Gain

  26. (tall, blond, blue) w (short, black, brown) e (short, silver, blue) w (tall, silver, black) e (short, black, blue) w (short, black, brown) e (tall, blond, brown) w (tall, black, brown) e (tall, silver, blue) w (tall, black, black) e (short, blond, blue) w (short, blond, black) e Information Gain

  27. ID3 • Iterative Dichotomizer (version) 3 • developed by Quinlan • Select decision sequence of the tree based on information gain.

  28. ID3 (modify of CLS) • T the whole training set. Create a T node. • If all examples in T are positive, create a ‘P’ node with T as its parent and stop. • If all examples in T are negative, create a ‘N’ node with T as its parent and stop. • Select an attribute X with values v1, v2, …, vN and partition T into subsets T1, T2, …, TN according their values on X. Create N nodes Ti(i = 1,..., N) with T as their parent and X = vi as the label of the branch from T to Ti. • For each Tido: TTi and goto step 2.

  29. ID3 (modify of CLS) By maximizing the information gain. • T the whole training set. Create a T node. • If all examples in T are positive, create a ‘P’ node with T as its parent and stop. • If all examples in T are negative, create a ‘N’ node with T as its parent and stop. • Select an attribute X with values v1, v2, …, vN and partition T into subsets T1, T2, …, TN according their values on X. Create N nodes Ti(i = 1,..., N) with T as their parent and X = vi as the label of the branch from T to Ti. • For each Tido: TTi and goto step 2.

  30. (tall, blond, blue) w (short, black, brown) e (short, silver, blue) w (tall, silver, black) e (short, black, blue) w (short, black, brown) e (tall, blond, brown) w (tall, black, brown) e (tall, silver, blue) w (tall, black, black) e (short, blond, blue) w (short, blond, black) e blue brown black (tall, blond, blue) w (short, black, brown) e (tall, silver, black) e (short, black, brown) e (short, silver, blue) w (tall, black, black) e (tall, black, brown) e (short, black, blue) w (short, blond, black) e (tall, blond, brown) w (tall, silver, blue) w (short, blond, blue) w blond black (tall, blond, brown) w (short, black, brown) e (short, black, brown) e (tall, black, brown) e Example

  31. + + + + – + + + + + + + – + + + – + + + + + – + + – + + + + – – – + – + – + – + + – + + – + – + – – + + – + – + + + – + – + + + – – + – – + + + + – – – + + + – – – + + – + + – – – + – + – + + – + – + – – + + – + + + + – – + + – + + – – + – + + + – – – – Windowing

  32. Windowing • ID3 can deal with very large data sets by performing induction on subsets or windows onto the data. • Select a random subset of the whole set of training instances. • Use the induction algorithm to form a rule to explain the current window. • Scan through all of the training instances looking for exceptions to the rule. • Add the exceptions to the window • Repeat steps 2 to 4 until there are no exceptions left.

  33. Inductive Biases • Shorter trees are preferred. • Attributes with higher information gain are selected first in tree construction. • Greedy Search • Preference bias (relative to restriction bias as in the VS approach) • Why prefer short hypotheses? • Occam's razor Generalization

  34. Hypothesis Space Search • Complete hypothesis space • because of disjunctive representation • Incomplete search • No backtracking • Non-incremental • but can be modified to be incremental • Ensemble statistical information for node selection • less sensitive to noise than the VS approach in this sense

  35. Overfitting to the Training Data • The training error is statistically smaller than the test error for a given hypothesis. • Solutions: • Early stopping • Validation sets • Statistical criterion for continuation (of the tree) • Post-pruning • Minimal description length cost-function = error + complexity

  36. Overfitting to the Training Data

  37. Pruning Techniques • Reduced error pruning (of nodes) • Used by ID3 • Rule post-pruning • Used by C4.5

  38. Reduced Error Pruning • Use a separated validation set • Tree accuracy: • percentage of correct classifications on validation set • Method: Do until further pruning is harmful • Evaluate the impact on validation set of pruning each possible node. • Greedily remove the one that most improves the validation set accuracy.

  39. Reduced Error Pruning

  40. Decision Tree Learning C4.5

  41. C4.5:An Extension of ID3 • Some additional features of C4.5 are: • Incorporation of numerical (continuous) attributes. • Nominal (discrete) values of a single attribute may be grouped together, to support more complex tests. • Post-pruning after induction of trees, e.g. based on test sets, in order to increase accuracy. • C4.5 can deal with incomplete information (missing attribute values).

  42. Rule Post-Pruning • Fully induce the decision tree from the training set (allowing overfitting) • Convert the learned tree to rules • one rule for each path from the root node to a leaf node • Prune each rule by removing any preconditions that result in improving its estimated accuracy • Sort the pruned rules by their estimated accuracy, and consider them in this sequence when classifying subsequent instances

  43. Converting to Rules . . .

  44. Temerature<54 Temerature54-85 Temerature>85 Continuous Valued Attributes Example:

  45. Attributes with Many Values • Information Gain – biases to attribute with many values • e.g., date • One approach – use GainRatio instead

  46. Associate Attributes of Cost • The availability of attributes may vary significantly in costs. • Example: Medical disease classification • Temperature • BiopsyResult • Pulse • BloodTestResult High High

  47. Associate Attributes of Cost • Tan and Schlimmer (1990) • Nunez (1988) w [0, 1] determines the importance of cost.

  48. S = Unknown Attribute Values Attribute A={x, y, z} • Assign most common value of A to the unknown one. • Assign most common value of A with the same target value to the unknown one. • Assign probability to each possible value. Possible Approaches:

  49. Decision Tree Learning References

  50. References • “Building Decision Trees with the ID3 Algorithm”, by: Andrew Colin, Dr. Dobbs Journal, June 1996 • “Incremental Induction of Decision Trees”, by Paul E. Utgoff, Kluwer Academic Publishers, 1989 • “Machine Learning”, by Tom Mitchell, McGraw-Hill, 1997 pp. 52-81 • “C4.5 Programs for Machine Learning”, by J. Ross Quinlan, Morgan Kaufmann, 1993 • “Algorithm Alley Column: C4.5”, by Lynn Monson, Dr. Dobbs Journal, Jan 1997

More Related