460 likes | 621 Views
CS B351: Decision Trees. Agenda. Decision trees Learning curves Combatting overfitting. a small one!. Classification Tasks. Supervised learning setting The target function f(x) takes on values True and False A example is positive if f is True, else it is negative
E N D
Agenda • Decision trees • Learning curves • Combatting overfitting
a small one! Classification Tasks • Supervised learning setting • The target function f(x) takes on values True and False • A example is positive if f is True, else it is negative • The set X of all possible examples is the example set • The training set is a subset of X
Logical Classification Dataset • Here, examples (x, f(x)) take on discrete values
Logical Classification Dataset • Here, examples (x, f(x)) take on discrete values Concept Note that the training set does not say whether an observable predicate is pertinent or not
Logical Classification Task • Find a representation of CONCEPT in the form: CONCEPT(x) S(A,B, …)where S(A,B,…) is a sentence built with the observable attributes, e.g.: CONCEPT(x) A(x) (B(x) v C(x))
A? True False B? False False True C? True True False True False Predicate as a Decision Tree The predicate CONCEPT(x) A(x) (B(x) v C(x)) can be represented by the following decision tree: • Example:A mushroom is poisonous iffit is yellow and small, or yellow, • big and spotted • x is a mushroom • CONCEPT = POISONOUS • A = YELLOW • B = BIG • C = SPOTTED
A? True False B? False False True C? True True False True False Predicate as a Decision Tree The predicate CONCEPT(x) A(x) (B(x) v C(x)) can be represented by the following decision tree: • Example:A mushroom is poisonous iffit is yellow and small, or yellow, • big and spotted • x is a mushroom • CONCEPT = POISONOUS • A = YELLOW • B = BIG • C = SPOTTED • D = FUNNEL-CAP • E = BULKY
D E B A A T F C T F T F T E A F T T F Possible Decision Tree
D E B A A T F C A? CONCEPT A (B v C) True False T F B? False False T F T True E C? True A False True True False F T T F Possible Decision Tree CONCEPT (D(EvA))v(D(C(Bv(B((EA)v(EA))))))
D E B A A T F C A? CONCEPT A (B v C) True False T F B? False False T F T True E C? True A False True True False F T T F Possible Decision Tree CONCEPT (D(EvA))v(D(C(Bv(B((EA)v(EA)))))) KIS bias Build smallest decision tree Computationally intractable problem greedy algorithm
Getting Started:Top-Down Induction of Decision Tree The distribution of training set is: True: 6, 7, 8, 9, 10,13 False: 1, 2, 3, 4, 5, 11, 12
Getting Started:Top-Down Induction of Decision Tree The distribution of training set is: True: 6, 7, 8, 9, 10,13 False: 1, 2, 3, 4, 5, 11, 12 Without testing any observable predicate, we could report that CONCEPT is False (majority rule) with an estimated probability of error P(E) = 6/13 Assuming that we will only include one observable predicate in the decision tree, which predicateshould we test to minimize the probability of error (i.e., the # of misclassified examples in the training set)? Greedy algorithm
A F T 6, 7, 8, 9, 10, 13 11, 12 True: False: 1, 2, 3, 4, 5 If we test only A, we will report that CONCEPT is Trueif A is True (majority rule) and False otherwise The number of misclassified examples from the training set is 2 Assume It’s A
B F T 9, 10 2, 3, 11, 12 True: False: 6, 7, 8, 13 1, 4, 5 If we test only B, we will report that CONCEPT is Falseif B is True and True otherwise The number of misclassified examples from the training set is 5 Assume It’s B
C F T 6, 8, 9, 10, 13 1, 3, 4 True: False: 7 1, 5, 11, 12 If we test only C, we will report that CONCEPT is Trueif C is True and False otherwise The number of misclassified examples from the training set is 4 Assume It’s C
D F T 7, 10, 13 3, 5 True: False: 6, 8, 9 1, 2, 4, 11, 12 If we test only D, we will report that CONCEPT is Trueif D is True and False otherwise The number of misclassified examples from the training set is 5 Assume It’s D
E F T 8, 9, 10, 13 1, 3, 5, 12 True: False: 6, 7 2, 4, 11 If we test only E we will report that CONCEPT is False, independent of the outcome The number of misclassified examples from the training set is 6 Assume It’s E
E F T 8, 9, 10, 13 1, 3, 5, 12 True: False: 6, 7 2, 4, 11 If we test only E we will report that CONCEPT is False, independent of the outcome The number of misclassified examples from the training set is 6 Assume It’s E So, the best predicate to test is A
6, 8, 9, 10, 13 True: False: 7 11, 12 Choice of Second Predicate A F T False C F T The number of misclassified examples from the training set is 1
11,12 True: False: 7 Choice of Third Predicate A F T False C F T True B T F
A True False A? C False False True True False B? False True B False True True False C? True False True True False True False Final Tree CONCEPT A (C v B) CONCEPT A (B v C)
A True False C False False True True B True False False True Subset of examples that satisfy A Top-DownInduction of a DT DTL(D, Predicates) • If all examples in D are positive then return True • If all examples in D are negative then return False • If Predicates is empty then return failure • A error-minimizing predicate in Predicates • Return the tree whose: - root is A, - left branch is DTL(D+A,Predicates-A), - right branch is DTL(D-A,Predicates-A)
A True False C False False True True B True False Noise in training set! May return majority rule,instead of failure False True Top-DownInduction of a DT DTL(D, Predicates) • If all examples in D are positive then return True • If all examples in D are negative then return False • If Predicates is empty then return failure • A error-minimizing predicate in Predicates • Return the tree whose: - root is A, - left branch is DTL(D+A,Predicates-A), - right branch is DTL(D-A,Predicates-A)
A True False C False False True True B True False False True Top-DownInduction of a DT DTL(D, Predicates) • If all examples in D are positive then return True • If all examples in D are negative then return False • If Predicates is empty then return majority rule • A error-minimizing predicate in Predicates • Return the tree whose: - root is A, - left branch is DTL(D+A,Predicates-A), - right branch is DTL(D-A,Predicates-A)
Comments • Widely used algorithm • Easy to extend to k-class classification • Greedy • Robust to noise (incorrect examples) • Not incremental
Human-Readability • DTs also have the advantage of being easily understood by humans • Legal requirement in many areas • Loans & mortgages • Health insurance • Welfare
Learnable Concepts • Some simple concepts cannot be represented compactly in DTs • Parity(x) = X1 xor X2 xor … xor Xn • Majority(x) = 1 if most of Xi’s are 1, 0 otherwise • Exponential size in # of attributes • Need exponential # of examples to learn exactly • The ease of learning is dependent on shrewdly (or luckily) chosen attributes that correlate with CONCEPT
100 % correct on test set size of training set Typical learning curve Performance Issues • Assessing performance: • Training set and test set • Learning curve
Performance Issues • Assessing performance: • Training set and test set • Learning curve 100 Some concepts are unrealizable within a machine’s capacity % correct on test set size of training set Typical learning curve
Risk of using irrelevantobservable predicates togenerate an hypothesisthat agrees with all examplesin the training set 100 % correct on test set size of training set Typical learning curve Performance Issues • Assessing performance: • Training set and test set • Learning curve • Overfitting
Performance Issues • Assessing performance: • Training set and test set • Learning curve • Overfitting • Tree pruning Risk of using irrelevantobservable predicates togenerate an hypothesisthat agrees with all examplesin the training set Terminate recursion when # errors (or information gain) is small
Performance Issues • Assessing performance: • Training set and test set • Learning curve • Overfitting • Tree pruning Risk of using irrelevantobservable predicates togenerate an hypothesisthat agrees with all examplesin the training set The resulting decision tree + majority rule may not classify correctly all examples in the training set Terminate recursion when # errors (or information gain) is small
Statistical Methods for Addressing Overfitting / Noise • There may be few training examples that match the path leading to a deep node in the decision tree • More susceptible to choosing irrelevant/incorrect attributes when sample is small • Idea: • Make a statistical estimate of predictive power (which increases with larger samples) • Prune branches with low predictive power • Chi-squared pruning
Top-down DT pruning • Consider an inner node X that by itself (majority rule) predicts p examples correctly and n examples incorrectly • At k leaf nodes, number of correct/incorrect examples are p1/n1,…,pk/nk • Chi-squared statistical significance test: • Null hypothesis: example labels randomly chosen with distribution p/(p+n) (X is irrelevant) • Alternate hypothesis: examples not randomly chosen (X is relevant) • Prune X if testing X is not statistically significant
Chi-Squared test • Let Z = Si (pi – pi’)2/pi’ + (ni – ni’)2/ni’ • Where pi’ = pi(pi+ni)/(p+n), ni’ = ni(pi+ni)/(p+n) are the expected number of true/false examples at leaf node i if the null hypothesis holds • Z is a statistic that is approximately drawn from the chi-squared distribution with k degrees of freedom • Look up p-Value of Z from a table, prune if p-Value > a for some a (usually ~.05)
Performance Issues • Assessing performance: • Training set and test set • Learning curve • Overfitting • Tree pruning • Incorrect examples • Missing data • Multi-valued and continuous attributes
Multi-Valued Attributes • Simple change: consider splits on all values A can take on • Caveat: the more values A can take on, the more important it may appear to be, even if it is irrelevant • More values => dataset split into smaller example sets when picking attributes • Smaller example sets => more likely to fit well to spurious noise
7 7 6 5 6 5 4 5 4 3 4 5 4 5 6 7 Continuous Attributes • Continuous attributes can be converted into logical ones via thresholds • X => X<a • When considering splitting on X, pick the threshold a to minimize # of errors / entropy
Decision Boundaries • With continuous attributes, a decision boundary is the surface in example space that splits positive from negative examples x2 x1>=20 F T T F x1
Decision Boundaries • With continuous attributes, a decision boundary is the surface in example space that splits positive from negative examples x2 x1>=20 F F x2>=10 F T F T x1
Decision Boundaries • With continuous attributes, a decision boundary is the surface in example space that splits positive from negative examples x2 x1>=20 T F x2>=10 x2>=15 F T F T T F x1
Decision Boundaries • With continuous attributes, a decision boundary is the surface in example space that splits positive from negative examples
Exercise • With 2 attributes, what kinds of decision boundaries can be achieved by a decision tree with arbitrary splitting threshold and maximum depth: • 1? • 2? • 3? • Describe the appearance and the complexity of these decision boundaries
Reading • Next class: • Neural networks & function learning • R&N 18.6-7