440 likes | 570 Views
CS B551: Decision Trees. Agenda. Decision trees Complexity Learning curves Combatting overfitting Boosting. Recap. Still in supervised setting with logical attributes
E N D
Agenda • Decision trees • Complexity • Learning curves • Combatting overfitting • Boosting
Recap • Still in supervised setting with logicalattributes • Find a representation of CONCEPT in the form: CONCEPT(x) S(A,B, …)where S(A,B,…) is a sentence built with the observable attributes, e.g.: CONCEPT(x) A(x) (B(x) v C(x))
A? True False B? False False True C? True True False True False Predicate as a Decision Tree The predicate CONCEPT(x) A(x) (B(x) v C(x)) can be represented by the following decision tree: • Example:A mushroom is poisonous iffit is yellow and small, or yellow, • big and spotted • x is a mushroom • CONCEPT = POISONOUS • A = YELLOW • B = BIG • C = SPOTTED
A? True False B? False False True C? True True False True False Predicate as a Decision Tree The predicate CONCEPT(x) A(x) (B(x) v C(x)) can be represented by the following decision tree: • Example:A mushroom is poisonous iffit is yellow and small, or yellow, • big and spotted • x is a mushroom • CONCEPT = POISONOUS • A = YELLOW • B = BIG • C = SPOTTED • D = FUNNEL-CAP • E = BULKY
D E B A A T F C T F T F T E A F T T F Possible Decision Tree
D E B A A T F C A? CONCEPT A (B v C) True False T F B? False False T F T True E C? True A False True True False F T T F Possible Decision Tree CONCEPT (D(EvA))v(D(C(Bv(B((EA)v(EA))))))
D E B A A T F C A? CONCEPT A (B v C) True False T F B? False False T F T True E C? True A False True True False F T T F Possible Decision Tree CONCEPT (D(EvA))v(D(C(Bv(B((EA)v(EA)))))) KIS bias Build smallest decision tree Computationally intractable problem greedy algorithm
A True False C False False True True B True False False True Top-DownInduction of a DT DTL(D, Predicates) • If all examples in D are positive then return True • If all examples in D are negative then return False • If Predicates is empty then return majority rule • A error-minimizing predicate in Predicates • Return the tree whose: - root is A, - left branch is DTL(D+A,Predicates-A), - right branch is DTL(D-A,Predicates-A)
Learnable Concepts • Some simple concepts cannot be represented compactly in DTs • Parity(x) = X1 xor X2 xor … xor Xn • Majority(x) = 1 if most of Xi’s are 1, 0 otherwise • Exponential size in # of attributes • Need exponential # of examples to learn exactly • The ease of learning is dependent on shrewdly (or luckily) chosen attributes that correlate with CONCEPT
100 % correct on test set size of training set Typical learning curve Performance Issues • Assessing performance: • Training set and test set • Learning curve
Performance Issues • Assessing performance: • Training set and test set • Learning curve 100 Some concepts are unrealizable within a machine’s capacity % correct on test set size of training set Typical learning curve
Risk of using irrelevantobservable predicates togenerate an hypothesisthat agrees with all examplesin the training set 100 % correct on test set size of training set Typical learning curve Performance Issues • Assessing performance: • Training set and test set • Learning curve • Overfitting
Performance Issues • Assessing performance: • Training set and test set • Learning curve • Overfitting • Tree pruning Risk of using irrelevantobservable predicates togenerate an hypothesisthat agrees with all examplesin the training set Terminate recursion when # errors / information gain is small
Performance Issues • Assessing performance: • Training set and test set • Learning curve • Overfitting • Tree pruning Risk of using irrelevantobservable predicates togenerate an hypothesisthat agrees with all examplesin the training set The resulting decision tree + majority rule may not classify correctly all examples in the training set Terminate recursion when # errors / information gain is small
Performance Issues • Assessing performance: • Training set and test set • Learning curve • Overfitting • Tree pruning • Incorrect examples • Missing data • Multi-valued and continuous attributes
Using Information Theory • Rather than minimizing the probability of error, minimize the expected number of questions needed to decide if an object x satisfies CONCEPT • Use the information-theoretic quantity known as information gain • Split on variable with highest information gain
Entropy / Information gain • Entropy: encodes the quantity of uncertainty in a random variable • H(X) = -xVal(X) P(x) log P(x) • Properties • H(X) = 0 if X is known, i.e. P(x)=1 for some value x • H(X) > 0 if X is not known with certainty • H(X) is maximal if P(X) is uniform distribution • Information gain: measures the reduction in uncertainty in X given knowledge of Y • I(X,Y) = Ey[H(X) – H(X|Y)] =y P(y) x [P(x|y) log P(x|y) – P(x)log P(x)] • Properties • Always nonnegative • = 0 if X and Y are independent • If Y is a choice, maximizing IG = > minimizing Ey[H(X|Y)]
Maximizing IG / minimizing conditional entropy in decision trees Ey[H(X|Y)] = y P(y) x P(x|y) log P(x|y) • Let n be # of examples • Let n+,n- be # of examples on T/F branches of Y • Let p+,p- be accuracy on true/false branches of Y • P(Y) = (p+n++p-n-)/n • P(correct|Y) = p+, P(correct|-Y) = p- • Ey[H(X|Y)] n+ [p+log p+ + (1-p+)log (1-p+)] + n- [p-log p-+ (1-p-) log (1-p-)]
7 7 6 5 6 5 4 5 4 3 4 5 4 5 6 7 Continuous Attributes • Continuous attributes can be converted into logical ones via thresholds • X => X<a • When considering splitting on X, pick the threshold a to minimize # of errors / entropy
Multi-Valued Attributes • Simple change: consider splits on all values A can take on • Caveat: the more values A can take on, the more important it may appear to be, even if it is irrelevant • More values => dataset split into smaller example sets when picking attributes • Smaller example sets => more likely to fit well to spurious noise
Statistical Methods for Addressing Overfitting / Noise • There may be few training examples that match the path leading to a deep node in the decision tree • More susceptible to choosing irrelevant/incorrect attributes when sample is small • Idea: • Make a statistical estimate of predictive power (which increases with larger samples) • Prune branches with low predictive power • Chi-squared pruning
Top-down DT pruning • Consider an inner node X that by itself (majority rule) predicts p examples correctly and n examples incorrectly • At k leaf nodes, number of correct/incorrect examples are p1/n1,…,pk/nk • Chi-squared statistical significance test: • Null hypothesis: example labels randomly chosen with distribution p/(p+n) (X is irrelevant) • Alternate hypothesis: examples not randomly chosen (X is relevant) • Prune X if testing X is not statistically significant
Chi-Squared test • Let Z = Si (pi – pi’)2/pi’ + (ni – ni’)2/ni’ • Where pi’ = pi(pi+ni)/(p+n), ni’ = ni(pi+ni)/(p+n) are the expected number of true/false examples at leaf node i if the null hypothesis holds • Z is a statistic that is approximately drawn from the chi-squared distribution with k degrees of freedom • Look up p-Value of Z from a table, prune if p-Value > a for some a (usually ~.05)
Idea • It may be difficult to search for a single hypothesis that explains the data • Construct multiple hypotheses (ensemble), and combine their predictions • “Can a set of weak learners construct a single strong learner?” – Michael Kearns, 1988
Motivation • 5 classifiers with 60% accuracy • On a new example, run them all, and pick the prediction using majority voting • If errors are independent, new classifier has 94% accuracy! • (In reality errors will not be independent, but we hope they will be mostly uncorrelated)
Boosting • Main idea: • If learner 1 fails to learn an example correctly, this example is more important for learner 2 • If learner 1 and 2 fail to learn an example correctly, this example is more important for learner 3 • … • Weighted training set • Weights encode importance
Boosting • Weighted training set
Boosting • Start with uniform weights wi=1/N • Use learner 1 to generate hypothesis h1 • Adjust weights to give higher importance to misclassified examples • Use learner 2 to generate hypothesis h2 • … • Weight hypotheses according to performance, and return weighted majority
Mushroom Example • “Decision stumps” - single attribute DT
Mushroom Example • Pick C first, learns CONCEPT = C
Mushroom Example • Pick C first, learns CONCEPT = C
Mushroom Example • Update weights (precise formula given in R&N)
Mushroom Example • Next try A, learn CONCEPT=A
Mushroom Example • Next try A, learn CONCEPT=A
Mushroom Example • Update weights
Mushroom Example • Next try E, learn CONCEPT=E
Mushroom Example • Next try E, learn CONCEPT=E
Mushroom Example • Update Weights…
Mushroom Example • Final classifier, order C,A,E,D,B • Weights on hypotheses determined by overall error • Weighted majority weightsA=2.1, B=0.9, C=0.8, D=1.4, E=0.09 • 100% accuracy on training set
Boosting Strategies • Prior weighting strategy was the popular AdaBoost algorithm see R&N pp. 667 • Many other strategies • Typically as the number of hypotheses increases, accuracy increases as well • Does this conflict with Occam’s razor?
Announcements • Next class: • Neural networks & function learning • R&N 18.6-7