680 likes | 1.05k Views
ACAI 05 ADVANCED COURSE ON KNOWLEDGE DISCOVERY. LEARNING FROM NOISY DATA. Ivan Bratko University of Ljubljana Slovenia. Acknowledgement: Thanks to Blaz Zupan for his contribution to these slides. Overview. Learning from noisy data Idea of tree pruning How to prune optimally
E N D
ACAI 05 ADVANCED COURSE ON KNOWLEDGE DISCOVERY LEARNING FROM NOISY DATA Ivan Bratko University of Ljubljana Slovenia Acknowledgement: Thanks to Blaz Zupan for his contribution to these slides
Overview • Learning from noisy data • Idea of tree pruning • How to prune optimally • Methods for tree pruning • Estimating probabilities
Learning from Noisy Data • Sources of “noise” • Errors in measurements, errors in data encoding, errors in examples, missing values • Problems • Complex hypothesis • Poor comprehensibility • Overfitting: hypothesis overfits the data • Low classification accuracy on new data
Fitting data looks good, but does not fit the data exactly! y x What is the relation between x and y, y = y(x)? How can we predict y from x?
Overfitting data y Makes no error in the training data! But how about predicting new cases? x What is the relation between x and y, y = y(x)? How can we predict y from x?
Overfitting in Extreme • Let default accuracy be the probability of majority class • Overfitting may result in accuracy lower then default • Example • Attributes have no correlation with class (i.e., 100% noise) • Two classes: c1, c2 • Class probabilities: p(c1) = 0.7, p(c2) = 0.3 • Default accuracy = 0.7
Overfitting in Extreme Decision tree with one example per leaf c1 c2 Acc. = 0.7 Acc. = 0.3 Expected accuracy = 0.7 x 0.7 + 0.3 x 0.3 = 0.58 0.58 < 0.7
Pruning of Decision Trees • Means of handling noise in tree learning • After pruning the accuracy on previously unseen examples may increase
Typical Example from Practice:Locating Primary Tumor Data set • 20 classes • Default classifier 24.7%
Effects of Pruning [credit] accuracy on training set accuracy on test set smaller trees bigger trees
How to Prune Optimally? • Main questions • How much pruning? • Where to prune? • Large number of candidate pruned trees! • Typical relation btw tree size and accuracy on the new data • Main difficulty in pruning: this curve is not known! Accuracy Tree Size
Two Kinds of Pruning Pre pruning (forward pruning) Post pruning
Forward Pruning • Stop expanding trees if benefits of potential sub-trees seem dubious • Information gain low • Number of examples very small • Example set statistically insignificant • Etc.
Forward Pruning Inferior • Myopic • Depends on parameters which are hard (impossible?) to guess • Example: x2 b x1 a
Pre and Post Pruning • Forward pruning considered inferior and myopic • Post pruning makes use of sub-trees and in this way reduces the complexity
Post pruning • Main idea: prune unreliable parts of tree • Outline of pruning procedure: start at bottom of tree, proceed upward; that is: prune unreliable subtrees • Main question: How to know whether a subtree is unreliable? Will accuracy improve after pruning?
Estimating accuracy of subtree • One idea: Use special test data set (“pruning set”) • This is OK if sufficient amount of learning data available • In case of shortage of data: Try estimate accuracy directly from learning data
Partitioning data in tree learning All available data Training set Test set Growing set Pruning set • Typical proportions: training set 70%, test set 30% growing set 70%, pruning set 30%
Estimating accuracy with pruning set • Accuracy of hypothesis on new data = probability of correct classification of a new example • Accuracy of hypothesis on new data proportion of correctly classified examples in pruning set • Error of a hypothesis = probability of misclassification of a new example • Drawback of using a pruning set: less data for “growing set”
Reduced error pruning, Quinlan 87 • Use pruning set to estimate accuracy of sub trees and accuracy at individual nodes • Let T be a sub tree rooted at node v: v T • Define: Gain from pruning at v = # misclassifications in T - # misclassifications at v
Reduced error pruning • Repeat: prune at node with largest gain until only negative gain nodes remain • “Bottom-up restriction”: T can only be pruned if it does not contain a sub tree with lower error than T
Reduced error pruning • Theorem (Esposito, Malerba, Semeraro 1997): REP with bottom-up restriction finds the smallest most accurate sub tree w.r.t. pruning set.
Minimal Error Pruning (MEP)Niblett and Bratko 86; Cestnik and Bratko 91 • Does not require a pruning set for estimating error • Estimates error on new data directly from “growing set”, using the Bayesian method for probability estimation (e.g. Laplace estimate or m-estimate) Main principle: Prune so that estimated classification error is minimal
Minimal Error Pruning • Deciding about pruning at node v: a tree T: v p1 p2 ... T1 T2 • E(T) = error of optimally pruned tree T
Static and backed-up errors Define: static error at v : e(v) = p( class C | v) where C is the most likely class at v If T pruned at v then its error is e(v). If T not pruned at v then its (backed-up)error is: p1 E(T1) + p2 E(T2) + ...
Minimal error pruning Decision whether to prune or not: Prune if static error backed-up error: E(T) = min( e(v), i pi E(Ti))
Minimal error pruning Main question: How to estimate static errors e(v)? Use Laplace or m-estimate of probability At a node v : N examples nC majority class examples
Laplace probability estimate where k is the number of classes. Problems with Laplace: • Assumes all classes a priori equally likely • Degree of pruning depends on number of classes
m-estimate of probability pC = ( nC + pCa m ) / ( N + m) where: pCa = a priori probability of class C m is a non-negative parameter tuned by expert
m-estimate Important points: • Takes into account prior probabilities • Pruning not sensitive to number of classes • Varying m: series of differently pruned trees • Choice of m depends on confidence in data
m-estimate in pruning Choice of m: Low noise low m little pruning High noise high m much pruning Note: Using m-estimate is as if examples at node were a random sample, which they are not. Suitably adjusting m compensates for this.
Some other pruning methods • Error-complexity pruning, Breiman et al. 84 (CART) • Pessimistic error pruning, Quinlan 87 • Error-based pruning, Quinlan 93 (C4.5)
Error-complexity pruningBreiman et al. 1884, Program CART Considers: • Error rate on "growing" set • Size of tree • Error rate on "pruning set" • Minimise error and complexity: i.e. find a compromise between error and size
A sub tree T with root v: v T • R(v) = # errors on "growing" set at node v • R(T) = # errors on "growing" set of tree T • NT = # leaves in T • Total cost = Error cost + Complexity cost • Total cost = R + N
Error complexity cost • Total cost = Error cost + Complexity cost • Total cost = R + N • = complexity cost per leaf
Pruning at v • Cost of T (T unpruned) = R(T) + NT • Cost of v (T pruned at v) = R(v) + • When costs of T and v are equal: = reduction of error per leaf
Pruning algorithm • Compute for each node in unpruned tree • Repeat prune sub tree with smallest until root only is left • This gives a series of increasingly pruned trees; estimate their accuracy
Selecting best pruned tree • Finally select the "best" tree from this series • Select the smallest tree within 1 standard error of minimum error (1-SE rule) • Standard error = sqrt( Rmin * (1-Rmin) / #exs)
Comments • Note: Cost complexity pruning limits selection to a subset of all possible pruned trees. • Consequence: Best pruned tree may be missed • Two ways of estimating error on new data: (a) using pruning set (b) using cross-validation in a rather complicated way
Comments • 1-SE rule tends to overprune; • Simply choosing min. error tree ("0-SE rule") performs better in experiments • Error estimate with cross validation is complicated and based on a debatable assumption
Selecting best tree • Using pruning set: Measure error of candidate pruned trees on pruning set; • Select the smallest tree within 1 standard error of minimum error.
Comparison of pruning methods (Esposito, Malerba, Semeraro 96, IEEE Trans.) • Experiments with 14 data sets from UCI repository • Results: Does pruning improve accuracy? • Generally yes • But the effects of pruning also depend on domain: • In most domains pruning improves accuracy, in some it does not, in very few it worsens
Pruning in rule learning • Ideas from pruning decision trees can be adapted to the learning of if-then rules • Pre-pruning and post-pruning can be combined and reduced error pruning idea applies • Furnkranz (1997) reviews several approaches and evaluates them experimentally
Estimating Probabilities • Setup • n experiments (n = r + s) • r successes • s failures • How likely it is that next experiment will be a success? • Estimate with relative frequency
Relative Frequency • Works when we have many experiments, but not with small samples • Consider • flipping a coin • we flip a coin twice, both times comes a head • what is probability of head in the next flip? • Probability of 1.0 (1.0=2/2) seems unreasonable
Coins and mushrooms • Probability of head = ? • Probability of mushroom edible = ? • Make one, two ... experiments • Interpret results in terms of probability • Relative frequency does not work well
Coins and mushrooms • We need to consider prior expectations • Prior prob. = 1/2, in both cases not unreasonable • But, is this enough? • Intuition says: our probability estimates for coins and mushrooms still different • Difference lies in prior probability distribution • What are sensible prior distributions for coins and for mushrooms?
Bayesian Procedure for Estimating Probabilities • Assume initial probability distribution (prior distribution) • Based on some evidence E, update this distribution to obtain posterior distribution • Compute the expected value over posterior distribution. Variance of posterior distribution is related to certainty of this estimate
Bayes Formula • Bayesian process takes prior probability and combines it with new evidence to obtain updated (posterior) probability
Bayes in estimating probabilities • Form of hypothesis H is: P(event) = x • So: P( H | E) = P( P(event)=x | E) • That is: probability that probability is x • May appear confusing!