LEARNING FROM NOISY DATA

ACAI 05 ADVANCED COURSE ON KNOWLEDGE DISCOVERY LEARNING FROM NOISY DATA Ivan Bratko University of Ljubljana Slovenia Acknowledgement: Thanks to Blaz Zupan for his contribution to these slides

Overview • Learning from noisy data • Idea of tree pruning • How to prune optimally • Methods for tree pruning • Estimating probabilities

Learning from Noisy Data • Sources of “noise” • Errors in measurements, errors in data encoding, errors in examples, missing values • Problems • Complex hypothesis • Poor comprehensibility • Overfitting: hypothesis overfits the data • Low classification accuracy on new data

Fitting data looks good, but does not fit the data exactly! y x What is the relation between x and y, y = y(x)? How can we predict y from x?

Overfitting data y Makes no error in the training data! But how about predicting new cases? x What is the relation between x and y, y = y(x)? How can we predict y from x?

Overfitting in Extreme • Let default accuracy be the probability of majority class • Overfitting may result in accuracy lower then default • Example • Attributes have no correlation with class (i.e., 100% noise) • Two classes: c1, c2 • Class probabilities: p(c1) = 0.7, p(c2) = 0.3 • Default accuracy = 0.7

Overfitting in Extreme Decision tree with one example per leaf c1 c2 Acc. = 0.7 Acc. = 0.3 Expected accuracy = 0.7 x 0.7 + 0.3 x 0.3 = 0.58 0.58 < 0.7

Pruning of Decision Trees • Means of handling noise in tree learning • After pruning the accuracy on previously unseen examples may increase 

Typical Example from Practice:Locating Primary Tumor Data set • 20 classes • Default classifier 24.7%

Effects of Pruning [credit] accuracy on training set accuracy on test set smaller trees bigger trees

How to Prune Optimally? • Main questions • How much pruning? • Where to prune? • Large number of candidate pruned trees! • Typical relation btw tree size and accuracy on the new data • Main difficulty in pruning: this curve is not known! Accuracy Tree Size

Two Kinds of Pruning Pre pruning (forward pruning)  Post pruning

Forward Pruning • Stop expanding trees if benefits of potential sub-trees seem dubious • Information gain low • Number of examples very small • Example set statistically insignificant • Etc.

Forward Pruning Inferior • Myopic • Depends on parameters which are hard (impossible?) to guess • Example: x2 b x1 a

Pre and Post Pruning • Forward pruning considered inferior and myopic • Post pruning makes use of sub-trees and in this way reduces the complexity

Post pruning • Main idea: prune unreliable parts of tree • Outline of pruning procedure: start at bottom of tree, proceed upward; that is: prune unreliable subtrees • Main question: How to know whether a subtree is unreliable? Will accuracy improve after pruning?

Estimating accuracy of subtree • One idea: Use special test data set (“pruning set”) • This is OK if sufficient amount of learning data available • In case of shortage of data: Try estimate accuracy directly from learning data

Partitioning data in tree learning All available data Training set Test set Growing set Pruning set • Typical proportions: training set 70%, test set 30% growing set 70%, pruning set 30%

Estimating accuracy with pruning set • Accuracy of hypothesis on new data = probability of correct classification of a new example • Accuracy of hypothesis on new data  proportion of correctly classified examples in pruning set • Error of a hypothesis = probability of misclassification of a new example • Drawback of using a pruning set: less data for “growing set”

Reduced error pruning, Quinlan 87 • Use pruning set to estimate accuracy of sub trees and accuracy at individual nodes • Let T be a sub tree rooted at node v: v T • Define: Gain from pruning at v = # misclassifications in T - # misclassifications at v

Reduced error pruning • Repeat: prune at node with largest gain until only negative gain nodes remain • “Bottom-up restriction”: T can only be pruned if it does not contain a sub tree with lower error than T

Reduced error pruning • Theorem (Esposito, Malerba, Semeraro 1997): REP with bottom-up restriction finds the smallest most accurate sub tree w.r.t. pruning set.

Minimal Error Pruning (MEP)Niblett and Bratko 86; Cestnik and Bratko 91 • Does not require a pruning set for estimating error • Estimates error on new data directly from “growing set”, using the Bayesian method for probability estimation (e.g. Laplace estimate or m-estimate) Main principle: Prune so that estimated classification error is minimal

Minimal Error Pruning • Deciding about pruning at node v: a tree T: v p1 p2 ... T1 T2 • E(T) = error of optimally pruned tree T

Static and backed-up errors Define: static error at v : e(v) = p( class  C | v) where C is the most likely class at v If T pruned at v then its error is e(v). If T not pruned at v then its (backed-up)error is: p1 E(T1) + p2 E(T2) + ...

Minimal error pruning Decision whether to prune or not: Prune if static error  backed-up error: E(T) = min( e(v), i pi E(Ti))

Minimal error pruning Main question: How to estimate static errors e(v)? Use Laplace or m-estimate of probability At a node v : N examples nC majority class examples

Laplace probability estimate where k is the number of classes. Problems with Laplace: • Assumes all classes a priori equally likely • Degree of pruning depends on number of classes

m-estimate of probability pC = ( nC + pCa m ) / ( N + m) where: pCa = a priori probability of class C m is a non-negative parameter tuned by expert

m-estimate Important points: • Takes into account prior probabilities • Pruning not sensitive to number of classes • Varying m: series of differently pruned trees • Choice of m depends on confidence in data

m-estimate in pruning Choice of m: Low noise  low m little pruning High noise  high m much pruning Note: Using m-estimate is as if examples at node were a random sample, which they are not. Suitably adjusting m compensates for this.

Some other pruning methods • Error-complexity pruning, Breiman et al. 84 (CART) • Pessimistic error pruning, Quinlan 87 • Error-based pruning, Quinlan 93 (C4.5)

Error-complexity pruningBreiman et al. 1884, Program CART Considers: • Error rate on "growing" set • Size of tree • Error rate on "pruning set" • Minimise error and complexity: i.e. find a compromise between error and size

A sub tree T with root v: v T • R(v) = # errors on "growing" set at node v • R(T) = # errors on "growing" set of tree T • NT = # leaves in T • Total cost = Error cost + Complexity cost • Total cost = R +  N

Error complexity cost • Total cost = Error cost + Complexity cost • Total cost = R +  N •  = complexity cost per leaf

Pruning at v • Cost of T (T unpruned) = R(T) +  NT • Cost of v (T pruned at v) = R(v) +  • When costs of T and v are equal: = reduction of error per leaf

Pruning algorithm • Compute  for each node in unpruned tree • Repeat prune sub tree with smallest  until root only is left • This gives a series of increasingly pruned trees; estimate their accuracy

Selecting best pruned tree • Finally select the "best" tree from this series • Select the smallest tree within 1 standard error of minimum error (1-SE rule) • Standard error = sqrt( Rmin * (1-Rmin) / #exs)

Comments • Note: Cost complexity pruning limits selection to a subset of all possible pruned trees. • Consequence: Best pruned tree may be missed • Two ways of estimating error on new data: (a) using pruning set (b) using cross-validation in a rather complicated way

Comments • 1-SE rule tends to overprune; • Simply choosing min. error tree ("0-SE rule") performs better in experiments • Error estimate with cross validation is complicated and based on a debatable assumption

Selecting best tree • Using pruning set: Measure error of candidate pruned trees on pruning set; • Select the smallest tree within 1 standard error of minimum error.

Comparison of pruning methods (Esposito, Malerba, Semeraro 96, IEEE Trans.) • Experiments with 14 data sets from UCI repository • Results: Does pruning improve accuracy? • Generally yes • But the effects of pruning also depend on domain: • In most domains pruning improves accuracy, in some it does not, in very few it worsens

Pruning in rule learning • Ideas from pruning decision trees can be adapted to the learning of if-then rules • Pre-pruning and post-pruning can be combined and reduced error pruning idea applies • Furnkranz (1997) reviews several approaches and evaluates them experimentally

Estimating Probabilities • Setup • n experiments (n = r + s) • r successes • s failures • How likely it is that next experiment will be a success? • Estimate with relative frequency

Relative Frequency • Works when we have many experiments, but not with small samples • Consider • flipping a coin • we flip a coin twice, both times comes a head • what is probability of head in the next flip? • Probability of 1.0 (1.0=2/2) seems unreasonable

Coins and mushrooms • Probability of head = ? • Probability of mushroom edible = ? • Make one, two ... experiments • Interpret results in terms of probability • Relative frequency does not work well

Coins and mushrooms • We need to consider prior expectations • Prior prob. = 1/2, in both cases not unreasonable • But, is this enough? • Intuition says: our probability estimates for coins and mushrooms still different • Difference lies in prior probability distribution • What are sensible prior distributions for coins and for mushrooms?

Bayesian Procedure for Estimating Probabilities • Assume initial probability distribution (prior distribution) • Based on some evidence E, update this distribution to obtain posterior distribution • Compute the expected value over posterior distribution. Variance of posterior distribution is related to certainty of this estimate

Bayes Formula • Bayesian process takes prior probability and combines it with new evidence to obtain updated (posterior) probability

Bayes in estimating probabilities • Form of hypothesis H is: P(event) = x • So: P( H | E) = P( P(event)=x | E) • That is: probability that probability is x • May appear confusing!

LEARNING FROM NOISY DATA

LEARNING FROM NOISY DATA

Presentation Transcript

Data Quality, Data Cleaning and Treatment of Noisy Data

LEARNING FROM DATA

Alignment of Noisy Unstructured Text Data

Predictive Learning from Data

Learning From the Data

Linguistic Analysis of noisy Text Data

Predictive Learning from Data

Security with Noisy Data

A Generalized Version Space Learning Algorithm for Noisy and Uncertain Data

Learning From Data

DETERMINING THE ORDINARY DIFFERENTIAL EQUATION FROM NOISY DATA

Dynamic Classifier Selection for Effective Mining from Noisy Data Streams

Reconstruction of Ocean Currents From Sparse and Noisy Data

Building Probabilistic Environment Models from Noisy Sensor Data

Data Quality, Data Cleaning and Treatment of Noisy Data

Predictive Learning from Data

Predictive Learning from Data

Predictive Learning from Data