200 likes | 276 Views
Announcements. Get to work on the MP! No official class Wednesday 11/29 To make up for Jordan’s talk We will be here for MP questions. The Bayes Optimal Classifier Getting away from generative models Our first ensemble method!. H is a parameterized hypothesis space
E N D
Announcements • Get to work on the MP! • No official class Wednesday 11/29 • To make up for Jordan’s talk • We will be here for MP questions CS446-Fall ’06
The Bayes Optimal ClassifierGetting away from generative modelsOur first ensemble method! • H is a parameterized hypothesis space • hML is the maximum likelihood hypothesis given some data. • Can we do better (higher expected accuracy) than hML? • Yes! We expect hMAP to outperform hML IF… • There is an interesting prior P(h) (ie, not uniform) • Can we do better (higher expected accuracy) than hMAP? • Yes! Bayes Optimal will outperform hMAP IF…(some assumptions) CS446-Fall ’06
Bayes Optimal Classifier Getting another Doctor’s second opinion, another’s third opinion… One doctor is most confident. He is hML One doctor is most reliable / accurate. She is hMAP But she may only be a little more trustworthy than the others. What if hMAP says “+” but *all* other hH say “-”? • If P(hMAP |D) < 0.5, perhaps we should prefer “-” • Think of each hi as casting a weighted vote • Weight each hi by how likely it is to be correct given the training data. • Not just by P(h) which is already reflected in hMAP • Rather by P(h|D) • The most reliable joint opinion may contradict hMAP CS446-Fall ’06
Bayes Optimal Classifier: Example • Assume a space of 3 hypotheses • Given a new instance, assume that h1(x) = 1 h2(x) = 0 h3(x) = 0 • In this case, P(f(x) =1 ) = 0.4 P(f(x) = 0) = 0.6 but hmap(x) =1 • We want to determine the most probable classification by combining the prediction of all hypotheses • We can weight each by its posterior probabilities(there are additional lurking assumptions…) CS446-Fall ’06
Bayes Optimal Classifier: Example(2) • Let V be a set of possible classifications • Bayes Optimal Classification: • In the example: • and the optimal prediction is indeed 0. CS446-Fall ’06
What Assumptions are We Making? (1) • Will this always work? • Hint: we are finding a linear combination of h’s • What if several doctors shared training:med school, classes, instructors, internships... ? • What if some “doctors” were really phone / web site referrals to a single doctor? More generally… • Significant covariance among doctors (h’s) not due to accuracy, indicates interdependent redundancies • We over-weigh these opinions • Bayes optimal looks like marginalization over H (is it?) CS446-Fall ’06
What Assumptions (2) • What does it mean “to work”? • As |D| grows without bound Bayes optimal classification should converge to the best answer. Does it? • Consider the weight vector w as |D| • “best answer” means better than any other • Assume no perfect tie so there is a best answer • Best w… • …will be all zeros except for a single one (for the best h) • In general must this happen? Why? How can we force it to happen? CS446-Fall ’06
Bayes Optimal Classifier • Without additional information we can do no better than Bayes optimal. • Bayes optimal classifier in general is not a member of H(!) • Bayes optimal classifiers make a strong assumption about “independence” (or nonredundancy) among h’s mistakes are uncorrelated • Mistakes are uncorrelated – kind of naïve Bayes but now among H • Another strong assumption: some hH is correct; NOT agnostic. • View as combining expertise; finding a linear combination (ensemble) of experts CS446-Fall ’06
Gibbs Classifier • Bayes optimal classifiers can be expensive to train • Must calculate posteriors for all hH • Train and classify according to the current posterior over H • Training: • Assume some prior (perhaps uniform) over H • Draw an h • Classify with that h • Update posterior of that h • Repeat • Multiple passes through training set; can draw & update several h’s at once • Effort is focused on “problem” h’s (mistaken high accuracy) • Training mistakes tend lower posteriors of offending h’sthrough normalization, raises posteriors of everything else • Tends (eventually) to exercise h’s that are more accurate • Converges • Expected behavior at worst twice error rate of Bayes optimal CS446-Fall ’06
Bagging: Bootstrap AGGregatING • Variance Reduction • Problem of unstable classifiers • Overfitting as low statistical confidence choices • Find specious patterns in the data • “Average” over a number of classifiers • Bootstrap: data resampling • Generate multiple training sets • Resample the original training data • With replacement • Data sets have different “specious” patterns • Learn a number of classifiers • Specious patterns will not correlate • Underlying true pattern will be common to many • Combine the classifiers: Label new test examples by a majority vote among classifiers CS446-Fall ’06
Bagging • Recall logistic regression can overfit • Linearly separable • Overly steep probability fits • Consider a bagging approach… • With many features (dimensions) • extreme steepness in some dimension may be common • But these will not be systematic • So averaging tends to diminish their effect • Generate a collection of trained but “random” classifiers • Sometimes resampling isn’t even necessary – consider iterative algorithms • Resampling reduces the information of the training set • Properly done, the effect can be small… • But it can be enough to permute the data • No reduction in information / evidence of the training set • Decision “stumps” • A decision tree with just one level (split) • Sometimes a few levels to capture some nonlinearity among features • Bagging decision stumps • Often works surprisingly well • A good first thing to try CS446-Fall ’06
Boosting: Weak to Strong Learning • A weak learner: • given a set of weighted training examples produces a hypothesis • with high probability • of accuracy at least *slightly* better than random guessing • over any distribution • Given a training set Z and a hypothesis space H • Learn a (sequence of) linear combinations of weak classifiers • At each iteration, add a classifier hi H • Weigh hi by performance on weighted Z • Each new h is trained on same Z but reweighted so hard zj count more • Two sets of weights – one set for data and one set for weak learners • Classify using weighted vote of classifiers • Builds a “strong” learner: arbitrarily high accuracies can be achieved CS446-Fall ’06
Boosting • A “meta” learning algorithm • Any learning algorithm that builds “weak learners” can be “boosted” • Boosting yields performance as high as desired(!) • Continued training improves performance even after perfectly classifying training set(!) • Seems not to overfit • Can be overly sensitive to outliers (and noisy data) • Popular & practical is AdaBoost (for adaptive boosting) CS446-Fall ’06
AdaBoost (from Freund and Schapire tutorial) CS446-Fall ’06
TestError TrainingError Boosting Iteration (size of classifier) Boosted C4.5 (DTs) Somewhat surprising behavior! Why surprising? CS446-Fall ’06
Boosted Stumps / Boosted Decision Trees • Error scatter plots; problems from the UCI repository What does that line represent? Above? Below? • Stumps (left) are more efficient to learn • DTs (right) reduce Boosting’s outlier sensitivity CS446-Fall ’06
Why does boosting work? • It’s complicated • Many approaches to boosting (AdaBoost is a standard) • Heuristic? No – provably effective • Heuristic? Yes – invented and refined, not derived • Accepted intuition • Empirically identifies support vectors • Finds a “large margin” strong classifier • Better: improves margin distribution rather than margin CS446-Fall ’06
Margin Distributionnumber of points as distance from classifier increases Best Margin BetterMarginDistribution CS446-Fall ’06
TrainingError TestError Boosting Iteration (size of classifier) Boosting as Improving Margin Distribution Cumulative margin distribution for 5 (dotted), 100 (dashed), 1000 (solid) boosting iterations CS446-Fall ’06