How to be a Bayesian without believing

How to be a Bayesian without believing Yoav Freund Joint work with Rob Schapire and Yishay Mansour

Motivation • Statistician: “Are you a Bayesian or a Frequentist?” • Yoav: “I don’t know, you tell me…” • I need a better answer….

Male Human Voice Female Toy example • Computer receives telephone call • Measures Pitch of voice • Decides gender of caller

mean1 mean2 var1 var2 Probability Generative modeling Voice Pitch

No. of mistakes Discriminative approach Voice Pitch

Conditional probability: Prior Posterior Probability Discriminative Bayesian approach Voice Pitch

Definitely female Definitely male No. of mistakes Suggested approach Unsure Voice Pitch

Formal Frameworks For stating theorems regarding the dependence of the generalization error on the size of the training set.

The PAC set-up • Learner chooses classifier set C c  C, c: X  {-1,+1} and requests m training examples • Nature chooses a target classifier c from C and a distribution P over X • Nature generates training set (x1,y1), (x2,y2), … ,(xm,ym) • Learner generates h: X  {-1,+1} Goal: P(h(x) c(x)) <  c,P

The agnostic set-up Vapnik’s pattern-recognition problem • Learner chooses classifier set C c  C, c: X  {-1,+1} and requests m training examples • Nature chooses distribution D overX  {-1,+1} • Nature generates training set according to D (x1,y1), (x2,y2), … ,(xm,ym) • Learner generates h: X  {-1,+1} Goal: PD(h(x)  y) < PD(c*(x)  y) +  D Where c* = argminc  C(PD(c(x)  y))

bound depends on training set ! Self-bounding learning Freund 97 • Learner selects concept class C • Nature generates training set T=(x1,y1), (x2,y2), … ,(xm,ym) IID according to a distribution D over X  {-1,+1} • Learner generates h: X  {-1,+1} and a bound T such that with high probability over the random choice of the training set T PD(h(x)  y) < PD(c*(x)  y) + T

Learning a region predictor Vovk 2000 • Learner selects concept class C • Nature generates training set (x1,y1), (x2,y2), … ,(xm,ym) IID according to a distribution D over X  {-1,+1} • Learner generates h: X  { {-1}, {+1}, {-1,+1} , {} } such that with high probability PD(y  h(x)) < PD(c*(x)  y) + 1 and PD(h(x)={-1,+1} ) < 2

Intuitions The rough idea

- - - - - + - + - + - + + - - + - - + - - - + - + + + + - - - + - + + + + - - + + + - - + + + + + + + - - - + - - + - + + - - + + - + + - - + + + + - + + + - - - - - - - - - + - - - - + - - - - + - - - - - A motivating example ? ? ?

True error 0 1/2 Empirical error Worst case 0 1/2 Typical case 0 1/2 Contenders for best. -> Predict with majority vote Non-contenders -> ignore! Distribution of errors

Main result Finite concept class

Data distribution: Generalization error: Training set: Training error: Notation

Parameters Hypothesis weight: Empirical Log Ratio: Prediction rule: The algorithm

Suggested tuning Yields:

Main properties • The ELR is very stable. Probability of large deviations is independent of size of concept class. • Expected value of ELR is close to the True Log Ratio (using true hypothesis errors instead of estimates.) • TLR is a good proxy of the best concept in the class.

If And are independent random variables Then McDiarmid’s theorem

training error with one example changed Empirical log ratio is stable

Bounded variation proof

Infinite concept classes Geometry of the concept class

Infinite concept classes • Stated bounds are vacuous. • How to approximate a infinite class with a finite class? • Unlabeled examples give useful information.

d f g d(f,g) = P( f(x) = g(x) ) A metric space of classifiers Classifier space Example Space Neighboring models make similar predictions

No. of neighbors increases like No. of neighbors increases like e-covers Classifier space Classifier class

Computational issues • How to compute the e-cover? • We can use unlabeled examples to generate cover. • Estimate prediction by ignoring concepts with high error.

Application: comparing perfect features • 45,000 features • Training Examples: • 102 negative • 2-10 positive • 104 unlabeled • >1 features have zero training error. • Which feature(s) should we use? • How to combine them?

Unlabeld examples Positive examples Negative examples A typical perfect feature No. of images Feature value

Pseudo-Bayes for single threhold • Set of possible thresholds is uncountably infinite • Using an e-cover over thresholds • Equivalent to using the distribution of unlabeled examples as the prior distribution over the set of thresholds.

+1 0 -1 Prior weights Error factor Negative examples What it will do Feature value

Neighborhood of good classifiers Relation to large margins SVM and Adaboost search for a linear discriminator with a large margin

Relation to Bagging • Bagging: • Generate classifiers from random subsets of training set. • Predict according to the majority vote among classifiers. (Another possibility: flip label of a small random subset of the training set) • Can be seen as a randomized estimate of the log ratio.

Bias/Variance for classification • Bias: error of predicting with the sign of the True Log Ratio (infinite training set). • Variance: additional error from predicting with the sign of the Empirical Log Ratio which is based on a finite training sample.

New directions How a measure of confidence can help in practice

Face Detection • Paul Viola and Mike Jones developed a face detector that can work in real time (15 frames per second).

All boxes Might be a face Definitely not a face Using confidence to save time The detector combines 6000 simple features using Adaboost. In most boxes, only 8-9 features are calculated. Feature 1 Feature 2

Sample of unconfident examples Labeled examples Selective sampling Unlabeled data Partially trained classifier

Confident Predictions Confident Predictions Co-training Partially trained Color based Classifier Color info Images that Might contain faces Shape info Partially trained Shape based Classifier

Summary • Bayesian averaging is justifiable even without Bayesian assumptions. • Infinite concept classes: use e-covers • Efficient implementations: Thresholds, SVM, boosting, bagging… still largely open. • Calibration (Recent work of Vovk) • A good measure of confidence is very important in practice. • >2 classes (predicting with a subset)

How to be a Bayesian without believing

How to be a Bayesian without believing

Presentation Transcript

how to be a millionaire

HOW TO TEACH WITHOUT A VOICE

How to be a Brand

How to be a

How to be a Teacher(:

How to be a Leader

How to... Be A Skateboarder

How to be a Millionaire

How To Be Healthy Without Killing Yourself

How to be a historian

How to be a Cartoonist

How to be a StAr

How to Be Romantic Without Being Goofy

How to be a Mentor

ABC: Bayesian Computation Without Likelihoods

Bayesian Statistics Without Tears: Prelude

Healthcare & Technology: How to Be a Doctor Without Borders

How to Be Assertive Without Being Aggressive?

How to be a Buddhist?

commission hero How To Be An Affiliate Marketer Without A Website

How To Be A Bishop?