410 likes | 626 Views
How to be a Bayesian without believing. Yoav Freund Joint work with Rob Schapire and Yishay Mansour. Motivation. Statistician: “Are you a Bayesian or a Frequentist?” Yoav: “I don’t know, you tell me…” I need a better answer…. Male. Human Voice. Female. Toy example.
E N D
How to be a Bayesian without believing Yoav Freund Joint work with Rob Schapire and Yishay Mansour
Motivation • Statistician: “Are you a Bayesian or a Frequentist?” • Yoav: “I don’t know, you tell me…” • I need a better answer….
Male Human Voice Female Toy example • Computer receives telephone call • Measures Pitch of voice • Decides gender of caller
mean1 mean2 var1 var2 Probability Generative modeling Voice Pitch
No. of mistakes Discriminative approach Voice Pitch
Conditional probability: Prior Posterior Probability Discriminative Bayesian approach Voice Pitch
Definitely female Definitely male No. of mistakes Suggested approach Unsure Voice Pitch
Formal Frameworks For stating theorems regarding the dependence of the generalization error on the size of the training set.
The PAC set-up • Learner chooses classifier set C c C, c: X {-1,+1} and requests m training examples • Nature chooses a target classifier c from C and a distribution P over X • Nature generates training set (x1,y1), (x2,y2), … ,(xm,ym) • Learner generates h: X {-1,+1} Goal: P(h(x) c(x)) < c,P
The agnostic set-up Vapnik’s pattern-recognition problem • Learner chooses classifier set C c C, c: X {-1,+1} and requests m training examples • Nature chooses distribution D overX {-1,+1} • Nature generates training set according to D (x1,y1), (x2,y2), … ,(xm,ym) • Learner generates h: X {-1,+1} Goal: PD(h(x) y) < PD(c*(x) y) + D Where c* = argminc C(PD(c(x) y))
bound depends on training set ! Self-bounding learning Freund 97 • Learner selects concept class C • Nature generates training set T=(x1,y1), (x2,y2), … ,(xm,ym) IID according to a distribution D over X {-1,+1} • Learner generates h: X {-1,+1} and a bound T such that with high probability over the random choice of the training set T PD(h(x) y) < PD(c*(x) y) + T
Learning a region predictor Vovk 2000 • Learner selects concept class C • Nature generates training set (x1,y1), (x2,y2), … ,(xm,ym) IID according to a distribution D over X {-1,+1} • Learner generates h: X { {-1}, {+1}, {-1,+1} , {} } such that with high probability PD(y h(x)) < PD(c*(x) y) + 1 and PD(h(x)={-1,+1} ) < 2
Intuitions The rough idea
- - - - - + - + - + - + + - - + - - + - - - + - + + + + - - - + - + + + + - - + + + - - + + + + + + + - - - + - - + - + + - - + + - + + - - + + + + - + + + - - - - - - - - - + - - - - + - - - - + - - - - - A motivating example ? ? ?
True error 0 1/2 Empirical error Worst case 0 1/2 Typical case 0 1/2 Contenders for best. -> Predict with majority vote Non-contenders -> ignore! Distribution of errors
Main result Finite concept class
Data distribution: Generalization error: Training set: Training error: Notation
Parameters Hypothesis weight: Empirical Log Ratio: Prediction rule: The algorithm
Suggested tuning Yields:
Main properties • The ELR is very stable. Probability of large deviations is independent of size of concept class. • Expected value of ELR is close to the True Log Ratio (using true hypothesis errors instead of estimates.) • TLR is a good proxy of the best concept in the class.
If And are independent random variables Then McDiarmid’s theorem
training error with one example changed Empirical log ratio is stable
Infinite concept classes Geometry of the concept class
Infinite concept classes • Stated bounds are vacuous. • How to approximate a infinite class with a finite class? • Unlabeled examples give useful information.
d f g d(f,g) = P( f(x) = g(x) ) A metric space of classifiers Classifier space Example Space Neighboring models make similar predictions
No. of neighbors increases like No. of neighbors increases like e-covers Classifier space Classifier class
Computational issues • How to compute the e-cover? • We can use unlabeled examples to generate cover. • Estimate prediction by ignoring concepts with high error.
Application: comparing perfect features • 45,000 features • Training Examples: • 102 negative • 2-10 positive • 104 unlabeled • >1 features have zero training error. • Which feature(s) should we use? • How to combine them?
Unlabeld examples Positive examples Negative examples A typical perfect feature No. of images Feature value
Pseudo-Bayes for single threhold • Set of possible thresholds is uncountably infinite • Using an e-cover over thresholds • Equivalent to using the distribution of unlabeled examples as the prior distribution over the set of thresholds.
+1 0 -1 Prior weights Error factor Negative examples What it will do Feature value
Neighborhood of good classifiers Relation to large margins SVM and Adaboost search for a linear discriminator with a large margin
Relation to Bagging • Bagging: • Generate classifiers from random subsets of training set. • Predict according to the majority vote among classifiers. (Another possibility: flip label of a small random subset of the training set) • Can be seen as a randomized estimate of the log ratio.
Bias/Variance for classification • Bias: error of predicting with the sign of the True Log Ratio (infinite training set). • Variance: additional error from predicting with the sign of the Empirical Log Ratio which is based on a finite training sample.
New directions How a measure of confidence can help in practice
Face Detection • Paul Viola and Mike Jones developed a face detector that can work in real time (15 frames per second).
All boxes Might be a face Definitely not a face Using confidence to save time The detector combines 6000 simple features using Adaboost. In most boxes, only 8-9 features are calculated. Feature 1 Feature 2
Sample of unconfident examples Labeled examples Selective sampling Unlabeled data Partially trained classifier
Confident Predictions Confident Predictions Co-training Partially trained Color based Classifier Color info Images that Might contain faces Shape info Partially trained Shape based Classifier
Summary • Bayesian averaging is justifiable even without Bayesian assumptions. • Infinite concept classes: use e-covers • Efficient implementations: Thresholds, SVM, boosting, bagging… still largely open. • Calibration (Recent work of Vovk) • A good measure of confidence is very important in practice. • >2 classes (predicting with a subset)