PAC-Bayes Risk Bounds for Sample-Compressed Gibbs Classifiers

PAC-Bayes Risk Bounds for Sample-Compressed Gibbs Classifiers ICML 2005 François Laviolette and Mario Marchand Université Laval

PLAN • The “traditional” PAC-Bayes theorem (for the usual data-independent setting ) • The “generalized” PAC-Bayes theorem (for the more general sample compression setting) • Implications and follow-ups

A result from folklore :

In particular, for Gibbs classifiers: What if we choose Pafter observing the data?

The “traditional” PAC-Bayes Theorem

The Gibbs and the majority vote • We have a bound for GQ but we normally use instead the Bayes classifier BQ (which is the Q-weighted majority vote classifier) • Consequently R(BQ) · 2R(GQ) (can be improved with the “de-randomization” technique of Langford-Shaw-Taylor 2003) • So the PAC-Bayes theorem also gives a bound on the Majority vote classifier.

The sample compression setting • Theorem 1 is valid in the usual data-independent setting where H is defined without reference to the training data • Example: H = the set of all linear classifiers h: Rn!{-1,+1} • In the more general sample compression setting, each classifier is identified by 2 different sources of information: • The compression set: an (ordered) subset of the training set • A message string of additional information needed to identify a classifier • Theorem 1 is not valid in this more general setting

To be more precise: • In the sample compression setting, there exists a “reconstruction” function R that gives a classifier h = R(, Si) when given a compression set Si and a message string . • Recall that Si is an ordered subset of the training set S where the order is specified by i=(i1, i2, … , i|i|).

Examples • Set Covering Machines (SCM) [Marchand and Shaw-Taylor JMLR 2002] • Decision List Machines (DLM) [Marchand and Sokolova JMLR 2005] • Support Vector Machines (SVM) • Nearest neighbour classifiers (NNC) • …

Priors in the sample compression setting • The priors must be Data-independent • We will thus use priors defined over the set of all the parameters (i,) needed by the reconstruction function R, once a training set S is given.The priors should be written as:

The “generalized” PAC-Bayes Theorem

a (the rescaled ) incorporates Occam’s principle of parsimony • The new PAC-Bayes theorem states that the risk bound for is lower than the risk bound for any .

The PAC-Bayes theorem for bounded compression set size

Conclusion • The new PAC-Bayes bound • is valid in the more general sample compression setting. • incorporates automatically the Occam’s principle of parsimony • A sample compressed Gibbs classifier can have a smaller risk bound than any of its member.

The next steps • Finding derived bounds for particular sample compressed classifiers like: • majority votes of SCMs and DLMs, • SVMs • NNCs. • Developing new learning algorithms based on the theoretical information given by the bound. • A tight Risk bound for Majority vote classifiers ?

PAC-Bayes Risk Bounds for Sample-Compressed Gibbs Classifiers

PAC-Bayes Risk Bounds for Sample-Compressed Gibbs Classifiers

Presentation Transcript

Bayes for Beginners

Bayes Classifiers and Generative Methods

Non-Bayes classifiers.

Bayes classifiers

Naive Bayes Classifiers, an Overview

Bayes for beginners

On Discriminative vs. Generative classifiers: Naïve Bayes

Bayes Net Classifiers The Naïve Bayes Model

Bayes for Beginners

Bayes for Beginners

Reverend Bayes Sample Sizes and Statistical Analysis

Recap: Naïve Bayes classifiers

MLE’s, Bayesian Classifiers and Naïve Bayes

Naïve Bayes Classifiers

Objectives: Error Bounds Complexity Theory PAC Learning PAC Bound Margin Classifiers

A gentle introduction to the mathematics of biosurveillance: Bayes Rule and Bayes Classifiers

compressed GETECNA compressed

Bayes for Beginners

Naïve-Bayes Classifiers

Bayes for Beginners