Minimum Information Inference

Minimum Information Inference Naftali Tishby Amir Globerson ICNC, CSE The Hebrew University TAU, Jan. 2, 2005

Talk outline • Classification with probabilistic models: Generative vs. Discriminative • The Minimum Information Principle • Generalization error bounds • Game theoretic motivation • Joint typicality • The MinMI algorithms • Empirical evaluations • Related extensions: SDR and IB

The Classification Problem • Learn how to classify (complex) observationsX into (simple) classes Y • Given labeled examples (xi,yi) • Use them to construct a classifiery=g(x) • What is a good classifier? • Denote by p *(x,y) the true underlying law • Want to minimize the generalization error

Problem … Generalization – Can’t be computed directly  p*(x,y) y=g(x) (xi,yi), i=1…n Observed Learned Truth

Choosing a classifier • Need to limit search to some set of rules. If every rule is possible we will surely over-fit. Use a family g(x) where  is a parameter. • Would be nice if the true rule is in g(x) • How do we choose in g(x) ?

Common approach:Empirical Risk Minimization • A reasonable strategy. Find the classifier which minimizes the empirical (sample) error: • Not necessarily provides the best generalization, although theoretical bounds exist. • Computationally hard to minimize directly. Many works minimize upper bounds on the error. • Here we focus on a different strategy.

Probabilistic models for classification • Had we known p*(x,y) the optimal predictor would be • But we don’t know it. We can try to estimate it. Two general approaches: generative and discriminative.

Generative Models • Assume p(x|y) has some parametric form, e.g. a Gaussian. • Each y has a different set of parameters y • How do we estimate y, p(y) ? Maximum Likelihood!

Generative Models -Estimation • Easy to see that p(y) should be set to the empirical frequency of the classes • The parameters yobtained by collecting all x values for the class y, and generating a maximum likelihood estimate.

Example: Gaussians • Assume the class conditional distribution is Gaussian • Then are the empirical mean and variance of the samples in class y. y=1 y=2

Example: Naïve Bayes • Say X=[X1,…,Xn] is an n dimensional observation • Assume: • Parameters are p(xi=k|y). Calculated by counting how many times xi=k in class y. • Empirical means of indicator functions:

Generative Classifiers: Advantages • Sometimes it makes sense to assume a generation process for p(x|y)(e.g. speech or DNA). • Estimation is easy. Closed form solutions in many cases (through empirical means). • The parameters can be estimated with relatively high confidence from small samples (e.g. empirical mean and variance). See Ng and Jordan (2001). • Performance is not bad at all.

Discriminative Classifiers • But, to classify we need onlyp(y|x). Why not estimate it directly? Generative classifiers (implicitly) estimate p(x), which is not really needed or known. • Assume a parametric form for p(y|x):

Discriminative Models - Estimation • Choose yto maximize conditional likelihood • Estimation is usually not in closed form. Requires iterative maximization (gradient methods etc).

Example: logistic regresion • Assume p(x|y) are Gaussians with different means and same variances. Then • Goal is to estimate ay,by • This is called logistic regression. Since the log of the distribution is linear in x

DiscriminativeNaïve Bayes • Assuming p(x|y) is in Naïve Bayes class, the discriminative distribution is • Similar to Naïve Bayes, but the ψ(x,y) functions are not distributions. This is why we need the additional normalization Z. • Also called a conditional first order loglinear model .

Discriminative: Advantages • Estimates only the relevant distributions (important when X is very complex). • Often outperforms generative models for large enough samples (see Ng and Jordan, 2001). • Can be shown to minimize an upper bound on the classification error.

The best of both worlds… • Generative models (often) employ empirical means which are easy and reliable to estimate. • But they model each class separately so poor discriminationis obtained. • We would like a discriminative approach based on empirical means.

Learning from Expected values(observations, in physics) • Assume we have some “interesting” observables: • And we are given their sample empirical means for different classes Y, e.g. class two moments: • How can we use this information to build a classifier? • Idea: Look for models which yield the observed expectations, but contain no other information.

The MaxEnt approach • The Entropy H(X,Y) is a measure of uncertainty (and typicality!) • Find the distribution with the given empirical means andmaximum joint entropy H(X,Y) (Jaynes 57, …) • “Least Committed” to the observations, most typical. • Yield “nice” exponential forms:

Occam’s in Classification • Minimum assumptions about X and Y imply independence. • Because X behaves differently for different Y they cannot be independent • How can we quantify their level of dependence ? p(x|y=1) p(x|y=2) m2 m1 X

Mutual Information (Shannon 48) • The measure of the information shared by two variables • X and Y are independent iff I(X;Y)=0 • Bounds the classification error: eBayes<0.5(H(Y)-I(X;Y)). (Hellman and Raviv 1970). • Why not minimizeit subject to the observation constraints?

More for Mutual Information… • I(X;Y) - the unique functional (up to units) that quantifies the notion of information in X about Y in a covariant way. • Mutual Information is the generating functional for both source coding (minimization) and channel coding (maximization). • Quantifies independence in a model free way • Has a natural multivariate extension - I(X1,…,Xn).

MinMI: Problem Setting • Given a sample (x1,y1),…,(xn,yn) • For each y, calculate the expected value of (X) • Calculate empirical marginal p(y) • Find the minimum Mutual Information distribution with the given empirical expected values • The valueof the minimum information is precisely the information in the observations!

MinMI Formulation • The (convex) set of constraints • The information minimizing distribution • A convex problem. No local minima!

pMI p • The problem is convex given p(y) for any empirical means, without specifying p(x). • The minimization generates an auxiliary sparse pMI (x): support alignments.

Characterizing • The solution form • Where (y) are Lagrange multipliers and • Via Bayes • Can be used for classification. But how do we find it?

Careful… I cheated… • What if pMI(x)=0 ? • No legal pMI(y|x) … • But we can still define: • Can show that it is subnormalized: • And use f(y|x) for classification! • Solutions are actually very sparse. Many pMI(x) are zero. “Support Assignments”…

A dual formulation • Using convex duality we can show that MinMI can be formulated as • Called a geometric program • Strict inequalities for x such that p(x)=0 • Avoids dealing with p(x) at all!

-log2 fMI(y|x) fMI(y|x) A generalization bound • If the estimated means are equal to their true expected values, we can show that the generalization error satisfies Y=1

A Game Theoretic Interpretation • Among all distributions in F(a), why choose MinMI? • The MinMI classifiers minimizes the worst case loss in the class • The loss is an upper bound on generalization error • Minimize a worst case upper bound

MinMI and Joint Typicality Given a sequence the probability that another independently drawn sequence: is drawn from their joint distribution, Is asymptotically Suggesting Minimum Mutual Information (MinMI) as a general principle for joint (typical) inference.

I-Projections (Csiszar 75, Amari 82,…) • The I-projection of a distribution q(x) on a set F • For a set defined by linear constraints: • Can be calculated using Generalized Iterative Scaling or Gradient methods. Looks Familiar ?

The MinMI Algorithm • Initialize • Iterate • For all y: Set to be the projection of on • Marginalize

The MinMI Algorithm

Example: Two moments • Observations are class conditional mean and variance. • MaxEnt solution would be p(X|y) a Gaussian. • MinMI solutions are far from Gaussians and discriminate much better. MaxEnt MinMI

Example: Conditional Marginals • Recall in Naïve Bayes we used the empirical means of: • Can use these means for MinMI.

Naïve Bayes Analogs Naïve Bayes Discriminative 1st Order LogLinear

Experiments • 12 UCI Datasets. Discrete Features Only used singleton marginal constraints. • Compared to Naïve Bayes and 1st order LogLinear model. • Note: Naïve Bayes and MinMI use exactly the same input. LogLinear regression also approximates p(x) and uses more information.

Generalization error for full sample

Related ideas • Extract the best observables using minimum MI: Sufficient Dimensionality Reduction (SDR) • Efficient representations of X with respect to Y: The Information Bottleneck approach. • Bounding the information in neural codes from very sparse statistics. • Statistical extension of Support Vector Machines.

Conclusions • MinMI outperforms discriminative model for small sample sizes • Outperforms generative model. • Presented a method for inferring classifiers based on simple sample means. • Unlike generative models, provides generalization guarantees.

Minimum Information Inference

Minimum Information Inference

Presentation Transcript

Minimum

Inference

Inference

INFERENCE

Inference

INFERENCE

Inference

Estimation and Inference by the Method of Projection Minimum Distance

Setting National Minimum Information Standards

Minimum Information for High Content Analysis (MIAHA)

Information Inference

Minimum Information About a Microarray Experiment - MIAME

Inference

Inference

Inference

Minimum Information About a Microarray Experiment - MIAME

Inference

Inference

HSC minimum standard Information for students

Minimum Information About a Microarray Experiment - MIAME

INFERENCE

Incorporating Discourse Information within Textual Entailment Inference