1 / 49

Information geometry

Information geometry. Learning a distribution. Toy motivating example: want to model the distribution of English words (in a corpus, or in natural speech, …) Domain: S = {English words} Default choice (without any knowledge): uniform over S.

bgilson
Download Presentation

Information geometry

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Information geometry

  2. Learning a distribution Toy motivating example: want to model the distribution of English words (in a corpus, or in natural speech, …) Domain: S = {English words} Default choice (without any knowledge): uniform over S. But suppose we are able to collect some simple statistics: Pr(length > 5) = 0.3 Pr(end in ‘e’) = 0.45 Pr(start with ‘s’) = 0.08 etc. Now what distribution should we choose?

  3. Outline The solution – and this talk – involves three intimately related concepts Entropy Exponential families Information projection

  4. Part I: Entropy

  5. Formulating the problem Domain: S = {English words} Measured features for words x 2 S: T1(x) = 1(length > 5) [= 1 if length>5, 0 otherwise] T2(x) = 1(x ends in ‘e’) etc. Find a distribution which satisfies the constraints: E T1(x) = 0.3, E T2(x) = 0.45, etc. but which is otherwise as random as possible (ie. makes no assumptions other than these constraints)

  6. What is randomness? Let X be a discrete-valued random variable… what is its randomness content? Intuitively: 1. A fair coin has one bit of randomness 2. A biased coin is less random 3. Two independent fair coins have two bits of randomness 4. Two dependent fair coins have less randomness 5. A uniform distribution over 32 possible outcomes has 5 bits of randomness – can describe each outcome using 5 bits

  7. Entropy If X has distribution p(¢), itsentropyis Examples: • Fair coin: H = ½ log 2 + ½ log 2 = 1 • Coin with bias ¾: H = ¾ log 4/3 + ¼ log 4 = 0.81 • Coin with bias 0.99: H = 0.99 log 1/0.99 + 0.01 log 1/0.01 = 0.08 (iv) Uniform distribution over k outcomes: H = log k

  8. Entropy is concave H(p) 1 p 0 ½ 1

  9. Properties of entropy Many properties which we intuitively expect from a notion of randomness: 1. Expansibility. If X has distribution (p1, p2, …, pn) and Y has (p1, p2, …, pn, 0), then H(X) = H(Y) 2. Symmetry. eg. Distribution (p,1-p) has the same entropy as (1-p, p). 3. Additivity. If X and Y are independent then H(X,Y) = H(X) + H(Y).

  10. Additivity Quick check:

  11. Properties of entropy, cont’d 4. Subadditivity. H(X,Y) · H(X) + H(Y) 5. Normalization. A fair coin has entropy one 6. “Small for small probability”. The entropy of a coin with bias p goes to zero as p goes to 0 In fact: Entropy is the only measure which satisfies these six properties! [Aczel-Forte-Ng 1975]

  12. KL divergence Kullback-Leibler divergence (relative entropy): a distance measure between two probability distributions. If p, q have the same domain S: … a very fundamental and widely-used distance measure in statistics and machine learning. Warnings: • K(p,q) is not the same as K(q,p) • K(p,q) could be infinite! But at least: K(p,q) ¸ 0 with equality iff p = q

  13. Entropy and KL divergence Say random variable X has distribution p and u is the uniform distribution over domain S: Therefore, entropy tells us the distance to the uniform distribution! [Also note: H(X) · log |S|]

  14. Another justification of entropy Let X1, X2, …, Xn be i.i.d. (independent, identically distributed) random variables. Consider the joint distribution over sequences (X1, …, Xn). For large n, put these sequences into two groups: (I) sequences whose probability is roughly 2-nH, where H = H(Xi) (II) all other sequences Then group I contains almost all the probability mass! This is the asymptotic equipartition property (AEP).

  15. Asymptotic equipartition Space of possible sequences Sequences with probability about 2-nH For large n, the distribution over sequences (X1, …, Xn) looks a bit like a uniform distribution over 2nH possible outcomes. Entropy tells us the “volume” of the typical set.

  16. AEP: examples For large n, the distribution over sequences (X1, …, Xn) looks a bit like a uniform distribution over 2nH possible outcomes. Example: Xi = fair coin Then (X1, …, Xn) = uniform distribution over 2n outcomes (Group I is everything) Example: Xi = coin with bias ¾ A typical sequence has about 75% heads, and therefore probability around q = (3/4)(3/4)n (1/4)(1/4)n Notice log q = ¾ n log ¾ + ¼ n log ¼ = -n H(1/4), so the probability of a typical sequence is indeed 2-nH

  17. Proof of AEP

  18. Back to our main question S = {English words} For x 2 S, we have features T1(x), …, Tk(x) (eg. T1(x) = 1(length > 5)) Find a distribution p over S which: 1. satisfies certain constraints: E Ti(x) = bi (eg. fraction of words with length > 5 is 0.3) 2. has maximum entropy The “maximum entropy principle”.

  19. Maximum entropy Think of p as a vector of length |S| Maximizing a concave function subject to linear constraints … a convex optimization problem!

  20. Alternative formulation Suppose we have a prior , and we want the distribution closest to it (in KL distance) which satisfies the constraints. A more general convex optimization problem – to get maximum entropy, choose  to be uniform.

  21. A projection operation Think of this page as the probability simplex (ie. space of valid probability distributions over |S|-vectors) prior  L = affine subspace given by constraints p p is the I-projection (information projection) of  onto the subspace L

  22. Solution by calculus Use Lagrange multipliers: Solution: (Z is a normalizer) This is familiar… the exponential family generated by !

  23. Form of the solution Back to our toy problem: p(x) / exp { 1¢ 1(length > 5) + 2¢ 1(x ends in ‘e’) + } For instance, if 2 = 0.81, this says that a word ending in ‘e’ is e0.81 = 2.25 times as likely as one which doesn’t.

  24. Part II: Exponential families

  25. Exponential families Many of the most common and widely-used probability distributions – such as Gaussian, Poisson, Bernoulli, Multinomial – are exponential families To define an exponential family, start with: Input space S µRr Base measure h: Rr!R Features T(x) = (T1(x), …, Tk(x)) The exponential family generated by h and T consists of log linear models parametrized by 2Rk: p(x) / e¢ T(x) h(x)

  26. Natural parameter space Input space S µRr, base measure h: Rr!R, features T(x) = (T1(x), …, Tk(x)) Log linear model with parameter 2Rk: p(x) / e¢ T(x) h(x) Normalize these models to integrate to one: p(x) = e¢ T(x) – G() h(x) where G() = ln x e¢ T(x) h(x) [or the appropriate integral] is the log partition function. This integral need not always exist, so define N = {2Rk: -1 < G() < 1}, the natural parameter space.

  27. Example: Bernoulli S = {0,1} Base measure h = 1 Features T(x) = x Functional form: p(x) / e x Log partition function: is defined for all 2R, so natural parameter space N = R Distribution with parameter : We are more accustomed to the parametrization 2 [0,1], with  = e/(1 + e).

  28. Parametrization of Bernoulli “Natural” parameter 2 R ! e/(1 + e) Usual parameter in [0,1], aka expectation parameter

  29. Example: Poisson Poisson() distribution over Z+ is given by Here: S = Z+ Base measure h(x) = 1/x! Feature T(x) = x Functional form: p(x) / e x/x! Log partition function: Therefore N = R. Notice  = ln .

  30. Example: Gaussian Gaussian with mean  and variance 2:

  31. Properties of exponential families A lot of information in the log-partition function G() = ln x e¢ T(x) h(x) G is strictly convex eg. recall Poisson G() = e • This implies, among other things, that G0 is 1-to-1 G0() = E T(x)… the mean of the feature values check: G00() = var T(x) … the variance of the feature values

  32. Maximum likelihood estimation Exponential family generated by h, T: p(x) = e¢ T(x) – G() h(x) Given data x1, x2, …, xm, find maximum likelihood . Setting derivatives to zero: But recall G0() = mean of T(x) under p… so just pick the distribution which matches the sample average!

  33. Maximum likelihood, cont’d Example. Fit a Poisson to integer data with mean 7.5. Simple: choose the Poisson with mean 7.5. But what is the natural parameter of this Poisson? Recall: for a Poisson, G() = e So the mean is G0() = e. Choose  = ln 7.5 Inverting G0 is not always so easy…

  34. Our toy problem Form of the solution: p(x) / exp { 1¢ 1(length > 5) + 2¢ 1(x ends in ‘e’) + } We know the expectation parameters and we need the natural parameters…

  35. The two spaces 1-1 map G0   N, the natural parameter space Rk, the expectation parameter space Given data, finding the maximum likelihood distribution is trivial under the expectation parametrization, and is a convex optimization problem under the natural parametrization.

  36. Part III: Information projection

  37. Back to maximum entropy Recall that: exponential families are maximum entropy distributions! Q: Given a prior , and empirical averages E Ti(x) = bi, what is the distribution closest to  that satisfies these constraints? A: It is the unique member of the exponential family generated by  and T which has the given expectations.

  38. Maximum entropy example Q: We are told that a distribution over R has EX = 0 and EX2 = 10. What distribution should we pick? A: Features are T(x) = (x, x2). These define the family of Gaussians… so pick the Gaussian N(0,10).

  39. Maximum entropy: restatement Choose: any sample space S µ Rr features T(x) = (T1(x), …, Tk(x)) constraints E T(x) = b and reference prior : Rr! R. If there is a distribution of the form p*(x) = e¢ T(x) – G()(x) satisfying the constraints, then it is the unique minimizer of K(p, ) subject to these constraints.

  40. Proof Consider any other distribution p which satisfies the constraints. We will show K(p, ) > K(p*, ). Hmm… an interesting relation: K(p,) = K(p, p*) + K(p*, )

  41. Geometric interpretation This page is the probability simplex (space of valid probability |S|-vectors). prior  L = affine subspace given by constraints p* p p* is the I-projection of  onto L K(p,) = K(p, p*) + K(p*, ) … Pythagorean thm!

  42. More geometry  Let Q be the set of distributions e¢ T(x) – G()(x), 2 Rk L = affine subspace given by constraints p* = I-projection of  onto L p* p dim(simplex) = |S|-1 dim(L) = |S|-k-1 dim(Q) = k Q

  43. Max entropy vs. max likelihood Given data x1, …, xm which define constraints E Ti(x) = bi the following are equivalent: 1. p* is the I-projection of  onto L; that is, p* minimizes K(p, ) 2. p* is the maximum likelihood distribution in Q 3. p*2 L Å Q

  44. An algorithm for I-projection Goal: project the prior  onto the constraint-satisfying affine subspace L = {p: Ep Ti(x) = bi, i = 1, 2, …, k} Define Li = {p: Ep Ti(x) = bi} (just the ith constraint) Algorithm (Csiszar): Let p0 =  Loop until convergence: pt+1 = I-projection of pt onto Lt mod k Reduce a multidimensional problem to a series of one-dimensional problems.

  45. p2 p4  L1 L2 p3 p1 One-dimensional I-projection Projecting pt onto Li: find i such that has E Ti(x) = bi. Equivalently, find i such that G0(i) = bi … e.g. line search.

  46. Proof of convergence We get closer to p* on each iteration: K(p*, pt+1) = K(p*, pt) – K(pt+1, pt) pt pt+1 p* Li

  47. Other methods for I-projection Csiszar’s method is sequential – one i at a time Iterative scaling (Darroch and Ratcliff) parallel – all i updated in each step Many variants on iterative scaling Gradient methods

  48. Postscript: Bregman divergences Projections with respect to KL divergence: much in common with squared Euclidean distance, eg. Pythagorean theorem. Q: What other distance measures also share these properties? A: Bregman divergences. Each exponential family has a “natural” distance measure associated with it, its Bregman divergence. Gaussian: squared Euclidean distance Multinomial: KL divergence

  49. Postscript: Bregman divergences Can define projections with respect to arbitrary Bregman divergences. Many machine learning tasks can then be seen as information projection: boosting, iterative scaling,…

More Related