Maximum Likelihood and the Information Bottleneck

Maximum Likelihood and the Information Bottleneck By Noam Slonim & Yair Weiss 2/19/2003

Overview • Main contribution • Defines mapping: ML of mixture models to iterative IB • Under some initial conditions, an algorithm for one gives a solution for the other. • Theoretical and practical concern • ML “ideal” vs. IB “real” • Using opposite algorithm could improve performance

IB Intuition & Review • Given r.v. X and Y w/ joint p(x,y) • Rerepresent X with clusters T that preserve information about Y • Find compressed representation T of X with mapping q(t|x) • choice of q(t\x) must minimize the IB-Functional: • |T| and fixed • minimizing I(T;X) maximizes compression • maximizing I(T;Y) minimizes distortion

IB Review • Additionally, given • so • From prev. paper to minimize • Use initial to get q(t), q(y|t) and iterate

ML for mixture models • Generative process: • Y generated by multinomial distribution • choose t to maximize this probability • but we don’t know • We don’t have p(x,y) either, just samples n(x,y) • Use EM to find , that maximizes the likelihood of seeing n(x,y) with t’s

EM • Iterative algorithm to compute ML • E step • denote as • set • k(x) normalization factor,

EM con’t • M step • set • set • Alternative free energy version:

ML IB mapping • Fairly straightforward mapping • , • Since we can’t map the corresponding parameter distributions directly, we do this mapping then an M-step or IB-step.

Observations • When X uniformly distributed, mapping is equiv. to direct mapping of parameter distributions. • M-step and IB-step mathematically equivalent • When X uniform, EM is equiv to IB iterative algorithm with r = |X|. • equivalence of E-step to IB step setting q(t |x). • since

Main Equivalence Claims • When X uniform and r = |X|, all fixed points of the likelihood L are fixed points of with • at the fixed points, • Any algorithm that finds a fixed point of L induces a fixed point of . If more than one the one that maximizes L minimizes

Claims (2) • For or all the fixed points of L are mapped to the fixed points of • again, at the fixed points • Again any algorithm that finds one induces one for the other domain.

Simulations • How do we know when N or is large enough to use the mapping? • Empirical validation: • Newsgroup clustering experiment • |X|=500 documents, |Y|=2000 words, |T|=10 groupings • N=43433 occurrences in one set, N=2171 in pruned set

Simulation results • At small values of N the differences are more prominent

Discussion • At higher values of N, EM can converge to a smaller value of after the mapping, and vice versa. • Mentions alternative formulation for IB where we minimize the KL distance between and the family of distributions for which the mixture model assumption holds. • For smaller sample size, the freedom of choosing in IB seems beneficial

Conclusion • Interesting reformulation of IB in the standard mixture model setting for clustering. • Interesting theoretical results with possible practical advantages for mapping from one to the other.

Maximum Likelihood and the Information Bottleneck

Maximum Likelihood and the Information Bottleneck

Presentation Transcript

Maximum likelihood (ML)

Maximum Likelihood

Maximum Likelihood

4. Maximum Likelihood

Maximum Likelihood Estimation for Information Thresholding

Maximum Likelihood

Maximum Likelihood Estimation

Maximum Bottleneck Paths

Maximum Likelihood

The maximum likelihood method

Maximum likelihood

Maximum likelihood decoding

Maximum likelihood (cont.)

Maximum Likelihood Estimation

Information Bottleneck versus Maximum Likelihood

Information Bottleneck versus Maximum Likelihood

Maximum likelihood (cont.)

Maximum Likelihood

Maximum Likelihood

Maximum Likelihood

Maximum Likelihood Estimation

Maximum Likelihood Estimation