150 likes | 357 Views
Maximum Likelihood and the Information Bottleneck. By Noam Slonim & Yair Weiss 2/19/2003. Overview. Main contribution Defines mapping: ML of mixture models to iterative IB Under some initial conditions, an algorithm for one gives a solution for the other. Theoretical and practical concern
E N D
Maximum Likelihood and the Information Bottleneck By Noam Slonim & Yair Weiss 2/19/2003
Overview • Main contribution • Defines mapping: ML of mixture models to iterative IB • Under some initial conditions, an algorithm for one gives a solution for the other. • Theoretical and practical concern • ML “ideal” vs. IB “real” • Using opposite algorithm could improve performance
IB Intuition & Review • Given r.v. X and Y w/ joint p(x,y) • Rerepresent X with clusters T that preserve information about Y • Find compressed representation T of X with mapping q(t|x) • choice of q(t\x) must minimize the IB-Functional: • |T| and fixed • minimizing I(T;X) maximizes compression • maximizing I(T;Y) minimizes distortion
IB Review • Additionally, given • so • From prev. paper to minimize • Use initial to get q(t), q(y|t) and iterate
ML for mixture models • Generative process: • Y generated by multinomial distribution • choose t to maximize this probability • but we don’t know • We don’t have p(x,y) either, just samples n(x,y) • Use EM to find , that maximizes the likelihood of seeing n(x,y) with t’s
EM • Iterative algorithm to compute ML • E step • denote as • set • k(x) normalization factor,
EM con’t • M step • set • set • Alternative free energy version:
ML IB mapping • Fairly straightforward mapping • , • Since we can’t map the corresponding parameter distributions directly, we do this mapping then an M-step or IB-step.
Observations • When X uniformly distributed, mapping is equiv. to direct mapping of parameter distributions. • M-step and IB-step mathematically equivalent • When X uniform, EM is equiv to IB iterative algorithm with r = |X|. • equivalence of E-step to IB step setting q(t |x). • since
Main Equivalence Claims • When X uniform and r = |X|, all fixed points of the likelihood L are fixed points of with • at the fixed points, • Any algorithm that finds a fixed point of L induces a fixed point of . If more than one the one that maximizes L minimizes
Claims (2) • For or all the fixed points of L are mapped to the fixed points of • again, at the fixed points • Again any algorithm that finds one induces one for the other domain.
Simulations • How do we know when N or is large enough to use the mapping? • Empirical validation: • Newsgroup clustering experiment • |X|=500 documents, |Y|=2000 words, |T|=10 groupings • N=43433 occurrences in one set, N=2171 in pruned set
Simulation results • At small values of N the differences are more prominent
Discussion • At higher values of N, EM can converge to a smaller value of after the mapping, and vice versa. • Mentions alternative formulation for IB where we minimize the KL distance between and the family of distributions for which the mixture model assumption holds. • For smaller sample size, the freedom of choosing in IB seems beneficial
Conclusion • Interesting reformulation of IB in the standard mixture model setting for clustering. • Interesting theoretical results with possible practical advantages for mapping from one to the other.