Enhancing Mixture Model Convergence with CM-EM Algorithm

From the EM Algorithm to the CM-EM Algorithm for Global Convergence of Mixture ModelsChenguang Lu lcguang@foxmail.com2018-11-10从EM算法到CM-EM算法求混合模型全局收敛鲁晨光Homepage: http://survivor99.com/http://www.survivor99.com/lcg/english/This ppt may be downloaded from http://survivor99.com/lcg/CM/CM4mix.ppt

Mixture Models • Sampling distribution P(X)=∑ j P*(yj)P(X|θj*) • Predicted distribution Pθ(X) by θ=(μ,σ) and P(Y) • To make the relative entropy KL divergence Iterations End iteration Pθ(X)≈ P(X) Start iteration Pθ(X)≠ P(X)

The EM Algorithm for Mixture Models • The popular EM algorithm andits convergenceproof • Likelihood is negative • general entropy • negative general joint entropy, in short, negative entropy • E-step: put P(yj|xi, θ) into Q • M-step：Maximize Q. • Popular convergence proof: • 1) Increasing Q can maximizes logP(X|θ); • 2) Q is increasing in every M-step and no-decreasing in every E-step. Jensen's inequality

The First Problem with the Convergence Proof of the EM Algorithm: Q may be greater than Q* • Assume P(y1)=P(y2)=0.5;µ1=µ1*, µ2=µ2*; σ1= σ2= σ. • [1].Dempster, A. P., Laird, N. M., Rubin, D.B.: Maximum Likelihood from Incomplete Data via the EM Algorithm. Journal of the Royal Statistical Society, Series B 39, 1–38 (1977). • [2] Wu, C. F. J.: On the Convergence Properties of the EM Algorithm. Annals of Statistics 11, 95–10 (1983). logP(XN,Y|θ) 0 Q=-6.75N Target Q*=-6.89N

The Second Problem with the EM algorithm Convergence Proof • P(Y|X) from the E-step is not a proper Shannon’s channel because new mixture ratio: P(X|θj+1)=P(X)P(X|θj)/Pθ(X) is not normalized. • It is possible that ∑ iP(xi|θ1+1)>1.6 and ∑ iP(xi|θ0+1) <0.4

The CM-EM Algorithm for Mixture Models • The CM-EM algorithm: basic idea is to minimize R-G=I(X;Y)-I(X;θ) since minimizing R-G is equivalent to minimizing H(P||Pθ) • E1-step=E-step • E2-step: to modify P(Y) by replacing it by P+1(Y): Until • MG-step: to maximize semantic mutual information G=I(X;θ) • For Gaussian distributions

Comparing the CM-EM with EM Algorithms for Mixture Models • To write Q of EM in cross-entropies: • Maximizing Q = Minimizing H(X|θ) and Hθ(Y). • The CM-EM does not minimize Hθ(Y), it modifies P(Y) so that P+1(Y)=P(Y) or Hθ(Y)=0. • Relationship: E-step of EM = E1-step of CM-EM M-step of EM (E2-step + MG-step) of CM-EM

Comparing CM-EM and MM Algorithms • Neal and Hinton define F=Q+NH(Y)=-NH(X,Y|θ)+NH(Y)≈ -NH(X|θ). then maximize F in both M-step and E-step. • CM-EM maximize G=H(X)-H(X|θ)inMGstep.So,the MG-stepissimilartothe M-stepof the MM algorithm. • Maximizing F is similar to minimizing H(X|θ) or maximizing G. • If we replace H(Y) with Hθ(Y) in F then the M-step of MM is the same as the MG-step. • However, E2-step does not maximize G, it minimize H(Y+1||Y).

An Iterative Example of Mixture Models with R<R* or Q<Q* • The number of iterations is 5 Both CM_G and EM_Q are monotonously increasing. H(Q||P)=R(G)-G→0

A Counterexample with R>R* or Q>Q* against the EM convergence Proof True, starting, and ending parameters: Excel demo files can be can downloaded from: http://survivor99.com/lcg/cc-iteration.zip The number of iterations is 5

Illustrating the Convergence of the CM-EM Algorithm for R>R* and R<R* A counterexample against the EM; Q is decreasing The central idea of The CM is • Finding the point G≈R on R-G plane, two-dimensional plane； also looking for R→R*——EM algorithm neglects R→R* • MinimizingH(Q||P)= R(G)-G (similar to min-max method)； Two examples: Start R<R* or Q<Q* Start R>R* or Q>Q* Target

Comparing the Iteration Numbers of CM-EM，EM and MM Algorithms • For the same example used by Neal and Hinton, • EM algorithm needs 36 iterations • MM algorithm (Neal and Hinton) needs 18 iterations; • CM-EM algorithm needs only 9 iterations. • References: 1. Lu Chenguang， From the EM Algorithm to the CM-EM Algorithm for Global Convergence of Mixture Models， http://arxiv.org/a/lu_c_3. • 2. Neal, Radford; Hinton, Geoffrey , ftp://ftp.cs.toronto.edu/pub/radford/emk.pdf

Fundamentals for the Convergence Proof 1:SemanticInformation is Defined with Log-normalized-likelihood • Semantic information conveyed by yj about xi: • Averaging I(xi;θj) to get Semantic Kullback-Leibler Information: • Averaging I(X;θj) to get Semantic Mutual Information:

From Shannon’s Channel to Semantic Channel yj不变X变 • The Shannon channel consists of transition function • The semantic channel consists of truth functions： • The semantic mutual information formula: • We may fix one and optimize another alternatively. X

Fundamentals for the Convergence Proof 2： From R(D) Function to R(G) Function Shannon’s Information rate distortion function: R(D) where R(D) means minimum R for given D. Replacing D with G: We have R(G) function： All R(G) functions are bowl like. Matching Point

Fundamentals for the Convergence Proof 2： Two Kinds of mutual Matching • 1. For maximum mutual information classifications • for Maximum R and G: • 2. For mixture models • for minimum R-G Matching Point

Semantic Channel Matches Shannon’s Channel • Optimize the truth function and the semantic channel: • or • When the sample is large enough, the optimized truth function is proportional to the transition probability function • or

Shannon’s Channel Matches Semantic Channel • For Maximum Mutual Information Classifications • Using classifier • For mixture models • Using E1-step and E2-step of CM-EM • Repeat • Until

The Convergence Proof of CM-EM I: Basic Formulas Semantic mutual information Shannon mutual information where Main formula for mixture models: = =∑i P(xi)P(yj|xi)

The Convergence Proof of CM-EM II: Using Variational Method • The Convergence Proof: Proving that Pθ(X) converges to P(X) is equivalent to proving that H(P||Pθ) converges to 0. Since E2-step makes R=R'' and H(Y+1||Y)=0, we only need to prove that every step minimizes R-G after the start step. • Because MG-step maximizes G without changing R. The left work is to prove that E1-step and E2-step minimize R-G. • Fortunately, we can strictly prove that by the variational method and the iterative method that Shannon (1959) and others (Berger, 1971; Zhou, 1883) used for analyzing the rate-distortion function R(D).

The CM Algorithm: Using Optimized Mixture Models for Maximum Mutual Information Classifications To find the best dividing points. First assume a z’ to obtain P(zj|Y) Matching I: Obtain T*(θzj|Y) And information lines I(Y;θzj|X) Matching II: Using the classifier: If H(P||P θ)<0.001, then End,else Goto Matching I.

IllustratingtheConvergence of the CMAlgorithmforMaximum Mutual Information Classifications with R(G) Function Iterative steps and convergence reasons: • 1)For each Shannon channel, there is a matched semantic channel that maximizes average log-likelihood; • 2)For given P(X) and semantic channel, we can find a better Shannon channel; • 3)Repeating the two steps can obtain the Shannon channel that maximizes Shannon mutual information and average log-likelihood. A R(G) function serves asaladderlettingRclimbup, and find a better semantic channel and a better ladder.

An Example Shows the Reliability of The CM Algorithm • A 3×3 Shannon channel to show reliable convergence • Even if a pair of bad start points are used, the convergence is also reliable. • Using good start points, the number of iterations is 4; • Using very bad start points, the number of iterations is 11. beginning convergent

Summary The CM algorithm is a new tool for statistical learning. To show its power, we use the CM-EM algorithm to resolve the problems with mixture models. In real applications, X may be multi-dimensional; however, the convergence reasons and reliability should be the same. ——End—— Thank you for your listening！ Welcome to criticize！ 2017-8-26 reported in ICIS2017 (第二届智能科学国际会议,上海) 2018-11-9 revised for better convergence proof. More papers about the author’s semantic information theory: http://survivor99.com/lcg/books/GIT/index.htm

Enhancing Mixture Model Convergence with CM-EM Algorithm