220 likes | 233 Views
Explore mutual information and maximum likelihood in matching and iterating between semantic and Shannon channels for enhanced testing, estimation, and classification processes.
E N D
语义信道和Shannon信道相互匹配和迭代for检验,估计和分类with最大互信息和最大似然度Semantic Channel and Shannon Channel Mutually Match and Iterate for Tests, Estimations, and Classifications with Maximum Mutual Information and Maximum Likelihood鲁晨光 Chenguang Lu lcguang@foxmail.comHpage: http://survivor99.com/lcg/; http://www.survivor99.com/lcg/english/This ppt may be downloaded from http://survivor99.com/lcg/CM/CM4MMIandML.ppt
1. The Tasks of Tests and Estimations For tests and estimations with given P(X) and P(Z|X) or a sample {(x(t),z(t)|t=1,2,…,N}, How do we partition C (get boundaries) so that Shannon’s Mutual Information and Average Log Likelihood (ALL) are maximum.
Mutual Information and Average Log Likelihood Changes with z’ in Tests • Z’ move right, the likelihood provided by y1, P(X|θ1) will increase; yet, P(X|θ0) will degrease; Kullback-Leibler information I(X; y1) will increase, yet, I(X;y0) will decrease. • There is the best dividing point z* that maximizes mutual information and average log likelihood.
A Story about Looking for z*: Similar to Catching a Cricket • I research semantic information theory, define: Information=log [P(X|θj)/P(X)]. The information criterion is compatible with likelihood or likelihood ratio criterion. • Once I use information criterion to find optimal z* in a test, an interesting thing happens! • For any start z’, my excel file tells me: The best dividing point is next one! • After I use the next, it still says: The best point is next one! • ……Fortunately, It converges!It is similar to catching … • Do you know this secret? Can this method converge in any case? • Let’s prove the convergence by my semantic information theory.
5. The Research History 1993:《广义信息论》(Ageneralized Information Theory), 中国科技大学出版社; 1994:《广义熵和广义互信息的编码意义》,《通信学报》, 5卷6期,37-44. 1997:《投资组合的熵理论和信息价值》, 中国科技大学出版社; 1999: A generalization of Shannon‘s information theory (ashortversionofthebook) , Int. J. of General Systems, 28: (6) 453-490,1999 Recently, I found this theory could be used to Improve statistical learning in many aspects. See http://www.survivor99.com/lcg/books/GIT/ http://www.survivor99.com/lcg/Recent.html Home page: http://survivor99.com/lcg/ Blog:http://blog.sciencenet.cn/?2056 Published in 1993
6. Important step: Using Truth Function to Produce Likelihood Function • Using membership function mAj(X) as truth function of a hypothesis yj=“X is in Aj”: T(θj|X)=mAj(X), θj=Aj(a fuzzy set) as a sub-model • Important step 1: Using T(θj|X) and source P(X) to produce semantic likelihood function: • How to predict real position according to GPS? • Is the car on the building? Logical probability =exp[(xj-xi) 2 /(2d)2] Most possible position
7. Semantic Information Measure Compatible with Shannon,Popper,Fisher,and Zadeh’sThoughts • Using log normalized likelihood to define semantic information: • If T(θj|X)=exp[-|X-xj|2/(2d2)], j=1, 2, …, n, then • =Bar-Hillel and Carnap’s information – standard deviation Reflects Popper’s thought well: • The less the logical probability is, the more information there is; • The larger the deviation is, the less information there is; • A wrong estimation conveys negative information.
8. Semantic Kullback-Leibler Informationand Semantic Mutual Information Important step 2: Averaging I(xi;θj) to get Semantic Kullback-Leibler Information: sampling distribution likelihood Its simple relation to normalized log-likelihood: Averaging I(X;θj) to get Semantic Mutual Information: To maximize I(X; θj) is to Maximize likelihood Sampling distribution
9. Channels’Matching Algorithm yj不变X变 • The Shannon channel • The semantic channel (consists of truth functions): • The semantic mutual information formula: • We may fix one and optimize another alternatively achieve MLE. X Transition probability function Shannon channel Semantic Channel Sampling distribution Likelihood function
10. Semantic Channel Matches Shannon’s Channel • Optimize the truth function and the semantic channel: • When the sample is large enough, the optimized truth function is proportional to the transition probability function,or say, Semantic Channel matches Shannon’s channel. • xj* makes P(yj|xj*) be the maximum of P(yj|X). If P(yj|X) orP(yj)is hard to obtain, we may use • With T*(θj|X), the semantic Bayesian prediction is equivalent to traditional Bayesian prediction: P*(X|θj)=P(X|yj). Longitudinal normalizing constant Semantic channel Shannon channel
11. Multi-label Logical Classification and Selective Classification with ML Criterion • Receivers’ logical classification is to get membership functions. When the sample is not big enough, • When the sample is big enough, • , • Senders’ selective classification is to select a yj ( or make Bayes’ decision ): • If X is unseen and we can only see observed condition Z as in a test or estimation, then we may use this formula
12. Two Information Amounts Change with z • y0: test-negative • y1: test-positive • To optimize T(θj|X): • T(θ1|x1)=T(θ0|x0)=1 • T(θ1|x0)=b1’*= P(y1|x0)/P(y1|x1) • T(θ0|x1)=b0’*=P(y0|x1)/P(y0|x0) To optimize classifier: j=1, 2
13. Using R(G) function to Prove Iterative Convergence Shannon’s Information proposed the rate distortion function: R(D) where R(D) means minimum R for given D. Replaced D by G: We have R(G) function: It describes All R(G) functions are bowl like. Matching Point
14. Using R(G) Function to Prove CMAlgorithm’s ConvergenceforTestsandEstimations Iterative steps and convergence reasons: • 1)For each Shannon channel, there is a matched semantic channel that maximizes average log-likelihood; • 2)For given P(X) and semantic channel, we can find a better Shannon channel; • 3)Repeating the two steps can obtain the Shannon channel that maximizes Shannon mutual information and average log-likelihood. A R(G) function serves asaladderlettingRclimbup, and find a better semantic channel and a better ladder.
15. An Example for Estimations Shows the Convergent Reliability • A 3×3 Shannon channel to show reliable convergence • Even if a pair of bad start points are used, the convergence is also reliable. • Using good start points, the number of iterations is 4; • Using very bad start points, the number of iterations is 11. Start After convergence
16. The CM Algorithm for Mixture ModelsDifference: Looking for the True Shannon channel Semantic mutual information Shannon mutual information where Main formula for mixture models (without Jensen's inequality) : Three steps: 1) Left-step-a for 2)Left-step-b for H(Y||Y+1)→0; 3)Right-step for guessed Shannon channel maximizing G The CM vs The EM: Left-step-a ≈ E-step; Left-step-b+ Right-step≈M-step = =∑i P(xi)P(yj|xi) Using an inner iteration
17. Illustrating the Convergence of the CM Algorithm A counterexample against the EM; Q is decreasing The central idea of The CM is • Finding the point G≈R on R-G plane, two-dimensional plane; also looking for R→R*——EM algorithm neglects R→R* • MinimizingH(Q||P)= R(G)-G (similar to min-max method); Two examples: Start R<R* or Q<Q* Start R>R* or Q>Q* Target
18. A Counterexample with R>R* or Q>Q* against the EM True, starting, and ending parameters: Excel demo files can be can downloaded from: http://survivor99.com/lcg/cc-iteration.zip The number of iterations is 5
19. Illustrating Fuzzy Classification for Mixture Models • After we obtain optimized P(X|Θ), we need to select Y (to make decision or classification) according to X. The parameter s in R(G) function reminds us that we may use the following Shannon channel as classifying function: j=1, 2, …, n When s->∞, P(yj|X)=0 or 1.
20. The Numbers of Iterations for Convergence For Gaussian mixture models with Component number n=2.
21. MSI in Comparison with MLE and MAP MSI(estimation)——Maximum Semantic Information (estimation) MLE: MAP: MSI: MSI has features: • 1)compatible with MLE,but, suitable to cases with variable source P(X); • 2)compatible with traditional Bayesian predictions; • 3)using truth functions as predictive models so that the models reflect communication channels’ features.(For example, GPS and Medical tests provide Shannon channels and semantic channels)
22. Summary The Channel’s Matching (CM) algorithm is a new tool for statistical learning. It can be used it to resolve the problems with tests, estimations, multi-label logical classifications, and mixture models more conveniently. ——End—— Thank you for your listening! Welcome to criticize! • 2018 IEEE International Conference on Big Data and Smart Computing • January 15-18, 2018, Shanghai, China More papers about the author’s semantic information theory: http://survivor99.com/lcg/books/GIT/index.htm