Advisor ： Dr.Hsu Graduate ： Keng-Wei Chang Author ： Lian Yan and David J. Miller

國立雲林科技大學National Yunlin University of Science and Technology • General statistical inference for discrete and • mixed spaces by an approximate application • of the maximum entropy principle • Advisor：Dr.Hsu • Graduate： Keng-Wei Chang • Author： Lian Yan and David J. Miller IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 11, NO. 3, MAY 2000

Outline • N.Y.U.S.T. • I.M. • Motivation • Objective • Introduction • Maximum Entropy Joint PMF • Extensions for More General Inference Problems • Experimental Results • Conclusions and Possible Extensions

maximum entropy (ME) joint probability mass function (pmf) powerful and not require expression of conditional independence the huge learning complexity has severely limited the use of this approach Motivation • N.Y.U.S.T. • I.M.

propose an approach can quite tractable learning extend to with mixed data Objective • N.Y.U.S.T. • I.M.

probability mass function (pmf) joint pmf, can compute a posteriori probabilities for a single, fixed feature given knowledge of the remaining feature values  statistical classification with some feature values missing  statistical classification for any (e.g., user-specified) discrete feature dimensions given values for the other features  generalized classification 1. Introduction • N.Y.U.S.T. • I.M.

Multiple Networks Approach Bayesian Networks Maximum Entropy Models Advantages of the Proposed ME Method over BN’s 1. Introduction • N.Y.U.S.T. • I.M.

multilayer perceptrons (MLP’s), radial basis functions, support vector machines one would train one network for each feature example: documents classification to multiple topics  one network was used to make an individual yes/no decision for presence of each possible topic multiplenetworks approach 1.1 Multiple Networks Approach • N.Y.U.S.T. • I.M.

several potential difficulties increased learning and storage complexities accuracy of inferences ignores dependencies between features example: network predict F1 = 1 and F2 = 1 respectively but the joint event (F1=1, F2=1) has zero probability 1.1 Multiple Networks Approach • N.Y.U.S.T. • I.M.

handles missing features and captures dependencies between the multiple features joint pmf explicitly a product of conditional probability versatile tools for inference that have a convenient, informative representation… 1.2 Bayesian Networks • N.Y.U.S.T. • I.M.

several difficulties with BN explicitly conditional independence relations between features optimizing over the set of possible BN structures sequential, greedy methods  may be suboptimal sequential learning  where to stop to avoid overfitting 1.2 Bayesian Networks • N.Y.U.S.T. • I.M.

Cheeseman proposed maximum entropy (ME) joint pmf consistent with arbitrary lower order probability constraints powerful, allowing joint pmf to express general dependencies between features 1.3 Maximum Entropy Models • N.Y.U.S.T. • I.M.

several difficulties with ME difficult learning for estimating the ME Ku and Kullback proposed an iterative algorithm, satisfies one constraint at a time, but cause violation of others they only presented results for dimension N = 4 and J = 2 discrete values per feature Peral cites complexity as the main barriers to using ME 1.3 Maximum Entropy Models • N.Y.U.S.T. • I.M.

our approach not requir explicit conditional independence an effective joint optimization learning technique 1.4 Advantages of the Proposed ME Method over BN’s • N.Y.U.S.T. • I.M.

a random feature vector full discrete feature space 2. Maximum Entropy Joint PMF • N.Y.U.S.T. • I.M.

pairwise pmf  constrain the joint pmf to agree with the ME joint pmf consistent with these pairwise pmf’s has the Gibbs form 2. Maximum Entropy Joint PMF • N.Y.U.S.T. • I.M. Lagrange multiplier

Lagrange multiplier equality constraint on the individual pairwise probability the joint pmf is specified by the set of Lagrange multipliers these probabilities also depend on Γ, they can often be tractably computed 2. Maximum Entropy Joint PMF • N.Y.U.S.T. • I.M.

two major difficulties optimization requires calculating  intractable cost D will require marginalizations over the joint pmf  intractable approximate ME was inspired 2. Maximum Entropy Joint PMF • N.Y.U.S.T. • I.M.

random feature vector still has intractable form (1) classification does require computing but rather just the a posteriori probabilities 2.1 Review of the ME Formulation for Classification • N.Y.U.S.T. • I.M. still not feasible!

here we review a tractable , approximate method Joint PMF Form Support Approximation Lagrangian Formulation 2.1 Review of the ME Formulation for Classification • N.Y.U.S.T. • I.M.

via Bayes rule where 2.1.1 Joint PMF Form • N.Y.U.S.T. • I.M.

the approximation may have some effect on accuracy of the learned model but will not sacrifice our capability full feature space  subset  computationally feasible example: N =19  40 billion  100  reduction is huge 2.1.2 Support Approximation • N.Y.U.S.T. • I.M.

i.e., then the joint entropy for 2.1.3 Lagrangian Formulation • N.Y.U.S.T. • I.M.

suggest the cross entropy the cross entropy/Kullback distance 2.1.3 Lagrangian Formulation • N.Y.U.S.T. • I.M.

For pairwise constraints involving the class label P[Fk, C] 2.1.3 Lagrangian Formulation • N.Y.U.S.T. • I.M.

overall constraint cost D is formed as a sum of all the individual pairwise costs given D and H, can form the Lagrangiancostfunction 2.1.3 Lagrangian Formulation • N.Y.U.S.T. • I.M.

General statistical Inference Joint PMF Representation Support Approximation Lagrangian Formulatoin Discussion Mixed Discrete and Continuous Feature Space 3. Extensions for More General Inference Problems • N.Y.U.S.T. • I.M.

the posteriori probabilities have 3.1.1 Joint PMF Representation • N.Y.U.S.T. • I.M.

respect to each feature Fi, the joint pmf as 3.1.1 Joint PMF Representation • N.Y.U.S.T. • I.M.

reduced joint pmf for if there is a set 3.1.2 Support Approximation • N.Y.U.S.T. • I.M.

the joint entropy H can be written 3.1.3 Lagrangian Formulatoin • N.Y.U.S.T. • I.M.

pairwise pmf PM[Fk, Fl] can be calculated in two different ways and 3.1.3 Lagrangian Formulatoin • N.Y.U.S.T. • I.M.

overall constraint cost D 3.1.3 Lagrangian Formulatoin • N.Y.U.S.T. • I.M.

3.1.3 Lagrangian Formulatoin • N.Y.U.S.T. • I.M.

N.Y.U.S.T. • I.M.

Choice of Constraints encode all probabilities of second order Tractability of Learning Qualitative Comparison of Methods 3.2. Discussion • N.Y.U.S.T. • I.M.

feature vector will be written our objective is to learn 3.3. Mixed Discrete and Continuous Feature Space • N.Y.U.S.T. • I.M.

given our choice of constraints, these probabilities decompose the joint density as 3.3. Mixed Discrete and Continuous Feature Space • N.Y.U.S.T. • I.M.

a conditional mean constraint on Ai given C = c a pair of continuous features Ai, Aj 3.3. Mixed Discrete and Continuous Feature Space • N.Y.U.S.T. • I.M.

Evaluation of generalized classification performance used solely for classification Mushroom, Congress, Nursery, Zoo, Hepatitis Generalized classification performance on data sets indicates multiple possible class features Solar Flare, Flag, Horse Colic Classification performance on data sets with mixed continuous and discrete features Credit Approval, Hepatitis, Horse Colic 4. Experiment Results • N.Y.U.S.T. • I.M.

the ME method was compared with BN DT powerful extension of DT mixtures of DT multilayer perceptrons (MLP) 4. Experiment Results • N.Y.U.S.T. • I.M.

for a arbitrary feature to be inrerred, Fi, computes the a posteriori probabilities 4. Experiment Results • N.Y.U.S.T. • I.M.

use the following criteria to evaluate all the methods (1) misclassification rate on the test set for the data set’s class label (2) (1) with a single feature missing randomly (3) average misclassification rate on the test set (4) misclassification rate on the test set, based on predicting a pair of randomly chosen features 4. Experiment Results • N.Y.U.S.T. • I.M.

N.Y.U.S.T. • I.M.

4. Experiment Results • N.Y.U.S.T. • I.M.

Regression Large-Scale Problems Model Selection-Searching for ME Constraints Applications 5. Conclusions and Possible Extensions • N.Y.U.S.T. • I.M.

… Personal opinion • N.Y.U.S.T. • I.M.

Advisor ： Dr.Hsu Graduate ： Keng-Wei Chang Author ： Lian Yan and David J. Miller