E N D
Maximum Entropy … the fact that a certain prob distribution maximizes entropy subject to certain constraints representing our incomplete information , is fundamental property which justifies use of that distribution for inference; it agrees with everything that is known, but carefully avoids assuming anything that is not known. It is a transcription into mathematics of an ancient principle of wisdom … (Jaynes, 1990) [from: A Maximum Entropy Approach to NLP by A.L.Berger, S.A.Della Pietra and V.J.Della Pietra, In Computational Linguistics, Vol. 22, Number 1, 1996]
Example • Let us try to see how an expert would translate the English word ‘in’ into Italian • In = {in, dentro, di, <0>, a} • If the translator selects always from this list than P(in) + P(dentro) + P(di) + P(<0>) +P(a) = 1 • If no preference P(.) = 1/5 • Suppose we notice that the translator chooses in or di in 30% of the cases. This changes our probabilities to: • P(dentro) + P(di) = 0.3 and • P(in) + P(di) + P(dentro) + P(<0>) + P(a) = 1
Example – cont. • This will change the distribution as follows (the most uniform p satisfying these constraints): • P(dentro) = 3/20 and P(di) = 3/20 • P(in) = P(<0>) = P(a) = 7/30 • Suppose we inspect the data further and we not another interesting fact: in half of the cases the expert chooses either in or di. So: • P(dentro) + P(di) = 0.3 • P(in) + P(di) + P(dentro) + P(<0>) + P(a) = 1 • P(in) + P(di) = 0.5 • Which is in such case the most uniform p?
Example - motivation • How can we measure the uniformity of a model? • Even if we answer the previous question, how do we determine the parameters of such model? • Maximum Entropy answers the questions: model everything that is known and assume nothing about what is unknown.
Aim • Construct a statistical model of the process that generated the training sample P(x,y); • P(y|x)… given a context x, the prob that the system will output y.
The expected value of fwith respect to theempirical distribution is exactly the statistics we are interested in. The expected value is:
Constraint: The expected values of the model and of the training sample must be the same:
What does uniform mean? • The mathematical measure of the uniformity of a conditional distribution P(y|x) is provided by the conditional entropy H(Y|X)=P(X,Y)*ΣP(Y|X), here marked as H(P). Joint probability of x and y
The Maximum Entropy Principle expected observed
Maximizing the Entropy … Lagrange ?
The Algorithm 1 • Input: features, empirical distribution • Output: optimal parameter values
Step 2a. • For constant feature counts
The Algorithm 1 Revisited • Input: features, empirical distribution • Output: optimal parameter values
The Algorithm 2 – Feature Selection • Input: Collection F of candidate features, empirical distribution P(x,y) • Output: Set S of active features and a model P incorporating these features