IBM Clustering: after Brown et al

IBM Clustering:after Brown et al

Word-based n-gram models seem to be willfully obtuse: they use the information that words contain, but overlook the information that in certain contexts we’re very likely to get a noun (or adjective, etc.), even if we don’t know which one.

The following model tries to take that into account. We model the probability of word S[i] being word wn as being wn’s probability, given that it’s a member of Category k, times the probability that a Category k should follow the preceding category.

Joint probability of a word/category sequence • We could calculate the joint probability of the sequence(s): C1 C2 C3 C4 W1 W2 W3 W4

Prob (Wi, Ci) = Prob (Wi | Ci ) * Prob (Ci|Ci-2); but if we want this to help us compute a distribution of the probabilities for the next word in a sentence, we have to sum over all relevant category sequences…

In addition, we could look at category trigrams; or we could use this (category-based) method as a back-off strategy…

Category trigrams: Prob (Wi, Ci) = Prob (Wi | Ci ) * Prob (Ci|Ci-2 Ci-1)

How to find categories? • Brown et al 1990 suggest essentially this: Set up 1,000 different “lexical categories”, each with one member: the 1,000 most frequent words. (Why 1,000?) Consider all 1000*999/2 ways of collapsing these, and pick the one which minimizes the decrease in the mutual information that you get when you pass from a system with 1,000 categories to one with 999 categories…. Repeat until you’re done.

pointwise mutual information for finding collocations

Mexican hat neighborhood

Examples of inferred categories • Friday Monday Thursday Tuesday Saturday Sunday weekends Sundays Saturdays • People guys folks fellows CEOs chaps doubters commies unfortunates blokes • Down backwards ashore sideways southward northward overboard aloft • That that heat • Head body hands eyes voice arm seat eye hair mouth • Water coal gas liquid acid sand carbon steam shale iron

IBM Clustering: after Brown et al