Three Basic Problems

Three Basic Problems • Compute the probability of a text: Pm(W1,N) • Compute maximum probability tag sequence: arg maxT1,N Pm(T1,N | W1,N) • Compute maximum likelihood model arg maxm Pm(W1,N)

Notation • aij = Estimate of P(titj) • bjk = Estimate of P(wk | tj) • Ak(i) = P(w1,k-1, tk=ti) (from Forward algorithm) • Bk(i) = P(wk+1,N | tk=ti) (from Backwards algorithm)

EM Algorithm(Estimation-Maximization) • Start with some initial model • Compute the most likely states for each output symbol from the current model • Use this tagging to revise the model, increasing the probability of the most likely transitions and outputs • Repeat until convergence Note: No labeled training required!

Estimating transition probabilities Define pk(i,j) as prob. of traversing arc titj at time k given the observations: pk(i,j) = P(tk = ti, tk+1 = tj | W, m) = P(tk = ti, tk+1 = tj,W | m) / P(W | m) = =

Expected transitions • Define gi(k) = P(tk = ti | W, m), then: gi(k) = • Now note that: • Expected number of transitions from tag i = • Expected transitions from tag i to tag j =

Reestimation • a’ij = = • b’k = =

EM Algorithm Outline • Choose initial model = <a,b,g(1)> • Repeat until results don’t improve much: • Compute pt using based on current model and Forward & Backwards algorithms to compute a and b (Estimation) • Compute new model <a’,b’,g’(1)> (Maximization) Note: Only guarantees a local maximum!

Example • Tags: a, b • Words: x, y, z • z can only be tagged b • Text: x y z z y

Some extensions for HMM POS tagging • Higher-order models: P(ti1,…,tin tj) • Incorporating text features: • Output prob = P(wi,fj| tk) where f is a vector of features (capitalized, ends in –d, etc.) • Combining labeled and unlabeled training (initialize with labeled then do EM)

Transformational Tagging • Introduced by Brill (1995) • Tagger: • Construct initial tag sequence for input • Iteratively refine tag sequence by applying “transformation rules” in rank order • Learner: • Construct initial tag sequence • Loop until done: • Try all possible rules, apply the best rule r* to the sequence and add it to the rule ranking

Unannotated Input Text Setting Initial State Ground Truth for Input Text Annotated Text Learning Algorithm Rules

Learning Algorithm • May assign tag X to word w only if: • w occurred in the corpus with tag X, or • w did not occur in the corpus at all • Try to find best transformation from some tag X to some other tag Y • Greedy algorithm: Choose next the rule that maximizes accuracy on the training set

Transformation Template Change tag A to tag B when: • The preceding (following) tag is Z • The tag two before (after) is Z • One of the two previous (following) tags is Z • One of the three previous (following) tags is Z • The preceding tag is Z and the following is W • The preceding (following) tag is Z and the tag two before (after) is W

Initial tag annotation • while transformations can be found, do: • for each from_tag, do: • for each to_tag, do: • for pos 1 to corpus_size, do: • if (correct_tag(pos) = to_tag && tag(pos) = from_tag) then num_good_trans(tag(pos – 1))++ • else if (correct_tag(pos) = from_tag && tag(pos) = from_tag) then num_bad_trans(tag(pos – 1))++ • find maxT (num_good_trans(T) – num_bad_trans(T)) • if this is the best score so far, store as best rule: Change from_tag to to_tag if previous tag is T • Apply best rule to training corpus • Append best rule to ordered list of transformations

Some examples 1. Change NN to VB if previous is TO • to/TO conflict/NN with  VB 2. Change VBP to VB if MD in previous three • might/MD vanish/VBP VB 3. Change NN to VB if MD in previous two • might/MD reply/NN VB 4. Change VB to NN if DT in previous two • might/MD the/DT reply/VB  NN

Lexicalization New templates to include dependency on surrounding words (not just tags): Change tag A to tag B when: • The preceding (following) word is w • The word two before (after) is w • One of the two preceding (following) words is w • The current word is w • The current word is w and the preceding (following) word is v • The current word is w and the preceding (following) tag is X • etc…

Initializing Unseen Words • How to choose most likely tag for unseen words? Transformation based approach: • Start with NP for capitalized words, NN for others • Learn transformations from: Change tag from X to Y if: • Deleting prefix (suffix) x results in a known word • The first (last) characters of the word are x • Adding x as a prefix (suffix) results in a known word • Word W ever appears immediately before (after) the word • Character Z appears in the word

Morphological Richness • Parts of speech really include features: • NN2  Noun(type=common,num=plural) This is more visible in other languages with richer morphology: • Hebrew nouns: number, gender, possession • German nouns: number, gender, case, ??? • And so on…

Three Basic Problems