230 likes | 321 Views
עיבוד שפות טבעיות - שיעור חמישי POS Tagging Algorithms. עידו דגן המחלקה למדעי המחשב אוניברסיטת בר אילן. Supervised Learning Scheme. “Labeled” Examples. Training Algorithm. Classification Model. New Examples. Classification Algorithm. Classifications.
E N D
עיבוד שפות טבעיות - שיעור חמישיPOS Tagging Algorithms עידו דגן המחלקה למדעי המחשב אוניברסיטת בר אילן 88-680
Supervised Learning Scheme “Labeled” Examples Training Algorithm Classification Model New Examples Classification Algorithm Classifications 88-680
Transformational Based Learning (TBL) for Tagging • Introduced by Brill (1995) • Can exploit a wider range of lexical and syntactic regularities via transformation rules – triggering environment and rewrite rule • Tagger: • Construct initial tag sequence for input – most frequent tag for each word • Iteratively refine tag sequence by applying “transformation rules” in rank order • Learner: • Construct initial tag sequence for the training corpus • Loop until done: • Try all possible rules and compare to known tags, apply the best rule r* to the sequence and add it to the rule ranking 88-680
Some examples 1. Change NN to VB if previous is TO • to/TO conflict/NN with VB 2. Change VBP to VB if MD in previous three • might/MD vanish/VBP VB 3. Change NN to VB if MD in previous two • might/MD reply/NN VB 4. Change VB to NN if DT in previous two • the/DT reply/VB NN 88-680
Transformation Templates Specify which transformations are possible For example: change tag A to tag B when: • The preceding (following) tag is Z • The tag two before (after) is Z • One of the two previous (following) tags is Z • One of the three previous (following) tags is Z • The preceding tag is Z and the following is W • The preceding (following) tag is Z and the tag two before (after) is W 88-680
Lexicalization New templates to include dependency on surrounding words (not just tags): Change tag A to tag B when: • The preceding (following) word is w • The word two before (after) is w • One of the two preceding (following) words is w • The current word is w • The current word is w and the preceding (following) word is v • The current word is w and the preceding (following) tag is X (Notice: word-tag combination) • etc… 88-680
Initializing Unseen Words • How to choose most likely tag for unseen words? Transformation based approach: • Start with NP for capitalized words, NN for others • Learn “morphological” transformations from: Change tag from X to Y if: • Deleting prefix (suffix) x results in a known word • The first (last) characters of the word are x • Adding x as a prefix (suffix) results in a known word • Word W ever appears immediately before (after) the word • Character Z appears in the word 88-680
TBL Learning Scheme Unannotated Input Text Setting Initial State Ground Truth for Input Text Annotated Text Learning Algorithm Rules 88-680
Greedy Learning Algorithm • Initial tagging of training corpus – most frequent tag per word • At each iteration: • Identify rules that fix errors and compute “error reduction” for each transformation rule: • #errors fixed - #errors introduced • Find best rule; If error reduction greater than a threshold (to avoid overfitting): • Apply best rule to training corpus • Append best rule to ordered list of transformations 88-680
Stochastic POS Tagging • POS tagging:For a given sentence W = w1…wnFind the matching POS tags T = t1…tn • In a statistical framework:T' = arg max P(T|W) T 88-680
Bayes’ Rule Denominator doesn’t depend on tags Words are independent of each other A word’s identity depends only on its own tag Chaining rule Markovian assumptions Notation: P(t1) = P(t1 | t0) 88-680
The Markovian assumptions • Limited Horizon • P(Xi+1 = tk |X1,…,Xi) = P(Xi+1 = tk | Xi) • Time invariant • P(Xi+1 = tk | Xi) = P(Xj+1 = tk | Xj) 88-680
Maximum Likelihood Estimations • In order to estimate P(wi|ti), P(ti|ti-1)we can use the maximum likelihood estimation • P(wi|ti) = c(wi,ti) / c(ti) • P(ti|ti-1) = c(ti-1ti) / c(ti-1) • Notice estimation for i=1 88-680
Unknown Words • Many words will not appear in the training corpus. • Unknown words are a major problem for taggers (!) • Solutions – • Incorporate Morphological Analysis • Consider words appearing once in training data as UNKOWNs 88-680
Smoothing for Tagging • For P(ti|ti-1) • Optionally – for P(ti|ti-1) 88-680
Viterbi • Finding the most probable tag sequence can be done with the viterbi algorithm. • No need to calculate every single possible tag sequence (!) 88-680
Hmms • Assume a state machine with • Nodes that correspond to tags • A start and end state • Arcs corresponding to transition probabilities - P(ti|ti-1) • A set of observations likelihoods for each state - P(wi|ti) 88-680
P(likes)=0.3P(flies)=0.1…P(eats)=0.5 P(like)=0.2P(fly)=0.3…P(eat)=0.36 VBZ RB VB NN P(the)=0.4P(a)=0.3P(an)=0.2… 0.6 NNS AT 0.4 88-680
HMMs • An HMM is similar to an Automata augmented with probabilities • Note that the states in an HMM do not correspond to the input symbols. • The input symbols don’t uniquely determine the next state. 88-680
HMM definition • HMM=(S,K,A,B) • Set of states S={s1,…sn} • Output alphabet K={k1,…kn} • State transition probabilities A={aij} i,jS • Symbol emission probabilities B=b(i,k) iS,kK • start and end states (Non emitting) • Alternatively: initial state probabilities • Note: for a given i- aij=1 & b(i,k)=1 88-680
Why Hidden? • Because we only observe the input - the underlying states are hidden • Decoding:The problem of part-of-speech tagging can be viewed as a decoding problem: Given an observation sequence W=w1,…,wn find a state sequence T=t1,…,tn that best explains the observation. 88-680
Homework 88-680