130 likes | 225 Views
Smoothing Other Methods for Tagging. Zeroes. Zeroes. When working with n-gram models (and their variants, such as HMMs), zero probabilities can be real show-stoppers Examples: Zero probabilities are a problem
E N D
Zeroes • When working with n-gram models (and their variants, such as HMMs), zero probabilities can be real show-stoppers • Examples: • Zero probabilities are a problem • p(w1 w2 w3...wn) p(w1) p(w2|w1) p(w3|w2)...p(wn|wn-1) bigram model • one zero and the whole product is zero • Zero frequencies are a problem • p(wn|wn-1) = C(wn-1wn)/C(wn-1) relative frequency • word doesn’t exist in dataset and we’re dividing by zero
Smoothing • Add-One Smoothing • add 1 to all frequency counts • Unigram • P(w) = C(w)/N (before Add-One) • N = size of corpus • P(w) = (C(w)+1)/(N+V) (with Add-One) = (C(w)+1)*N/(N+V) (with Add-One) • V = number of distinct words in corpus • N/(N+V) normalization factor adjusting for the effective increase in the corpus size caused by Add-One
Smoothing • Bigram • P(wn|wn-1) = C(wn-1wn)/C(wn-1) (before Add-One) • P(wn|wn-1) = (C(wn-1wn)+1)/(C(wn-1)+V) (after Add-One) = (C(wn-1 wn)+1)* C(wn-1) /(f(wn-1)+V) • N-gram • P(wn|wn-1,n-k) = C(wn-k,…,n)+1 / (C(C(wn-k,…,n-1)+V)
Smoothing • Add-One Smoothing • (C(wn-1 wn)+1)* C(wn-1) /(f(wn-1)+V) Remarks: perturbation problem add-one causes large changes in some frequencies due to relative size of V (1616) want to: 786 338
Smoothing • Other smoothing techniques: • Add delta smoothing: • P(wn|wn-1) = (C(wnwn-1) + ) / (C(wn) + V) • Similar perturbations to add-1 • Witten-Bell Discounting • Equate zero frequency items with frequency 1 items • Use frequency of things seen once to estimate frequency of things we haven’t seen yet • Smaller impact than Add-1 • Good-Turing Discounting • Nc = frequency of N-grams with frequency c • re-estimate c using formula (c+1)*Nc+1/Nc • Will talk about these and other methods later (n-gram)
Other Methods for Tagging Transformation based tagging
Transformation Based Tagging • Explained in Brill 1995 • Basic method: • Assign words their most likely tags (the stupid tagger) • For example, race would be tagged NN rather than VB because • P(NN|race) = 0.98 • P(VB|race) = 0.02 • Alter tags assigned using transformations • Transformations based on context
Transformation Based Tagging • Transformations based on context • Contextual “triggers” can be just about anything: • Preceding tag • NN → VB, previous tag is TO • One of the preceding n tags • VBP → VB, one of the previous three tags is MD (modal, as in “you may read”) • Next tag • JJR → RBR, next tag is JJ (“a more valuable player”) • One of the preceding n words • VBP → VB, a preceding word is n’t (“should n’t read”) • And others (morphological triggers), or combinations
Transformation Based Tagging • Still a learning based method (like HMMs) • Input: a correctly tagged corpus • Output: transformation rules • Apply rules iteratively to corpus until some threshold is achieved. • Transformation rules drawn from set of hand-written “metarules”, such as: • Tag A → Tag B when the preceding word is z • The transformation rules output are those that reduce the error to some prespecified threshold • Then apply frequent tags and transformations to some raw corpus
Transformation Based Tagging • Benefits: • Can use it for unsupervised learning • Brill 1995 describes a tagger that achieves 95.6% which is quite high for unsupervised • Doesn’t overtrain, which can happen with HMM taggers • Tagger available for download from: • http://www.cs.jhu.edu/~brill/RBT1_14.tar.Z • Paper from: • http://citeseer.ist.psu.edu/brill95transformationbased.html