490 likes | 602 Views
Corpora and Statistical Methods Lecture 10. Albert Gatt. Part 1. POS Tagging continued. Transformation-based error-driven learning. Transformation-based learning. Approach proposed by Brill (1995) uses quantitative information at training stage outcome of training is a set of rules
E N D
Corpora and Statistical MethodsLecture 10 Albert Gatt
Part 1 POS Tagging continued
Transformation-based learning • Approach proposed by Brill (1995) • uses quantitative information at training stage • outcome of training is a set of rules • tagging is then symbolic, using the rules • Components: • a set of transformation rules • learning algorithm
Transformations • General form: t1 t2 • “replace t1 with t2 if certain conditions are satisfied” • Examples: • Morphological: Change the tag from NN to NNS if the word has the suffix "s" • dogs_NN dogs_NNS • Syntactic: Change the tag from NN to VB if the word occurs after "to" • go_NN to_TO go_VB • Lexical: Change the tag to JJ if deleting the prefix "un" results in a word. • uncool_XXX uncool_JJ • uncle_NN -/-> uncle_JJ
Learning Unannotated text Initial state annotator e.g. assign each word its most frequent tag in a dictionary truth: a manually annotated version of corpus against which to compare Learner: learns rules by comparing initial state to Truth rules
Learning algorithm • Simple iterative process: • apply a rule to the corpus • compare to the Truth • if error rate is reduced, keep the results • A priori specifications: • how initial state annotator works • the space of possible transformations • Brill (1995) used a set of initial templates • the function to compare the result of applying the rules to the truth
Non-lexicalised rule templates • Take only tags into account, not the shape of words Change tag a to tag b when: • The preceding (following) word is tagged z. • The word two before (after) is tagged z. • One of the three preceding (following) words is tagged z. • The preceding (following) word is tagged z and the word two before (after) is tagged w. • …
Lexicalised rule templates • Take into account specific words in the context Change tag a to tag b when: • The preceding (following) word is w. • The word two before (after) is w. • The current word is w, the preceding (following) word is w2 and the preceding (following) tag is t. • …
Morphological rule templates • Usful for completely unknown words. Sensitive to the word’s “shape”. Change the tag of an unknown word (from X) to Y if: • Deleting the prefix (suffix) x, |x| ≤ 4, results in a word • The first (last) (1,2,3,4) characters of the word are x. • Adding the character string x as a prefix (suffix) results in a word (|x| ≤ 4). • Word w ever appears immediately to the left (right) of the word. • Character z appears in the word. • …
Order-dependence of rules • Rules are triggered by environments satisfying their conditions • E.g. “AB if preceding tag is A” • Suppose our sequence is “AAAA” • Two possible forms of rule application: • immediate effect: applications of the same transformation can influence eachother • result: ABAB • delayed effect: results in ABBB • the rule is triggered multiple times from the same initial input • Brill (1995) opts for this solution
More on Transformation-based tagging • Can be used for unsupervised learning • like HMM-based tagging, the only info available is the allowable tags for each word • takes advantage of the fact that most words have only one tag • E.g. word can = NN in context AT ___ BEZ because most other words in this context are NN • therefore, learning algorithm would learn the rule “change tag to NN in context AT ___ BEZ” • Unsupervised method achieves 95.6% accuracy!!
Limitations of HMMs • An HMM tagger relies on: • P(tag|previous tag) • P(word|tag) • these are combined by multiplication • TBL includes many other useful features which are hard to model in HMM: • prefixes, suffixes • capitalisation • … • Can we combine both, i.e. have HMM-style tagging with multiple features?
The rationale • In order to tag a word, we consider its context or “history” h. We want to estimate a probability distribution p(h,t) from sparse data. • h is encoded in terms of features (e.g. morphological features, surrounding tag features etc) • There are some constraints on these features that we discover from training data. • We want our model to make the fewest possible assumptions beyond these constraints.
Motivating example • Suppose we wanted to tag the word w. • Assume we have a set T of 45 different tags: T ={NN, JJ, NNS, NNP, VVS, VB, …} • The probabilistic tagging model that makes fewest assumptions assigns a uniform distribution over the tags:
Motivating example • Suppose we find that the possible tags for w are NN, JJ, NNS, VB. • We therefore impose our first constraint on the model: • (and the prob. of every other tag is 0) • The simplest model satisfying this constraint:
Motivating example • We suddenly discover that w is tagged as NN or NNS 8 out of 10 times. • Model now has two constraints: • Again, we require our model to make no further assumptions. Simplest distribution leaves probabilities for all tags except NN/NNS equal: • P(NN) = 4/10 • P(NNS) = 4/10 • P(JJ) = 1/10 • P(VB) = 1/10
Motivating example • We suddenly discover that verbs (VB) occur 1 in every 20 words. • Model now has three constraints: • Simplest distribution is now: • P(NN) = 4/10 • P(NNS) = 4/10 • P(JJ) = 3/20 • P(VB) = 1/20
What we’ve been doing • Maximum entropy builds a distribution by continuously adding features. • Each feature picks out a subset of the training observations. • For each feature, we add a constraint on our total distribution. • Our task is then to find the best distribution given the constraints.
Features for POS Tagging • Each tagging decision for a word occurs in a specific context or “history” h. • For tagging, we consider as context: • the word itself • morphological properties of the word • other words surrounding the word • previous tags • For each relevant aspect of the context hi, we can define a feature fj that allows us to learn how well that aspect is associated with a tag ti. • Probability of a tag given a context is a weighted function of the features.
Features for POS Tagging • In a maximum entropy model, this information is captured by a binary or indicator feature • each featurefihas a weight αi reflecting its importance • NB: each αiis uniquely associated with a feature
Features for POS Tagging in Ratnaparkhi (1996) • Had three sets of features, for non-rare, rare and all words:
Features for POS Tagging • Given the large number of possible features, which ones will be part of the model? • We do not want redundant features • We do not want unreliable and rarely occurring features (avoid overfitting) • Ratnaparkhi (1996) used only those features which occur 10 times or more in the training data
The form of the model • Features fj and their parameters are used to compute the probability p(hi, ti): • where j ranges over features & Z is a normalisation constant • Transform into a linear equation:
Conditional probabilities • The conditional probabilities can be computed based on the joint probabilities • Probability of a sequence of tags given a sequence of words: • NB: unlike an HMM, we have one probability here. • we directly estimate p(t|h) • model combines all features in hi into a single estimate • no limit in principle on what features we can take into account
The use of constraints • Every feature we have imposes a constraint or expectation on the probability model. We want: • Where: the model p’s expectation of fj the empirical expectation of fj
Why maximum entropy? • Recall that entropy is a measure of uncertainty in a distribution. • Without any knowledge, simplest distribution is uniform • uniform distributions have the highest entropy • As we add constraints, the MaxEnt principle dictates that we find the simplest model p* satisfying the constraints: • where P is the set of possible distributions with • p* is unique and has the form given earlier • Basically, an application of Occam’s razor: make no further assumptions than necessary.
Training 1: computing empirical expectation • Recall that: • Suppose we are interested in the feature: • In a corpus of 10k words + tags, where the word moving occurs as VBG 20 times:
Training 2: computing model expectation • Recall that: • Requires sum over all possible histories and tags! • Approximate by computing model expectation of the feature on training data only:
Learning the optimal parameters • Our goal is to learn the best parameter αj for each feature fj, such that: • i.e.: • One method is Generalised Iterative Scaling
Generalised Iterative Scaling: assumptions • for all (h,t), features sum to a constant value: • If this is not the case, we set C to: • and add a filler feature fl, such that:
Generalised Iterative Scaling: assumptions (II) • for all (h,t), there is at least one feature f which is active, i.e.:
Generalised Iterative Scaling • Input: Features f1, ..., fn and empirical distribution • Output: Optimal parameter values α1, ..., αn • Initialise αi = 0 for all i Є {1, 2, ..., n} • For each i do: • set • set • If model has not converged, repeat from (2)
Tagging sequences • We want to tag a sequence w1, ...,wn • This can be decomposed into: • The history hi consists of the words w1, ...,wi-1 and previous tags t1, ..., ti−1
Finding the best tag sequence: beam search (Ratnaparki, 1996) • To find the best sequence of n tags given N features. • sij = the jth highest probability tag sequence up to word i. • Generate all tags for w1 • find the top N tags • set s1j for 1 ≤ j ≤ N • for i= 2 to n do: • for j = 1 to N do: • Generate tags for wi given s(i-1)j • Append each tag to s(i-1)j to create new sequence • Find the N highest probability sequences generated by loop 2a. • Return sn1
Worked example • Suppose our data consists of the sequence: a, b, c • Assume the correct tags are A, B, C • Assume that N = 1 (i.e. we only ever consider the top most likely tag)
Worked Example • Step 1: generate all possible tags for a: A, B, C • Step 2: find the most likely tag for a: A
Worked Example • Step 2: generate all possible tags for b: A, B, C • merge with s11: A-A, A-B, A-C • Find the most likely sequence: A-B
Worked Example • Step 3: generate all possible tags for w3: A, B, C • merge with s21: A-B-A, A-B-B, A-B-C • Find the most likely: A-B-C
Worked Example • Return s31 (=A-B-C)
HMM vs MaxEnt • Standard HMMs cannot compute conditional probability directly. • E.g. for tagging: • we want p(t1,n|w1,n) • we obtain it via Bayes’ rule, combining p(w1,n|t1,n) with the prior p(t1,n) • HMMs are generative models which optimise p(w1,n|t1,n) • By contrast, a MaxEnt Markov Model (MEMM) is a discriminative model which optimises p(t1,n|w1,n) directly.
Graphically (after Jurafsky & Martin 2009) HMM has separate models for P(w|t) and for P(t) MEMM has a single model to estimate P(t|w)
More formally… • With an HMM: • With a MEMM:
Adapting Viterbi • We can easily adapt the Viterbi algorithm to find the best state sequence in a MEMM. • Recall that with HMMs: • Adaptation for MEMMs:
Summary • MaxEnt is a powerful classification model with some advantages over HMM: • direct computation of conditional probabilities from the training data • can handle multiple features • First introduced by Ratnaparkhi (1996) for POS Tagging. Used for many other applications since then.