Corpora and Statistical Methods

Corpora and Statistical Methods Albert Gatt

POS Tagging • Assign each word in continuous text a tag indicating its part of speech. • Essentially a classification problem. • Current state of the art: • taggers typically have 96-97% accuracy • figure evaluated on a per-word basis • in a corpus with sentences of average length 20 words, 96% accuracy can mean one tagging error per sentence

Sources of difficulty in POS tagging • Mostly due to ambiguity when words have more than one possible tag. • need context to make a good guess about POS • context alone won’t suffice • A simple approach which assigns only the most common tag to each word performs with 90% accuracy!

The information sources • Syntagmatic information: the tags of other words in the context of w • Not sufficient on its own. E.g. Greene/Rubin 1977 describe a context-only tagger with only 77% accuracy • Lexicalinformation (“dictionary”): most common tag(s) for a given word • e.g. in English, many nouns can be used as verbs (flour the pan, wax the car…) • however, their most likely tag remains NN • distribution of a word’s usages across different POSs is uneven: usually, one highly likely, other much less

Tagging in other languages (than English) • In English, high reliance on context is a good idea, because of fixed word order • Free word order languages make this assumption harder • Compensation: these languages typically have rich morphology • Good source of clues for a tagger

Evaluation and error analysis • Training a statistical POS tagger requires splitting corpus into training and test data. • Often, we need a development set as well, to tune parameters. • Using (n-fold) cross-validation is a good idea to save data. • randomly divide data into train + test • train and evaluate on test • repeat n times and take an average • NB: cross-validation requires the whole corpus to be blind. • To examine the training data, best to have fixed training & test sets, perform cross-validation on training data, and final evaluation on test set.

Evaluation • Typically carried out against a gold standard based on accuracy (% correct). • Ideal to compare accuracy of our tagger with: • baseline (lower-bound): • standard is to choose the unigram most likely tag • ceiling (upper bound): • e.g. see how well humans do at the same task • humans apparently agree on 96-7% tags • means it is highly suspect for a tagger to get 100% accuracy

Part 1 HMM taggers

Using Markov models • Basic idea: sequences of tags are a Markov Chain: • Limited horizon assumption: sufficient to look at previous tag for information about current tag • Time invariance: The probability of a sequence remains the same over time

Implications/limitations • Limited horizon ignores long-distance dependences • e.g. can’t deal with WH-constructions • Chomsky (1957): this was one of the reasons cited against probabilistic approaches • Time invariance: • e.g. P(finite verb|pronoun) is constant • but we may be more likely to find a finite verb following a pronoun at the start of a sentence than in the middle!

Notation • We let ti range over tags • Let wi range over words • Subscripts denote position in a sequence • Use superscripts to denote word types: • wj = an instance of word type j in the lexicon • tj = tag t assigned to word wj • Limited horizon property becomes:

Basic strategy • Training set of manually tagged text • extract probabilities of tag sequences: • e.g. using Brown Corpus, P(NN|JJ) = 0.45, but P(VBP|JJ) = 0.0005 • Next step: estimate the word/tag probabilities: These are basically symbol emission probabilities

Training the tagger: basic algorithm • Estimate probability of all possible sequences of 2 tags in the tagset from training data • For each tag tjand for each word wl estimate P(wl| tj). • Apply smoothing.

Finding the best tag sequence • Given: a sentence of n words • Find: t1,n = the best n tags • Application of Bayes’ rule • denominator can be eliminated as it’s the same for all tag sequences.

Finding the best tag sequence • The expression needs to be reduced to parameters that can be estimated from the training corpus • need to make some simplifying assumptions • words are independent of eachother • a word’s identity depends only on its tag

The independence assumption • Probability of a sequence of words given a sequence of tags is computed as a function of each word independently

The identity assumption • Probability of a word given a tag sequence = probability a word given its own tag

Applying these assumptions

Tagging with the Markov Model • Can use the Viterbi Algorithm to find the best sequence of tags given a sequence of words (sentence) • Reminder: probability of being in state (tag) j at word i on the best path most probable state (tag) at word i given that we’re in state j at word i+1

Assume that P(PERIOD) = 1 at end of sentence Set all other tag probs to 0 The algorithm: initialisation

Algorithm: induction step for i = 1 to n step 1: for all tags tj do: Probability of tag tj at i+1 on best path through i Most probable tag leading to tj at i+1

for j = n to 1 do: retrieve the most probable tags for every point in sequence Algorithm: backtrace State at n+1 Calculate probability for the sequence of tags selected

Some observations • The model is a Hidden Markov Model • we only observe words when we tag • In actuality, during training we have a visible Markov Model • because the training corpus provides words + tags

“True” HMM taggers • Applied to cases where we do not have a large training corpus • We maintain the usual MM assumptions • Initialisation: use dictionary: • set emission probability for a word/tag to 0 if it’s not in dictionary • Training: apply to data, use forward-backward algorithm • Tagging: exactly as before

Part 2 Transformation-based error-driven learning

Transformation-based learning • Approach proposed by Brill (1995) • uses quantitative information at training stage • outcome of training is a set of rules • tagging is then symbolic, using the rules • Components: • a set of transformation rules • learning algorithm

Transformations • General form: t1  t2 • “replace t1 with t2 if certain conditions are satisfied” • Examples: • Morphological: Change the tag from NN to NNS if the word has the suffix "s" • dogs_NN  dogs_NNS • Syntactic: Change the tag from NN to VB if the word occurs after "to" • go_NN to_TO  go_VB • Lexical: Change the tag to JJ if deleting the prefix "un" results in a word. • uncool_XXX  uncool_JJ • uncle_NN -/-> uncle_JJ

Learning Unannotated text Initial state annotator e.g. assign each word its most frequent tag in a dictionary truth: a manually annotated version of corpus against which to compare Learner: learns rules by comparing initial state to Truth rules

Learning algorithm • Simple iterative process: • apply a rule to the corpus • compare to the Truth • if error rate is reduced, keep the results • A priori specifications: • how initial state annotator works • the space of possible transformations • Brill (1995) used a set of initial templates • the function to compare the result of applying the rules to the truth

Non-lexicalised rule templates • Take only tags into account, not the shape of words Change tag a to tag b when: • The preceding (following) word is tagged z. • The word two before (after) is tagged z. • One of the three preceding (following) words is tagged z. • The preceding (following) word is tagged z and the word two before (after) is tagged w. • …

Lexicalised rule templates • Take into account specific words in the context Change tag a to tag b when: • The preceding (following) word is w. • The word two before (after) is w. • The current word is w, the preceding (following) word is w2 and the preceding (following) tag is t. • …

Morphological rule templates • Usful for completely unknown words. Sensitive to the word’s “shape”. Change the tag of an unknown word (from X) to Y if: • Deleting the prefix (suffix) x, |x| ≤ 4, results in a word • The first (last) (1,2,3,4) characters of the word are x. • Adding the character string x as a prefix (suffix) results in a word (|x| ≤ 4). • Word w ever appears immediately to the left (right) of the word. • Character z appears in the word. • …

Order-dependence of rules • Rules are triggered by environments satisfying their conditions • E.g. “AB if preceding tag is A” • Suppose our sequence is “AAAA” • Two possible forms of rule application: • immediate effect: applications of the same transformation can influence eachother • result: ABAB • delayed effect: results in ABBB • the rule is triggered multiple times from the same initial input • Brill (1995) opts for this solution

More on Transformation-based tagging • Can be used for unsupervised learning • like HMM-based tagging, the only info available is the allowable tags for each word • takes advantage of the fact that most words have only one tag • E.g. word can = NN in context AT ___ BEZ because most other words in this context are NN • therefore, learning algorithm would learn the rule “change tag to NN in context AT ___ BEZ” • Unsupervised method achieves 95.6% accuracy!!

Part 3 Maximum Entropy models and POS Tagging

Limitations of HMMs • An HMM tagger relies on: • P(tag|previous tag) • P(word|tag) • these are combined by multiplication • TBL includes many other useful features which are hard to model in HMM: • prefixes, suffixes • capitalisation • … • Can we combine both, i.e. have HMM-style tagging with multiple features?

The rationale • In order to tag a word, we consider its context or “history” h. We want to estimate a probability distribution p(h,t) from sparse data. • h is encoded in terms of features (e.g. morphological features, surrounding tag features etc) • There are some constraints on these features that we discover from training data. • We want our model to make the fewest possible assumptions beyond these constraints.

Motivating example • Suppose we wanted to tag the word w. • Assume we have a set T of 45 different tags: T ={NN, JJ, NNS, NNP, VVS, VB, …} • The probabilistic tagging model that makes fewest assumptions assigns a uniform distribution over the tags:

Motivating example • Suppose we find that the possible tags for w are NN, JJ, NNS, VB. • We therefore impose our first constraint on the model: • (and the prob. of every other tag is 0) • The simplest model satisfying this constraint:

Motivating example • We suddenly discover that w is tagged as NN or NNS 8 out of 10 times. • Model now has two constraints: • Again, we require our model to make no further assumptions. Simplest distribution leaves probabilities for all tags except NN/NNS equal: • P(NN) = 4/10 • P(NNS) = 4/10 • P(JJ) = 1/10 • P(VB) = 1/10

Motivating example • We suddenly discover that verbs (VB) occur 1 in every 20 words. • Model now has three constraints: • Simplest distribution is now: • P(NN) = 4/10 • P(NNS) = 4/10 • P(JJ) = 3/20 • P(VB) = 1/20

What we’ve been doing • Maximum entropy builds a distribution by continuously adding features. • Each feature picks out a subset of the training observations. • For each feature, we add a constraint on our total distribution. • Our task is then to find the best distribution given the constraints.

Features for POS Tagging • Each tagging decision for a word occurs in a specific context or “history” h. • For tagging, we consider as context: • the word itself • morphological properties of the word • other words surrounding the word • previous tags • For each relevant aspect of the context hi, we can define a feature fj that allows us to learn how well that aspect is associated with a tag ti. • Probability of a tag given a context is a weighted function of the features.

Features for POS Tagging • In a maximum entropy model, this information is captured by a binary or indicator feature • each featurefihas a weight αi reflecting its importance • NB: each αiis uniquely associated with a feature

Features for POS Tagging in Ratnaparkhi (1996) • Had three sets of features, for non-rare, rare and all words:

Features for POS Tagging • Given the large number of possible features, which ones will be part of the model? • We do not want redundant features • We do not want unreliable and rarely occurring features (avoid overfitting) • Ratnaparkhi (1996) used only those features which occur 10 times or more in the training data

The form of the model • Features fj and their parameters are used to compute the probability p(hi, ti): • where j ranges over features & Z is a normalisation constant • Transform into a linear equation:

Conditional probabilities • The conditional probabilities can be computed based on the joint probabilities • Probability of a sequence of tags given a sequence of words: • NB: unlike an HMM, we have one probability here. • we directly estimate p(t|h) • model combines all features in hi into a single estimate • no limit in principle on what features we can take into account

The use of constraints • Every feature we have imposes a constraint or expectation on the probability model. We want: • Where: the model p’s expectation of fj the empirical expectation of fj

Why maximum entropy? • Recall that entropy is a measure of uncertainty in a distribution. • Without any knowledge, simplest distribution is uniform • uniform distributions have the highest entropy • As we add constraints, the MaxEnt principle dictates that we find the simplest model p* satisfying the constraints: • where P is the set of possible distributions with • p* is unique and has the form given earlier • Basically, an application of Occam’s razor: make no further assumptions than necessary.

Corpora and Statistical Methods