360 likes | 366 Views
This lecture discusses parts of speech (POS) tagging, including influential tag sets, tricky cases, tokenization, simple models, corpus-based methods, training models, and Markov models for POS tagging.
E N D
NYU Part-of-Speech TaggingCSCI-GA.2590 – Lecture 4 Ralph Grishman
Parts of Speech Grammar is stated in terms of parts of speech (‘preterminals’): • classes of words sharing syntactic properties: noun verb adjective … NYU
POS Tag Sets Most influential tag sets were those defined for projects to produce large POS-annotated corpora: • Brown corpus • 1 million words from variety of genres • 87 tags • UPenn Tree Bank • initially 1 million words of Wall Street Journal • later retagged Brown • first POS tags, then full parses • 45 tags (some distinctions captured in parses) NYU
The Penn POS Tag Set • Noun categories • NN (common singular) • NNS (common plural) • NNP (proper singular) Penn POS tags • NNPS (proper plural) • Verb categories • VB (base form) • VBZ (3rd person singular present tense) • VBP (present tense, other than 3rd person singular) • VBD (past tense) • VBG (present participle) • VBN (past participle) NYU
some tricky cases • present participles which act as prepositions: • according/JJ to • nationalities: • English/JJ cuisine • an English/NNP sentence • adjective vs. participle • the striking/VBG teachers • a striking/JJ hat • he was very surprised/JJ • he was surprised/VBN by his wife NYU
Tokenization • any annotated corpus assumes some tokenization • relatively straightforward for English • generally defined by whitespace and punctuation • treat negative contraction as separate token: do | n’t • treat possessive as separate token: cat | ‘s • do not split hyphenated terms: Chicago-based NYU
the Tagging Task Task: assigning a POS to each word • not trivial: many words have several tags • dictionary only lists possible POS, independent of context • how about using a parser to determine tags? • some analysis (e.g., partial parsers) assume input is tagged NYU
Why tag? • POS tagging can help parsing by reducing ambiguity • Can resolve some pronunciation ambiguities for text-to-speech (“desert”) • Can resolve some semantic ambiguities NYU
Simple Models • Natural language is very complex • we don't know how to model it fully,so we build simplified models which provide some approximation to natural language NYU
Corpus-Based Methods How can we measure 'how good' these models are? • we assemble a text corpus • annotate it by hand with respect to the phenomenon we are interested in • compare it with the predictions of our model • for example, how well the model predicts part-of-speech or syntactic structure NYU
Preparing a Good Corpus • To build a good corpus • we must define a task people can do reliably (choose a suitable POS set, for example) • we must provide good documentation for the task • so annotation can be done consistently • we must measure human performance (through dual annotation and inter-annotator agreement) • Often requires several iterations of refinement
Training the model How to build a model? • need a goodness metric • train by hand, by adjusting rules and analyzing errors (ex: Constraint Grammar) • train automatically • develop new rules • build probabilistic model (generally very hard to do by hand) • choice of model affected by ability to train it (NN) NYU
The simplest model • The simplest POS model considers each word separately: • We tag each word with its most likely part-of-speech • this works quite well: about 90% accuracy when trained and tested on similar texts • although many words have multiple parts of speech, one POS typically dominates within a single text type • How can we take advantage of context to do better? NYU
A Language Model • To see how we might do better, let us consider a related problem: building a language model • a language model can generate sentences following some probability distribution NYU
Markov Model • In principle each word we select depends on all the decisions which came before (all preceding words in the sentence) • But we’ll make life simple by assuming that the decision depends on only the immediately preceding decision • [first-order] Markov Model • representable by a finite state transition network • Tij = probability of a transition from state i to state j
Finite State Network 0.30 dog: woof 0.50 0.30 start end 0.40 0.40 cat: meow 0.50 0.30 0.30
Our bilingual pets • Suppose our cat learned to say “woof” and our dog “meow” • … they started chatting in the next room • … and we wanted to know who said what
Hidden State Network woof meow woof meow dog start end cat
How do we predict • When the cat is talking: ti = cat • When the dog is talking: ti = dog • We construct a probabilistic model of the phenomenon • And then seek the most likely state sequence S
Hidden Markov Model • Assume current word depends only on current tag
HMM for POS Tagging • We can use the same formulas for POS tagging states POS tags NYU
Training an HMM • Training an HMM is simple if we have a completely labeled corpus: • have marked the POS of each word. • can directly estimate both P ( ti | ti-1 )and P ( wi | ti ) from corpus counts • using the Maximum Likelihood Estimator. NYU
Greedy Decoder • simplest decoder (tagger) assign tags deterministically from left to right • selects ti to maximize P(wi|ti) * P(ti|ti-1) • does not take advantage of right context • can we do better? NYU
Performance • Accuracy with good unknown-word model trained and tested on WSJ is 96.5% to 96.8% NYU
Unknown words • Problem (as with NB) of zero counts … words not in the training corpus • simplest: assume all POS equally likely for unknown words • can make better estimate by observing unknown words are very likely open class words, and most likely nouns • base P(t|w) of unknown word on probability distribution of words which occur once in corpus NYU
Unknown words, cont’d • can do even better by taking into account the form of a word • whether it is capitalized • whether it is hyphenated • its last few letters NYU
Trigram Models • in some cases we need to look two tags back to find an informative context • e.g, conjunction (N and N, V and V, …) • but there’s not enough data for a pure trigram model • so combine unigram, bigram, and trigram • linear interpolation • backoff NYU
Domain adaptation • Substantial loss in shifting to new domain • 8-10% loss in shift from WSJ to biology domain • adding small annotated sample (200-500 sentences) in new domain greatly reduces error • some reduction possible without annotated target data (Blitzer, Structured Correspondence Learning) NYU
Jet Tagger • HMM–based • trained on WSJ • file pos_hmm.txt NYU
Transformation-Based Learning • TBL provides a very different corpus-based approach to part-of-speech tagging • It learns a set of rules for tagging • the result is inspectable NYU
TBL Model • TBL starts by assigning each word its most likely part of speech • Then it applies a series of transformations to the corpus • each transformation states some condition and some change to be made to the assigned POS if the condition is met • for example: • Change NN to VB if the preceding tag is TO. • Change VBP to VB if one of the previous 3 tags is MD. NYU
Transformation Templates • Each transformation is based on one of a small number of templates, such as • Change tag x to y if the preceding tag is z. • Change tag x to y if one of the previous 2 tags is z. • Change tag x to y if one of the previous 3 tags is z. • Change tag x to y if the next tag is z. • Change tag x to y if one of the next 2 tags is z. • Change tag x to y if one of the next 3 tags is z. NYU
Training the TBL Model • To train the tagger, using a hand-tagged corpus, we begin by assigning each word its most common POS. • We then try all possible rules (all instantiations of one of the templates) and keep the best rule -- the one which corrects the most errors. • We do this repeatedly until we can no longer find a rule which corrects some minimum number of errors. NYU
Some Transformations the first 9 transformations found for WSJ corpus NYU
TBL Performance • Performance competitive with good HMM • accuracy 96.6% on WSJ • Compared to HMM, much slower to train, but faster to apply NYU