280 likes | 393 Views
Part-of-Speech (POS) tagging. See Eric Brill “Part-of-speech tagging”. Chapter 17 of R Dale, H Moisl & H Somers (eds) Handbook of Natural Language Processing , New York (2000): Marcel Dekker
E N D
Part-of-Speech (POS) tagging See Eric Brill “Part-of-speech tagging”. Chapter 17 of R Dale, H Moisl & H Somers (eds) Handbook of Natural Language Processing, New York (2000): Marcel Dekker D Jurafsky & JH Martin: Speech and Language Processing, Upper Saddle River NJ (2000): Prentice Hall, Chapter 8 CD Manning & H Schütze: Foundations of Statistical Natural Language Processing, Cambridge, Mass (1999): MIT Press, Chapter 10. [skip the maths bits if too daunting]
Word categories • A.k.a. parts of speech (POSs) • Important and useful to identify words by their POS • To distinguish homonyms • To enable more general word searches • POS familiar (?) from school and/or language learning (noun, verb, adjective, etc.)
Word categories • Recall that we distinguished • open-class categories (noun, verb, adjective, adverb) • Closed-class categories (preposition, determiner, pronoun, conjunction, …) • While the big four are fairly clearcut, it is less obvious exactly what and how many closed-class categories there may be
POS tagging • Labelling words for POS can be done by • dictionary lookup • morphological analysis • “tagging” • Identifying POS can be seen as a prerequisite to parsing, and/or a process in its own right • However, there are some differences: • Parsers often work with the most simple set of word categories, subcategorized by feature (or attribute-value) schemes • Indeed the parsing procedure may contribute to the disambiguation of homonyms
POS tagging • POS tagging, per se, aims to identify word-category information somewhat independently of sentence structure … • … and typically uses rather different means • POS tags are generally shown as labels on words: John/NPN saw/VB the/AT book/NCN on/PRP the/AT table/NN ./PNC
What is a tagger? • Lack of distinction between … • Software which allows you to create something you can then use to tag input text, e.g. “Brill’s tagger” • The result of running such software, e.g. a tagger for English (based on the such-and-such corpus) • Taggers (even rule-based ones) are almost invariably trained on a given corpus • “Tagging” usually understood to mean “POS tagging”, but you can have other types of tags (eg semantic tags)
Tagging vs. parsing • Once tagger is “trained”, process consists straightforward look-up, plus local context (and sometimes morphology) • Tagger will attempt to assign a tag to unknown words, and to disambiguate homographs • “Tagset” (list of categories) usually larger with more distinctions than categories used in parsing
Tagset • Parsing usually has basic word-categories, whereas tagging makes more subtle distinctions • E.g. noun sg vs pl vs genitive, common vs proper, +is, +has, … and all combinations • Parser uses maybe 12-20 categories, tagger may use 60-100
Simple taggers • Default tagger has one tag per word, and assigns it on the basis of dictionary lookup • Tags may indicate ambiguity but not resolve it, e.g. NVB for noun-or-verb • Words may be assigned different tags with associated probabilities • Tagger will assign most probable tag unless • there is some way to identify when a less probable tag is in fact correct • Tag sequences may be defined by regular expressions, and assigned probabilities (including 0 for illegal sequences – negative rules)
Rule-based taggers • Earliest type of tagging: two stages • Stage 1: look up word in lexicon to give list of potential POSs • Stage 2: Apply rules which certify or disallow tag sequences • Rules originally handwritten; more recently Machine Learning methods can be used • cf transformation-based tagging, below
How do they work? • Tagger must be “trained” • Many different techniques, but typically … • Small “training corpus” hand-tagged • Tagging rules learned automatically • Rules define most likely sequence of tags • Rules based on • Internal evidence (morphology) • External evidence (context) • Probabilities
What probabilities do we have to learn? • Individual word probabilities: Probability that a given tag t is appropriate for a givenword w • Easy (in principle): learn from training corpus: • Problem of “sparse data”: • Add a small amount to each calculation, so we get no zeros run occurs 4800 times in the training corpus: 3600 times as a verb, 1200 times as a noun: P(verb|run) = 0.75
(b)Tag sequence probability: Probability that a given tag sequence t1,t2,…,tn is appropriate for a givenword sequence w1,w2,…,wn • P(t1,t2,…,tn | w1,w2,…,wn ) = ??? • Too hard to calculate for entire sequence: P(t1,t2 ,t3 ,t4 ,...) = P(t2|t1 ) P(t3|t1,t2 ) P(t4|t1,t2 ,t3 ) … • Subsequence is more tractable • Sequence of 2 or 3 should be enough: Bigram model: P(t1,t2) = P(t2|t1 ) Trigram model: P(t1,t2 ,t3) = P(t2|t1 ) P(t3|t2 ) N-gram model:
More complex taggers • Bigram taggers assign tags on the basis of sequences of two words (usually assigning tag to wordn on the basis of wordn-1) • An nth-order tagger assigns tags on the basis of sequences of n words • As the value of n increases, so does the complexity of the statistical calculation involved in comparing probability combinations
Stochastic taggers • Nowadays, pretty much all taggers are statistics-based and have been since 1980s (or even earlier ... Some primitive algorithms were already published in 60s and 70s) • Most common is based on Hidden Markov Models (also found in speech processing, etc.)
(Hidden) Markov Models • Probability calculations imply Markov models: we assume that P(t|w) is dependent only on the (or, a sequence of) previous word(s) • (Informally) Markov models are the class of probabilistic models that assume we can predict the future without taking too much account of the past • Markov chains can be modelled by finite state automata: the next state in a Markov chain is always dependent on some finite history of previous states • Model is “hidden” if it is actually a succession of Markov models, whose intermediate states are of no interest
Supervised vs unsupervised training • Learning tagging rules from a marked-up corpus (supervised learning) gives very good results (98% accuracy) • Though assigning most probable tag, and “proper noun” to unknowns will give 90% • But it depends on having a corpus already marked up to a high quality • If this is not available, we have to try something else: • “forward-backward” algorithm • A kind of “bootstrapping” approach
Forward-backward (Baum-Welch) algorithm • Start with initial probabilities • If nothing known, assume all Ps equal • Adjust the individual probabilities so as to increase the overall probability. • Re-estimate the probabilities on the basis of the last iteration • Continue until convergence • i.e. there is no improvement, or improvement is below a threshold • All this can be done automatically
Transformation-based tagging • Eric Brill (1993) • Start from an initial tagging, and apply a series of transformations • Transformations are learned as well, from the training data • Captures the tagging data in much fewer parameters than stochastic models • The transformations learned (often) have linguistic “reality”
Transformation-based tagging • Three stages: • Lexical look-up • Lexical rule application for unknown words • Contextual rule application to correct mis-tags • Painting analogy
Transformation-based learning • Change tag a to b when: • Internal evidence (morphology) • Contextual evidence • One or more of the preceding/following words has a specific tag • One or more of the preceding/following words is a specific word • One or more of the preceding/following words has a certain form • Order of rules is important • Rules can change a correct tag into an incorrect tag, so another rule might correct that “mistake”
Transformation-based tagging: examples • if a word is currently tagged NN, and has a suffix of length 1 which consists of the letter 's', change its tag to NNS • if a word has a suffix of length 2 consisting of the letter sequence 'ly', change its tag to RB (regardless of the initial tag) • change VBN to VBD if previous word is tagged as NN • Change VBD to VBN if previous word is ‘by’
Transformation-based tagging: example Example after lexical lookup Booth/NP killed/VBN Abraham/NP Lincoln/NP Abraham/NP Lincoln/NP was/BEDZ shot/VBD by/BY Booth/NP He/PPS witnessed/VBD Lincoln/NP killed/VBN by/BY Booth/NP Example after application of contextual rule ’vbn vbd PREVTAG np’ Booth/NP killed/VBD Abraham/NP Lincoln/NP Abraham/NP Lincoln/NP was/BEDZ shot/VBD by/BY Booth/NP He/PPS witnessed/VBD Lincoln/NP killed/VBD by/BY Booth/NP Example after application of contextual rule ’vbd vbn NEXTWORD by’ Booth/NP killed/VBD Abraham/NP Lincoln/NP Abraham/NP Lincoln/NP was/BEDZ shot/VBN by/BY Booth/NP He/PPS witnessed/VBD Lincoln/NP killed/VBN by/BY Booth/NP
Tagging – final word • Many taggers now available for download • Sometimes not clear whether “tagger” means • Software enabling you to build a tagger given a corpus • An already built tagger for a given language • Because a given tagger (2nd sense) will have been trained on some corpus, it will be biased towards that (kind of) corpus • Question of goodness of match between original training corpus and material you want to use the tagger on