260 likes | 416 Views
Tagging – more details. Reading: D Jurafsky & J H Martin (2000) Speech and Language Processing , Ch 8 R Dale et al (2000) Handbook of Natural Language Processing , Ch 17 C D Manning & H Sch ü tze (1999) Foundations of Statistical Natural Language Processing , Ch 10. POS tagging - overview.
E N D
Tagging – more details Reading: D Jurafsky & J H Martin (2000) Speech and Language Processing, Ch 8 R Dale et al (2000) Handbook of Natural Language Processing, Ch 17 C D Manning & H Schütze (1999) Foundations of Statistical Natural Language Processing, Ch 10
POS tagging - overview • What is a “tagger”? • Tagsets • How to build a tagger and how a tagger works • Supervised vs unsupervised learning • Rule-based vs stochastic • And some details
What is a tagger? • Lack of distinction between … • Software which allows you to create something you can then use to tag input text, e.g. “Brill’s tagger” • The result of running such software, e.g. a tagger for English (based on the such-and-such corpus) • Taggers (even rule-based ones) are almost invariably trained on a given corpus • “Tagging” usually understood to mean “POS tagging”, but you can have other types of tags (eg semantic tags)
Tagging vs. parsing • Once tagger is “trained”, process consists straightforward look-up, plus local context (and sometimes morphology) • Will attempt to assign a tag to unknown words, and to disambiguate homographs • “Tagset” (list of categories) usually larger with more distinctions
Tagset • Parsing usually has basic word-categories, whereas tagging makes more subtle distinctions • E.g. noun sg vs pl vs genitive, common vs proper, +is, +has, … and all combinations • Parser uses maybe 12-20 categories, tagger may use 60-100
Simple taggers • Default tagger has one tag per word, and assigns it on the basis of dictionary lookup • Tags may indicate ambiguity but not resolve it, e.g. nvb for noun-or-verb • Words may be assigned different tags with associated probabilities • Tagger will assign most probable tag unless • there is some way to identify when a less probable tag is in fact correct • Tag sequences may be defined by regular expressions, and assigned probabilities (including 0 for illegal sequences)
What probabilities do we have to learn? • Individual word probabilities: Probability that a given tag t is appropriate for a givenword w • Easy (in principle): learn from training corpus: • Problem of “sparse data”: • Add a small amount to each calculation, so we get no zeros
(b) Tag sequence probability: Probability that a given tag sequence t1,t2,…,tn is appropriate for a givenword sequence w1,w2,…,wn • P(t1,t2,…,tn | w1,w2,…,wn ) = ??? • Too hard to calculate entire sequence: P(t1,t2 ,t3 ,t4 , …) = P(t2|t1 ) P(t3|t1,t2 ) P(t4|t1,t2 ,t3 ) … • Subsequence is more tractable • Sequence of 2 or 3 should be enough: Bigram model: P(t1,t2) = P(t2|t1 ) Trigram model: P(t1,t2 ,t3) = P(t2|t1 ) P(t3|t2 ) N-gram model:
More complex taggers • Bigram taggers assign tags on the basis of sequences of two words (usually assigning tag to wordn on the basis of wordn-1) • An nth-order tagger assigns tags on the basis of sequences of n words • As the value of n increases, so does the complexity of the statistical calculation involved in comparing probability combinations
1960 1970 1980 1990 2000 History Combined Methods 98%+ Trigram Tagger (Kempe) 96%+ DeRose/Church Efficient HMM Sparse Data 95%+ Tree-Based Statistics (Helmut Shmid) Rule Based – 96%+ Transformation Based Tagging (Eric Brill) Rule Based – 95%+ Greene and Rubin Rule Based - 70% HMM Tagging (CLAWS) 93%-95% Neural Network 96%+ LOB Corpus Tagged Brown Corpus Created (EN-US) 1 Million Words Brown Corpus Tagged British National Corpus (tagged by CLAWS) POS Tagging separated from other NLP LOB Corpus Created (EN-UK) 1 Million Words Penn Treebank Corpus (WSJ, 4.5M)
How do they work? • Tagger must be “trained” • Many different techniques, but typically … • Small “training corpus” hand-tagged • Tagging rules learned automatically • Rules define most likely sequence of tags • Rules based on • Internal evidence (morphology) • External evidence (context)
Rule-based taggers • Earliest type of tagging: two stages • Stage 1: look up word in lexicon to give list of potential POSs • Stage 2: Apply rules which certify or disallow tag sequences • Rules originally handwritten; more recently Machine Learning methods can be used • cf transformation-based learning, below
Stochastic taggers • Nowadays, pretty much all taggers are statistics-based and have been since 1980s (or even earlier ... Some primitive algorithms were already published in 60s and 70s) • Most common is based on Hidden markov Models (also found in speech processing, etc.)
(Hidden) Markov Models • Probability calculations imply Markov models: we assume that P(t|w) is dependent only on the (or, a sequence of) previous word(s) • (Informally) Markov models are the class of probabilistic models that assume we can predict the future without taking too much account of the past • Markov chains can be modelled by finite state automata: the next state in a Markov chain is always dependent on some finite history of previous states • Model is “hidden” if it is actually a succession of Markov models, whose intermediate states are of no interest
Three stages of HMM training • Estimating likelihoods on the basis of a corpus: Forward-backward algorithm • “Decoding”: applying the process to a given input: Viterbi algorithm • Learning (training): Baum-Welch algorithm or Iterative Viterbi
Denote Claim: Therefore we can calculate all At(s) in time O(L*Tn). Similar, by going backwards, we can get: Multiplying we can get: Note that summing this for all states at a time t gives the likelihood of w1…wL. Forward-backward algorithm
Viterbi algorithm (aka Dynamic programming)(see J&M p177ff) • Denote • Claim: • Otherwise, appending s to the prefix would get a path better than Qt+1(s). • Therefore, checking all possible states q at time t, multiplying by the transition probability between q and s and the expression probability of wt+1 given s, and finding the maximum, gives Qt+1(s). • We need to store for each state the previous state in Qt(s). • Find the maximal finish state, and reconstruct the path. • O(L*Tn) instead of TL.
Baum-Welch algorithm • Start with initial HMM • Calculate, using F-B, the likelihood to get our observations given that a certain hidden state was used at time i. • Re-estimate the HMM parameters • Continue until convergence • Can be shown to constantly improve likelihood
Unsupervised learning • We have an untagged corpus • We may also have partial information such as a set of tags, a dictionary, knowledge of tag transitions, etc. • Use Baum-Welch to estimate both the context probabilities and the lexical probabilities
Supervised learning • Use a tagged corpus • Count the frequencies of tag-pairs t,w: C(t,w) • Estimate (Maximum Likelihood Estimate): • Count the frequencies of tag n-grams C(t1…tn) • Estimate (Maximum Likelihood Estimate): • What about small counts? Zero counts?
Sparse Training Data - Smoothing • Adding a bias: • Compensates for estimation (Bayesean approach) • Has larger effect on low-count words • Solves zero-count word problem • Generalized Smoothing: • Reduces to bias using:
Decision-tree tagging • Not all n-grams are created equal: • Some n-grams contain redundant information that may be expressed well enough with less tags • Some n-grams are too sparse • Decision Tree (Schmid, 1994)
Decision Trees • Each node is a binary test of tag ti-k. • The leaves store probabilities for ti. • All HMM algorithms can still be used • Learning: • Build tree from root to leaves • Choose tests for nodes that maximize information gain • Stop when branch too sparse • Finally, prune tree
Transformation-based learning • Eric Brill (1993) • Start from an initial tagging, and apply a series of transformations • Transformations are learned as well, from the training data • Captures the tagging data in much fewer parameters than stochastic models • The transformations learned have linguistic meaning
Transformation-based learning • Examples: Change tag a to b when: • The preceding (following) word is tagged z • The word two before (after) is tagged z • One of the 2 preceding (following) words is tagged z • The preceding word is tagged z and the following word is tagged w • The preceding (following) word is W
Transformation-based Tagger: Learning • Start with initial tagging • Score the possible transformations by comparing their result to the “truth”. • Choose the transformation that maximizes the score • Repeat last 2 steps