1 / 71

Corpora and Statistical Methods

Corpora and Statistical Methods. Albert Gatt. POS Tagging. Assign each word in continuous text a tag indicating its part of speech. Essentially a classification problem. Current state of the art: taggers typically have 96-97% accuracy figure evaluated on a per-word basis

naoko
Download Presentation

Corpora and Statistical Methods

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Corpora and Statistical Methods Albert Gatt

  2. POS Tagging • Assign each word in continuous text a tag indicating its part of speech. • Essentially a classification problem. • Current state of the art: • taggers typically have 96-97% accuracy • figure evaluated on a per-word basis • in a corpus with sentences of average length 20 words, 96% accuracy can mean one tagging error per sentence

  3. Sources of difficulty in POS tagging • Mostly due to ambiguity when words have more than one possible tag. • need context to make a good guess about POS • context alone won’t suffice • A simple approach which assigns only the most common tag to each word performs with 90% accuracy!

  4. The information sources • Syntagmatic information: the tags of other words in the context of w • Not sufficient on its own. E.g. Greene/Rubin 1977 describe a context-only tagger with only 77% accuracy • Lexicalinformation (“dictionary”): most common tag(s) for a given word • e.g. in English, many nouns can be used as verbs (flour the pan, wax the car…) • however, their most likely tag remains NN • distribution of a word’s usages across different POSs is uneven: usually, one highly likely, other much less

  5. Tagging in other languages (than English) • In English, high reliance on context is a good idea, because of fixed word order • Free word order languages make this assumption harder • Compensation: these languages typically have rich morphology • Good source of clues for a tagger

  6. Evaluation and error analysis • Training a statistical POS tagger requires splitting corpus into training and test data. • Often, we need a development set as well, to tune parameters. • Using (n-fold) cross-validation is a good idea to save data. • randomly divide data into train + test • train and evaluate on test • repeat n times and take an average • NB: cross-validation requires the whole corpus to be blind. • To examine the training data, best to have fixed training & test sets, perform cross-validation on training data, and final evaluation on test set.

  7. Evaluation • Typically carried out against a gold standard based on accuracy (% correct). • Ideal to compare accuracy of our tagger with: • baseline (lower-bound): • standard is to choose the unigram most likely tag • ceiling (upper bound): • e.g. see how well humans do at the same task • humans apparently agree on 96-7% tags • means it is highly suspect for a tagger to get 100% accuracy

  8. Part 1 HMM taggers

  9. Using Markov models • Basic idea: sequences of tags are a Markov Chain: • Limited horizon assumption: sufficient to look at previous tag for information about current tag • Time invariance: The probability of a sequence remains the same over time

  10. Implications/limitations • Limited horizon ignores long-distance dependences • e.g. can’t deal with WH-constructions • Chomsky (1957): this was one of the reasons cited against probabilistic approaches • Time invariance: • e.g. P(finite verb|pronoun) is constant • but we may be more likely to find a finite verb following a pronoun at the start of a sentence than in the middle!

  11. Notation • We let ti range over tags • Let wi range over words • Subscripts denote position in a sequence • Use superscripts to denote word types: • wj = an instance of word type j in the lexicon • tj = tag t assigned to word wj • Limited horizon property becomes:

  12. Basic strategy • Training set of manually tagged text • extract probabilities of tag sequences: • e.g. using Brown Corpus, P(NN|JJ) = 0.45, but P(VBP|JJ) = 0.0005 • Next step: estimate the word/tag probabilities: These are basically symbol emission probabilities

  13. Training the tagger: basic algorithm • Estimate probability of all possible sequences of 2 tags in the tagset from training data • For each tag tjand for each word wl estimate P(wl| tj). • Apply smoothing.

  14. Finding the best tag sequence • Given: a sentence of n words • Find: t1,n = the best n tags • Application of Bayes’ rule • denominator can be eliminated as it’s the same for all tag sequences.

  15. Finding the best tag sequence • The expression needs to be reduced to parameters that can be estimated from the training corpus • need to make some simplifying assumptions • words are independent of eachother • a word’s identity depends only on its tag

  16. The independence assumption • Probability of a sequence of words given a sequence of tags is computed as a function of each word independently

  17. The identity assumption • Probability of a word given a tag sequence = probability a word given its own tag

  18. Applying these assumptions

  19. Tagging with the Markov Model • Can use the Viterbi Algorithm to find the best sequence of tags given a sequence of words (sentence) • Reminder: probability of being in state (tag) j at word i on the best path most probable state (tag) at word i given that we’re in state j at word i+1

  20. Assume that P(PERIOD) = 1 at end of sentence Set all other tag probs to 0 The algorithm: initialisation

  21. Algorithm: induction step for i = 1 to n step 1: for all tags tj do: Probability of tag tj at i+1 on best path through i Most probable tag leading to tj at i+1

  22. for j = n to 1 do: retrieve the most probable tags for every point in sequence Algorithm: backtrace State at n+1 Calculate probability for the sequence of tags selected

  23. Some observations • The model is a Hidden Markov Model • we only observe words when we tag • In actuality, during training we have a visible Markov Model • because the training corpus provides words + tags

  24. “True” HMM taggers • Applied to cases where we do not have a large training corpus • We maintain the usual MM assumptions • Initialisation: use dictionary: • set emission probability for a word/tag to 0 if it’s not in dictionary • Training: apply to data, use forward-backward algorithm • Tagging: exactly as before

  25. Part 2 Transformation-based error-driven learning

  26. Transformation-based learning • Approach proposed by Brill (1995) • uses quantitative information at training stage • outcome of training is a set of rules • tagging is then symbolic, using the rules • Components: • a set of transformation rules • learning algorithm

  27. Transformations • General form: t1  t2 • “replace t1 with t2 if certain conditions are satisfied” • Examples: • Morphological: Change the tag from NN to NNS if the word has the suffix "s" • dogs_NN  dogs_NNS • Syntactic: Change the tag from NN to VB if the word occurs after "to" • go_NN to_TO  go_VB • Lexical: Change the tag to JJ if deleting the prefix "un" results in a word. • uncool_XXX  uncool_JJ • uncle_NN -/-> uncle_JJ

  28. Learning Unannotated text Initial state annotator e.g. assign each word its most frequent tag in a dictionary truth: a manually annotated version of corpus against which to compare Learner: learns rules by comparing initial state to Truth rules

  29. Learning algorithm • Simple iterative process: • apply a rule to the corpus • compare to the Truth • if error rate is reduced, keep the results • A priori specifications: • how initial state annotator works • the space of possible transformations • Brill (1995) used a set of initial templates • the function to compare the result of applying the rules to the truth

  30. Non-lexicalised rule templates • Take only tags into account, not the shape of words Change tag a to tag b when: • The preceding (following) word is tagged z. • The word two before (after) is tagged z. • One of the three preceding (following) words is tagged z. • The preceding (following) word is tagged z and the word two before (after) is tagged w. • …

  31. Lexicalised rule templates • Take into account specific words in the context Change tag a to tag b when: • The preceding (following) word is w. • The word two before (after) is w. • The current word is w, the preceding (following) word is w2 and the preceding (following) tag is t. • …

  32. Morphological rule templates • Usful for completely unknown words. Sensitive to the word’s “shape”. Change the tag of an unknown word (from X) to Y if: • Deleting the prefix (suffix) x, |x| ≤ 4, results in a word • The first (last) (1,2,3,4) characters of the word are x. • Adding the character string x as a prefix (suffix) results in a word (|x| ≤ 4). • Word w ever appears immediately to the left (right) of the word. • Character z appears in the word. • …

  33. Order-dependence of rules • Rules are triggered by environments satisfying their conditions • E.g. “AB if preceding tag is A” • Suppose our sequence is “AAAA” • Two possible forms of rule application: • immediate effect: applications of the same transformation can influence eachother • result: ABAB • delayed effect: results in ABBB • the rule is triggered multiple times from the same initial input • Brill (1995) opts for this solution

  34. More on Transformation-based tagging • Can be used for unsupervised learning • like HMM-based tagging, the only info available is the allowable tags for each word • takes advantage of the fact that most words have only one tag • E.g. word can = NN in context AT ___ BEZ because most other words in this context are NN • therefore, learning algorithm would learn the rule “change tag to NN in context AT ___ BEZ” • Unsupervised method achieves 95.6% accuracy!!

  35. Part 3 Maximum Entropy models and POS Tagging

  36. Limitations of HMMs • An HMM tagger relies on: • P(tag|previous tag) • P(word|tag) • these are combined by multiplication • TBL includes many other useful features which are hard to model in HMM: • prefixes, suffixes • capitalisation • … • Can we combine both, i.e. have HMM-style tagging with multiple features?

  37. The rationale • In order to tag a word, we consider its context or “history” h. We want to estimate a probability distribution p(h,t) from sparse data. • h is encoded in terms of features (e.g. morphological features, surrounding tag features etc) • There are some constraints on these features that we discover from training data. • We want our model to make the fewest possible assumptions beyond these constraints.

  38. Motivating example • Suppose we wanted to tag the word w. • Assume we have a set T of 45 different tags: T ={NN, JJ, NNS, NNP, VVS, VB, …} • The probabilistic tagging model that makes fewest assumptions assigns a uniform distribution over the tags:

  39. Motivating example • Suppose we find that the possible tags for w are NN, JJ, NNS, VB. • We therefore impose our first constraint on the model: • (and the prob. of every other tag is 0) • The simplest model satisfying this constraint:

  40. Motivating example • We suddenly discover that w is tagged as NN or NNS 8 out of 10 times. • Model now has two constraints: • Again, we require our model to make no further assumptions. Simplest distribution leaves probabilities for all tags except NN/NNS equal: • P(NN) = 4/10 • P(NNS) = 4/10 • P(JJ) = 1/10 • P(VB) = 1/10

  41. Motivating example • We suddenly discover that verbs (VB) occur 1 in every 20 words. • Model now has three constraints: • Simplest distribution is now: • P(NN) = 4/10 • P(NNS) = 4/10 • P(JJ) = 3/20 • P(VB) = 1/20

  42. What we’ve been doing • Maximum entropy builds a distribution by continuously adding features. • Each feature picks out a subset of the training observations. • For each feature, we add a constraint on our total distribution. • Our task is then to find the best distribution given the constraints.

  43. Features for POS Tagging • Each tagging decision for a word occurs in a specific context or “history” h. • For tagging, we consider as context: • the word itself • morphological properties of the word • other words surrounding the word • previous tags • For each relevant aspect of the context hi, we can define a feature fj that allows us to learn how well that aspect is associated with a tag ti. • Probability of a tag given a context is a weighted function of the features.

  44. Features for POS Tagging • In a maximum entropy model, this information is captured by a binary or indicator feature • each featurefihas a weight αi reflecting its importance • NB: each αiis uniquely associated with a feature

  45. Features for POS Tagging in Ratnaparkhi (1996) • Had three sets of features, for non-rare, rare and all words:

  46. Features for POS Tagging • Given the large number of possible features, which ones will be part of the model? • We do not want redundant features • We do not want unreliable and rarely occurring features (avoid overfitting) • Ratnaparkhi (1996) used only those features which occur 10 times or more in the training data

  47. The form of the model • Features fj and their parameters are used to compute the probability p(hi, ti): • where j ranges over features & Z is a normalisation constant • Transform into a linear equation:

  48. Conditional probabilities • The conditional probabilities can be computed based on the joint probabilities • Probability of a sequence of tags given a sequence of words: • NB: unlike an HMM, we have one probability here. • we directly estimate p(t|h) • model combines all features in hi into a single estimate • no limit in principle on what features we can take into account

  49. The use of constraints • Every feature we have imposes a constraint or expectation on the probability model. We want: • Where: the model p’s expectation of fj the empirical expectation of fj

  50. Why maximum entropy? • Recall that entropy is a measure of uncertainty in a distribution. • Without any knowledge, simplest distribution is uniform • uniform distributions have the highest entropy • As we add constraints, the MaxEnt principle dictates that we find the simplest model p* satisfying the constraints: • where P is the set of possible distributions with • p* is unique and has the form given earlier • Basically, an application of Occam’s razor: make no further assumptions than necessary.

More Related