Part of Speech Tagging

Part of Speech Tagging September 9, 2005

Part of Speech • Each word belongs to a word class. The word class of a word is known as part-of-speech (POS) of that word. • Most POS tags implicitly encode fine-grained specializations of eight basic parts of speech: • noun, verb, pronoun, preposition, adjective, adverb, conjunction, article • These categories are based on morphological and distributional similarities (not semantic similarities). • Part of speech is also known as: • word classes • morphological classes • lexical tags CAP6640 Natural Language Systems

Part of Speech (cont.) • A POS tag of a word describes the major and minor word classes of that word. • A POS tag of a word gives a significant amount of information about that word and its neighbours. For example, a possessive pronoun (my, your, her, its) most likely will be followed by a noun, and a personal pronoun (I, you, he, she) most likely will be followed by a verb. • Most of words have a single POS tag, but some of them have more than one (2,3,4,…) • For example, book/noun or book/verb • I bought a book. • Please book that flight.

Tag Sets • There are various tag sets to choose. • The choice of the tag set depends on the nature of the application. • We may use small tag set (more general tags) or • large tag set (finer tags). • Some of widely used part-of-speech tag sets: • Penn Treebank has 45 tags • Brown Corpus has 87 tags • C7 tag set has 146 tags • In a tagged corpus, each word is associated with a tag from the used tag set.

English Word Classes • Part-of-speech can be divided into two broad categories: • closed class types -- such as prepositions • open class types -- such as noun, verb • Closed class words are generally also function words. • Function words play important role in grammar • Some function words are: of, it, and, you • Functions words are most of time very short and frequently occur. • There are four major open classes. • noun, verb, adjective, adverb • a new word may easily enter into an open class. • Word classes may change depending on the natural language, but all natural languages have at least two word classes: noun and verb.

Nouns • Nouns can be divided as: • proper nouns -- names for specific entities such as India, John, Ali • common nouns • Proper nouns do not take an article but common nouns may take. • Common nouns can be divided as: • count nouns -- they can be singular or plural -- chair/chairs • mass nouns -- they are used when something is conceptualized as a homogenous group -- snow, salt • Mass nouns cannot take articles a and an, and they can not be plural.

Verbs • Verb class includes the words referring actions and processes. • Verbs can be divided as: • main verbs -- open class -- draw, bake • auxiliary verbs -- closed class -- can, should • Auxiliary verbs can be divided as: • copula -- be, have • modal verbs -- may, can, must, should • Verbs have different morphological forms: • non-3rd-person-sg eat • 3rd-person-sg - eats • progressive -- eating • past -- ate • past participle -- eaten

Adjectives • Adjectives describe properties or qualities • for color -- black, white • for age -- young, old • In Turkish, all adjectives can also be used as noun. • kırmızı kitap red book • kırmızıyı the red one (ACC)

Adverbs • Adverbs normally modify verbs. • Adverb categories: • locative adverbs -- home, here, downhill • degree adverbs -- very, extremely • manner adverbs -- slowly, delicately • temporal adverbs -- yesterday, Friday • Because of the heterogeneous nature of adverbs, some adverbs such as Friday may be tagged as nouns.

Major Closed Classes • Prepositions -- on, under, over, near, at, from, to, with • Determiners -- a, an, the • Pronouns -- I, you, he, she, who, others • Conjunctions -- and, but, if, when • Participles -- up, down, on, off, in, out • Numerals -- one, two, first, second

Prepositions • Occur before noun phrases • indicate spatial or temporal relations • Example: • on the table • under chair • Frequent. For example, some of the frequency counts in a 16 million word corpora (COBUILD). • of 540,085 • in 331,235 • for 142,421 • to 125,691 • with 124,965 • on 109,129 • at 100,169

Particles • A particle combines with a verb to form a larger unit called phrasal verb. • go on • turn on • turn off • shut down

Articles • A small closed class • Only three words in the class: a an the • Marks definite or indefinite • They occur so often. For example, some of the frequency counts in a 16 million word corpora (COBUILD). • the 1,071,676 • a 413,887 • an 59,359 • Almost 10% of words are articles in this corpus.

Conjunctions • Conjunctions are used to combine or join two phrases, clauses or sentences. • Coordinating conjunctions -- and or but • join two elements of equal status • Example: you and me • Subordinating conjunctions -- that who • combines main clause with subordinate clause • Example: • I thoughtthat you might like milk

Pronouns • Shorthand for referring to some entity or event. • Pronouns can be divided: • personal you she I • possessive my your his • wh-pronouns who what

TagSets for English • There are popular actual tagsets for part-of-speech • PENN TREEBANK tagset has 45 tags • IN preposition/subordinating conj. • DT determiner • JJ adjective • NN noun, singular or mass • NNS noun, plural • VB verb, base form • VBD verb, past tense • A sentence from Brown corpus which is tagged using Penn Treebank tagset. • The/DT grand/JJ jury/NN commented/VBD on/IN a/DT number/NN of/IN other/JJ topics/NNS ./.

Part of Speech Tagging • Part of speech tagging is simply assigning the correct part of speech for each in an input sentence • We assume that we have the following: • A set of tags (our tag set) • A dictionary that tells us the possible tags for each word (including all morphological variants). • A text to be tagged. • There are different algorithms for tagging. • Rule Based Tagging – uses hand-written rules • Stochastic Tagging – uses probabilities computed from training corpus • Transformation Based Tagging – uses rules learned automatically

How hard is tagging? • Most words in English are unambiguous. They have only a single tag. • But many of most common words are ambiguous: • can/verb can/auxiliary can/noun • The number of word types in Brown Corpus • unambiguous (one tag) 35,340 • ambiguous (2-7 tags) 4,100 • 2 tags 3760 • 3 tags 264 • 4 tags 61 • 5 tags 12 • 6 tags 2 • 7 tags 1 • While only 11.5% of word types are ambiguous, over 40% of Brown corpus tokens are ambiguous.

Problem Setup • There are M types of POS tags • Tag set: {t1,..,tM}. • The word vocabulary size is V • Vocabulary set: {w1,..,wV}. • We have a word sequence of length n: W = w1,w2…wn • Want to find the best sequence of POS tags: T = t1,t2…tn

Information sources for tagging All techniques are based on the same observations… • some tag sequences are more probable than others • ART+ADJ+N is more probable than ART+ADJ+VB • Lexical information: knowing the word to be tagged gives a lot of information about the correct tag • “table”: {noun, verb} but not a {adj, prep,…} • “rose”: {noun, adj, verb} but not {prep, ...}

Rule-Based Part-of-Speech Tagging • First Stage: Uses a dictionary to assign each word a list of potential parts-of-speech. • Second Stage: Uses a large list of handcrafted rules to window down this list to a single part-of-speech for each word. • The ENGTWOL is a rule-based tagger • In the first stage, uses a two-level lexicon transducer • In the second stage, uses hand-crafted rules (about 1100 rules)

Sample rules N-IP rule: A tag N (noun) cannot be followed by a tag IP (interrogative pronoun) ... man who … • man: {N} • who: {RP, IP} --> {RP} relative pronoun ART-V rule: A tag ART (article) cannot be followed by a tag V (verb) ...the book… • the: {ART} • book: {N, V} --> {N}

After The First Stage • Example: He had a book. • After the fırst stage: • he he/pronoun • had have/verbpast have/auxliarypast • a a/article • book book/noun book/verb

Tagging Rule Rule-1: if (the previous tag is an article) then eliminate all verb tags Rule-2: if (the next tag is verb) then eliminate all verb tags

Transformation-based tagging • Due to Eric Brill (1995) • basic idea: • take a non-optimal sequence of tags and • improve it successively by applying a series of well-ordered re-write rules • rule-based • but, rules are learned automatically by training on a pre-tagged corpus

An example 1. Assign to words their most likely tag • P(NN|race) = .98 • P(VB|race) = .02 2. Change some tags by applying transformation rules

Types of context • lots of latitude… • can be: • tag-triggered transformation • The preceding/following word is tagged this way • The word two before/after is tagged this way • ... • word- triggered transformation • The preceding/following word this word • … • morphology- triggered transformation • The preceding/following word finishes with an s • … • a combination of the above • The preceding word is tagged this ways AND the following word is this word

Learning the transformation rules • Input: A corpus with each word: • correctly tagged (for reference) • tagged with its most frequent tag (C0) • Output: A bag of transformation rules • Algorithm: • Instantiates a small set of hand-written templates (generic rules) by comparing the reference corpus to C0 • Change tag a to tag b when… • The preceding/following word is tagged z • The word two before/after is tagged z • One of the 2 preceding/following words is tagged z • One of the 2 preceding words is z • …

Learning the transformation rules (con't) • Run the initial tagger and compile types of errors • <incorrect tag, desired tag, # of occurrences> • For each error type, instantiate all templates to generate candidate transformations • Apply each candidate transformation to the corpus and count the number of corrections and errors that it produces • Save the transformation that yields the greatest improvement • Stop when no transformation can reduce the error rate by a predetermined threshold

Example • if the initial tagger mistags 159 words as verbs instead of nouns • create the error triple: <verb, noun, 159> • Suppose template #3 is instantiated as the rule: • Change the tag from <verb> to <noun> if one of the two preceding words is tagged as a determiner. • When this template is applied to the corpus: • it corrects 98 of the 159 errors • but it also creates 18 new errors • Error reduction is 98-18=80

Learning the best transformations • input: • a corpus with each word: • correctly tagged (for reference) • tagged with its most frequent tag (C0) • a bag of unordered transformation rules • output: • an ordering of the best transformation rules

Learning the best transformations (con’t) let: • E(Ck) = nb of words incorrectly tagged in the corpus at iteration k • v(C) = the corpus obtained after applying rule v on the corpus C ε = minimum number of errors desired for k:= 0 step 1 do bt := argmint (E(t(Ck))// find the transformation t thatminimizes // the error rate if ((E(Ck) - E(bt(Ck))) < ε)// if bt does not improve the taggingsignificantly then goto finished Ck+1 := bt(Ck)// apply rule bt to the current corpus Tk+1 := bt// bt will be kept as the currenttransformation // rule end finished: the sequence T1 T2 … Tk is the ordered transformation rules

Strengths of transformation-based tagging • exploits a wider range of lexical and syntactic regularities • can look at a wider context • condition the tags on preceding/next words not just preceding tags. • can use more context than bigram or trigram. • transformation rules are easier to understand than matrices of probabilities

How TBL Rules are Applied • Before the rules are applied the tagger labels every word with its most likely tag. • We get these most likely tags from a tagged corpus. • Example: • He is expected to race tomorrow • he/PRN is/VBZ expected/VBN to/TO race/NN tomorrow/NN • After selecting most-likely tags, we apply transformation rules. • Change NN to VB when the previous tag is TO • This rule converts race/NN into race/VB • This may not work for every case • ….. According to race

How TBL Rules are Learned • We will assume that we have a tagged corpus. • Brill’s TBL algorithm has three major steps. • Tag the corpus with the most likely tag for each (unigram model) • Choose a transformation that deterministically replaces an existing tag with a new tag such that the resulting tagged training corpus has the lowest error rate out of all transformations. • Apply the transformation to the training corpus. • These steps are repeated until a stopping criterion is reached. • The result (which will be our tagger) will be: • First tags using most-likely tags • Then apply the learned transformations

Transformations • A transformation is selected from a small set of templates. Change tag a to tag b when - The preceding (following) word is tagged z. - The word two before (after) is tagged z. - One of two preceding (following) words is tagged z. - One of three preceding (following) words is tagged z. - The preceding word is tagged z and the following word is tagged w. - The preceding (following) word is tagged z and the word two before (after) is tagged w.

Stochastic POS tagging • Assume that a word’s tag only depends on the previous tags (not following ones) • Use a training set (manually tagged corpus) to: • learn the regularities of tag sequences • learn the possible tags for a word • model this info through a language model (n-gram)

Hidden Markov Model (HMM) Taggers • Goal: maximize P(word|tag) x P(tag|previous n tags) • P(word|tag) • word/lexical likelihood • probability that given this tag, we have this word • NOT probability that this word has this tag • modeled through language model (word-tag matrix) • P(tag|previous n tags) • tag sequence likelihood • probability that this tag follows these previous tags • modeled through language model (tag-tag matrix) Lexical information Syntagmatic information

Tag sequence probability • P(tag|previous n tags) • if we look (n-1) tags before to find current tag --> n-gram model • trigram model • chooses the most probable tag ti for word wi given: • the previous 2 tags ti-2 & ti-1 and • the current word wi • bigram model • chooses the most probable tag ti for word wi given: • the previous tag ti-1 and • the current word wi • unigram model (just most-likely tag) • chooses the most probable tag ti for word wi given: • the current word wi

Example • “race” can be VB or NN • “Secretariat/NNP is/VBZ expected/VBN to/TO race/VB tomorrow/ADV” • “People/NNS continue/VBP to/TO inquire/VB the/DT reason/NN for/IN the/DT race/NNfor/IN outer/JJ space/NN” • let’s tag the word “race” in 1st sentence with a bigram model.

Example (con’t) • assuming previous words have been tagged, we have: “Secretariat/NNP is/VBZ expected/VBN to/TO race/?? tomorrow” • P(race|VB) x P(VB|TO) ? • given that we have a VB, how likely is the current word to be race • given that the previous tag is TO, how likely is the current tag to be VB • P(race|NN) x P(NN|TO) ? • given that we have a NN, how likely is the current word to be race • given that the previous tag is TO, how likely is the current tag to be NN

Example (con’t) • From the training corpus, we found that: • P(NN|TO) = .021// given that the previous tag is TO // 2.1% chances that the current tag is NN • P(VB|TO) = .34// given that the previous tag is TO // 34% chances that the current tag is VB • P(race|NN) = .00041// given that we have an NN // 0.041% chances that this word is "race" • P(race|VB) = .00003// given that we have a VB // 003% chances that this word is "race" so: P(VB|TO) x P(race|VB) = .34 x .00003 = .000 01 P(NN|TO) x P(race|NN) = .021 x .00041 = .000 009 so: VB is more probable!

Example (con’t) • and by the way: race is 98% of the time a NN !!! P(VB|race) = 0.02 P(NN|race) = 0.98 !!! • How are the probabilities found ? • using a training corpus of hand-tagged text • long & meticulous work done by linguists

HMM Tagging • But HMM tagging tries to find: • the best sequence of tags for a sentence • not just best tag for a single word • goal: maximize the probability of a tag sequence, given a word sequence • i.e. choose the sequence of tags that maximizes P(tag sequence|word sequence)

HMM Tagging (con’t) • By Bayes law: • wordSeq is given… • so P(wordSeq) will be the same for all tagSeq • so we can drop it from the equation

Assumptions in HMM Tagging • words are independent • Markov assumption (approximation to short history) • ex. with bigram approximation: • probability of a word is only dependent on its tag emission probability state transition probability

Emissions & Transitions probabilities • let • N: number of possible tags (size of tag set) • V: number of word types (vocabulary) • from a tagged training corpus, we compute the frequency of: • Emission probabilities P(wi| ti) • stored in an N x V matrix • emission[i,j] = probability that tag i is the correct tag for word j • Transitions probabilities P(ti|ti-1) • stored in an N x N matrix • transmission[i,j] = probability that tag i follows tag j • In practice, these matrices are very sparse • So these models are smoothed to avoid zero probabilities

Emission probabilities P(wi| ti) • stored in an N x V matrix • emission[i,j] = probability/frequency that tag i is the correct tag for word j

Transitions probabilities P(ti|ti-1) • stored in an N x N matrix • transmission[i,j] = probability/frequency that tag i follows tag j

Part of Speech Tagging