1.26k likes | 1.9k Views
Parts of Speech. Sudeshna Sarkar 7 Aug 2008. Why Do We Care about Parts of Speech?. Pronunciation Hand me the lead pipe. Predicting what words can be expected next Personal pronoun (e.g., I , she ) ____________ Stemming -s means singular for verbs, plural for nouns
E N D
Parts of Speech Sudeshna Sarkar 7 Aug 2008
Why Do We Care about Parts of Speech? • Pronunciation • Hand me the lead pipe. • Predicting what words can be expected next • Personal pronoun (e.g., I, she) ____________ • Stemming • -s means singular for verbs, plural for nouns • As the basis for syntactic parsing and then meaning extraction • I will lead the group into the lead smelter. • Machine translation • (E) content +N (F) contenu +N • (E) content +Adj (F) content +Adj or satisfait +Adj
What is a Part of Speech? Is this a semantic distinction? For example, maybe Noun is the class of words for people, places and things. Maybe Adjective is the class of words for properties of nouns. Consider: green book book is a Noun green is an Adjective Now consider: book worm This green is very soothing.
How Many Parts of Speech Are There? • A first cut at the easy distinctions: • Open classes: • nouns, verbs, adjectives, adverbs • Closed classes: function words • conjunctions: and, or, but • pronounts: I, she, him • prepositions: with, on • determiners: the, a, an
Part of speech tagging • 8 (ish) traditional parts of speech • Noun, verb, adjective, preposition, adverb, article, interjection, pronoun, conjunction, etc • This idea has been around for over 2000 years (Dionysius Thrax of Alexandria, c. 100 B.C.) • Called: parts-of-speech, lexical category, word classes, morphological classes, lexical tags, POS • We’ll use POS most frequently • I’ll assume that you all know what these are
POS examples • N noun chair, bandwidth, pacing • V verb study, debate, munch • ADJ adj purple, tall, ridiculous • ADV adverb unfortunately, slowly, • P preposition of, by, to • PRO pronoun I, me, mine • DET determiner the, a, that, those
Tagsets Brown corpus tagset (87 tags): http://www.scs.leeds.ac.uk/amalgam/tagsets/brown.html Penn Treebank tagset (45 tags): http://www.cs.colorado.edu/~martin/SLP/Figures/ (8.6) C7 tagset (146 tags) http://www.comp.lancs.ac.uk/ucrel/claws7tags.html
WORDS TAGS the koala put the keys on the table N V P DET POS Tagging: Definition • The process of assigning a part-of-speech or lexical class marker to each word in a corpus:
POS Tagging example WORD tag the DET koala N put V the DET keys N on P the DET table N
POS tagging: Choosing a tagset • There are so many parts of speech, potential distinctions we can draw • To do POS tagging, need to choose a standard set of tags to work with • Could pick very coarse tagets • N, V, Adj, Adv. • More commonly used set is finer grained, the “UPenn TreeBank tagset”, 45 tags • PRP$, WRB, WP$, VBG • Even more fine-grained tagsets exist
Using the UPenn tagset • The/DT grand/JJ jury/NN commmented/VBD on/IN a/DT number/NN of/IN other/JJ topics/NNS ./. • Prepositions and subordinating conjunctions marked IN (“although/IN I/PRP..”) • Except the preposition/complementizer “to” is just marked “to”.
POS Tagging • Words often have more than one POS: back • The back door = JJ • On my back = NN • Win the voters back = RB • Promised to back the bill = VB • The POS tagging problem is to determine the POS tag for a particular instance of a word.
Algorithms for POS Tagging • Ambiguity – In the Brown corpus, 11.5% of the word types are ambiguous (using 87 tags): • Worse, 40% of the tokens are ambiguous.
Algorithms for POS Tagging • Why can’t we just look them up in a dictionary? • Words that aren’t in the dictionary http://story.news.yahoo.com/news?tmpl=story&cid=578&ncid=578&e=1&u=/nm/20030922/ts_nm/iraq_usa_dc • One idea: P(ti| wi) = the probability that a random hapax legomenon in the corpus has tag ti. • Nouns are more likely than verbs, which are more likely than pronouns. • Another idea: use morphology.
Algorithms for POS Tagging - Knowledge • Dictionary • Morphological rules, e.g., • _____-tion • _____-ly • capitalization • N-gram frequencies • to _____ • DET _____ N • But what about rare words, e.g, smelt (two verb forms, melt and past tense of smell, and one noun form, a small fish) • Combining these • V _____-ing I was gracking vs. Gracking is fun.
POS Tagging - Approaches • Approaches • Rule-based tagging • (ENGTWOL) • Stochastic (=Probabilistic) tagging • HMM (Hidden Markov Model) tagging • Transformation-based tagging • Brill tagger • Do we return one best answer or several answers and let later steps decide? • How does the requisite knowledge get entered?
3 methods for POS tagging 1. Rule-based tagging • Example: Karlsson (1995) EngCGtagger based on the Constraint Grammar architecture and ENGTWOL lexicon • Basic Idea: • Assign all possible tags to words (morphological analyzer used) • Remove wrong tags according to set of constraint rules (typically more than 1000 hand-written constraint rules, but may be machine-learned)
3 methods for POS tagging 2. Transformation-based tagging • Example: Brill (1995) tagger - combination of rule-based and stochastic (probabilistic) tagging methodologies • Basic Idea: • Start with a tagged corpus + dictionary (with most frequent tags) • Set the most probable tag for each word as a start value • Change tags according to rules of type “if word-1 is a determiner and word is a verb then change the tag to noun” in a specific order (like rule-based taggers) • machine learning is used—the rules are automatically induced from a previously tagged training corpus (like stochastic approach)
3 methods for POS tagging 3. Stochastic (=Probabilistic) tagging • Example: HMM (Hidden Markov Model) tagging - a training corpus used to compute the probability (frequency) of a given word having a given POS tag in a given context
Hidden Markov Model (HMM) Tagging • Using an HMM to do POS tagging • HMM is a special case of Bayesian inference • It is also related to the “noisy channel” model in ASR (Automatic Speech Recognition)
Hidden Markov Model (HMM) Taggers • Goal: maximize P(word|tag) x P(tag|previous n tags) • P(word|tag) • word/lexical likelihood • probability that given this tag, we have this word • NOT probability that this word has this tag • modeled through language model (word-tag matrix) • P(tag|previous n tags) • tag sequence likelihood • probability that this tag follows these previous tags • modeled through language model (tag-tag matrix) Lexical information Syntagmatic information
POS tagging as a sequence classification task • We are given a sentence (an “observation” or “sequence of observations”) • Secretariat is expected to race tomorrow • sequence of n words w1…wn. • What is the best sequence of tags which corresponds to this sequence of observations? • Probabilistic/Bayesian view: • Consider all possible sequences of tags • Out of this universe of sequences, choose the tag sequence which is most probable given the observation sequence of n words w1…wn.
Getting to HMM • Let T = t1,t2,…,tn • Let W = w1,w2,…,wn • Goal: Out of all sequences of tags t1…tn, get the the most probable sequence of POS tags T underlying the observed sequence of words w1,w2,…,wn • Hat ^ means “our estimate of the best = the most probable tag sequence” • Argmaxxf(x) means “the x such that f(x) is maximized” it maximazes our estimate of the best tag sequence
Getting to HMM • This equation is guaranteed to give us the best tag sequence • But how do we make it operational? How do we compute this value? • Intuition of Bayesian classification: • Use Bayes rule to transform it into a set of other probabilities that are easier to compute • Thomas Bayes: British mathematician (1702-1761)
Bayes Rule Breaks down any conditional probability P(x|y) into three other probabilities P(x|y): The conditional probability of an event x assuming that y has occurred
Bayes Rule We can drop the denominator: it does not change for each tag sequence; we are looking for the best tag sequence for the same observation, for the same fixed set of words
Likelihood and prior Further Simplifications 1. the probability of a word appearing depends only on its own POS tag, i.e, independent of other words around it n 2. BIGRAM assumption: the probability of a tag appearing depends only on the previous tag 3. The most probable tag sequence estimated by the bigram tagger
WORDS TAGS the koala put the keys on the table N V P DET Likelihood and prior Further Simplifications 1. the probability of a word appearing depends only on its own POS tag, i.e, independent of other words around it n
Likelihood and prior Further Simplifications 2. BIGRAM assumption: the probability of a tag appearing depends only on the previous tag Bigrams are groups of two written letters, two syllables, or two words; they are a special case of N-gram. Bigrams are used as the basis for simple statistical analysis of text The bigram assumption is related to the first-order Markov assumption
Likelihood and prior Further Simplifications 3. The most probable tag sequence estimated by the bigram tagger --------------------------------------------------------------------------------------------------------------- n biagram assumption
Two kinds of probabilities (1) • Tag transition probabilities p(ti|ti-1) • Determiners likely to precede adjs and nouns • That/DT flight/NN • The/DT yellow/JJ hat/NN • So we expect P(NN|DT) and P(JJ|DT) to be high • But P(DT|JJ) to be:?
Two kinds of probabilities (1) • Tag transition probabilities p(ti|ti-1) • Compute P(NN|DT) by counting in a labeled corpus: # of times DT is followed by NN
Two kinds of probabilities (2) • Word likelihood probabilities p(wi|ti) • P(is|VBZ) = probability of VBZ (3sg Pres verb) being “is” • Compute P(is|VBZ) by counting in a labeled corpus: If we were expecting a third person singular verb, how likely is it that this verb would be is?
An Example: the verb “race” • Secretariat/NNP is/VBZ expected/VBN to/TO race/VB tomorrow/NR • People/NNS continue/VB to/TO inquire/VB the/DT reason/NN for/IN the/DTrace/NN for/IN outer/JJ space/NN • How do we pick the right tag?
Disambiguating “race” • P(NN|TO) = .00047 • P(VB|TO) = .83 The tag transition probabilities P(NN|TO) and P(VB|TO) answer the question: ‘How likely are we to expect verb/noun given the previous tag TO?’ • P(race|NN) = .00057 • P(race|VB) = .00012 Lexical likelihoods from the Brown corpus for ‘race’ given a POS tag NN or VB. • P(NR|VB) = .0027 • P(NR|NN) = .0012 tag sequence probability for the likelihood of an adverb occurring given the previous tag verb or noun • P(VB|TO)P(NR|VB)P(race|VB) = .00000027 • P(NN|TO)P(NR|NN)P(race|NN)=.00000000032 Multiply the lexical likelihoods with the tag sequence probabiliies: the verb wins
Hidden Markov Models • What we’ve described with these two kinds of probabilities is a Hidden Markov Model (HMM) • Let’s just spend a bit of time tying this into the model • In order to define HMM, we will first introduce the Markov Chain, or observable Markov Model.
Definitions • A weighted finite-state automaton adds probabilities to the arcs • The sum of the probabilities leaving any arc must sum to one • A Markov chain is a special case of a WFST in which the input sequence uniquely determines which states the automaton will go through • Markov chains can’t represent inherently ambiguous problems • Useful for assigning probabilities to unambiguous sequences
Markov chain = “First-order observed Markov Model” • a set of states • Q = q1, q2…qN; the state at time t is qt • a set of transition probabilities: • a set of probabilities A = a01a02…an1…ann. • Each aij represents the probability of transitioning from state i to state j • The set of these is the transition probability matrix A • Distinguished start and end states Special initial probability vector ithe probability that the MM will start in state i, each iexpresses the probability p(qi|START)
Markov chain = “First-order observed Markov Model” Markov Chain for weather: Example 1 • three types of weather: sunny, rainy, foggy • we want to find the following conditional probabilities: P(qn|qn-1, qn-2, …, q1) - I.e., the probability of the unknown weather on day n, depending on the (known) weather of the preceding days - We could infer this probability from the relative frequency (the statistics) of past observations of weather sequences Problem: the larger n is, the more observations we must collect. Suppose that n=6, then we have to collect statistics for 3(6-1) = 243 past histories
Markov chain = “First-order observed Markov Model” • Therefore, we make a simplifying assumption, called the (first-order) Markov assumption for a sequence of observations q1, … qn, current state only depends on previous state • the joint probability of certain past and current observations
Markov chain = “First-order observed Markov Model” • Given that today the weather is sunny, what's the probability that tomorrow is sunny and the day after is rainy? • Using the Markov assumption and the probabilities in table 1, this translates into:
The weather figure: specific example • Markov Chain for weather: Example 2
Markov chain for weather • What is the probability of 4 consecutive rainy days? • Sequence is rainy-rainy-rainy-rainy • I.e., state sequence is 3-3-3-3 • P(3,3,3,3) = • 1a11a11a11a11 = 0.2 x (0.6)3 = 0.0432
Hidden Markov Model • For Markov chains, the output symbols are the same as the states. • See sunny weather: we’re in state sunny • But in part-of-speech tagging (and other things) • The output symbols are words • But the hidden states are part-of-speech tags • So we need an extension! • A Hidden Markov Model is an extension of a Markov chain in which the output symbols are not the same as the states. • This means we don’t know which state we are in.