380 likes | 566 Views
LINGUISTICA GENERALE E COMPUTAZIONALE. DISAMBIGUAZIONE DELLE PARTI DEL DISCORSO. POS tagging: the problem. People/NNS continue/VBP to/TO inquire/VB the/DT reason/NN for/IN the/DT race /NN for/IN outer/JJ space/NN Problem: assign a tag to race Requires: tagged corpus.
E N D
LINGUISTICA GENERALE E COMPUTAZIONALE DISAMBIGUAZIONE DELLE PARTI DEL DISCORSO
POS tagging: the problem • People/NNS continue/VBP to/TO inquire/VB the/DT reason/NN for/IN the/DT race/NN for/IN outer/JJ space/NN • Problem: assign a tag to race • Requires: tagged corpus
Ambiguity in POS tagging The ATman NN VBstill NN VB RBsaw NN VBDher PPO PP$
How hard is POS tagging? In the Brown corpus,- 11.5% of word types ambiguous- 40% of word TOKENS
Frequency + Context • Both the Brill tagger and HMM-based taggers achieve good results by combining • FREQUENCY • I poured FLOUR/NN into the bowl. • Peter should FLOUR/VB the baking tray • Information about CONTEXT • I saw the new/JJ PLAY/NN in the theater. • The boy will/MD PLAY/VBP in the garden.
The importance of context • Secretariat/NNP is/VBZ expected/VBN to/TO race/VB tomorrow/NN • People/NNS continue/VBP to/TO inquire/VB the/DT reason/NN for/IN the/DT race/NN for/IN outer/JJ space/NN
Choosing a tagset • The choice of tagset greatly affects the difficulty of the problem • Need to strike a balance between • Getting better information about context (best: introduce more distinctions) • Make it possible for classifiers to do their job (need to minimize distinctions)
Some of the best-known Tagsets • Brown corpus: 87 tags • Penn Treebank: 45 tags • Lancaster UCREL C5 (used to tag the BNC): 61 tags • Lancaster C7: 145 tags
Tagsets per l’italiano PAROLE Si-TAL (Pisa, Venezia, IRST, ....) TEXTPRO (dopo)
POS tags in the Brown corpus Television/NN has/HVZ yet/RB to/TO work/VB out/RP a/AT living/RBG arrangement/NN with/IN jazz/NN ,/, which/VDT comes/VBZ to/IN the/AT medium/NN more/QL as/CS an/AT uneasy/JJ guest/NN than/CS as/CS a/AT relaxed/VBN member/NN of/IN the/AT family/NN ./.
SGML-based POS in the BNC <div1 complete=y org=seq> <head> <s n=00040> <w NN2>TROUSERS <w VVB>SUIT </head> <caption> <s n=00041> <w EX0>There <w VBZ>is <w PNI>nothing <w AJ0>masculine <w PRP>about <w DT0>these <w AJ0>new <w NN1>trouser <w NN2-VVZ>suits <w PRP>in <w NN1>summer<w POS>'s <w AJ0>soft <w NN2>pastels<c PUN>. <s n=00042> <w NP0>Smart <w CJC>and <w AJ0>acceptable <w PRP>for <w NN1>city <w NN1-VVB>wear <w CJC>but <w AJ0>soft <w AV0>enough <w PRP>for <w AJ0>relaxed <w NN2>days </caption>
Quick test DoCoMo and Sony are to develop a chip that would let people pay for goods through their mobiles.
POS TAGGED CORPORA IN NLTK >>> tagged_token = nltk.tag.str2tuple('fly/NN') >>> tagged_token ('fly', 'NN') >>> tagged_token[0] 'fly' >>> tagged_token[1] 'NN' >>> nltk.corpus.brown.tagged_words() [('The', 'AT'), ('Fulton', 'NP-TL'), ('County', 'NN-TL'), ...]
Exploring tagged corpora • Ch.5, p. 184-189
OTHER POS-TAGGED CORPORA • NLTK: • WAC Corpora: • English: UKWAC • Italian: ITWAC
Markov Model POS tagging • Again, the problem is to find an `explanation’ with the highest probability: • As in the lecture on text classification, this can be ‘turned around’ using Bayes’ Rule:
Combining frequency and contextual information • As in the case of spelling, this equation can be simplified: • As we will see, once further simplifications are applied, this equation will encode both FREQUENCY and CONTEXT INFORMATION
Three further assumptions • MARKOV assumption: a tag only depends on a FIXED NUMBER of previous tags (here, assume bigrams) • Simplify second factor • INDEPENDENCE assumption: words are independent from each other. • A word’s identity only depends on its own tag • Simplify first factor
CONTEXT FREQUENCY The final equations
Estimating the probabilities Can be done using Maximum Likelihood Estimation as usual, for BOTH probabilities:
An example of tagging with Markov Models : • Secretariat/NNP is/VBZ expected/VBN to/TO race/VB tomorrow/NN • People/NNS continue/VBP to/TO inquire/VB the/DT reason/NN for/IN the/DT race/NN for/IN outer/JJ space/NN • Problem: assign a tag to race given the subsequences • to/TO race/??? • the/DT race/??? • Solution: we choose the tag that has the greater of these probabilities: • P(VB|TO) P(race|VB) • P(NN|TO)P(race|NN)
Tagging with MMs (2) • Actual estimates from the Switchboard corpus: • LEXICAL FREQUENCIES: • P(race|NN) = .00041 • P(race|VB) = .00003 • CONTEXT: • P(NN|TO) = .021 • P(VB|TO) = .34 • The probabilities: • P(VB|TO) P(race|VB) = .00001 • P(NN|TO)P(race|NN) = .000007
Computing the most likely sequence of tags • In general, the problem of computing the most likely sequence t1 .. tn could have exponential complexity • It can however be solved in polynomial time using an example of DYNAMIC PROGRAMMING: the VITERBI ALGORITHM (Viterbi, 1967) • (Also called TRELLIS ALGORITHMs)
POS TAGGING IN NLTK DEFAULT POS TAGGER: nltk.pos_tag >>> text = nltk.word_tokenize("And now for something completely different") >>> nltk.pos_tag(text) [('And', 'CC'), ('now', 'RB'), ('for', 'IN'), ('something', 'NN'), ('completely', 'RB'), ('different', 'JJ')]
TEXTPRO • The most widely used NLP tool for Italian • http://textpro.fbk.eu/ • Demo
READINGS • Bird et al, chapter 5, chapter 6.1