130 likes | 426 Views
Part of Speech (POS) Tagging Lab. CSC 9010: Special Topics. Natural Language Processing. Paula Matuszek, Mary-Angela Papalaskari Spring, 2005. Examples taken from the Bird, Klein and Loper: NLTK Tutorial, Tagging, nltk.sourceforge.net/tutorial/tagging/index.html. Simple Taggers.
E N D
Part of Speech (POS) Tagging Lab CSC 9010: Special Topics. Natural Language Processing. Paula Matuszek, Mary-Angela Papalaskari Spring, 2005 Examples taken from the Bird, Klein and Loper: NLTK Tutorial, Tagging, nltk.sourceforge.net/tutorial/tagging/index.html CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari
Simple Taggers • Three simple taggers in NLTK • Default tagger • Regular expression tagger • Unigram tagger • All start with tokenized text. >>> from nltk.tokenizer import * >>> text_token = Token(TEXT="John saw 3 polar bears .") >>> WhitespaceTokenizer().tokenize(text_token) >>> print text_token <[<John>, <saw>, <3>, <polar>, <bears>, <.>]> CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari
Default Tagger • Assigns the same tag to every token. • We create an instance of the tagger and give it the desired tag. >>> from nltk.tagger import * >>> my_tagger = DefaultTagger('nn') >>> my_tagger.tag(text_token) >>> print text_token <[<John/nn>, <saw/nn>, <3/nn>, <polar/nn>, <bears/nn>, <./nn>]> • We’ve just labeled everything as a noun. • 20-30% accuracy (terrible), but useful as an adjunct to other taggers. CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari
Regular Expression Tagger • Takes a list of regular expressions and tags to assign when they match. • >>> NN_CD_tagger = RegexpTagger([(r'^[0-9]+(.[0-9]+)?$', 'cd'), (r'.*', 'nn')]) • >>> NN_CD_tagger.tag(text_token) • >>> print text_token <[<John/nn>, <saw/nn>, <3/cd>, <polar/nn>, <bears/nn>, <./nn>]> • This tags cardinal numbers as CD and everything else as nouns. • Still pretty poor, but may be a useful step in conjunction with other taggers. CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari
Unigram Tagger • Assign each word its most frequent tag • Must be trained to determine frequency. • Will assign “none” as a tag to any word not seen in the training set. • About 90% accurate. • Example training case (from Brown corpus) • The/at Fulton/np-tl County/nn-tl Grand/jj-tl Jury/nn-tl said/vbd Friday/nr an/at investigation/nn of/in Atlanta's/np$ recent/jj primary/nn election/nn produced/vbd ``/`` no/at evidence/nn ''/‘ that/cs any/dti irregularities/nns took/vbd place/nn ./. CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari
Train the Unigram Tagger >>> from nltk.tagger import * >>> from nltk.corpus import brown # Tokenize ten texts from the Brown Corpus >>> train_tokens = [ ] >>> for item in brown.items()[:10]: ... train_tokens.append(brown.read(item)) # Initialise and train a unigram tagger >>> mytagger = UnigramTagger(SUBTOKENS='WORDS') >>> for tok in train_tokens: mytagger.train(tok) CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari
And Then Tag New Text >>> text_token = Token(TEXT="John saw the book on the table") >>> WhitespaceTokenizer(SUBTOKENS='WORDS').tokenize(text_token) >>> mytagger.tag(text_token) >>> print text_token <[<John/np>, <saw/vbd>, <the/at>, <book/None>, <on/in>, <the/at>, <table/nn>]> CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari
Testing a Tagger • So how well does the tagger do? • Split up the inputs into training and testing sets >>> train_tokens = [ ] >>> for item in brown.items()[:10]:# texts 0-9 ... train_tokens.append(brown.read(item)) >>> unseen_tokens = [ ] >>> for item in brown.items()[10:12]:# texts 10-11 ... unseen_tokens.append(brown.read(item)) CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari
Train And Test >>> for tok in train_tokens: mytagger.train(tok) >>> acc = tagger_accuracy(mytagger, unseen_tokens) >>> print 'Accuracy = %4.1f%%' % (100 * acc) Accuracy = 64.6% CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari
More in NLTK • Error analysis • Higher order taggers • Bigram • Nth-order • Combining taggers • Brill tagger CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari
For Lab/Homework • Complete the tagger tutorial from the NLTK tutorial page. • Tutorial exercises 1, 3, 4, 5 and 10. • 8.2 (we will compare next time) • 8.9 (using the NLTK and any higher-order tagger) CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari