1 / 20

Lecture 9 NLTK POS Tagging Part 2

Lecture 9 NLTK POS Tagging Part 2. CSCE 771 Natural Language Processing. Topics Taggers Rule Based Taggers Probabilistic Taggers Transformation Based Taggers - Brill Supervised learning Readings: Chapter 5.4-?. February 3, 2011. Overview. Last Time Overview of POS Tags Today

ozzy
Download Presentation

Lecture 9 NLTK POS Tagging Part 2

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Lecture 9 NLTK POS Tagging Part 2 CSCE 771 Natural Language Processing • Topics • Taggers • Rule Based Taggers • Probabilistic Taggers • Transformation Based Taggers - Brill • Supervised learning • Readings: Chapter 5.4-? February 3, 2011

  2. Overview • Last Time • Overview of POS Tags • Today • Part of Speech Tagging • Parts of Speech • Rule Based taggers • Stochastic taggers • Transformational taggers • Readings • Chapter 5.4-5.?

  3. Table 5.1: Simplified Part-of-Speech Tagset

  4. Rank tags from most to least common • >>> from nltk.corpus import brown • >>> brown_news_tagged = brown.tagged_words(categories='news', simplify_tags=True) • >>> tag_fd = nltk.FreqDist(tag for (word, tag) in brown_news_tagged) • >>> print tag_fd.keys() • ['N', 'P', 'DET', 'NP', 'V', 'ADJ', ',', '.', 'CNJ', 'PRO', 'ADV', 'VD', ...]

  5. What Tags Precede Nouns? • >>> word_tag_pairs = nltk.bigrams(brown_news_tagged) • >>> list(nltk.FreqDist(a[1] for (a, b) in word_tag_pairs if b[1] == 'N')) • ['DET', 'ADJ', 'N', 'P', 'NP', 'NUM', 'V', 'PRO', 'CNJ', '.', ',', 'VG', 'VN', ...]

  6. Most common Verbs • >>> wsj = nltk.corpus.treebank.tagged_words(simplify_tags=True) • >>> word_tag_fd = nltk.FreqDist(wsj) • >>> [word + "/" + tag for (word, tag) in word_tag_fd if tag.startswith('V')] • ['is/V', 'said/VD', 'was/VD', 'are/V', 'be/V', 'has/V', 'have/V', 'says/V', 'were/VD', 'had/VD', 'been/VN', "'s/V", 'do/V', 'say/V', 'make/V', 'did/VD', 'rose/VD', 'does/V', 'expected/VN', 'buy/V', 'take/V', 'get/V', 'sell/V', 'help/V', 'added/VD', 'including/VG', 'according/VG', 'made/VN', 'pay/V', ...]

  7. Rank Tags for words using CFDs • word as a condition and the tag as an event • >>> wsj = nltk.corpus.treebank.tagged_words(simplify_tags=True) • >>> cfd1 = nltk.ConditionalFreqDist(wsj) • >>> print cfd1['yield'].keys() • ['V', 'N'] • >>> print cfd1['cut'].keys() • ['V', 'VD', 'N', 'VN']

  8. Tags and counts for the word cut • print "ranked tags for the word cut" • cut_tags=cfd1['cut'].keys() • print "Counts for cut" • for c in cut_tags: • print c, cfd1['cut'][c] • ranked tags for the word cut • Counts for cut • V 12 • VD 10 • N 3 • VN 3

  9. P(W | T) – Flipping it around • >>> cfd2 = nltk.ConditionalFreqDist((tag, word) for (word, tag) in wsj) • >>> print cfd2['VN'].keys() • ['been', 'expected', 'made', 'compared', 'based', 'priced', 'used', 'sold', 'named', 'designed', 'held', 'fined', 'taken', 'paid', 'traded', 'said', ...]

  10. List of words for which VD and VN are both events • list1=[w for w in cfd1.conditions() if 'VD' in cfd1[w] and 'VN' in cfd1[w]] • print list1

  11. Print the 4 word/tag pairs before kicked/VD • idx1 = wsj.index(('kicked', 'VD')) • print wsj[idx1-4:idx1+1]

  12. Table 2.4

  13. Example 5.2 (code_findtags.py) • deffindtags(tag_prefix, tagged_text): • cfd = nltk.ConditionalFreqDist((tag, word) for (word, tag) in tagged_text • if tag.startswith(tag_prefix)) • return dict((tag, cfd[tag].keys()[:5]) for tag in cfd.conditions()) • >>> tagdict = findtags('NN', nltk.corpus.brown.tagged_words(categories='news')) • >>> for tag in sorted(tagdict): • ... print tag, tagdict[tag] • ...

  14. NN ['year', 'time', 'state', 'week', 'home'] • NN$ ["year's", "world's", "state's", "city's", "company's"] • NN$-HL ["Golf's", "Navy's"] • NN$-TL ["President's", "Administration's", "Army's", "Gallery's", "League's"] • NN-HL ['Question', 'Salary', 'business', 'condition', 'cut'] • NN-NC ['aya', 'eva', 'ova'] • NN-TL ['President', 'House', 'State', 'University', 'City'] • NN-TL-HL ['Fort', 'Basin', 'Beat', 'City', 'Commissioner'] • NNS ['years', 'members', 'people', 'sales', 'men'] • NNS$ ["children's", "women's", "janitors'", "men's", "builders'"] • NNS$-HL ["Dealers'", "Idols'"]

  15. words following often • import nltk • from nltk.corpus import brown • print "For the Brown Tagged Corpus category=learned" • brown_learned_text = brown.words(categories='learned') • print "sorted words following often" • print sorted(set(b for (a, b) in nltk.ibigrams(brown_learned_text) if a == 'often'))

  16. brown_lrnd_tagged = brown.tagged_words(categories='learned', simplify_tags=True) • tags = [b[1] for (a, b) in nltk.ibigrams(brown_lrnd_tagged) if a[0] == 'often'] • fd = nltk.FreqDist(tags) • print fd.tabulate() • VN V VD ADJ DET ADV P , CNJ . TO VBZ VG WH • 15 12 8 5 5 4 4 3 3 1 1 1 1 1

  17. highly ambiguous words • >>> brown_news_tagged = brown.tagged_words(categories='news', simplify_tags=True) • >>> data = nltk.ConditionalFreqDist((word.lower(), tag) ... for (word, tag) in brown_news_tagged) • >>> for word in data.conditions(): • ... if len(data[word]) > 3: • ... tags = data[word].keys() • ... print word, ' '.join(tags) • ... • best ADJ ADV NP V • better ADJ ADV V DET • ….

  18. Tag Package • http://nltk.org/api/nltk.tag.html#module-nltk.tag

More Related