Word Bi-grams and PoS Tags

School of Computing FACULTY OF ENGINEERING Word Bi-grams and PoS Tags COMP3310 Natural Language Processing Eric Atwell, Language Research Group (with thanks to Katja Markert, Marti Hearst, and other contributors)

Reminder • FreqDist counts of tokens and their distribution can be useful • Eg find main characters in Gutenberg texts • Eg compare word-lengths in different languages • Human can predict the next word … • N-gram models are based on counts in a large corpus • Auto-generate a story ... (but gets stuck in local maximum) • Grammatical trends: modal verb distribution predicts genre

Why do puns make us groan? • He drove his expensive car into a tree and found out how the Mercedes bends. Isn't the Grand Canyon just gorges? Time flies like an arrow. Fruit flies like a banana.

Predicting Next Words • One reason puns make us groan is they play on our assumptions of what the next word will be – human language processing involves predicting the most probable next word • They also exploit • homonymy – same sound, different spelling and meaning (bends, Benz; gorges, gorgeous) • polysemy – same spelling, different meaning NLP programs can also make use of word-sequence modeling

Auto-generate a Story How to fix this? Use a random number generator.

Auto-generate a Story The choice() method chooses one item randomly from a list (from random import *)

Part-of-Speech Tagging: Terminology • Tagging • The process of associating labels with each token in a text, using an algorithm to select a tag for each word, eg • Hand-coded rules • Statistical taggers • Brill (transformation-based) tagger • Hybrid tagger: combination, eg by “vote” • Tags • The labels • Tag Set • The collection of tags used for a particular task, eg Brown or LOB tagset Modified from Diane Litman's version of Steve Bird's notes

Example from the GENIA corpus • Typically a tagged text is a sequence of white-space separated word/tag tokens: These/DT findings/NNS should/MD be/VB useful/JJ for/IN therapeutic/JJ strategies/NNS and/CC the/DT development/NN of/IN immunosuppressants/NNS targeting/VBG the/DT CD28/NN costimulatory/NN pathway/NN ./.

What does Tagging do? • Collapses Distinctions • Lexical identity may be discarded • e.g., all personal pronouns tagged with PRP • Introduces Distinctions • Ambiguities may be resolved • e.g. deal tagged with NN or VB • Helps in classification and prediction Modified from Diane Litman's version of Steve Bird's notes

Significance of Parts of Speech • A word’s POS tells us a lot about the word and its neighbors: • Limits the range of meanings (deal), pronunciation (object vs object) or both (wind) • Helps in stemming • Limits the range of following words • Can help select nouns from a document for summarization • Basis for partial parsing (chunked parsing) • Parsers can build trees directly on the POS tags instead of maintaining a lexicon Modified from Diane Litman's version of Steve Bird's notes

Choosing a tagset • The choice of tagset greatly affects the difficulty of the problem • Need to strike a balance between • Getting better information about context • Make it possible for classifiers to do their job Slide modified from Massimo Poesio's

Some of the best-known Tagsets • Brown corpus: 87 tags • (more when tags are combined, eg isn’t) • LOB corpus: 132 tags • Penn Treebank: 45 tags • Lancaster UCREL C5 (used to tag the BNC): 61 tags • Lancaster C7: 145 tags Slide modified from Massimo Poesio's

The Brown Corpus • An early digital corpus (1961) • Francis and Kucera, Brown University • Contents: 500 texts, each 2000 words long • From American books, newspapers, magazines • Representing genres: • Science fiction, romance fiction, press reportage scientific writing, popular lore Modified from Diane Litman's version of Steve Bird's notes

help(nltk.corpus.brown) • >>> help(nltk.corpus.brown) • | paras(self, fileids=None, categories=None) • | • | raw(self, fileids=None, categories=None) • | • | sents(self, fileids=None, categories=None) • | • | tagged_paras(self, fileids=None, categories=None, simplify_tags=False) • | • | tagged_sents(self, fileids=None, categories=None, simplify_tags=False) • | • | tagged_words(self, fileids=None, categories=None, simplify_tags=False) • | • | words(self, fileids=None, categories=None) • |

nltk.corpus.brown • >>> nltk.corpus.brown.words() • ['The', 'Fulton', 'County', 'Grand', 'Jury', 'said', ...] • >>> nltk.corpus.brown.tagged_words() • [('The', 'AT'), ('Fulton', 'NP-TL'), ...] • >>> nltk.corpus.brown.tagged_sents() • [[('The', 'AT'), ('Fulton', 'NP-TL'), ('County', 'NN-TL'), ('Grand', 'JJ-TL'), ('Jury', 'NN-TL'), ('said', 'VBD'), ('Friday', 'NR'), ('an', 'AT'), ('investigation', 'NN'), …

Penn Treebank • First large syntactically annotated corpus • 1 million words from Wall Street Journal • Part-of-speech tags and syntax trees Modified from Diane Litman's version of Steve Bird's notes

How hard is POS tagging? In the Brown corpus, 12% of word types ambiguous 40% of word tokens ambiguous Slide modified from Massimo Poesio's

Tagging with lexical frequencies • Secretariat/NNP is/VBZ expected/VBN to/TO race/VB tomorrow/NN • People/NNS continue/VBP to/TO inquire/VB the/DT reason/NN for/IN the/DT race/NN for/IN outer/JJ space/NN • Problem: assign a tag to race given its lexical frequency • Solution: we choose the tag that has the greater probability • P(race|VB) • P(race|NN) • Actual estimate from the Switchboard corpus: • P(race|NN) = .00041 • P(race|VB) = .00003 This suggests we should always tag race/NN (correct 41/44=93%) Modified from Massio Poesio's lecture

Reminder • Puns play on our assumptions of the next word… • … eg they present us with an unexpected homonym (bends) • ConditionalFreqDist() counts word-pairs: word bigrams • Used for story generation, Speech recognition, … • Parts of Speech: groups words into grammatical categories • … and separates different functions of a word • In English, many words are ambiguous: 2 or more PoS-tags • Very simple tagger: choose by lexical probability (only) • Better Pos-Taggers: to come…

Word Bi-grams and PoS Tags

Word Bi-grams and PoS Tags

Presentation Transcript

Meters, Grams and Liters

Signs and Tags

Labels and Tags

Grams and Kilograms

Labs 3: Bi-Grams

Stoichiometry 2: grams to grams

Meters, Grams, and Liters

Word-counts and N-grams

Converting Grams To Moles and Moles To Grams

1.500 grams .

Moles, Molecules and Grams

Comparing Word Relatedness Measures Based on Google n-grams

Meters, Grams and Liters

Possession Tags , CITES Tags, and Shipping Tags

Signs and Tags

Word Bi-grams and PoS Tags

Meters, Grams and Liters

N-Grams

Word Classes and POS Tagging

Tags

DAILY GRAMS

Meters, Grams and Liters