370 likes | 619 Views
CSA2050: Natural Language Processing. Tagging 1 Tagging POS and Tagsets Ambiguities NLTK. Tagging 1 Lecture. Slides based on Mike Rosner and Marti Hearst notes Diane Litman’s version of Steven Bird’s notes Additions from NLTK tutorials. Tagging.
E N D
CSA2050: Natural Language Processing Tagging 1 Tagging POS and Tagsets Ambiguities NLTK CSA3050: Tagging I
Tagging 1 Lecture • Slides based on Mike Rosner and Marti Hearst notes • Diane Litman’s version of Steven Bird’s notes • Additions from NLTK tutorials CSA3050: Tagging I
Tagging Mr. Sherlock Holmes, who was usually very X, … What is the part of speech of X ? CSA3050: Tagging I
Tagging Mr. Sherlock Holmes, who was usually very late/ADJ in the mornings, save upon those not infrequent occasions when he was up all night, was Y What is the part of speech of Y ? CSA3050: Tagging I
Tagging Mr. Sherlock Holmes, who was usually very late in the mornings, save upon those not infrequent occasions when he was up all night, was seated/VBN at the breakfast table CSA3050: Tagging I
Tagging Terminology • Tagging • The process of associating labels with each token in a text • Tags • The labels • Tag Set • The collection of tags used for a particular task CSA3050: Tagging I
Tagging Example Typically a tagged text is a sequence of white-space separated base/tag tokens: The/at Pantheon’s/np interior/nn ,/,still/rb in/in its/pp original/jj form/nn ,/, is/bez truly/ql majestic/jj and/cc an/at architectural/jj triumph/nn ./. Its/pp rotunda/nn forms/vbz a/at perfect/jj circle/nn whose/wp diameter/nn is/bez equal/jj to/in the/at height/nn from/in the/at floor/nn to/in the/at ceiling/nn ./. CSA3050: Tagging I
What does tagging do? • Collapses Some Distinctions • Lexical identity may be discarded • e.g. all personal pronouns tagged with PRP • ….But Introduces Others • Ambiguities may be removed • e.g. deal tagged with NN or VB • e.g. deal tagged with DEAL1 or DEAL2 • Helps classification and prediction CSA3050: Tagging I
Parts of Speech (POS) • A word’s POS tells us a lot about the word and its neighbors: • Limits the range of meanings (deal), pronunciation (object vs object) or both (wind) • Helps in stemming • Limits the range of following words for Speech Recognition • Can help select nouns from a document for IR • Basis for partial parsing (chunked parsing) • Parsers can build trees directly on the POS tags instead of maintaining a lexicon CSA3050: Tagging I
POS and Tagsets • The choice of tagset greatly affects the difficulty of the problem • Need to strike a balance between • Getting better information about context (best: introduce more distinctions) • Make it possible for classifiers to do their job (need to minimize distinctions) CSA3050: Tagging I
Common Tagsets • Brown corpus: 87 tags • Penn Treebank: 45 tags • Lancaster UCREL C5 (used to tag the British National Corpus - BNC): 61 tags • Lancaster C7: 145 tags CSA3050: Tagging I
Brown Corpus • The first digital corpus (1961) • Francis and Kucera, Brown University • Contents: 500 texts, each 2000 words long • From American books, newspapers, magazines • Representing genres: • Science fiction, romance fiction, press reportage scientific writing, popular lore CSA3050: Tagging I
Penn Treebank • First syntactically annotated corpus • 1 million words from Wall Street Journal • Part of speech tags and syntax trees CSA3050: Tagging I
Penn Treebank The/DT grand/JJ jury/NN commented/VBD on/IN a/DT number/NN of/IN other/JJ topics/NNS ./. VB DT NN .Book that flight . VBZ DT NN VB NN ?Does that flight serve dinner ? CSA3050: Tagging I
Penn Treebank CSA3050: Tagging I
Penn Treebank – Important Tags CSA3050: Tagging I
Penn Treebank – Verb Tags CSA3050: Tagging I
Penn Treebank Example (S (NP-SBJ-1 (DT The) (NNP Senate)) (VP (VBZ plans_ (S (NP-SBJ (-NONE- *-1)) (VP (TO to) (VP (VB take) (PRT (RP up)) (NP (DT the) (NN measure)) (ADV-TMP (RB quickly)))))) (. .)) CSA3050: Tagging I
Tagging • Typically the set of tags is larger than basic parts of speech • Tags often contain some morphological information • Often referred to as “morphosyntactic labels” CSA3050: Tagging I
Tagging Ambiguities N N-V V-IN DT N FRUIT FLIES LIKE A BANANA CSA3050: Tagging I
Interpretation 1 S VP NP NP N N V DT N FRUIT FLIES LIKE A BANANA CSA3050: Tagging I
Interpretation 2 S VP PP NP NP N V IN DT N FRUIT FLIES LIKE A BANANA CSA3050: Tagging I
Lots of ambiguities… • He can can a can. • I canlight a fire and you canopen a can of beans. Now the can is open, and we can eat in the light of the fire. CSA3050: Tagging I
Lots of ambiguities… • In the Brown Corpus • 11.5% of word types are ambiguous • 40% of word tokens are ambiguous • Most words in English are unambiguous. • Many of the most common words are ambiguous. • Typically ambiguous tags are not equally probable. CSA3050: Tagging I
Lots of ambiguities… Brown Corpus Unambiguous (1 tag): 35,340 types Ambiguous (2-7 tags): 4,100 types (Table: Derose, 1988) CSA3050: Tagging I
Approaches to Tagging • Tagger: ENGTWOL Tagger(Voutilainen 1995) • Stochastic Tagger: HMM-based Tagger • Transformation-Based Tagger: Brill Tagger(Brill 1995) CSA3050: Tagging I
NLTK • Natural Language Toolkit (NLTK) • http://nltk.sourceforge.net/ • Please download and install! • Runs on Python CSA3050: Tagging I
NLTK Introduction • The Natural Language Toolkit (NLTK) provides: • Basic classes for representing data relevant to natural language processing. • Standard interfaces for performing tasks, such as tokenization, tagging, and parsing. • Standard implementations of each task, which can be combined to solve complex problems. • Two versions: NLTK and NLTK-Lite CSA3050: Tagging I
NLTK Modules • nltk.token: processing individual elements of text, such as words or sentences. • nltk.probability: modeling frequency distributions and probabilistic systems. • nltk.tagger: tagging tokens with supplemental information, such as parts of speech or wordnet sense tags. • nltk.parser: high-level interface for parsing texts. • nltk.chartparser: a chart-based implementation of the parser interface. • nltk.chunkparser: a regular-expression based surface parser. CSA3050: Tagging I
Python for NLP • Python is a great language for NLP: • Simple • Easy to debug: • Exceptions • Interpreted language • Easy to structure • Modules • Object oriented programming • Powerful string manipulation CSA3050: Tagging I
Python Modules and Packages • Python modules “package program code and data for reuse.” (Lutz) • Similar to library in C, package in Java. • Python packages are hierarchical modules (i.e., modules that contain other modules). • Three commands for accessing modules: • import • from…import • reload CSA3050: Tagging I
Import Command • The importcommand loads a module: # Load the regular expression module >>> import re • To access the contents of a module, use dotted names: # Use the search method from the re module >>> re.search(‘\w+’, str) • To list the contents of a module, use dir: >>> dir(re) [‘DOTALL’, ‘I’, ‘IGNORECASE’,…] CSA3050: Tagging I
from...import • The from…import command loads individual functions and objects from a module: # Load the search function from the re module >>> from re import search • Once an individual function or object is loaded with from…import,it can be used directly: # Use the search method from the re module >>> search (‘\w+’, str) CSA3050: Tagging I
Import Keeps module functions separate from user functions. Requires the use of dotted names. Works with reload. Import vs. from...import from…import • Puts module functions and user functions together. • More convenient names. • Does not work with reload. CSA3050: Tagging I
Reload • If you edit a module, you must use the reload command before the changes become visible in Python: >>> import mymodule ... >>> reload (mymodule) • The reload command only affects modules that have been loaded with import; it does not update individual functions and objects loaded with from...import. CSA3050: Tagging I
Reload • If you edit a module, you must use the reload command before the changes become visible in Python: >>> import mymodule ... >>> reload (mymodule) • The reload command only affects modules that have been loaded with import; it does not update individual functions and objects loaded with from...import. CSA3050: Tagging I
Next Sessions… • Rule-Based Tagging • Stochastic Tagging • Hidden Markov Models (HMMs) • N-Grams • Read Jurafsky and Marting Chapter 4 (PDF) • Install NLTK CSA3050: Tagging I