1 / 134

I256 Applied Natural Language Processing Fall 2009

I256 Applied Natural Language Processing Fall 2009. Lecture 9 Review. Barbara Rosario. Why NLP is difficult. Fundamental goal: deep understand of broad language Not just string processing or keyword matching Language is ambiguous At all levels: lexical, phrase, semantic

stevie
Download Presentation

I256 Applied Natural Language Processing Fall 2009

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. I256 Applied Natural Language ProcessingFall 2009 Lecture 9 Review Barbara Rosario

  2. Why NLP is difficult • Fundamental goal: deep understand of broad language • Not just string processing or keyword matching • Language is ambiguous • At all levels: lexical, phrase, semantic • Language is flexible • New words, new meanings • Different meanings in different contexts • Language is subtle • Language is about human communication • Problem of scale • Many (infinite?) possible words, meanings, context • Problem of sparsity • Very difficult to do statistical analysis, most things (words, concepts) are never seen before • Long range correlations • Representation of meaning

  3. Linguistics essentials • Important distinction: • study of language structure (grammar) • study of meaning (semantics) • Grammar • Phonology (the study of sound systems and abstract sound units). • Morphology (the formation and composition of words) • Syntax (the rules that determine how words combine into sentences) • Semantics • The study of the meaning of words (lexical semantics) and fixed word combinations (phraseology), and how these combine to form the meanings of sentences http://en.wikipedia.org/wiki/Linguistics

  4. Today’s review • What I hope we achieved • Overall idea of linguistic problems • Overall understanding of “lower level” NLP tasks • POS, WSD, language models, segmentation, etc • NOTE: Will be used for preprocessing and as features for higher level tasks • Initial understanding of Stat NLP • Corpora & annotation • probability theory, GM • Sparsity problem • Familiarity with Python and NLTK • Grammar • Morphology • Part-of-speech (POS) • Phrase level syntax • Lower level text processing • Semantics • Lexical semantics • Word sense disambiguation (WSD) • Lexical acquisition • Corpus-based statistical approaches to tackle NLP problems • Corpora • Intro to probability theory and graphical models (GM) • Example for WSD • Language Models (LM) and smoothing

  5. Morphology • Morphology is the study of the internal structure of words, of the way words are built up from smaller meaning units. • Morpheme: • The smallest meaningful unit in the grammar of a language. • Two classes of morphemes • Stems: “main” morpheme of the word, supplying the main meaning (i.e. establish in the example below) • Affixes: add additional meaning • Prefixes: Antidisestablishmentarianism • Suffixes: Antidisestablishmentarianism • Infixes: hingi (borrow) – humingi (borrower) in Tagalog • Circumfixes: sagen (say) – gesagt (said) in German • Examples: unladylike, dogs, technique

  6. Types of morphological processes • Inflection: • Systematic modification of a root form by means of prefixes and suffixes to indicate grammatical distinctions like singular and plural. • Doesn’t change the word class • New grammatical role • Usually produces a predictable, non idiosyncratic change of meaning. • run  runs | running | ran • hope+ing  hoping hop  hopping • Derivation: • Ex: compute  computer  computerization • Less systematic that inflection • It can involve a change of meaning • Compounding: • Merging of two or more words into a new word • Downmarket, (to) overtake

  7. Stemming & Lemmatization • The removal of the inflectional ending from words (strip off any affixes) • Laughing, laugh, laughs, laughed laugh • Problems • Can conflate semantically different words • Gallery and gall may both be stemmed to gall • Regular Expressions for Stemming • Porter Stemmer • nltk.wordnet.morphy • A further step is to make sure that the resulting form is a known word in a dictionary, a task known as lemmatization.

  8. Grammar: words: POS • Words of a language are grouped into classes to reflect similar syntactic behaviors • Syntactical or grammatical categories (aka part-of-speech) • Nouns (people, animal, concepts) • Verbs (actions, states) • Adjectives • Prepositions • Determiners • Open or lexical categories (nouns, verbs, adjective) • Large number of members, new words are commonly added • Closed or functional categories (prepositions, determiners) • Few members, clear grammatical use

  9. Part-of-speech (English) From Dan Klein’s cs 288 slides

  10. Terminology • Tagging • The process of associating labels with each token in a text • Tags • The labels • Syntactic word classes • Tag Set • The collection of tags used

  11. Example • Typically a tagged text is a sequence of white-space separated base/tag tokens: These/DT findings/NNS should/MD be/VB useful/JJ for/IN therapeutic/JJ strategies/NNS and/CC the/DT development/NN of/IN immunosuppressants/NNS targeting/VBG the/DT CD28/NN costimulatory/NN pathway/NN ./.

  12. Part-of-speech (English) From Dan Klein’s cs 288 slides

  13. Part-of-Speech Ambiguity Words that are highly ambiguous as to their part of speech tag

  14. Sources of information • Syntagmatic: tags of the other words • AT JJ NN is common • AT JJ VBP impossible (or unlikely) • Lexical: look at the words • The  AT • Flour  more likely to be a noun than a verb • A tagger that always chooses the most common tag is 90% correct (often used as baseline) • Most taggers use both

  15. What does Tagging do? • Collapses Distinctions • Lexical identity may be discarded • e.g., all personal pronouns tagged with PRP • Introduces Distinctions • Ambiguities may be resolved • e.g. deal tagged with NN or VB • Helps in classification and prediction

  16. Why POS? • A word’s POS tells us a lot about the word and its neighbors: • Limits the range of meanings (deal), pronunciation (text to speech) (object vs object, record) or both (wind) • Helps in stemming: saw[v] → see, saw[n] → saw • Limits the range of following words • Can help select nouns from a document for summarization • Basis for partial parsing (chunked parsing)

  17. Choosing a tagset • The choice of tagset greatly affects the difficulty of the problem • Need to strike a balance between • Getting better information about context • Make it possible for classifiers to do their job

  18. Tagging methods • Hand-coded • Statistical taggers • N-Gram Tagging • HMM • (Maximum Entropy) • Brill (transformation-based) tagger

  19. Unigram Tagger • Unigram taggers are based on a simple statistical algorithm: for each token, assign the tag that is most likely for that particular token. • For example, it will assign the tag JJ to any occurrence of the word frequent, since frequent is used as an adjective (e.g. a frequent word) more often than it is used as a verb (e.g. I frequent this cafe).

  20. N-Gram Tagging • An n-gram tagger is a generalization of a unigram tagger whose context is the current word together with the part-of-speech tags of the n-1 preceding tokens • A 1-gram tagger is another term for a unigram tagger: i.e., the context used to tag a token is just the text of the token itself. 2-gram taggers are also called bigram taggers, and 3-gram taggers are called trigram taggers. trigram tagger

  21. N-Gram Tagging • Why not 10-gram taggers? • As n gets larger, the specificity of the contexts increases, as does the chance that the data we wish to tag contains contexts that were not present in the training data. • This is known as the sparse data problem, and is quite pervasive in NLP. As a consequence, there is a trade-off between the accuracy and the coverage of our results (and this is related to the precision/recall trade-off)

  22. Markov Model Tagger • Bigram tagger • Assumptions: • Words are independent of each other • A word identity depends only on its tag • A tag depends only on the previous tag

  23. t1 t2 tn w1 w2 wn Markov Model Tagger

  24. Rule-Based Tagger • The Linguistic Complaint • Where is the linguistic knowledge of a tagger? • Just massive tables of numbers • Aren’t there any linguistic insights that could emerge from the data? • Could thus use handcrafted sets of rules to tag input sentences, for example, if input follows a determiner tag it as a noun.

  25. The Brill tagger(transformation-based tagger) • An example of Transformation-Based Learning • Basic idea: do a quick job first (using frequency), then revise it using contextual rules. • Very popular (freely available, works fairly well) • Probably the most widely used tagger (esp. outside NLP) • …. but not the most accurate: 96.6% / 82.0 % • A supervised method: requires a tagged corpus

  26. Brill Tagging: In more detail • Start with simple (less accurate) rules…learn better ones from tagged corpus • Tag each word initially with most likely POS • Examine set of transformationsto see which improves tagging decisions compared to tagged corpus • Re-tag corpus using best transformation • Repeat until, e.g., performance doesn’t improve • Result: tagging procedure (ordered list of transformations) which can be applied to new, untagged text

  27. An example • Examples: • They are expected to racetomorrow. • Therace for outer space. • Tagging algorithm: • Tag all uses of “race” as NN (most likely tag in the Brown corpus) • They are expected to race/NN tomorrow • the race/NN for outer space • Use a transformation rule to replace the tag NN with VB for all uses of “race” preceded by the tag TO: • They are expected to race/VB tomorrow • the race/NN for outer space

  28. What gets learned? [from Brill 95] Tags-triggered transformations Morphology-triggered transformations Rules are linguistically interpretable

  29. Today’s review • Grammar • Morphology • Part-of-speech (POS) • Phrase level syntax • Lower level text processing • Semantics • Word sense disambiguation (WSD) • Lexical semantics • Lexical acquisition • Corpora • Intro to probability theory and graphical models (GM) • Example for WSD • Language Models (LM) and smoothing

  30. Phrase structure • Words are organized in phrases • Phrases: grouping of words that are clumped as a unit • Syntax: study of the regularities and constraints of word order and phrase structure

  31. Major phrase types • Sentence (S) (whole grammatical unit). Normally rewrites as a subject noun phrase and a verb phrase • Noun phrase (NP): phrase whose head is a noun or a pronoun, optionally accompanied by a set of modifiers • The smart student of physics with long hair

  32. Major phrase types • Prepositional phrases (PP) • Headed by a preposition and containing a NP • She is [on the computer] • They walked [to their school] • Verb phrases (VP) • Phrase whose head is a verb • [Getting to school on time] was a struggle • He [was trying to keep his temper] • That woman [quickly showed me the way to hide]

  33. Phrase structure grammar • Syntactic analysis of sentences • (Ultimately) to extract meaning: • Mary gave Peter a book • Peter gave Mary a book

  34. Phrase structure parsing • Parsing: the process of reconstructing the derivation(s) or phrase structure trees that give rise to a particular sequence of words • Parse is a phrase structure tree • New art critics write reviews with computers

  35. Phrase structure parsing & ambiguity • The children ate the cake with a spoon • PP Attachment Ambiguity • Why is it important for NLP?

  36. Today’s review • Grammar • Morphology • Part-of-speech (POS) • Phrase level syntax • Lower level text processing • Text normalization • Segmentation • Semantics • Corpora • Intro to probability theory and graphical models (GM) • Example for WSD • Language Models (LM) and smoothing

  37. Text Normalization • Stemming • Convert to lower case • Identifying non-standard words including numbers, abbreviations, and dates, and mapping any such tokens to a special vocabulary. • For example, every decimal number could be mapped to a single token 0.0, and every acronym could be mapped to AAA. This keeps the vocabulary small and improves the accuracy of many language modeling tasks. • Lemmatization • Make sure that the resulting form is a known word in a dictionary • WordNet lemmatizer only removes affixes if the resulting word is in its dictionary

  38. Segmentation • Word segmentation • For languages that do not put spaces between words • Chinese, Japanese, Korean, Thai, German (for compound nouns) • Tokenization • Sentence segmentation • Divide text into sentences

  39. Tokenization • Divide text into units called tokens (words, numbers, punctuations) • Page 124—136 Manning • What is a word? • Graphic word: string of continuous alpha numeric character surrounded by white space • $22.50 • Main clue (in English) is the occurrence of whitespaces • Problems • Periods: usually remove punctuation but sometimes it’s useful to keep periods (Wash.  wash) • Single apostrophes, contractions (isn’t, didn’t, dog’s: for meaning extraction could be useful to have 2 separate forms: is + n’t or not) • Hyphenation: • Sometime best a single word: co-operate • Sometime best as 2 separate words: 26-year-old, aluminum-export ban • (RE for tokenization)

  40. Sentence Segmentation • Sentence: • Something ending with a .. ?, ! (and sometime also :) • “You reminded me,” she remarked, “of your mother.” • Nested sentences • Note the .” • Sentence boundary detection algorithms • Heuristic (see figure 4.1 page 135 Manning) • Statistical classification trees (Riley 1989) • Probability of a word to occur before or after a boundary, case and length of words • Neural network (Palmer and Hearst 1997) • Part of speech distribution of preceding and following words • Maximum Entropy (Mikheev 1998) For reference see Manning

  41. Sentence Segmentation • Sentence: • Something ending with a .. ?, ! (and sometime also :) • “You reminded me,” she remarked, “of your mother.” • Nested sentences • Note the .” • Sentence boundary detection algorithms • Heuristic (see figure 4.1 page 135 Manning) • Statistical classification trees (Riley 1989) • Probability of a word to occur before or after a boundary, case and length of words • Neural network (Palmer and Hearst 1997) • Part of speech distribution of preceding and following words • Maximum Entropy (Mikheev 1998) Note: MODELS and Features

  42. Segmentation as classification • Sentence segmentation can be viewed as a classification task for punctuation: • Whenever we encounter a symbol that could possibly end a sentence, such as a period or a question mark, we have to decide whether it terminates the preceding sentence. • We’ll return on this when we cover classification • See Section 6.2 NLTK book • For word segmentation see section 3.8 NLTK book • Also page 180 of Speech and Language ProcessingJurafsky and Martin

  43. Today’s review • Grammar • Morphology • Part-of-speech (POS) • Phrase level syntax • Lower level text processing • Semantics • Lexical semantics • Word sense disambiguation (WSD) • Lexical acquisition • Corpora • Intro to probability theory and graphical models (GM) • Example for WSD • Language Models (LM) and smoothing

  44. Semantics • Semantics is the study of the meaning of words, construction and utterances • Study of the meaning of individual words (lexical semantics) • Study of how meanings of individual words are combined into the meaning of sentences (or larger units)

  45. Lexical semantics • How words are related with each other • Hyponymy • scarlet, vermilion, carmine, and crimson are all hyponyms of red • Hypernymy • Antonymy (opposite) • Male, female • Meronymy (part of) • Tire is meromym of car • Etc..

  46. Word Senses • Words have multiple distinct meanings, or senses: • Plant: living plant, manufacturing plant, … • Title: name of a work, ownership document, form of address, material at the start of a film, … • Many levels of sense distinctions • Homonymy: totally unrelated meanings (river bank, money bank) • Polysemy: related meanings (star in sky, star on tv, title) • Systematic polysemy: productive meaning extensions (metonymy such as organizations to their buildings) or metaphor • Sense distinctions can be extremely subtle (or not) • Granularity of senses needed depends a lot on the task Taken from Dan Klein’s cs 288 slides

  47. Word Sense Disambiguation • Determine which of the senses of an ambiguous word is invoked in a particular use of the word • Example: living plant vs. manufacturing plant • How do we tell these senses apart? • “Context” • The manufacturing plant which had previously sustained the town’s economy shut down after an extended labor strike. • Maybe it’s just text categorization • Each word sense represents a topic • Why is it important to model and disambiguate word senses? • Translation • Bank banca or riva • Parsing • For PP attachment, for example • information retrieval • To return documents with the right sense of bank Adapted from Dan Klein’s cs 288 slides

  48. Features • Bag-of-words (use words around with no order) • The manufacturing plant which had previously sustained the town’s economy shut down after an extended labor strike. • Bags of words = {after, manufacturing, which, labor, ..} • Bag-of-words classification works ok for noun senses • 90% on classic, shockingly easy examples (line, interest, star) • 80% on senseval-1 nouns • 70% on senseval-1 verbs

  49. Verb WSD • Why are verbs harder? • Verbal senses less topical • More sensitive to structure, argument choice • Better disambiguated by their argument (subject-object): importance of local information • For nouns, a wider context likely to be useful • Verb Example: “Serve” • [function] The tree stump serves as a table • [enable] The scandal served to increase his popularity • [dish] We serve meals for the homeless • [enlist] She served her country • [jail] He served six years for embezzlement • [tennis] It was Agassi's turn to serve • [legal] He was served by the sheriff • Different types of information may be appropriate for different part of speech Adapted from Dan Klein’s cs 288 slides

  50. Better features • There are smarter features: • Argument selectional preference: • serve NP[meals] vs. serve NP[papers] vs. serve NP[country] • Subcategorization: • [function] serve PP[as] • [enable] serve VP[to] • [tennis] serve <intransitive> • [food] serve NP {PP[to]} • Can capture poorly (but robustly) with local windows… but we can also use a parser and get these features explicitly Taken from Dan Klein’s cs 288 slides

More Related