I256 Applied Natural Language Processing Fall 2009

I256 Applied Natural Language ProcessingFall 2009 Lecture 9 Review Barbara Rosario

Why NLP is difficult • Fundamental goal: deep understand of broad language • Not just string processing or keyword matching • Language is ambiguous • At all levels: lexical, phrase, semantic • Language is flexible • New words, new meanings • Different meanings in different contexts • Language is subtle • Language is about human communication • Problem of scale • Many (infinite?) possible words, meanings, context • Problem of sparsity • Very difficult to do statistical analysis, most things (words, concepts) are never seen before • Long range correlations • Representation of meaning

Linguistics essentials • Important distinction: • study of language structure (grammar) • study of meaning (semantics) • Grammar • Phonology (the study of sound systems and abstract sound units). • Morphology (the formation and composition of words) • Syntax (the rules that determine how words combine into sentences) • Semantics • The study of the meaning of words (lexical semantics) and fixed word combinations (phraseology), and how these combine to form the meanings of sentences http://en.wikipedia.org/wiki/Linguistics

Today’s review • What I hope we achieved • Overall idea of linguistic problems • Overall understanding of “lower level” NLP tasks • POS, WSD, language models, segmentation, etc • NOTE: Will be used for preprocessing and as features for higher level tasks • Initial understanding of Stat NLP • Corpora & annotation • probability theory, GM • Sparsity problem • Familiarity with Python and NLTK • Grammar • Morphology • Part-of-speech (POS) • Phrase level syntax • Lower level text processing • Semantics • Lexical semantics • Word sense disambiguation (WSD) • Lexical acquisition • Corpus-based statistical approaches to tackle NLP problems • Corpora • Intro to probability theory and graphical models (GM) • Example for WSD • Language Models (LM) and smoothing

Morphology • Morphology is the study of the internal structure of words, of the way words are built up from smaller meaning units. • Morpheme: • The smallest meaningful unit in the grammar of a language. • Two classes of morphemes • Stems: “main” morpheme of the word, supplying the main meaning (i.e. establish in the example below) • Affixes: add additional meaning • Prefixes: Antidisestablishmentarianism • Suffixes: Antidisestablishmentarianism • Infixes: hingi (borrow) – humingi (borrower) in Tagalog • Circumfixes: sagen (say) – gesagt (said) in German • Examples: unladylike, dogs, technique

Types of morphological processes • Inflection: • Systematic modification of a root form by means of prefixes and suffixes to indicate grammatical distinctions like singular and plural. • Doesn’t change the word class • New grammatical role • Usually produces a predictable, non idiosyncratic change of meaning. • run  runs | running | ran • hope+ing  hoping hop  hopping • Derivation: • Ex: compute  computer  computerization • Less systematic that inflection • It can involve a change of meaning • Compounding: • Merging of two or more words into a new word • Downmarket, (to) overtake

Stemming & Lemmatization • The removal of the inflectional ending from words (strip off any affixes) • Laughing, laugh, laughs, laughed laugh • Problems • Can conflate semantically different words • Gallery and gall may both be stemmed to gall • Regular Expressions for Stemming • Porter Stemmer • nltk.wordnet.morphy • A further step is to make sure that the resulting form is a known word in a dictionary, a task known as lemmatization.

Grammar: words: POS • Words of a language are grouped into classes to reflect similar syntactic behaviors • Syntactical or grammatical categories (aka part-of-speech) • Nouns (people, animal, concepts) • Verbs (actions, states) • Adjectives • Prepositions • Determiners • Open or lexical categories (nouns, verbs, adjective) • Large number of members, new words are commonly added • Closed or functional categories (prepositions, determiners) • Few members, clear grammatical use

Part-of-speech (English) From Dan Klein’s cs 288 slides

Terminology • Tagging • The process of associating labels with each token in a text • Tags • The labels • Syntactic word classes • Tag Set • The collection of tags used

Example • Typically a tagged text is a sequence of white-space separated base/tag tokens: These/DT findings/NNS should/MD be/VB useful/JJ for/IN therapeutic/JJ strategies/NNS and/CC the/DT development/NN of/IN immunosuppressants/NNS targeting/VBG the/DT CD28/NN costimulatory/NN pathway/NN ./.

Part-of-speech (English) From Dan Klein’s cs 288 slides

Part-of-Speech Ambiguity Words that are highly ambiguous as to their part of speech tag

Sources of information • Syntagmatic: tags of the other words • AT JJ NN is common • AT JJ VBP impossible (or unlikely) • Lexical: look at the words • The  AT • Flour  more likely to be a noun than a verb • A tagger that always chooses the most common tag is 90% correct (often used as baseline) • Most taggers use both

What does Tagging do? • Collapses Distinctions • Lexical identity may be discarded • e.g., all personal pronouns tagged with PRP • Introduces Distinctions • Ambiguities may be resolved • e.g. deal tagged with NN or VB • Helps in classification and prediction

Why POS? • A word’s POS tells us a lot about the word and its neighbors: • Limits the range of meanings (deal), pronunciation (text to speech) (object vs object, record) or both (wind) • Helps in stemming: saw[v] → see, saw[n] → saw • Limits the range of following words • Can help select nouns from a document for summarization • Basis for partial parsing (chunked parsing)

Choosing a tagset • The choice of tagset greatly affects the difficulty of the problem • Need to strike a balance between • Getting better information about context • Make it possible for classifiers to do their job

Tagging methods • Hand-coded • Statistical taggers • N-Gram Tagging • HMM • (Maximum Entropy) • Brill (transformation-based) tagger

Unigram Tagger • Unigram taggers are based on a simple statistical algorithm: for each token, assign the tag that is most likely for that particular token. • For example, it will assign the tag JJ to any occurrence of the word frequent, since frequent is used as an adjective (e.g. a frequent word) more often than it is used as a verb (e.g. I frequent this cafe).

N-Gram Tagging • An n-gram tagger is a generalization of a unigram tagger whose context is the current word together with the part-of-speech tags of the n-1 preceding tokens • A 1-gram tagger is another term for a unigram tagger: i.e., the context used to tag a token is just the text of the token itself. 2-gram taggers are also called bigram taggers, and 3-gram taggers are called trigram taggers. trigram tagger

N-Gram Tagging • Why not 10-gram taggers? • As n gets larger, the specificity of the contexts increases, as does the chance that the data we wish to tag contains contexts that were not present in the training data. • This is known as the sparse data problem, and is quite pervasive in NLP. As a consequence, there is a trade-off between the accuracy and the coverage of our results (and this is related to the precision/recall trade-off)

Markov Model Tagger • Bigram tagger • Assumptions: • Words are independent of each other • A word identity depends only on its tag • A tag depends only on the previous tag

t1 t2 tn w1 w2 wn Markov Model Tagger

Rule-Based Tagger • The Linguistic Complaint • Where is the linguistic knowledge of a tagger? • Just massive tables of numbers • Aren’t there any linguistic insights that could emerge from the data? • Could thus use handcrafted sets of rules to tag input sentences, for example, if input follows a determiner tag it as a noun.

The Brill tagger(transformation-based tagger) • An example of Transformation-Based Learning • Basic idea: do a quick job first (using frequency), then revise it using contextual rules. • Very popular (freely available, works fairly well) • Probably the most widely used tagger (esp. outside NLP) • …. but not the most accurate: 96.6% / 82.0 % • A supervised method: requires a tagged corpus

Brill Tagging: In more detail • Start with simple (less accurate) rules…learn better ones from tagged corpus • Tag each word initially with most likely POS • Examine set of transformationsto see which improves tagging decisions compared to tagged corpus • Re-tag corpus using best transformation • Repeat until, e.g., performance doesn’t improve • Result: tagging procedure (ordered list of transformations) which can be applied to new, untagged text

An example • Examples: • They are expected to racetomorrow. • Therace for outer space. • Tagging algorithm: • Tag all uses of “race” as NN (most likely tag in the Brown corpus) • They are expected to race/NN tomorrow • the race/NN for outer space • Use a transformation rule to replace the tag NN with VB for all uses of “race” preceded by the tag TO: • They are expected to race/VB tomorrow • the race/NN for outer space

What gets learned? [from Brill 95] Tags-triggered transformations Morphology-triggered transformations Rules are linguistically interpretable

Today’s review • Grammar • Morphology • Part-of-speech (POS) • Phrase level syntax • Lower level text processing • Semantics • Word sense disambiguation (WSD) • Lexical semantics • Lexical acquisition • Corpora • Intro to probability theory and graphical models (GM) • Example for WSD • Language Models (LM) and smoothing

Phrase structure • Words are organized in phrases • Phrases: grouping of words that are clumped as a unit • Syntax: study of the regularities and constraints of word order and phrase structure

Major phrase types • Sentence (S) (whole grammatical unit). Normally rewrites as a subject noun phrase and a verb phrase • Noun phrase (NP): phrase whose head is a noun or a pronoun, optionally accompanied by a set of modifiers • The smart student of physics with long hair

Major phrase types • Prepositional phrases (PP) • Headed by a preposition and containing a NP • She is [on the computer] • They walked [to their school] • Verb phrases (VP) • Phrase whose head is a verb • [Getting to school on time] was a struggle • He [was trying to keep his temper] • That woman [quickly showed me the way to hide]

Phrase structure grammar • Syntactic analysis of sentences • (Ultimately) to extract meaning: • Mary gave Peter a book • Peter gave Mary a book

Phrase structure parsing • Parsing: the process of reconstructing the derivation(s) or phrase structure trees that give rise to a particular sequence of words • Parse is a phrase structure tree • New art critics write reviews with computers

Phrase structure parsing & ambiguity • The children ate the cake with a spoon • PP Attachment Ambiguity • Why is it important for NLP?

Today’s review • Grammar • Morphology • Part-of-speech (POS) • Phrase level syntax • Lower level text processing • Text normalization • Segmentation • Semantics • Corpora • Intro to probability theory and graphical models (GM) • Example for WSD • Language Models (LM) and smoothing

Text Normalization • Stemming • Convert to lower case • Identifying non-standard words including numbers, abbreviations, and dates, and mapping any such tokens to a special vocabulary. • For example, every decimal number could be mapped to a single token 0.0, and every acronym could be mapped to AAA. This keeps the vocabulary small and improves the accuracy of many language modeling tasks. • Lemmatization • Make sure that the resulting form is a known word in a dictionary • WordNet lemmatizer only removes affixes if the resulting word is in its dictionary

Segmentation • Word segmentation • For languages that do not put spaces between words • Chinese, Japanese, Korean, Thai, German (for compound nouns) • Tokenization • Sentence segmentation • Divide text into sentences

Tokenization • Divide text into units called tokens (words, numbers, punctuations) • Page 124—136 Manning • What is a word? • Graphic word: string of continuous alpha numeric character surrounded by white space • $22.50 • Main clue (in English) is the occurrence of whitespaces • Problems • Periods: usually remove punctuation but sometimes it’s useful to keep periods (Wash.  wash) • Single apostrophes, contractions (isn’t, didn’t, dog’s: for meaning extraction could be useful to have 2 separate forms: is + n’t or not) • Hyphenation: • Sometime best a single word: co-operate • Sometime best as 2 separate words: 26-year-old, aluminum-export ban • (RE for tokenization)

Sentence Segmentation • Sentence: • Something ending with a .. ?, ! (and sometime also :) • “You reminded me,” she remarked, “of your mother.” • Nested sentences • Note the .” • Sentence boundary detection algorithms • Heuristic (see figure 4.1 page 135 Manning) • Statistical classification trees (Riley 1989) • Probability of a word to occur before or after a boundary, case and length of words • Neural network (Palmer and Hearst 1997) • Part of speech distribution of preceding and following words • Maximum Entropy (Mikheev 1998) For reference see Manning

Sentence Segmentation • Sentence: • Something ending with a .. ?, ! (and sometime also :) • “You reminded me,” she remarked, “of your mother.” • Nested sentences • Note the .” • Sentence boundary detection algorithms • Heuristic (see figure 4.1 page 135 Manning) • Statistical classification trees (Riley 1989) • Probability of a word to occur before or after a boundary, case and length of words • Neural network (Palmer and Hearst 1997) • Part of speech distribution of preceding and following words • Maximum Entropy (Mikheev 1998) Note: MODELS and Features

Segmentation as classification • Sentence segmentation can be viewed as a classification task for punctuation: • Whenever we encounter a symbol that could possibly end a sentence, such as a period or a question mark, we have to decide whether it terminates the preceding sentence. • We’ll return on this when we cover classification • See Section 6.2 NLTK book • For word segmentation see section 3.8 NLTK book • Also page 180 of Speech and Language ProcessingJurafsky and Martin

Today’s review • Grammar • Morphology • Part-of-speech (POS) • Phrase level syntax • Lower level text processing • Semantics • Lexical semantics • Word sense disambiguation (WSD) • Lexical acquisition • Corpora • Intro to probability theory and graphical models (GM) • Example for WSD • Language Models (LM) and smoothing

Semantics • Semantics is the study of the meaning of words, construction and utterances • Study of the meaning of individual words (lexical semantics) • Study of how meanings of individual words are combined into the meaning of sentences (or larger units)

Lexical semantics • How words are related with each other • Hyponymy • scarlet, vermilion, carmine, and crimson are all hyponyms of red • Hypernymy • Antonymy (opposite) • Male, female • Meronymy (part of) • Tire is meromym of car • Etc..

Word Senses • Words have multiple distinct meanings, or senses: • Plant: living plant, manufacturing plant, … • Title: name of a work, ownership document, form of address, material at the start of a film, … • Many levels of sense distinctions • Homonymy: totally unrelated meanings (river bank, money bank) • Polysemy: related meanings (star in sky, star on tv, title) • Systematic polysemy: productive meaning extensions (metonymy such as organizations to their buildings) or metaphor • Sense distinctions can be extremely subtle (or not) • Granularity of senses needed depends a lot on the task Taken from Dan Klein’s cs 288 slides

Word Sense Disambiguation • Determine which of the senses of an ambiguous word is invoked in a particular use of the word • Example: living plant vs. manufacturing plant • How do we tell these senses apart? • “Context” • The manufacturing plant which had previously sustained the town’s economy shut down after an extended labor strike. • Maybe it’s just text categorization • Each word sense represents a topic • Why is it important to model and disambiguate word senses? • Translation • Bank banca or riva • Parsing • For PP attachment, for example • information retrieval • To return documents with the right sense of bank Adapted from Dan Klein’s cs 288 slides

Features • Bag-of-words (use words around with no order) • The manufacturing plant which had previously sustained the town’s economy shut down after an extended labor strike. • Bags of words = {after, manufacturing, which, labor, ..} • Bag-of-words classification works ok for noun senses • 90% on classic, shockingly easy examples (line, interest, star) • 80% on senseval-1 nouns • 70% on senseval-1 verbs

Verb WSD • Why are verbs harder? • Verbal senses less topical • More sensitive to structure, argument choice • Better disambiguated by their argument (subject-object): importance of local information • For nouns, a wider context likely to be useful • Verb Example: “Serve” • [function] The tree stump serves as a table • [enable] The scandal served to increase his popularity • [dish] We serve meals for the homeless • [enlist] She served her country • [jail] He served six years for embezzlement • [tennis] It was Agassi's turn to serve • [legal] He was served by the sheriff • Different types of information may be appropriate for different part of speech Adapted from Dan Klein’s cs 288 slides

Better features • There are smarter features: • Argument selectional preference: • serve NP[meals] vs. serve NP[papers] vs. serve NP[country] • Subcategorization: • [function] serve PP[as] • [enable] serve VP[to] • [tennis] serve <intransitive> • [food] serve NP {PP[to]} • Can capture poorly (but robustly) with local windows… but we can also use a parser and get these features explicitly Taken from Dan Klein’s cs 288 slides

I256 Applied Natural Language Processing Fall 2009