LING / C SC 439/539 Statistical Natural Language Processing

LING / C SC 439/539Statistical Natural Language Processing Lecture 5 1/28/2013

Recommended reading • NLTK book, chapter 6, Learning to Classify Text • http://nltk.googlecode.com/svn/trunk/doc/book/ch06.html

Outline • NLP problems as classification tasks • Computation of features • Feature choice and sparse data • Features for zero-frequency data • Written assignment #3

Solve NLP problems with classification • Diverse range of computational problems • Finding sentences in text • Identify the language of a document • Word sense disambiguation • Part of speech tagging etc. • Despite their differences, there is a common framework for solving these problems, through classification

Classification • Classification: given an instance and its features, make a decision about the label to be assigned to it • Features are used to predict the label • Train a classifier (supervised learning): • Take an annotated corpus, where instances are labeled • Choose the features you think are relevant for determining the label • Apply an algorithm to train a classifier that will predict labels for new instances that you have not seen before

Training and testing sets in machine learning • Get an annotated corpus, split into training and testing sets • Proportions: in NLP, often 90% / 10% • Use the training set to train the classifier • Use the testing set for to test the classifier quantitatively (what % of cases did it get right) Training set Testing set

Example: spam detection • Training: • Apply an algorithm to learn statistical association between word stems and “spam”/”nonspam” • Creates a classifier, which can be applied to new data to make a decision about the most likely label • Testing: • Take a new document, stem the words, apply the classifier, which chooses a label

Represent NLP problem numerically • X: instance-feature matrix, size nx m (n rows, m columns) • ntraining instances Xi • mfeature functions fj • Xi,j = fj(Xi), i.e., the value of feature function fj for Xi • Y: vector of labels, length n Feature functions F Y: labels X: training instances

Definitions: instances, features, and labels • Instances • X = { X1, X2, …, Xn}: set of all instances in the data • Xi: an individual instance • Features • F = { f1, f2, …, fm }: a set of feature functions • Xi = [ f1(Xi), f2(Xi), …, fm(Xi) ] is a vector of features, i.e., the result of applying each feature function in F to the instance Xi • Features are used to predict the label of an instance • Labels • Set of possible labels: Y = { y1, y2, …, yk } • Binary classification: |Y| = 2 • Multiclass classification: |Y| > 2 • Classification: for an instance Xi, predict a label yi ∈ Y

Following: instances, features, and labels for these problems • Spam detection • Finding sentences in text • Identify the language of a document • Word sense disambiguation • Part of speech tagging

Problem #1: Spam detection • What are the instances? X = { } • What are the labels? Y = { } • What are some relevant features? (informally)

Problem #1: Spam detection • What are the instances? X = set of e-mails • What are the labels? Y = { SPAM, NONSPAM } (binary) • What are some relevant features? (informally) • Word stems in the e-mail • Whether e-mail contains a word that is a monetary amount • Whether e-mail contains a word that is “Nigeria” • Whether e-mail contains a word in { Cialis, Viagra, enhancement }

Problem #2: finding sentences in text • Find individual sentences in a text document • Why: NLP modules such as part of speech taggers and parsers usually take in individual sentences as input 4. Elevator Buttons: Do you ALWAYS wash you hands after coughing or sneezing into them? Think about all the times that you couldn’t wash your hands (or didn’t have hand sanitizer) and then multiply that number by the hundreds of people who may touch the elevator in your apartment building or in your college campus’ student union. It makes us catch a cold just thinking about it. Use your knuckle, elbow, sleeve, or a tissue to push the button instead. You may look a little silly, but raising an eyebrow is better than raising your temperature. If you must touch the button, use antibacterial gel (a full teaspoon) before you touch your face or eat food.

Problem #2: finding sentences in text • What are the instances? X = { } • What are the labels? Y = { } • What are some relevant features? (informally)

Problem #2: finding sentences in text • What are the instances? X = set of occurrences of these characters: . Period ? Question mark ! Exclamation mark \n Newline character • What are the labels? Y = {sentence-boundary, not-sentence-boundary } (binary ) • What are some relevant features? (informally) • Whether the next word is capitalized • Whether whitespace is following • Whether the word is in a list of abbreviations

Difficulties in finding sentence boundaries YES YES • John came to the United States. He left. • John came to the U.S.A. from China. He left. • John went from China to the U.S.A. He left. • John went from China to the U.S.A.He left. • John went from China to the United States\n • John went from China to the United\n States of America\n • john went from china to the usa he left • john went from china to the usahe left NO YES YES NO NO YES NO NO YES YES NO NO YES YES NO YES Very hard!

Problem #2: finding sentences in text • What are the instances? X = set of all positions in the document • we want to insert sentence boundary markers in a document • don’t mark characters as being sentence boundaries • What are the labels? Y = {sentence-boundary, not-sentence-boundary } (binary ) • What are some relevant features? • Whether the character at the current position is in { ., !, ? } • Whether it’s the last position in a document • Whether the next word is capitalized • Whether whitespace is following • Whether the word is in a list of abbreviations

Problem #3: predict the language of a document • Example application: • A program that translate a document into English • The user doesn’t know what language the document was written in

Problem #3: predict the language of a document • What are the instances? X = { } • What are the labels? Y = { } • What are some relevant features?

Problem #3: predict the language of a document • What are the instances? X = a set of text documents • What are the labels? Y = { German, French, Chinese, … } (some finite set) This is a multiclass classification problem • What are some relevant features? • Probability of a character • Character trigrams • Distribution of consonant clusters • Presence of particular language-specific characters • Average word length

Problem #4: word sense disambiguation • Ambiguous word “bank”: • Example ... the south bank of Deer Island ... ... to a bank for liquidation . • Means “land touching water” or “money” • Applications: • Information retrieval / search: want documents within the domain that the user intended • Machine translation • Speech synthesis

Word sense disambiguation is important for machine translation • Translate into Korean: • Iraq lost the battle. Ilakukacentweyciessta. [Iraq ] [battle] [lost] • John lost his computer. John-i computer-lulilepelyessta. [John] [computer] [misplaced]

WSD is needed in speech synthesis(text-to-speech) • … slightly elevated lead levels • Sense 1: lead role (rhymes with seed) • Sense 2: lead mines (rhymes with bed) • The speaker produces too little bass • Sense 1: string bass (rhymes with vase) • Sense 2: sea bass (rhymes with pass)

Problem #4: word sense disambiguation • What are the instances? X = { } • What are the labels? Y = { } • What are some relevant features?

Problem #4: word sense disambiguation • What are the instances? X = all occurrences of a particular potentially ambiguous word in a document Example: all occurrences of “bank” • What are the labels? Y = { sense1, sense2, …, sensek }, where senses are pre-defined in a lexical resource such as a dictionary or WordNet Example: Y = { river, financial institution } • What are some relevant features? • Part of speech of the word • Capitalization of the word • Immediately neighboring words • Words within a N-word window of context

Problem #5: part of speech tagging • Assign a part of speech (POS) tag to each token in a sentence • Important for everything: morphological analysis, syntactic parsing semantics, etc. • Example sentence • Input: The quick brown fox jumped over the lazy dog . • Output: The/DT quick/JJ brown/JJ fox/NN jumped/VBD over/IN the/DT lazy/JJ dog/NN ./.

Problem #5: part of speech tagging • What are the instances? X = { } • What are the labels? Y = { } • What are some relevant features?

Problem #5: POS tagging (initial attempt) • What are the instances? X = the set of all tokens in a document (where a token is any contiguous sequence of non-whitespace characters) • What are the labels? Y = { NN, NNS, VB, VBZ, … } Penn Treebank POS tag set (36 tags + punctuation tags) • What are some relevant features? • The spelling of the token itself • Whether the token is capitalized • The previous token in the sentence • The next token in the sentence

Resolving POS ambiguity • Example: • My cat Mr. Nels climbed up the tree. • Please look up the information. • “up” is ambiguous • Preposition (IN) in “climbed up the tree” • Verbal particle (RP) in “look up” • Possible solution: look at the preceding word and following word

POS tag of neighboring word is useful for determining POS of current word • Counterexample: • a new look up the junction • Is “up” a verbal particle (RP) or a preposition (IN)? • “look” precedes “up”, but “up” is a preposition here • POS tag of the preceding word helps to determine tag • “look” is a verb (VB)  “up” is likely to be verbal particle • “look” is a sing. noun (NN)  “up” is likely to be preposition • (This is a real example) http://homepage.ntlworld.com/ms.draper/Products/TurnerTexts/L.html

Later we’ll approach POS tagging as sequence labeling • Since labels for words may affect decisions for other words, we want to determine all the POS tags at once; thus, the unit to be labeled is an entire sequence of words • What are the instances? X = { X1, X2, …, Xn }: the set of sentences in a document Xi = a sentence = a sequence of tokens x1, x2, …, xk • What are the labels? • Y = { Y1, Y2, …, Yn}: the set of all poss. POS tag seqs. for a sentence • Yi = yi1, yi2, …, yik is a POS tag sequence for a sentence Xi • Each yij ∈ a finite set of POS tags

Outline • NLP problems as classification tasks • Computation of features • Categorical features • Binary features • Counts and probabilities • Feature choice and sparse data • Features for zero-frequency data • Written assignment #3

Represent NLP problem numerically • X: instance-feature matrix, size n x m (n rows, m columns) • n training instances Xi • m feature functions Fi • Xi,j = fj(Xi), i.e., the value of feature function fj for Xi • Y: vector of labels, length n Feature functions Y: labels X: training instances Here is a particular instance Xi and its label Yi. This section will discuss feature values.

Feature functions and features • F = { f1, f2, …, fm }: a set of feature functions • Each has an index • A feature function fjtakes in an instance Xi and outputs a feature. • Xi = [ f1(Xi), f2(Xi), …, fm(Xi) ] is a vector of features • i.e., the result of applying each feature function to Xi … • f1(Xi) • f2(Xi) • f3(Xi) • f4(Xi) • f5(Xi) • fm(Xi) • Yi instance Xi

Common types of feature values • Binary • { True, False } • { yes, no} • { 0, 1 } • Categorical • Members of a finite set, such as a set of words or POS tags • Typically means > 2 possibilities • Whole numbers: { 0, 1, 2, 3, ... } • Used for counts of features in a corpus • Real-valued, nonnegative • Used for probabilities

Example: computing categorical features; seeread-features.py or read-features-numpy.py • Corpus Please/VB look/VB up/RP the/DT information/NN ./. a/DT new/JJ look/NN up/IN the/DT junction/NN • Training instances • X1 = occurrence of “up” in first sentence • X2 = occurrence of “up” in second sentence • Feature functions • Let’s apply these feature functions to each training instance: • f1(Xi): returns a feature for the current word • f2(Xi): returns a feature for the previous word • f3(Xi): returns a feature for the POS tag of the previous word

Feature functions for categorical features # tokens: list of tokens # i: index of token we're computing a feature for def f_prev_w(tokens, i): prev_w = tokens[i-1].split('/')[0] return 'w-1=' + prev_w def f_curr_w(tokens, i): curr_w = tokens[i].split('/')[0] return 'w=' + curr_w def f_prev_pos(tokens, i): prev_pos = tokens[i-1].split('/')[1] return 'pos-1=' + prev_pos

Computed features • Example instance: the word “up” in this token sequence Please/VB look/VB up/RP the/DT information/NN ./. • Features for “up”: 3 strings • 'w-1=look' • 'w=up' • 'pos-1=VB' • N.B.: these features are atomic • Not hierarchically constructed, no substructure to them • I’ve written the features like this to make them easy to read • Could have named them 'feature59','xyzq', '12345', etc.

Construct instance-feature matrix X and label matrix Y • Initialization: • X = empty matrix • Y = empty vector • Go through the training data • When you find an instance Xi: • Apply feature functions f1, …, fm • Construct a vector of feature values: Xi = [ f1(Xi), …, fm(Xi) ] • Store features in matrix as X[i, :] • Store label Yi as Y[i]

Completed instance-feature matrix X and label vector Y • Corpus Please/VB look/VB up/RP the/DT information/NN ./. a/DT new/JJ look/NN up/IN the/DT junction/NN Feature functions Y f_prev_pos(Xi) f_prev_w(Xi) f_curr_w(Xi) • 'w-1=look' • 'w=up' • 'pos-1=VB' • 'RP' X • 'w-1=look' • 'w=up' • 'pos-1=NN' • 'IN'

This example uses categorical features • Possible values of f_prev_pos (45 choices): { 'pos-1=NN', 'pos-1=JJ', 'pos-1=VB', 'pos-1=DT', 'pos-1=RB', 'pos-1=IN', … } Feature function f_prev_pos(X) • 'pos-1=NN' • 'pos-1=VB'

Binary features • Many algorithms use binary features (two values) • Less computationally complex, in certain ways • How can we have binary features when our feature functions output more than 2 values? • We can turn a feature function that outputs more than 2 categorical values into a larger set of binary feature functions

Example: turn categorical feature into binary features • Suppose that f_prev_pos has 3 possible values: { 'pos-1=NN', 'pos-1=JJ', 'pos-1=VB'} • For each value, create a new feature function with binary-valued output: • f_prev_pos_NN: { 0, 1 } • f_prev_pos_JJ: { 0, 1 } • f_prev_pos_VB: { 0, 1 } • Remove the old feature function

One 3-valued feature (left)vs. three binary features (right) Feature function f_prev_pos(Xi) f_prev_pos_NN(Xi) f_prev_pos_VB(Xi) f_prev_pos_JJ(Xi) • 'pos-1=NN' 1 0 0 0 1 0 • 'pos-1=JJ' 0 0 1 • 'pos-1=VB'

Code for a binary feature function def f_prev_pos_NN(tokens, i): prev_pos = tokens[i-1].split('/')[1] if prev_pos=='NN': return 1 else: return 0

Counts and probabilities as features • We often store counts of occurrences of a particular feature • These counts are often turned into probability distributions (normalized)

Problem #3: predict the language of a document • What are the instances? X = a set of text documents • What are the labels? Y = { German, French, Chinese, … } (some finite set) This is a multiclass classification problem • What are some relevant features? • Probability of a character • Character trigrams • Distribution of consonant clusters • Presence of particular language-specific characters • Average word length

Training corpus • Training data: a set of documents, and the language they are written in • (doc1, French) • (doc2, English) • (doc3, English) • (doc4, French) • (doc5, Spanish) • …

LING / C SC 439/539 Statistical Natural Language Processing