LING / C SC 439/539 Statistical Natural Language Processing

LING / C SC 439/539Statistical Natural Language Processing Lecture 4 1/23/2013

Recommended reading • Manning & Schütze • Chapter 2, Mathematical Foundations • 2.1.7, independence • (2.2.2-2.2.3, joint and conditional entropy, mutual information) • Chapter 5, Collocations • Entire chapter • 5.4, Pointwise mutual information

Outline • Collocations • Probability: independence and product rule • Hypothesis testing and Chi-square test • Pointwise mutual information • Written Assignment #2

Collocations • http://grammar.about.com/od/c/g/collocationterm.htm • A collocation is a sequence of words that intuitively is a single lexical unit • Examples: • New York City • Hillary Clinton • heat sink • iron will • weapons of mass destruction • throw up • make a decision • once upon a time • kick the bucket • Some collocations can be discontinuous: • Make apromptdecision

Types of collocations • Proper names • New York City, Hillary Clinton • Light verbs (have little semantic content on their own) • make a decision • do a favor • Verb-particle constructions / phrasal verbs • throw up, take out, bring in • Terminological expressions • heat sink (a computer part) • Idioms • kick the bucket • throw the baby out with the bath water • Other multi-word phrases • weapons of mass destruction • once upon a time

Properties of collocations • Limited compositionality • Composition: meaning of whole = sum of parts • Example: under a compositional interpretation of “kick the bucket”, you would literally kick a bucket • Non-substitutability • Literal meaning: kick the pail, punch the bucket • iron will  chalk will ??? • Non-modifiability • swiftly kick the bucket • kick the yellow bucket • Cannot translate directly • “make a decision” translated word-for-word into French is faire unedécision, which is incorrect • Correct: prendreunedécision

Non-collocations • Example: big house • Not a collocation because we can produce counterexamples to properties of collocations: • Limited compositionality • “big” modifies “house” • Non-substitutability • small house • big apartment • Non-modifiability • big expensive house • Cannot translate directly • French: grandemaison

Collocation vs. non-collocation isn’t always clear • Example: light bulb • Fails these properties of collocations: • Limited compositionality • “Light” modifies bulb • Non-substitutability • Flower bulb • Non-modifiability • Cheap light bulb • But passes this one: • Cannot translate directly • French: ampoule (one word)

Finding collocations empirically • Look in a big corpus • Manning & Schütze examples: 14 million words of New York Times • Compute statistics to find collocations (discuss bold) • N-gram frequency • N-gram frequency with POS filter • Mean-variance method for discontinuous collocations • Hypothesis testing • t-test • Chi-square test • Likelihood ratios, relative frequency ratios • Pointwise mutual information

N-gram frequency

Part of speech filter

N-grams by frequency, after applying part of speech filter

Independence • Two random variables A and B are independent if p(A, B) = p(A) * p(B) • i.e., if the joint probability equals the product of the marginal probabilities • “Independent”: a random variable has no effect on the distribution of another random variable

Independence: example • Flip a fair coin: p(heads) = .5, p(tails) = .5 • Flip the coin twice. • Let X be the random variable for the 1st flip. • Let Y be the random variable for the 2nd flip. • The two flips don’t influence each other, so you would expect that p(X, Y) = p(X) * p(Y) • p(X=heads, Y=tails) = p(X=heads) * p(Y=tails) = .5*.5 = .25

Non-independence: example • Suppose a class has a midterm and a final, and the final is cumulative. No one drops out of the class. • Midterm: 200 pass, 130 fail • Final: 180 pass, 150 fail • Contingency table shows marginal total counts • Rate of failure increases over time

p(MIDTERM, FINAL) • This table shows values for joint probability • Divide each cell’s count by total count of 330 • Margins show marginal probabilities • Example: p(MIDTERM=fail) = 0.394

p(MIDTERM) * p(FINAL) • Suppose MIDTERM and FINAL are independent. • Then p(MIDTERM, FINAL) = p(MIDTERM) * p(FINAL) • Expected probabilities assuming independence: For each cell, p(MIDTERM=x, FINAL=y) = p(MIDTERM=x) * p(FINAL=y) Example: p(MIDTERM=fail, FINAL=pass) = p(MIDTERM=fail) * p(FINAL=pass) = .394 * .545 = .215

MIDTERM and FINAL are not independent:p(MIDTERM, FINAL) != p(MIDTERM) * p(FINAL) • Observed probability Joint prob. under independence

Calculate conditional probability through joint and marginal probability • Conditional probability is the quotient of joint and marginal probability: p(B|A) = p(A, B) p(A) • Probability of events of B, restricted to events of A • For numerator, only consider events that occur in both A and B B A A&B

Product rule • Conditional probability: P(B | A) = P(A, B) P(A) • Product rule: P(A) * P(B | A) = P(A, B) • The product rule generates a joint probability from an unconditional probability and a conditional probability

Product rule, conditional probability, and independence • Product rule: P(A) * P(B | A) = P(A, B) • Suppose A and B are independent: P(A) * P(B) = P(A, B) • Then p(B | A) = p(B) • Explanation: B has a particular probability in the sample space. When restricted to the subset of events belonging to A, the proportion of events also in B does not change from the unrestricted sample space.

Conditional probability and independence • B has a particular probability in the sample space. When restricted to the subset of events belonging to A, the proportion of events in B does not change. • Example: • p(COLOR=blue) = 3/9 = 1/3 • P(COLOR=blue|SHAPE=square) = 1/3 • P(COLOR=blue|SHAPE=circle) = 1/3 • p(COLOR=red) = 6/9 = 2/3 • P(COLOR=red|SHAPE=square) = 2/3 • P(COLOR=red|SHAPE=circle) = 2/3 • Therefore p(COLOR) = p(COLOR|SHAPE)

Independence and collocations • We can use independence to determine if a word sequence is a collocation • Pick two words at random. Count the number of times they occur together in a corpus. • If words are a collocation: words should occur more often together than expected by chance • If words are not a collocation: words should occur together only at chance frequency

Independence and collocations • Example (Manning & Schütze): new companies is not a collocation • Corpus has 14307668 tokens • Observed probabilities: • p(new) = 15828 / 14307668 • p(companies) = 4675 / 14307668 • p(new companies) = 8 / 14307668 = 5.5914 * 10-7 • p(new companies) under independence: • p(new companies) = p(new) * p(companies) = 3.615 x 10-7

Independence and collocations • new companies is not a collocation, so new and companies should be independent • But they are not independent in corpus: • p(new companies) = 8 / 14307668 = 5.5914 * 10-7 • p(new companies) under independence: p(new companies) = p(new) * p(companies) = 3.615 x 10-7 • Real-world data is never exactly independent • But is it “close enough” to independent? • Need hypothesis testing • Non-independence is too strict of a criterion for discovering collocations

Hypothesis testing • http://en.wikipedia.org/wiki/Statistical_hypothesis_testing • We’ve gathered data, and now want to test whether some effect has occurred on variable(s) • Formulate a hypothesis Hi • Is the effect statistically significant? • Test against a null hypothesis H0 • If statistically significant, reject the null hypothesis (i.e., unlikely to have occurred at random)

Hypothesis testing for collocations • H0: null hypothesis • Words are independent: p(w1 w2) = p(w1)*p(w2) • H1: alternative hypothesis • Words are not independent: p(w1 w2) != p(w1)*p(w2) • Apply various statistical tests to determine whether or not we can reject the null hypothesis • (By the way, hypothesis testing is seen in corpus linguistics, but rarely in applied statistical NLP)

Switch examples:MIDTERM and FINAL are not independent:p(MIDTERM, FINAL) != p(MIDTERM) * p(FINAL) • Observed probability Joint prob. under independence • H0: MIDTERM and FINAL are independent • H1: MIDTERM and FINAL are not independent

Observed and expected counts • Observed counts Expected counts under independence CountExp(MIDTERM=x, FINAL=y) = p(MIDTERM=x)*p(FINAL=y)*330 (330 total students)

Visualization of observed and expected counts (different data set) • Is the difference in proportions enough that it is unlikely to have occurred by chance?

Chi-square test for independence of random variables • http://en.wikipedia.org/wiki/Pearson%27s_chi-square_test • Compares observed counts in contingency table with the expected counts, under the null hypothesis that the random variables are independent

Chi-square test • Compare the differences between observed and expected counts through a single number, called the chi-square test statistic. • Degrees of freedom = n-1, where n is # of random vars

Calculate chi-square test statistic • Observed counts Expected counts under independence Chi-sq = (120-109)2/109 + (80-91) 2/91 + (60-71) 2/71 + (70-59) 2/59 = 6.093 Degrees of freedom = 2 - 1 = 1

Chi-square statistic and independence • The more independent the data is, the lower the value of the chi-square statistic • At the extreme, when observedi = expectedi for all data, X2 = 0. • The less independent the data is, the higher the value of the chi-square statistic

Probability distribution for chi-square statistic, for different degrees of freedom

p-values • For a particular value of degrees of freedom, there is a probability distribution over the chi-square statistic • Low probability of the statistic having a high value • A p-value is the probability of obtaining a test statistic as extreme as the one calculated, assuming the null hypothesis • If the p-value is sufficiently low (usually .05 or .01), then the probability of supporting the null hypothesis is very low • We reject the null hypothesis and adopt the alternative hypothesis

Chi-square test in R (http://www.r-project.org/) > A = matrix(c(120,80,60,70), nrow=2, ncol=2, byrow=TRUE) > A [,1] [,2] [1,] 120 80 [2,] 60 70 > # (don't use contiguity correction) > chisq.test(A, correct=FALSE) Pearson's Chi-squared test data: A X-squared = 6.0923, df = 1, p-value = 0.01358 • p-value < 0.05, reject null hypothesis: it’s unlikely that MIDTERM and FINAL are independent

Apply Chi-square test to expected counts > B = matrix(c(109, 91, 71, 59), nrow=2, ncol=2, byrow=TRUE) > B [,1] [,2] [1,] 109 91 [2,] 71 59 > chisq.test(B, correct=FALSE) Pearson's Chi-squared test data: B X-squared = 4e-04, df = 1, p-value = 0.9836 • H0: MIDTERM and FINAL are independent • p-value not < 0.05, so cannot reject null hypothesis

Chi-square test applied to collocation • new occurs 15828 times • companies occurs 4675 times • 14307668 total bigrams • new companies occurs 8 times

Apply Chi-square test to collocation > C = matrix(c(8, 4667, 15820, 14287181), nrow=2, ncol=2, byrow=TRUE) > C [,1] [,2] [1,] 8 4667 [2,] 15820 14287181 > chisq.test(C, correct=FALSE) Pearson's Chi-squared test data: C X-squared = 1.5489, df = 1, p-value = 0.2133 • H0: w1=new and w2=companies are independent • p-value not < 0.05, so cannot reject null hypothesis

Pointwise mutual information • Not the same as mutual information (Manning & Schutze 2.2.3) – we’ll see that later • Let x and y be particular values of random variables (i.e., events) • I(x, y) = log2p(x, y) p(x)p(y)

Pointwise mutual information • log2 p(x) = y • 2y = p(x) • Interpret as # of bits needed to specify p(x) • As p(x) increases, log2 p(x) increases • I(x, y) = log2p(x, y) p(x)p(y) # of bits to specify the ratio between joint prob and product of marginals

Pointwise mutual inf., independence, and collocations • I(x, y) = log2p(x, y) p(x)p(y) • Suppose x and y are independent. • Then p(x, y) = p(x) * p(y). • Therefore, p(x,y) / p(x)p(y) = 1. • I(x, y) = log2 [ p(x,y) / p(x)p(y) ] = log2 1 = 0. • Suppose x and y occur together more frequently than chance. • Then p(x, y) > p(x) * p(y). • Therefore, p(x,y) / p(x)p(y) > 1. • I(x, y) = log2 [ p(x,y) / p(x)p(y) ] > 0. • The higher the pointwise mutual information is for an n-gram, the more likely it is to be a collocation

Pointwise mutual information • I(x, y) = log2p(x, y) p(x)p(y) = log2 p(x, y) – log2 p(x)p(y) = # of additional bits needed to specify the joint distribution of x and y, given the # of bits for the product of the marginals • If the collocation xy occurs more frequently than chance, the log2 p(x, y) term is greater than log2 p(x)p(y)

Example: Ayatollah Ruhollah • (Manning & Schutze) • p(Ayatollah) = 42 / 14307668 • p(Ruhollah) = 20 / 14307668 • p(Ayatollah, Ruhollah) = 42 / 14307668

Rank collocations according to pointwise mutual information • (Manning & Schutze)

Problem: sensitivity to sparse data • Pointwise mutual information for bigrams in entire corpus (23,000 documents) • PMI is high for low-frequency bigrams that are not collocations

LING / C SC 439/539 Statistical Natural Language Processing