1 / 53

LING / C SC 439/539 Statistical Natural Language Processing

LING / C SC 439/539 Statistical Natural Language Processing. Lecture 4 1/23/2013. Recommended reading. Manning & Schütze Chapter 2, Mathematical Foundations 2.1.7, independence (2.2.2-2.2.3, joint and conditional entropy, mutual information) Chapter 5, Collocations Entire chapter

tovi
Download Presentation

LING / C SC 439/539 Statistical Natural Language Processing

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. LING / C SC 439/539Statistical Natural Language Processing Lecture 4 1/23/2013

  2. Recommended reading • Manning & Schütze • Chapter 2, Mathematical Foundations • 2.1.7, independence • (2.2.2-2.2.3, joint and conditional entropy, mutual information) • Chapter 5, Collocations • Entire chapter • 5.4, Pointwise mutual information

  3. Outline • Collocations • Probability: independence and product rule • Hypothesis testing and Chi-square test • Pointwise mutual information • Written Assignment #2

  4. Collocations • http://grammar.about.com/od/c/g/collocationterm.htm • A collocation is a sequence of words that intuitively is a single lexical unit • Examples: • New York City • Hillary Clinton • heat sink • iron will • weapons of mass destruction • throw up • make a decision • once upon a time • kick the bucket • Some collocations can be discontinuous: • Make apromptdecision

  5. Types of collocations • Proper names • New York City, Hillary Clinton • Light verbs (have little semantic content on their own) • make a decision • do a favor • Verb-particle constructions / phrasal verbs • throw up, take out, bring in • Terminological expressions • heat sink (a computer part) • Idioms • kick the bucket • throw the baby out with the bath water • Other multi-word phrases • weapons of mass destruction • once upon a time

  6. Properties of collocations • Limited compositionality • Composition: meaning of whole = sum of parts • Example: under a compositional interpretation of “kick the bucket”, you would literally kick a bucket • Non-substitutability • Literal meaning: kick the pail, punch the bucket • iron will  chalk will ??? • Non-modifiability • swiftly kick the bucket • kick the yellow bucket • Cannot translate directly • “make a decision” translated word-for-word into French is faire unedécision, which is incorrect • Correct: prendreunedécision

  7. Non-collocations • Example: big house • Not a collocation because we can produce counterexamples to properties of collocations: • Limited compositionality • “big” modifies “house” • Non-substitutability • small house • big apartment • Non-modifiability • big expensive house • Cannot translate directly • French: grandemaison

  8. Collocation vs. non-collocation isn’t always clear • Example: light bulb • Fails these properties of collocations: • Limited compositionality • “Light” modifies bulb • Non-substitutability • Flower bulb • Non-modifiability • Cheap light bulb • But passes this one: • Cannot translate directly • French: ampoule (one word)

  9. Finding collocations empirically • Look in a big corpus • Manning & Schütze examples: 14 million words of New York Times • Compute statistics to find collocations (discuss bold) • N-gram frequency • N-gram frequency with POS filter • Mean-variance method for discontinuous collocations • Hypothesis testing • t-test • Chi-square test • Likelihood ratios, relative frequency ratios • Pointwise mutual information

  10. N-gram frequency

  11. Part of speech filter

  12. N-grams by frequency, after applying part of speech filter

  13. Outline • Collocations • Probability: independence and product rule • Hypothesis testing and Chi-square test • Pointwise mutual information • Written Assignment #2

  14. Independence • Two random variables A and B are independent if p(A, B) = p(A) * p(B) • i.e., if the joint probability equals the product of the marginal probabilities • “Independent”: a random variable has no effect on the distribution of another random variable

  15. Independence: example • Flip a fair coin: p(heads) = .5, p(tails) = .5 • Flip the coin twice. • Let X be the random variable for the 1st flip. • Let Y be the random variable for the 2nd flip. • The two flips don’t influence each other, so you would expect that p(X, Y) = p(X) * p(Y) • p(X=heads, Y=tails) = p(X=heads) * p(Y=tails) = .5*.5 = .25

  16. Non-independence: example • Suppose a class has a midterm and a final, and the final is cumulative. No one drops out of the class. • Midterm: 200 pass, 130 fail • Final: 180 pass, 150 fail • Contingency table shows marginal total counts • Rate of failure increases over time

  17. p(MIDTERM, FINAL) • This table shows values for joint probability • Divide each cell’s count by total count of 330 • Margins show marginal probabilities • Example: p(MIDTERM=fail) = 0.394

  18. p(MIDTERM) * p(FINAL) • Suppose MIDTERM and FINAL are independent. • Then p(MIDTERM, FINAL) = p(MIDTERM) * p(FINAL) • Expected probabilities assuming independence: For each cell, p(MIDTERM=x, FINAL=y) = p(MIDTERM=x) * p(FINAL=y) Example: p(MIDTERM=fail, FINAL=pass) = p(MIDTERM=fail) * p(FINAL=pass) = .394 * .545 = .215

  19. MIDTERM and FINAL are not independent:p(MIDTERM, FINAL) != p(MIDTERM) * p(FINAL) • Observed probability Joint prob. under independence

  20. Calculate conditional probability through joint and marginal probability • Conditional probability is the quotient of joint and marginal probability: p(B|A) = p(A, B) p(A) • Probability of events of B, restricted to events of A • For numerator, only consider events that occur in both A and B B A A&B

  21. Product rule • Conditional probability: P(B | A) = P(A, B) P(A) • Product rule: P(A) * P(B | A) = P(A, B) • The product rule generates a joint probability from an unconditional probability and a conditional probability

  22. Product rule, conditional probability, and independence • Product rule: P(A) * P(B | A) = P(A, B) • Suppose A and B are independent: P(A) * P(B) = P(A, B) • Then p(B | A) = p(B) • Explanation: B has a particular probability in the sample space. When restricted to the subset of events belonging to A, the proportion of events also in B does not change from the unrestricted sample space.

  23. Conditional probability and independence • B has a particular probability in the sample space. When restricted to the subset of events belonging to A, the proportion of events in B does not change. • Example: • p(COLOR=blue) = 3/9 = 1/3 • P(COLOR=blue|SHAPE=square) = 1/3 • P(COLOR=blue|SHAPE=circle) = 1/3 • p(COLOR=red) = 6/9 = 2/3 • P(COLOR=red|SHAPE=square) = 2/3 • P(COLOR=red|SHAPE=circle) = 2/3 • Therefore p(COLOR) = p(COLOR|SHAPE)

  24. Independence and collocations • We can use independence to determine if a word sequence is a collocation • Pick two words at random. Count the number of times they occur together in a corpus. • If words are a collocation: words should occur more often together than expected by chance • If words are not a collocation: words should occur together only at chance frequency

  25. Independence and collocations • Example (Manning & Schütze): new companies is not a collocation • Corpus has 14307668 tokens • Observed probabilities: • p(new) = 15828 / 14307668 • p(companies) = 4675 / 14307668 • p(new companies) = 8 / 14307668 = 5.5914 * 10-7 • p(new companies) under independence: • p(new companies) = p(new) * p(companies) = 3.615 x 10-7

  26. Independence and collocations • new companies is not a collocation, so new and companies should be independent • But they are not independent in corpus: • p(new companies) = 8 / 14307668 = 5.5914 * 10-7 • p(new companies) under independence: p(new companies) = p(new) * p(companies) = 3.615 x 10-7 • Real-world data is never exactly independent • But is it “close enough” to independent? • Need hypothesis testing • Non-independence is too strict of a criterion for discovering collocations

  27. Outline • Collocations • Probability: independence and product rule • Hypothesis testing and Chi-square test • Pointwise mutual information • Written Assignment #2

  28. Hypothesis testing • http://en.wikipedia.org/wiki/Statistical_hypothesis_testing • We’ve gathered data, and now want to test whether some effect has occurred on variable(s) • Formulate a hypothesis Hi • Is the effect statistically significant? • Test against a null hypothesis H0 • If statistically significant, reject the null hypothesis (i.e., unlikely to have occurred at random)

  29. Hypothesis testing for collocations • H0: null hypothesis • Words are independent: p(w1 w2) = p(w1)*p(w2) • H1: alternative hypothesis • Words are not independent: p(w1 w2) != p(w1)*p(w2) • Apply various statistical tests to determine whether or not we can reject the null hypothesis • (By the way, hypothesis testing is seen in corpus linguistics, but rarely in applied statistical NLP)

  30. Switch examples:MIDTERM and FINAL are not independent:p(MIDTERM, FINAL) != p(MIDTERM) * p(FINAL) • Observed probability Joint prob. under independence • H0: MIDTERM and FINAL are independent • H1: MIDTERM and FINAL are not independent

  31. Observed and expected counts • Observed counts Expected counts under independence CountExp(MIDTERM=x, FINAL=y) = p(MIDTERM=x)*p(FINAL=y)*330 (330 total students)

  32. Visualization of observed and expected counts (different data set) • Is the difference in proportions enough that it is unlikely to have occurred by chance?

  33. Chi-square test for independence of random variables • http://en.wikipedia.org/wiki/Pearson%27s_chi-square_test • Compares observed counts in contingency table with the expected counts, under the null hypothesis that the random variables are independent

  34. Chi-square test • Compare the differences between observed and expected counts through a single number, called the chi-square test statistic. • Degrees of freedom = n-1, where n is # of random vars

  35. Calculate chi-square test statistic • Observed counts Expected counts under independence Chi-sq = (120-109)2/109 + (80-91) 2/91 + (60-71) 2/71 + (70-59) 2/59 = 6.093 Degrees of freedom = 2 - 1 = 1

  36. Chi-square statistic and independence • The more independent the data is, the lower the value of the chi-square statistic • At the extreme, when observedi = expectedi for all data, X2 = 0. • The less independent the data is, the higher the value of the chi-square statistic

  37. Probability distribution for chi-square statistic, for different degrees of freedom

  38. p-values • For a particular value of degrees of freedom, there is a probability distribution over the chi-square statistic • Low probability of the statistic having a high value • A p-value is the probability of obtaining a test statistic as extreme as the one calculated, assuming the null hypothesis • If the p-value is sufficiently low (usually .05 or .01), then the probability of supporting the null hypothesis is very low • We reject the null hypothesis and adopt the alternative hypothesis

  39. Chi-square test in R (http://www.r-project.org/) > A = matrix(c(120,80,60,70), nrow=2, ncol=2, byrow=TRUE) > A [,1] [,2] [1,] 120 80 [2,] 60 70 > # (don't use contiguity correction) > chisq.test(A, correct=FALSE) Pearson's Chi-squared test data: A X-squared = 6.0923, df = 1, p-value = 0.01358 • p-value < 0.05, reject null hypothesis: it’s unlikely that MIDTERM and FINAL are independent

  40. Apply Chi-square test to expected counts > B = matrix(c(109, 91, 71, 59), nrow=2, ncol=2, byrow=TRUE) > B [,1] [,2] [1,] 109 91 [2,] 71 59 > chisq.test(B, correct=FALSE) Pearson's Chi-squared test data: B X-squared = 4e-04, df = 1, p-value = 0.9836 • H0: MIDTERM and FINAL are independent • p-value not < 0.05, so cannot reject null hypothesis

  41. Chi-square test applied to collocation • new occurs 15828 times • companies occurs 4675 times • 14307668 total bigrams • new companies occurs 8 times

  42. Apply Chi-square test to collocation > C = matrix(c(8, 4667, 15820, 14287181), nrow=2, ncol=2, byrow=TRUE) > C [,1] [,2] [1,] 8 4667 [2,] 15820 14287181 > chisq.test(C, correct=FALSE) Pearson's Chi-squared test data: C X-squared = 1.5489, df = 1, p-value = 0.2133 • H0: w1=new and w2=companies are independent • p-value not < 0.05, so cannot reject null hypothesis

  43. Outline • Collocations • Probability: independence and product rule • Hypothesis testing and Chi-square test • Pointwise mutual information • Written Assignment #2

  44. Pointwise mutual information • Not the same as mutual information (Manning & Schutze 2.2.3) – we’ll see that later • Let x and y be particular values of random variables (i.e., events) • I(x, y) = log2p(x, y) p(x)p(y)

  45. Pointwise mutual information • log2 p(x) = y • 2y = p(x) • Interpret as # of bits needed to specify p(x) • As p(x) increases, log2 p(x) increases • I(x, y) = log2p(x, y) p(x)p(y) # of bits to specify the ratio between joint prob and product of marginals

  46. Pointwise mutual inf., independence, and collocations • I(x, y) = log2p(x, y) p(x)p(y) • Suppose x and y are independent. • Then p(x, y) = p(x) * p(y). • Therefore, p(x,y) / p(x)p(y) = 1. • I(x, y) = log2 [ p(x,y) / p(x)p(y) ] = log2 1 = 0. • Suppose x and y occur together more frequently than chance. • Then p(x, y) > p(x) * p(y). • Therefore, p(x,y) / p(x)p(y) > 1. • I(x, y) = log2 [ p(x,y) / p(x)p(y) ] > 0. • The higher the pointwise mutual information is for an n-gram, the more likely it is to be a collocation

  47. Pointwise mutual information • I(x, y) = log2p(x, y) p(x)p(y) = log2 p(x, y) – log2 p(x)p(y) = # of additional bits needed to specify the joint distribution of x and y, given the # of bits for the product of the marginals • If the collocation xy occurs more frequently than chance, the log2 p(x, y) term is greater than log2 p(x)p(y)

  48. Example: Ayatollah Ruhollah • (Manning & Schutze) • p(Ayatollah) = 42 / 14307668 • p(Ruhollah) = 20 / 14307668 • p(Ayatollah, Ruhollah) = 42 / 14307668

  49. Rank collocations according to pointwise mutual information • (Manning & Schutze)

  50. Problem: sensitivity to sparse data • Pointwise mutual information for bigrams in entire corpus (23,000 documents) • PMI is high for low-frequency bigrams that are not collocations

More Related