Corpora and Statistical Methods

Corpora and Statistical Methods – Part 2 Albert Gatt

Preliminaries: Hypothesis testing and the binomial distribution

Permutations • Suppose we have the 5 words {the, dog, ate, a, bone} • How many permutations (possible orderings) are there of these words? • the dog ate a bone • dog the ate a bone • … • E.g. there are 5! = 120 ways of permuting 5 words.

Binomial coefficient • Slight variation: • How many different choices of three words are there out of these 5? • This is known as an “n choose k” problem, in our case: “5 choose 3” • For our problem, this gives us 10 ways of choosing three items out of 5

Bernoulli trials • A Bernoulli (or binomial) trial is like a coin flip. Features: • There are two possible outcomes (not necessarily with the same likelihood), e.g. success/failure or 1/0. • If the situation is repeated, then the likelihoods of the two outcomes are stable.

Sampling with/out replacement • Suppose we’re interested in the probability of pulling out a function word from a corpus of 100 words. • we pull out words one by one without putting them back • Is this a Bernoulli trial? • we have a notion of success/failure: w is either a function word (“success”) or not (“failure”) • but our chances aren’t the same across trials: they diminish since we sample without replacement

Cutting corners • If the sample (e.g. the corpus) is large enough, then we can assume a Bernoulli situation even if we sample without replacement. • Suppose our corpus has 52 million words • Success = pulling out a function word • Suppose there are 13 million function words • First trial: p(success) = .25 • Second trial: p(success) = 12,999,999/51,999,999 = .249 • On very large samples, the chances remain relatively stable even without replacement.

Binomial probabilities - I • Let π represent the probability of success on a Bernoulli trial (e.g. our simple word game on a large corpus). • Then, p(failure) =1 - π • Problem: What are the chances of achieving success 3 times out of 5 trials? • Assumption: each trial is independent of every other. • (Is this assumption reasonable?)

Binomial probabilities - II • How many ways are there of getting success three times out of 5? • Several: SSSFF, SFSFS, SFSSF, … • To estimate the number of possible ways of getting k outcomes from n possibilities, we use the binomial coefficient:

Binomial probabilities - III • “5 choose 3” gives 10. • Given independence, each of these sequences is equally likely. • What’s the probability of a sequence? • it’s an AND problem (multiplication rule) • P(SSSFF) = πππ(1- π)(1 – π) = π3(1- π)2 • P(SFSFS) = π(1- π) π(1- π) π = π3(1- π)2 • (they all come out the same)

Binomial probabilities - IV • The binomial distribution states that: • given n Bernoulli trials, with probability π of success on each trial, the probability of getting exactly k successes is: probability of each success Number of different ways of getting k successes probability of k successes out of n

Expected value and variance • Expected value: • where π is our probability of success Expected value of X over n trials Variance of X over n trials

Using the t-test for collocation discovery

The logic of hypothesis testing • The typical scenario in hypothesis testing compares two hypotheses: • The research hypothesis • A null hypothesis • The idea is to set up our experiment (study, etc) in such a way that: • If we show the null hypothesis to be false then • we can affirm our research hypothesis with a certain degree of confidence

H0 for collocation studies • There is no real association between w1 and w2, i.e. occurrence of <w1,w2> is no more likely than chance. • More formally: • H0: P(w1 & w2) = P(w1)P(w2) • i.e. P(w1) and P(w2) are independent

Some more on hypothesis testing • Ourresearch hypothesis (H1): • <w1,w2> are strong collocates • P(w1 & w2) > P(w1)P(w2) • Anull hypothesis H0 • P(w1 & w2) = P(w1)P(w2) • How do we know whether our results are sufficient to affirm H1? • I.e. how big is our risk of wrongly falsifying H0?

The notion of significance • We generally fix a “level of confidence” in advance. • In many disciplines, we’re happy with being 95% confident that the result we obtain is correct. • So we have a 5% chance of error. • Therefore, we state our results at p = 0.05 • “The probability of wrongly rejecting H0 is 5% (0.05)”

Tests for significance • Many of the tests we use involve: • having a prior notion of what the mean/variance of a population is, according to H0 • computing the mean/variance on our sample of the population • checking whether the sample mean/variance is different from the sample predicted by H0, at 95% confidence.

The t-test: strategy • obtain mean (x) and variance (s2) for a sample • H0: sample is drawn from a population with mean μ and variance σ2 • estimate the t value: this compares the sample mean/variance to the expected (population) mean/variance under H0 • check if any difference found is significant enough to reject H0

Computing t • calculate difference between sample mean and expected population mean • scale the difference by the variance • Assumption: population is normally distributed. • If t is big enough, we reject H0. The magnitude of t given our sample size N is simply looked up in a table. • Tables tell us what the level of significance is (p-value, or likelihood of making a Type 1 error, wrongly rejecting H0).

Example: new companies • We think of our corpus as a series of bigrams, and each sample we take is an indicator variable (Bernoulli trial): • value = 1 if a bigram is new companies • value = 0 otherwise • Compute P(new) and P(companies) using standard MLE. • H0: P(new companies) = P(new)P(companies)

Example continued • We have computed the likelihood of our bigram of interest under H0. • Since this is a Bernoulli Trial, this is also our expected mean. • We then compute the actual sample probability of <w1,w2> (new companies). • Computet and check significance

Uses of the t-test • Often used to rank candidate collocations, rather than compute significance. • Stop word lists must be used, else all bigrams will be significant. • e.g. M&S report 824 out of 831 bigrams that pass the significance test. • Reason: • language is just not random • regularities mean that if the corpus is large enough, all bigrams will occur together regularly and often enough to be significant. • Kilgarriff (2005): Any null hypothesis will be rejected on a large enough corpus.

Extending the t-test to compare samples • Variation on the original problem: • what co-occurrence relations are best to distinguish between two words, w1 and w1’ that are near-synonyms? • e.g. strong vs. powerful • Strategy: • find all bigrams <w1,w2> and <w1, w2’> • e.g. strong tea, strongsupport • check, for each w1, if it occurs significantly more often with w2, versus w2’. • NB. This is a two-sample t-test

Two-samplet-test: details • H0: For any w1, the probabilities of <w1,w2> and <w1,w2’> is the same. • i.e. μ (expected difference) = 0 • Strategy: • extract sample of <w1,w2> and <w1,w2’> • assume they are independent • compute mean and SD for each sample • compute t • check for significance: is the magnitude of the difference large enough? • Formula:

Simplifying under binomial assumptions • On large samples, variance in the binomial distribution approaches the mean. I.e.: • (similarly for the other sample mean) • Therefore:

Concrete example: strong vs. powerful (M&S, p. 167); NY Times Words occurring significantly more often with powerful than strong Words occurring significantly more often with strong than powerful

Criticisms of the t-test • Assumes that the probabilities are normally distributed. This is probably not the case in linguistic data, where probabilities tend to be very large or very small. • Alternative: chi-squared test (Χ2) • compare differences between expected and observed frequencies (e.g. of bigrams)

The chi-square test

Example • Imagine we’re interested in whether poor performance is a good collocation. • H0: frequency of poor performance is no different from the expected frequency if each word occurs independently. • Find frequencies of bigrams containing poor,performance and poor performance. • compare actual to expected frequencies • check if the value is high enough to reject H0

Example continued OBSERVED FREQUENCIES Expected frequencies need to be computed for each cell: E.g. expected value for cell (1,1) poor performance:

Computing the value • The chi-squared value is the sum of differences of observed and expected frequencies, scaled by expected frequencies. • Value is once again looked up in a table to check if degree of confidence (p-value) is acceptable. • If so, we conclude that the dependency between w1 and w2 is significant.

More applications of this statistic • Kilgarriff and Rose 1998 use chi-square as a measure of corpus similarity • draw up an n (row)*2 (column) table • columns correspond to corpora • rows correspond to individual types • compare the difference in counts between corpora • H0: corpora are drawn from the same underlying linguistic population (e.g. register or variety) • corpora will be highly similar if the ratio of counts for each word is roughly constant. • This uses lexical variation to compute corpus-similarity.

Limitations of t-test and chi-square • Not easily interpretable • a large chi-square or t value suggests a large difference • but makes more sense as a comparative measure, rather than in absolute terms • t-test is problematic because of the normality assumption • chi-square doesn’t work very well for small frequencies (by convention, we don’t calculate it if the expected value for any of the cells is less than 5) • but n-grams will often be infrequent!

Likelihood ratios for collocation discovery

Rationale • A likelihood ratio is the ratio of two probabilities • indicates how much more likely one hypothesis is compared to another • Notation: • c1= C(w1) • c2= C(w2) • c12= C(<w1,w2>) • Hypotheses: • H0: P(w2|w1) = p = P(w2|¬w1) • H1: • P(w2|w1) = p1 • P(w2|¬w1) = p2 • p1 =/= p2

Computing the likelihood ratio

Computing the likelihood ratio • The likelihood (odds) that a hypothesis H is correct is L(H).

Computing the Likelihood ratio • We usually compute the log of the ratio: • Usually expressed as: because, for v. large samples, this is roughly equivalent to a Χ2 value

Interpreting the ratio • Suppose that the likelihood ratio for some bigram <w1,w2> is x. This says: • If we make the hypothesis that w2 is somehow dependent on w1, then we expect it to occur x times more than its actual base rate of occurrence would predict. • This ratio is also better for sparse data. • we can use the estimate as an approximate chi-square value even when expected frequencies are small.

Concrete example: bigrams involving powerful (M&S, p. 174) Source: NY Times corpus (N=14.3m) Note: sparse data can still have a high log likelihood value! Interpreting -2 log l as chi-squared allows us to reject H0, even for small samples (e.g. powerful cudgels)

Relative frequency ratios • An extension of the same logic of a likelihood ratio • used to compare collocations across corpora • Let <w1,w2> be our bigram of interest. • Let C1 and C2 be two corpora: • p1 = P(<w1,w2>) in C1 • p2 = P(<w1,w2>) in C2. • r= p1/p2 gives an indication of the relative likelihood of <w1,w2> in C1 and C2.

Example application • Manning and Schutze (p.176) compare: • C1: NY Times texts from 1990 • C2: NY Times texts from 1989 • Bigram <East,Berliners> occurs 44 times in C2, but only 2 times in C1, so r = 0.03 • The big difference is due to 1989 papers dealing more with the fall of the Berlin Wall.

Summary • We’ve now considered two forms of hypothesis testing: • t-test • chi-square • Also, log-likelihood ratios as measures of relative probability under different hypotheses. • Next, we begin to look at the problem of lexical acquisition.

References • M. Lapata, S. McDonald & F. Keller (1999). Determinants of Adjective-Noun plausibility. Proceedings of the 9th Conference of the European Chapter of the Association for Computational Linguistics, EACL-99 • A. Kilgarriff (2005). Language is never, ever, ever random. Corpus Linguistics and Linguistic Theory 1(2): 263 • Church, K. and Hanks, P. (1990). Word association norms, mutual information and lexicography. Computational Linguistics 16(1).

Corpora and Statistical Methods – Part 2