1 / 53

Statistics for variationists

Statistics for variationists. - or - what a linguist needs to know about statistics . Sean Wallis Survey of English Usage University College London s.wallis@ucl.ac.uk. Outline. What is the point of statistics? Variationist corpus linguistics How inferential statistics works

kieu
Download Presentation

Statistics for variationists

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Statistics for variationists - or -what a linguist needs to know about statistics Sean Wallis Survey of English Usage University College London s.wallis@ucl.ac.uk

  2. Outline • What is the point of statistics? • Variationist corpus linguistics • How inferential statistics works • Introducing ztests • Two types (single-sample and two-sample) • How these tests are related to χ² • ‘Effect size’ and comparing results of experiments • Methodological implications for corpus linguistics

  3. What is the point of statistics? • Analyse data you already have • corpus linguistics • Design new experiments • collect new data, add annotation • experimental linguistics ‘in the lab’ • Try new methods • pose the right question • We are going to focus onz and χ² tests

  4. What is the point of statistics? } • Analyse data you already have • corpus linguistics • Design new experiments • collect new data, add annotation • experimental linguistics ‘in the lab’ • Try new methods • pose the right question • We are going to focus onz and χ² tests observational science } experimental science } philosophy of science } a little maths

  5. What is ‘inferentialstatistics’? • Suppose we carry out an experiment • We toss a coin 10 times and get 5 heads • How confident are we in the results? • Suppose we repeat the experiment • Will we get the same result again? • Inferential statistics is a method of inferringthe behaviour of future ‘ghost’ experiments from one experiment • We infer from the sample to the population • Let us consider one type of experiment • Linguistic alternation experiments

  6. Alternation experiments • A variationist corpus paradigm • Imagine a speaker forming a sentence as a series of decisions/choices. They can • add: choose to extend a phrase or clause, or stop • select: choose between constructions • Choices will be constrained • grammatically • semantically

  7. Alternation experiments • A variationist corpus paradigm • Imagine a speaker forming a sentence as a series of decisions/choices. They can • add: choose to extend a phrase or clause, or stop • select: choose between constructions • Choices will be constrained • grammatically • semantically • Research question: • within these constraints,what factors influence the particular choice?

  8. Alternation experiments • Laboratory experiment (cued) • pose the choice to subjects • observe the one they make • manipulate different potential influences • Observational experiment (uncued) • observe the choices speakers make when they make them (e.g. in a corpus) • extract data for different potential influences • sociolinguistic: subdivide data by genre, etc • lexical/grammatical: subdivide data by elements in surrounding context • BUT the alternate choice is counterfactual

  9. Statistical assumptions • A random sample taken from the population • Not always easy to achieve • multiple cases from the same text and speakers, etc • may be limited historical data available • Be careful with data concentrated in a few texts • The sample is tiny compared to the population • This is easy to satisfy in linguistics! • Observations are free to vary (alternate) • Repeated sampling tends to form a Binomial distribution around the expected mean • This requires slightly more explanation...

  10. 1 3 5 7 9 The Binomial distribution • Repeated sampling tends to form a Binomial distribution around the expected mean P F • We toss a coin 10 times, and get 5 heads N = 1 P x

  11. 1 3 5 7 9 The Binomial distribution • Repeated sampling tends to form a Binomial distribution around the expected mean P F • Due to chance, some samples will have a higher or lower score N = 4 P x

  12. 1 3 5 7 9 The Binomial distribution • Repeated sampling tends to form a Binomial distribution around the expected mean P F • Due to chance, some samples will have a higher or lower score N = 8 P x

  13. 1 3 5 7 9 The Binomial distribution • Repeated sampling tends to form a Binomial distribution around the expected mean P F • Due to chance, some samples will have a higher or lower score N = 12 P x

  14. 1 3 5 7 9 The Binomial distribution • Repeated sampling tends to form a Binomial distribution around the expected mean P F • Due to chance, some samples will have a higher or lower score N = 16 P x

  15. 1 3 5 7 9 The Binomial distribution • Repeated sampling tends to form a Binomial distribution around the expected mean P F • Due to chance, some samples will have a higher or lower score N = 20 P x

  16. 1 3 5 7 9 The Binomial distribution • Repeated sampling tends to form a Binomial distribution around the expected mean P F • Due to chance, some samples will have a higher or lower score N = 24 P x

  17. 1 3 5 7 9 Binomial  Normal • The Binomial (discrete) distribution is close to the Normal (continuous) distribution F x

  18. The central limit theorem • Any Normal distribution can be defined by only two variables and the Normal function z  population mean P  standard deviations =  P(1 – P) / n F • With more data in the experiment, s will be smaller z . s z . s • Divide x by 10 for probability scale 0.1 0.3 0.5 0.7 p

  19. The central limit theorem • Any Normal distribution can be defined by only two variables and the Normal function z  population mean P  standard deviations =  P(1 – P) / n F z . s z . s • 95% of the curve is within ~2 standard deviations of the expected mean • the correct figure is 1.95996! • the critical value of z for an error level of 0.05. 2.5% 2.5% 95% 0.1 0.3 0.5 0.7 p

  20. The central limit theorem • Any Normal distribution can be defined by only two variables and the Normal function z  population mean P  standard deviations =  P(1 – P) / n F z . s z . s za/2 • the critical value of z for an error level a of 0.05. 2.5% 2.5% 95% 0.1 0.3 0.5 0.7 p

  21. The single-sample ztest... • Is an observationp > z standard deviations from the expected (population) mean P? • If yes, p is significantly different from P F observation p z . s z . s 0.25% 0.25% P 0.1 0.3 0.5 0.7 p

  22. ...gives us a “confidence interval” • P±z . s is the confidence interval for P • We want to plot the interval about p F z . s z . s 0.25% 0.25% P 0.1 0.3 0.5 0.7 p

  23. ...gives us a “confidence interval” • P±z . s is the confidence interval for P • We want to plot the interval about p observation p F w– w+ P 0.25% 0.25% 0.1 0.3 0.5 0.7 p

  24. ...gives us a “confidence interval” • The interval about pis called the Wilson score interval observation p • This interval is asymmetric • It reflects the Normal interval about P: • If P is at the upper limit of p,p is at the lower limit of P F w– w+ P 0.25% 0.25% (Wallis, to appear, a) 0.1 0.3 0.5 0.7 p

  25. ...gives us a “confidence interval” • The interval about pis called theWilson score interval observation p • To calculate w–andw+ we use this formula: F w– w+ P 0.25% 0.25% (Wilson, 1927) 0.1 0.3 0.5 0.7 p

  26. 1.0 p(shall | {shall, will}) 0.8 0.6 0.4 0.2 0.0 1955 1960 1965 1970 1975 1980 1985 1990 1995 Plotting confidence intervals • Plotting modal shall/will over time (DCPSE) • Small amounts of data / year • Highly skewed p in some cases • p = 0 or 1 (circled) • Confidence intervals identify the degree of certainty in our results (Wallis, to appear, a)

  27. 0.25 p 0.20 0.15 0.10 0.05 x 0.00 0 1 2 3 4 Plotting confidence intervals • Probability of adding successive attributive adjective phrases (AJPs) to a NP in ICE-GB • x = number of AJPs • NPs get longer  adding AJPs is more difficult • The first two falls are significant, the last is not

  28. 2 x 1 goodness of fit χ² test • Same as single-sample z test for P (z² = χ²) • Does the value of a affect p(b)? F p(b | a) z . s z . s p(b) P = p(b) p(b | a) IV:A = {a, ¬a} DV:B = {b, ¬b} p

  29. 2 x 1 goodness of fit χ² test • Same as single-sample z test for P (z² = χ²) • Or Wilson test for p (by inversion) F p(b) P = p(b) w– w+ p(b | a) IV:A = {a, ¬a} DV:B = {b, ¬b} p(b | a) p

  30. The single-sample ztest • Compares an observation with a given value • Compare p(b| a) with p(b) • A “goodness of fit” test • Identical to a standard 21 χ² test • Note that p(b) is given • All of the variation is assumedto be in the estimate of p(b| a) • Could also comparep(b | ¬a)with p(b) p(b) p(b | a) p(b | ¬a)

  31. ztest for 2 independent proportions • Method: combine observed values • take the difference (subtract) |p1–p2| • calculate an ‘averaged’ confidence interval p2 = p(b| ¬a) F O1 O2 p1 = p(b| a) (Wallis, to appear, b) p

  32. ztest for 2 independent proportions • New confidence interval D = |O1 – O2| • standard deviations'= p(1 – p) (1/n1 +1/n2) • p = p(b) • comparez.s' withx = |p1–p2| ^ ^  ^ x D z.s' (Wallis, to appear, b) 0 meanx = 0 p

  33. ztest for 2 independent proportions • Identical to a standard 22 χ² test • So you can use the usual method!

  34. ztest for 2 independent proportions • Identical to a standard 22 χ² test • So you can use the usual method! • BUT: 21 and 22 tests have different purposes • 21 goodness of fit compares single value a with supersetA • assumes onlya varies • 22 test compares two valuesa, ¬awithin a set A • both values may vary A g.o.f. c2 ¬a a 2  2 c2 IV:A = {a, ¬a}

  35. ztest for 2 independent proportions • Identical to a standard 22 χ² test • So you can use the usual method! • BUT: 21 and 22 tests have different purposes • 21 goodness of fit compares single value a with supersetA • assumes onlya varies • 22 test compares two valuesa, ¬awithin a set A • both values may vary • Q: Do we need χ²? A g.o.f. c2 ¬a a 2  2 c2 IV:A = {a, ¬a}

  36. Larger χ² tests • χ² is popular because it can be applied to contingency tables with many values • r 1 goodness of fit χ² tests (r 2) • r c χ² tests for homogeneity (r,c 2) • ztests have 1 degree of freedom • strength: significance is due to only one source • strength: easy to plot values and confidence intervals • weakness: multiple values may be unavoidable • With larger χ² tests, evaluate and simplify: • Examine χ² contributions for each row or column • Focus on alternation - try to test for a speaker choice

  37. How big is the effect? • These tests do not measure the strength of the interaction between two variables • They test whether the strength of an interaction is greater than would be expected by chance • With lots of data, a tiny change would be significant • Don’t use χ², p or z values to compare two different experiments • A result significant at p<0.01 is not ‘better’ than one significant at p<0.05 • There are a number of ways of measuring ‘association strength’ or ‘effect size’

  38. How big is the effect? • Percentage swing • swingd = p(a | ¬b) – p(a| b) • % swing d%= d/p(a| b) • frequently used (“X increased by 50%”) • may have confidence intervals on change • can be misleading (“+50%” then “-50%” is not zero) • one change, not sequence • over one value, not multiple values

  39. How big is the effect? • Percentage swing • swingd = p(a | ¬b) – p(a| b) • % swing d%= d/p(a| b) • frequently used (“X increased by 50%”) • may have confidence intervals on change • can be misleading (“+50%” then “-50%” is not zero) • one change, not sequence • over one value, not multiple values • Cramér’s φ •  = χ²/N(22) N= grand total • c = χ²/(k – 1)N(rc) k= min(r, c) • measures degree of association of one variable with another (across all values)  

  40. Comparing experimental results • Suppose we have two similar experiments • How do we test if one result is significantly stronger than another?

  41. Comparing experimental results • Suppose we have two similar experiments • How do we test if one result is significantly stronger than another? • Test swings • ztest for two samples from different populations • Uses' = s12 + s22 • Test |d1(a) – d2(a)| > z.s' 0  -0.1 -0.2 -0.3 -0.4 -0.5 -0.6 d1(a) d2(a) -0.7 (Wallis 2011)

  42. Comparing experimental results • Suppose we have two similar experiments • How do we test if one result is significantly stronger than another? • Test swings • ztest for two samples from different populations • Uses' = s12 + s22 • Test |d1(a) – d2(a)| > z.s' • Same method can be used to compare other z or χ² tests 0  -0.1 -0.2 -0.3 -0.4 -0.5 -0.6 d1(a) d2(a) -0.7 (Wallis 2011)

  43. Modern improvements on zand χ² • ‘Continuity correction’ for small n • Yates’ χ2 test • errs on side of caution • can also be applied to Wilson interval • Newcombe (1998) improves on 22 χ² test • combines two Wilson score intervals • performs better than χ² and log-likelihood (etc.) for low-frequency events or small samples • However, for corpus linguists, there remains one outstanding problem...

  44. Experimental design • Each observation should be free to vary • i.e. p can be any value from 0 to 1 p(b | words) p(b | VPs) p(b | tensed VPs) b1 b2

  45. Experimental design • Each observation should be free to vary • i.e. p can be any value from 0 to 1 • However many people use these methods incorrectly • e.g. citation ‘per million words’ • what does this actually mean? p(b | words) p(b | VPs) p(b | tensed VPs) b1 b2

  46. Experimental design • Each observation should be free to vary • i.e. p can be any value from 0 to 1 • However many people use these methods incorrectly • e.g. citation ‘per million words’ • what does this actually mean? • Baseline should be choice • Experimentalists can designchoice into experiment • Corpus linguists have to infer when speakers had opportunity to choose, counterfactually p(b | words) p(b | VPs) p(b | tensed VPs) b1 b2

  47. A methodological progression • Aim: • investigate change when speakers have a choice • Four levels of experimental refinement:  pmw words

  48. A methodological progression • Aim: • investigate change when speakers have a choice • Four levels of experimental refinement:   select a plausible baseline pmw words tensed VPs

  49. A methodological progression • Aim: • investigate change when speakers have a choice • Four levels of experimental refinement:    select a plausible baseline grammatically restrict data or enumerate cases pmw {will,shall} words tensed VPs

  50. A methodological progression • Aim: • investigate change when speakers have a choice • Four levels of experimental refinement:     select a plausible baseline check each case individually for plausibility of alternation grammatically restrict data or enumerate cases pmw {will,shall} {will,shall} words tensed VPs Ye shall be saved

More Related