1 / 58

z -squared: the origin and use of χ²

z -squared: the origin and use of χ². - or - what I wish I had been told about statistics (but had to work out for myself). Sean Wallis Survey of English Usage University College London s.wallis@ucl.ac.uk. Outline. What is the point of statistics? Linguistic alternation experiments

gad
Download Presentation

z -squared: the origin and use of χ²

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. z-squared: the origin and use of χ² - or -what I wish I had been told about statistics (but had to work out for myself) Sean Wallis Survey of English Usage University College London s.wallis@ucl.ac.uk

  2. Outline • What is the point of statistics? • Linguistic alternation experiments • How inferential statistics works • Introducing ztests • Two types (single-sample and two-sample) • How these tests are related to χ² • Comparing experiments and ‘effect size’ • Swing and ‘skew’ • Low frequency events and small samples

  3. What is the point of statistics? • Analyse data you already have • corpus linguistics • Design new experiments • collect new data, add annotation • experimental linguistics in the lab • Try new methods • pose the right question • We are going to focus onz and χ² tests

  4. What is the point of statistics? } • Analyse data you already have • corpus linguistics • Design new experiments • collect new data, add annotation • experimental linguistics in the lab • Try new methods • pose the right question • We are going to focus onz and χ² tests observational science } experimental science } philosophy of science } a little maths

  5. What is ‘inferentialstatistics’? • Suppose we carry out an experiment • We toss a coin 10 times and get 5 heads • How confident are we in the results? • Suppose we repeat the experiment • Will we get the same result again? • Inferential statistics is a method of inferringthe behaviour of future ‘ghost’ experiments from one experiment • Infer from the sample to the population • Let us consider one type of experiment • Linguistic alternation experiments

  6. Alternation experiments • Imagine a speaker forming a sentence as a series of decisions/choices. They can • add: choose to extend a phrase or clause, or stop • select: choose between constructions • Choices will be constrained • grammatically • semantically

  7. Alternation experiments • Imagine a speaker forming a sentence as a series of decisions/choices. They can • add: choose to extend a phrase or clause, or stop • select: choose between constructions • Choices will be constrained • grammatically • semantically • Research question: • within these constraints,what factors influence the particular choice?

  8. Alternation experiments • Laboratory experiment (cued) • pose the choice to subjects • observe the one they make • manipulate different potential influences • Observational experiment (uncued) • observe the choices speakers make when they make them (e.g. in a corpus) • extract data for different potential influences • sociolinguistic: subdivide data by genre, etc • lexical/grammatical: subdivide data by elements in surrounding context

  9. Statistical assumptions • A random sample taken from the population • Not always easy to achieve • multiple cases from the same text and speakers, etc • may be limited historical data available • Be careful with data concentrated in a few texts • The sample is tiny compared to the population • This is easy to satisfy in linguistics! • Repeated sampling tends to form a Binomial distribution • This requires slightly more explanation...

  10. 1 3 5 7 9 The Binomial distribution • Repeated sampling tends to form a Binomial distribution • We toss a coin 10 times, and get 5 heads: F N = 1 x

  11. 1 3 5 7 9 The Binomial distribution • Repeated sampling tends to form a Binomial distribution F N = 4 x

  12. 1 3 5 7 9 The Binomial distribution • Repeated sampling tends to form a Binomial distribution F N = 8 x

  13. 1 3 5 7 9 The Binomial distribution • Repeated sampling tends to form a Binomial distribution F N = 12 x

  14. 1 3 5 7 9 The Binomial distribution • Repeated sampling tends to form a Binomial distribution F N = 16 x

  15. 1 3 5 7 9 The Binomial distribution • Repeated sampling tends to form a Binomial distribution F N = 20 x

  16. 1 3 5 7 9 The Binomial distribution • Repeated sampling tends to form a Binomial distribution F N = 24 x

  17. 1 3 5 7 9 Binomial  Normal • The Binomial (discrete) distribution tends to match the Normal (continuous) distribution F x

  18. The central limit theorem • Any Normal distribution can be defined by only two variables and the Normal function z  population mean x = P  standard deviations =  P(1 – P) / n F • With more data in the experiment, s will be smaller z . s z . s • Divide by 10 for probability scale 0.1 0.3 0.5 0.7 p

  19. The central limit theorem • Any Normal distribution can be defined by only two variables and the Normal function z  population mean x = P  standard deviations =  P(1 – P) / n F z . s z . s • 95% of the curve is within ~2 standard deviations of the mean (the correct figure is 1.95996!) 2.5% 2.5% 95% 0.1 0.3 0.5 0.7 p

  20. The single-sample ztest... • Is an observation > z standard deviations from the expected population mean? • If yes, the result is significant F observation p z . s z . s 0.25% 0.25% P 0.1 0.3 0.5 0.7 p

  21. ...gives us a “confidence interval” • P±z . s is the confidence interval for P • Enough for a test F z . s z . s 0.25% 0.25% P 0.1 0.3 0.5 0.7 p

  22. ...gives us a “confidence interval” • P±z . s is the confidence interval for P • But we need the interval about p observation p F w– w+ P 0.25% 0.25% 0.1 0.3 0.5 0.7 p

  23. ...gives us a “confidence interval” • The interval about pis called the Wilson score interval • This interval is asymmetric • It reflects the Normal interval about P: • If P is at the upper limit of p,p is at the lower limit of P observation p F w– w+ P 0.25% 0.25% (Wilson, 1927) 0.1 0.3 0.5 0.7 p

  24. ...gives us a “confidence interval” • The interval about pis called the Wilson score interval • To calculate w–andw+ we use this formula: observation p F w– w+ P 0.25% 0.25% (Wilson, 1927) 0.1 0.3 0.5 0.7 p

  25. Plotting confidence intervals • E.g. Plot the probability of adding successive attributive adjectives to a NP in ICE-GB • You can easily see that the first two falls are significant, but the last is not 0.25 p 0.20 0.15 0.10 0.05 0.00 0 1 2 3 4

  26. A simple experiment • Consider two binary variables, A and B • Each one is subdivided: • A = {a, ¬a} e.g. NP has AJP? {yes, no} • B = {b, ¬b} e.g.Speaker gender {male, female} • Does B ‘affect’ A? • We perform an experiment(or sample a corpus) • We find 45 cases (NPs) classified by A and B (left) • This is a ‘contingency table’

  27. A simple experiment • Consider two binary variables, A and B • Each one is subdivided: • A = {a, ¬a} e.g. NP has AJP? {yes, no} • B = {b, ¬b} e.g.Speaker gender {male, female} • Does B ‘affect’ A? • We perform an experiment(or sample a corpus) • We find 45 cases (NPs) classified by A and B (left) • This is a ‘contingency table’ • Q1. Does B cause a to differ from A? • Does speaker gender affect decision to include an AJP? A = dependent variable a ¬aS b20 5 25 ¬b10 10 20 S30 15 45 B = independent variable

  28. Does B cause a to differ from A? • Compare column 1 (a) and column 3 (A) • Probability of picking b at random (gender = male) • p(b) = 25/45 = 5/9 = 0.556 a ¬a S b20 5 25 ¬b10 10 20 S30 15 45

  29. Does B cause a to differ from A? • Compare column 1 (a) and column 3 (A) • Probability of picking b at random (gender = male) • p(b) = 25/45 = 5/9 = 0.556 • Next, examine a (has AJP) • New probability of picking b • p(b| a) = 20/30 = 2/3 = 0.667 • Confidence interval for p(b| a) • population standard deviations = p(b)(1–p(b))/n=  (5/9 4/9) / 30 • pz.s = (0.489, 0.845) a ¬a S b20 5 25 ¬b10 10 20 S30 15 45  

  30. Does B cause a to differ from A? • Compare column 1 (a) and column 3 (A) • Probability of picking b at random (gender = male) • p(b) = 25/45 = 5/9 = 0.556 • Next, examine a (has AJP) • New probability of picking b • p(b| a) = 20/30 = 2/3 = 0.667 • Confidence interval for p(b) • population standard deviations = p(b)(1–p(b))/n=  (5/9 4/9) / 30 • pz.s = (0.378, 0.733) • Not significant:p(b | a) is inside c.i. for p(b) a ¬a S b20 5 25 ¬b10 10 20 S30 15 45  

  31. Visualising this test • Confidence interval for p(b) • P= expected value E = expected distribution F p p(b) 0.667 p(b | a) z . s z . s E p(b) P 0.556 p A a 0.378 0.733

  32. The single-sample ztest • Compares an observation with a given value • We used it to compare p(b| a) with p(b) • This is a “goodness of fit” test • Identical to a standard 21 χ² test • No need to test p(¬b| a) with p(¬b) • Note that p(b) is given • All of the variation is assumed to be in the estimation of p(b| a) • Could also compare p(b | ¬a) (no AJP) with p(b) • Q2. Does B cause a to differ from ¬a? • Does speaker gender affect presence / absence of AJP? p E A a

  33. F p O1 O1 O2 O2 p ¬a a ztest for 2 independent proportions • Method: combine observed values • take the difference (subtract) |p1–p2| • calculate an ‘averaged’ confidence interval p2 = p(b| ¬a) p1 p2 p1 = p(b| a)

  34. ztest for 2 independent proportions • New confidence interval D = |O1 – O2| • standard deviations'= p(1 – p) (1/n1 +1/n2) • p = p(b) = 25/45 = 5/9 • comparez.s' withx = |p1–p2| ^ ^  ^ x difference in p D z.s' x = |p1–p2| a ¬a S b20 5 25 ¬b10 10 20 S30 15 45 D n1 n2 0 meanx = 0 p

  35. Does B cause a to differ from ¬a? • Compare column 1 (a) and column 2 (¬a) • Probabilities (speaker gender = male) • p(b| a) = 20/30 = 2/3 = 0.667 • p(b| ¬a) = 5/15 = 1/3 = 0.333 • Confidence interval • pooled probability estimatep = p(b) = 5/9 = 0.556 • standard deviations'= p(1 – p) (1/n1 +1/n2) = (5/9 4/9) (1/30+1/15) • z.s' = 0.308 a ¬a S b20 5 25 ¬b10 10 20 S30 15 45 ^ ^ ^  

  36. Does B cause a to differ from ¬a? • Compare column 1 (a) and column 2 (¬a) • Probabilities (speaker gender = male) • p(b| a) = 20/30 = 2/3 = 0.667 • p(b| ¬a) = 5/15 = 1/3 = 0.333 • Confidence interval • pooled probability estimatep = p(b) = 5/9 = 0.556 • standard deviations'= p(1 – p) (1/n1 +1/n2) = (5/9 4/9) (1/30+1/15) • z.s' = 0.308 • Significant:|p(b| a) –p(b| ¬a)|>z.s' a ¬a S b20 5 25 ¬b10 10 20 S30 15 45 ^ ^ ^  

  37. ztest for 2 independent proportions • Identical to a standard 22 χ² test • So you can use the usual method!

  38. ztest for 2 independent proportions • Identical to a standard 22 χ² test • So you can use the usual method! • BUT: these tests have different purposes • 21 goodness of fit compares single value a with supersetA • assumes onlya varies • 22 test compares two valuesa, ¬awithin a set A • both values may vary A g.o.f. c2 ¬a a 2  2 c2

  39. ztest for 2 independent proportions • Identical to a standard 22 χ² test • So you can use the usual method! • BUT: these tests have different purposes • 21 goodness of fit compares single value a with supersetA • assumes onlya varies • 22 test compares two valuesa, ¬awithin a set A • both values may vary • Q: Do we need χ²? A g.o.f. c2 ¬a a 2  2 c2

  40. Larger χ² tests • χ² is popular because it can be applied to contingency tables with many values • r 1 goodness of fit χ² tests (r 2) • r c χ² tests for homogeneity (r,c 2) • ztests have 1 degree of freedom • strength: significance is due to only one source • strength: easy to plot values and confidence intervals • weakness: multiple values may be unavoidable • With larger χ² tests, evaluate and simplify: • Examine χ² contributions for each row or column • Focus on alternation - try to test for a speaker choice

  41. How big is the effect? • These tests do not measure the strength of the interaction between two variables • They test whether the strength of an interaction is greater than would be expected by chance • With lots of data, a tiny change would be significant

  42. How big is the effect? • These tests do not measure the strength of the interaction between two variables • They test whether the strength of an interaction is greater than would be expected by chance • With lots of data, a tiny change would be significant • Don’t use χ², p or z values to compare two different experiments • A result significant at p<0.01 is not ‘better’ than one significant at p<0.05

  43. How big is the effect? • These tests do not measure the strength of the interaction between two variables • They test whether the strength of an interaction is greater than would be expected by chance • With lots of data, a tiny change would be significant • Don’t use χ², p or z values to compare two different experiments • A result significant at p<0.01 is not ‘better’ than one significant at p<0.05 • There are a number of ways of measuring ‘association strength’ or ‘effect size’

  44. Percentage swing • Compare probabilities of a DV value (a, AJP) across a change in the IV (gender): • swingd = p(a | ¬b) – p(a| b) = 10/20 – 20/25 = -0.3 a ¬a S b20 5 25 ¬b10 10 20 S30 15 45

  45. Percentage swing • Compare probabilities of a DV value (a, AJP) across a change in the IV (gender): • swingd = p(a | ¬b) – p(a| b) = 10/20 – 20/25 = -0.3 • As a proportion of theinitial value • % swing d%= d/p(a| b) = -0.3/0.8 a ¬a S b20 5 25 ¬b10 10 20 S30 15 45

  46. Percentage swing • Compare probabilities of a DV value (a, AJP) across a change in the IV (gender): • swingd = p(a | ¬b) – p(a| b) = 10/20 – 20/25 = -0.3 • As a proportion of theinitial value • % swing d%= d/p(a| b) = -37.5% • We can even calculateconfidence intervals on d or d% • Use ztest for two independent proportions(we are comparing differences in p values) a ¬a S b20 5 25 ¬b10 10 20 S30 15 45

  47. Cramér’s φ • Can be used on any χ² table • Mathematically well defined • Probabilistic (c.f. swing d [-1, +1], d%= ?) •  = 0no relationship between A and B •  = 1B strictly determines A • straight line between these two extremes  = 0  = 1 a ¬a S b0.5 0.5 1 ¬b0.5 0.5 1 S1 1 2 a ¬a S b1 0 1 ¬b0 1 1 S1 1 2 

  48. Cramér’s φ • Can be used on any χ² table • Mathematically well defined • Probabilistic (c.f. swing d [-1, +1], d%= ?) •  = 0no relationship between A and B •  = 1B strictly determines A • straight line between these two extremes } ‘averaged’ swing  = 0  = 1 a ¬a S b0.5 0.5 1 ¬b0.5 0.5 1 S1 1 2 a ¬a S b1 0 1 ¬b0 1 1 S1 1 2 

  49. Cramér’s φ • Can be used on any χ² table • Mathematically well defined • Probabilistic (c.f. swing d [-1, +1], d%= ?) •  = 0no relationship between A and B •  = 1B strictly determines A • straight line between these two extremes • Based on χ² •  = χ²/N(22) N= grand total • c = χ²/(k – 1)N(rc) k= min(r, c)  

  50. Cramér’s φ • Can be used on any χ² table • Mathematically well defined • Probabilistic (c.f. swing d [-1, +1], d%= ?) •  = 0no relationship between A and B •  = 1B strictly determines A • straight line between these two extremes • Based on χ² •  = χ²/N(22) N= grand total • c = χ²/(k – 1)N(rc) k= min(r, c) • Can be used for r1 goodness of fit tests • Recalibrate using methods in Wallis (2012) • Better indicator than percentage swing  

More Related