580 likes | 704 Views
z -squared: the origin and use of χ². - or - what I wish I had been told about statistics (but had to work out for myself). Sean Wallis Survey of English Usage University College London s.wallis@ucl.ac.uk. Outline. What is the point of statistics? Linguistic alternation experiments
E N D
z-squared: the origin and use of χ² - or -what I wish I had been told about statistics (but had to work out for myself) Sean Wallis Survey of English Usage University College London s.wallis@ucl.ac.uk
Outline • What is the point of statistics? • Linguistic alternation experiments • How inferential statistics works • Introducing ztests • Two types (single-sample and two-sample) • How these tests are related to χ² • Comparing experiments and ‘effect size’ • Swing and ‘skew’ • Low frequency events and small samples
What is the point of statistics? • Analyse data you already have • corpus linguistics • Design new experiments • collect new data, add annotation • experimental linguistics in the lab • Try new methods • pose the right question • We are going to focus onz and χ² tests
What is the point of statistics? } • Analyse data you already have • corpus linguistics • Design new experiments • collect new data, add annotation • experimental linguistics in the lab • Try new methods • pose the right question • We are going to focus onz and χ² tests observational science } experimental science } philosophy of science } a little maths
What is ‘inferentialstatistics’? • Suppose we carry out an experiment • We toss a coin 10 times and get 5 heads • How confident are we in the results? • Suppose we repeat the experiment • Will we get the same result again? • Inferential statistics is a method of inferringthe behaviour of future ‘ghost’ experiments from one experiment • Infer from the sample to the population • Let us consider one type of experiment • Linguistic alternation experiments
Alternation experiments • Imagine a speaker forming a sentence as a series of decisions/choices. They can • add: choose to extend a phrase or clause, or stop • select: choose between constructions • Choices will be constrained • grammatically • semantically
Alternation experiments • Imagine a speaker forming a sentence as a series of decisions/choices. They can • add: choose to extend a phrase or clause, or stop • select: choose between constructions • Choices will be constrained • grammatically • semantically • Research question: • within these constraints,what factors influence the particular choice?
Alternation experiments • Laboratory experiment (cued) • pose the choice to subjects • observe the one they make • manipulate different potential influences • Observational experiment (uncued) • observe the choices speakers make when they make them (e.g. in a corpus) • extract data for different potential influences • sociolinguistic: subdivide data by genre, etc • lexical/grammatical: subdivide data by elements in surrounding context
Statistical assumptions • A random sample taken from the population • Not always easy to achieve • multiple cases from the same text and speakers, etc • may be limited historical data available • Be careful with data concentrated in a few texts • The sample is tiny compared to the population • This is easy to satisfy in linguistics! • Repeated sampling tends to form a Binomial distribution • This requires slightly more explanation...
1 3 5 7 9 The Binomial distribution • Repeated sampling tends to form a Binomial distribution • We toss a coin 10 times, and get 5 heads: F N = 1 x
1 3 5 7 9 The Binomial distribution • Repeated sampling tends to form a Binomial distribution F N = 4 x
1 3 5 7 9 The Binomial distribution • Repeated sampling tends to form a Binomial distribution F N = 8 x
1 3 5 7 9 The Binomial distribution • Repeated sampling tends to form a Binomial distribution F N = 12 x
1 3 5 7 9 The Binomial distribution • Repeated sampling tends to form a Binomial distribution F N = 16 x
1 3 5 7 9 The Binomial distribution • Repeated sampling tends to form a Binomial distribution F N = 20 x
1 3 5 7 9 The Binomial distribution • Repeated sampling tends to form a Binomial distribution F N = 24 x
1 3 5 7 9 Binomial Normal • The Binomial (discrete) distribution tends to match the Normal (continuous) distribution F x
The central limit theorem • Any Normal distribution can be defined by only two variables and the Normal function z population mean x = P standard deviations = P(1 – P) / n F • With more data in the experiment, s will be smaller z . s z . s • Divide by 10 for probability scale 0.1 0.3 0.5 0.7 p
The central limit theorem • Any Normal distribution can be defined by only two variables and the Normal function z population mean x = P standard deviations = P(1 – P) / n F z . s z . s • 95% of the curve is within ~2 standard deviations of the mean (the correct figure is 1.95996!) 2.5% 2.5% 95% 0.1 0.3 0.5 0.7 p
The single-sample ztest... • Is an observation > z standard deviations from the expected population mean? • If yes, the result is significant F observation p z . s z . s 0.25% 0.25% P 0.1 0.3 0.5 0.7 p
...gives us a “confidence interval” • P±z . s is the confidence interval for P • Enough for a test F z . s z . s 0.25% 0.25% P 0.1 0.3 0.5 0.7 p
...gives us a “confidence interval” • P±z . s is the confidence interval for P • But we need the interval about p observation p F w– w+ P 0.25% 0.25% 0.1 0.3 0.5 0.7 p
...gives us a “confidence interval” • The interval about pis called the Wilson score interval • This interval is asymmetric • It reflects the Normal interval about P: • If P is at the upper limit of p,p is at the lower limit of P observation p F w– w+ P 0.25% 0.25% (Wilson, 1927) 0.1 0.3 0.5 0.7 p
...gives us a “confidence interval” • The interval about pis called the Wilson score interval • To calculate w–andw+ we use this formula: observation p F w– w+ P 0.25% 0.25% (Wilson, 1927) 0.1 0.3 0.5 0.7 p
Plotting confidence intervals • E.g. Plot the probability of adding successive attributive adjectives to a NP in ICE-GB • You can easily see that the first two falls are significant, but the last is not 0.25 p 0.20 0.15 0.10 0.05 0.00 0 1 2 3 4
A simple experiment • Consider two binary variables, A and B • Each one is subdivided: • A = {a, ¬a} e.g. NP has AJP? {yes, no} • B = {b, ¬b} e.g.Speaker gender {male, female} • Does B ‘affect’ A? • We perform an experiment(or sample a corpus) • We find 45 cases (NPs) classified by A and B (left) • This is a ‘contingency table’
A simple experiment • Consider two binary variables, A and B • Each one is subdivided: • A = {a, ¬a} e.g. NP has AJP? {yes, no} • B = {b, ¬b} e.g.Speaker gender {male, female} • Does B ‘affect’ A? • We perform an experiment(or sample a corpus) • We find 45 cases (NPs) classified by A and B (left) • This is a ‘contingency table’ • Q1. Does B cause a to differ from A? • Does speaker gender affect decision to include an AJP? A = dependent variable a ¬aS b20 5 25 ¬b10 10 20 S30 15 45 B = independent variable
Does B cause a to differ from A? • Compare column 1 (a) and column 3 (A) • Probability of picking b at random (gender = male) • p(b) = 25/45 = 5/9 = 0.556 a ¬a S b20 5 25 ¬b10 10 20 S30 15 45
Does B cause a to differ from A? • Compare column 1 (a) and column 3 (A) • Probability of picking b at random (gender = male) • p(b) = 25/45 = 5/9 = 0.556 • Next, examine a (has AJP) • New probability of picking b • p(b| a) = 20/30 = 2/3 = 0.667 • Confidence interval for p(b| a) • population standard deviations = p(b)(1–p(b))/n= (5/9 4/9) / 30 • pz.s = (0.489, 0.845) a ¬a S b20 5 25 ¬b10 10 20 S30 15 45
Does B cause a to differ from A? • Compare column 1 (a) and column 3 (A) • Probability of picking b at random (gender = male) • p(b) = 25/45 = 5/9 = 0.556 • Next, examine a (has AJP) • New probability of picking b • p(b| a) = 20/30 = 2/3 = 0.667 • Confidence interval for p(b) • population standard deviations = p(b)(1–p(b))/n= (5/9 4/9) / 30 • pz.s = (0.378, 0.733) • Not significant:p(b | a) is inside c.i. for p(b) a ¬a S b20 5 25 ¬b10 10 20 S30 15 45
Visualising this test • Confidence interval for p(b) • P= expected value E = expected distribution F p p(b) 0.667 p(b | a) z . s z . s E p(b) P 0.556 p A a 0.378 0.733
The single-sample ztest • Compares an observation with a given value • We used it to compare p(b| a) with p(b) • This is a “goodness of fit” test • Identical to a standard 21 χ² test • No need to test p(¬b| a) with p(¬b) • Note that p(b) is given • All of the variation is assumed to be in the estimation of p(b| a) • Could also compare p(b | ¬a) (no AJP) with p(b) • Q2. Does B cause a to differ from ¬a? • Does speaker gender affect presence / absence of AJP? p E A a
F p O1 O1 O2 O2 p ¬a a ztest for 2 independent proportions • Method: combine observed values • take the difference (subtract) |p1–p2| • calculate an ‘averaged’ confidence interval p2 = p(b| ¬a) p1 p2 p1 = p(b| a)
ztest for 2 independent proportions • New confidence interval D = |O1 – O2| • standard deviations'= p(1 – p) (1/n1 +1/n2) • p = p(b) = 25/45 = 5/9 • comparez.s' withx = |p1–p2| ^ ^ ^ x difference in p D z.s' x = |p1–p2| a ¬a S b20 5 25 ¬b10 10 20 S30 15 45 D n1 n2 0 meanx = 0 p
Does B cause a to differ from ¬a? • Compare column 1 (a) and column 2 (¬a) • Probabilities (speaker gender = male) • p(b| a) = 20/30 = 2/3 = 0.667 • p(b| ¬a) = 5/15 = 1/3 = 0.333 • Confidence interval • pooled probability estimatep = p(b) = 5/9 = 0.556 • standard deviations'= p(1 – p) (1/n1 +1/n2) = (5/9 4/9) (1/30+1/15) • z.s' = 0.308 a ¬a S b20 5 25 ¬b10 10 20 S30 15 45 ^ ^ ^
Does B cause a to differ from ¬a? • Compare column 1 (a) and column 2 (¬a) • Probabilities (speaker gender = male) • p(b| a) = 20/30 = 2/3 = 0.667 • p(b| ¬a) = 5/15 = 1/3 = 0.333 • Confidence interval • pooled probability estimatep = p(b) = 5/9 = 0.556 • standard deviations'= p(1 – p) (1/n1 +1/n2) = (5/9 4/9) (1/30+1/15) • z.s' = 0.308 • Significant:|p(b| a) –p(b| ¬a)|>z.s' a ¬a S b20 5 25 ¬b10 10 20 S30 15 45 ^ ^ ^
ztest for 2 independent proportions • Identical to a standard 22 χ² test • So you can use the usual method!
ztest for 2 independent proportions • Identical to a standard 22 χ² test • So you can use the usual method! • BUT: these tests have different purposes • 21 goodness of fit compares single value a with supersetA • assumes onlya varies • 22 test compares two valuesa, ¬awithin a set A • both values may vary A g.o.f. c2 ¬a a 2 2 c2
ztest for 2 independent proportions • Identical to a standard 22 χ² test • So you can use the usual method! • BUT: these tests have different purposes • 21 goodness of fit compares single value a with supersetA • assumes onlya varies • 22 test compares two valuesa, ¬awithin a set A • both values may vary • Q: Do we need χ²? A g.o.f. c2 ¬a a 2 2 c2
Larger χ² tests • χ² is popular because it can be applied to contingency tables with many values • r 1 goodness of fit χ² tests (r 2) • r c χ² tests for homogeneity (r,c 2) • ztests have 1 degree of freedom • strength: significance is due to only one source • strength: easy to plot values and confidence intervals • weakness: multiple values may be unavoidable • With larger χ² tests, evaluate and simplify: • Examine χ² contributions for each row or column • Focus on alternation - try to test for a speaker choice
How big is the effect? • These tests do not measure the strength of the interaction between two variables • They test whether the strength of an interaction is greater than would be expected by chance • With lots of data, a tiny change would be significant
How big is the effect? • These tests do not measure the strength of the interaction between two variables • They test whether the strength of an interaction is greater than would be expected by chance • With lots of data, a tiny change would be significant • Don’t use χ², p or z values to compare two different experiments • A result significant at p<0.01 is not ‘better’ than one significant at p<0.05
How big is the effect? • These tests do not measure the strength of the interaction between two variables • They test whether the strength of an interaction is greater than would be expected by chance • With lots of data, a tiny change would be significant • Don’t use χ², p or z values to compare two different experiments • A result significant at p<0.01 is not ‘better’ than one significant at p<0.05 • There are a number of ways of measuring ‘association strength’ or ‘effect size’
Percentage swing • Compare probabilities of a DV value (a, AJP) across a change in the IV (gender): • swingd = p(a | ¬b) – p(a| b) = 10/20 – 20/25 = -0.3 a ¬a S b20 5 25 ¬b10 10 20 S30 15 45
Percentage swing • Compare probabilities of a DV value (a, AJP) across a change in the IV (gender): • swingd = p(a | ¬b) – p(a| b) = 10/20 – 20/25 = -0.3 • As a proportion of theinitial value • % swing d%= d/p(a| b) = -0.3/0.8 a ¬a S b20 5 25 ¬b10 10 20 S30 15 45
Percentage swing • Compare probabilities of a DV value (a, AJP) across a change in the IV (gender): • swingd = p(a | ¬b) – p(a| b) = 10/20 – 20/25 = -0.3 • As a proportion of theinitial value • % swing d%= d/p(a| b) = -37.5% • We can even calculateconfidence intervals on d or d% • Use ztest for two independent proportions(we are comparing differences in p values) a ¬a S b20 5 25 ¬b10 10 20 S30 15 45
Cramér’s φ • Can be used on any χ² table • Mathematically well defined • Probabilistic (c.f. swing d [-1, +1], d%= ?) • = 0no relationship between A and B • = 1B strictly determines A • straight line between these two extremes = 0 = 1 a ¬a S b0.5 0.5 1 ¬b0.5 0.5 1 S1 1 2 a ¬a S b1 0 1 ¬b0 1 1 S1 1 2
Cramér’s φ • Can be used on any χ² table • Mathematically well defined • Probabilistic (c.f. swing d [-1, +1], d%= ?) • = 0no relationship between A and B • = 1B strictly determines A • straight line between these two extremes } ‘averaged’ swing = 0 = 1 a ¬a S b0.5 0.5 1 ¬b0.5 0.5 1 S1 1 2 a ¬a S b1 0 1 ¬b0 1 1 S1 1 2
Cramér’s φ • Can be used on any χ² table • Mathematically well defined • Probabilistic (c.f. swing d [-1, +1], d%= ?) • = 0no relationship between A and B • = 1B strictly determines A • straight line between these two extremes • Based on χ² • = χ²/N(22) N= grand total • c = χ²/(k – 1)N(rc) k= min(r, c)
Cramér’s φ • Can be used on any χ² table • Mathematically well defined • Probabilistic (c.f. swing d [-1, +1], d%= ?) • = 0no relationship between A and B • = 1B strictly determines A • straight line between these two extremes • Based on χ² • = χ²/N(22) N= grand total • c = χ²/(k – 1)N(rc) k= min(r, c) • Can be used for r1 goodness of fit tests • Recalibrate using methods in Wallis (2012) • Better indicator than percentage swing