400 likes | 530 Views
PALA Summer School 2014 Inferential Statistics. Willie van Peer Ludwig-Maximilians-University Munich w.vanpeer@gmail.com. Inferential statistics. Also ‘ test statistics ’ sample ---> population? Tests whether observed results in sample may be generalized to the population.
E N D
PALA Summer School2014Inferential Statistics Willie van Peer Ludwig-Maximilians-University Munich w.vanpeer@gmail.com
Inferential statistics • Also ‘test statistics’ • sample ---> population? • Tests whether observed results in sample may be generalized to the population. • Not as ‘yes’ or ‘no’, but as a probability. • Statistics is a discipline in which such probabilities are investigated.
Sample vs. Population • Suppose you wish to investigate whether ‘free’ or ‘guided’ reading lessons in school yield different pedagogical results. • You will have to make observations, ask questions, in one word: collect data. • But impossible to ask ALL pupils in your country. • So you make a SELECTION: a sample. • But you are not interested in this sample only: • You want to go beyond the sample: to the population (here in the statistical sense!)
This is a generalization • Beyond the sample. • But this can be tricky / dangerous / fatal (!) • Gunners in WWII bombers mostly said the attacks came from behind and above. • Can one generalize these answers? • In order to be able to, the sample has to be representative of the population. • Is that condition fulfilled here? • Of course it is NOT. Think why!
Suppose you have followed the two instruction methods for reading (‘free’ or ‘guided’) in 4 Maribor schools for 2 months. • Your data suggest that the ‘guided’ method yields superior pedagogic effects for boys. • Are you able to generalize your findings to: • All boys in Maribor schools? (Most probably) • All pupils in Maribor schools? (Certainly not) • All Slovene boys? (Difficult, maybe) • All Slovene pupils? (Definitely not) • All pupils? (no way)
A paradox • The paradox of sampling: you need to know what you are in fact trying to find out… • If your sample is not representative, its data will be misleading, but how do you know whether it is representative? • Another serious problem: self-selection of participants! • To avoid sampling problems when asking people in the street: use random numbers, or accost every 4th or 5th person who walks by.
Errors • 2 types: • constant errors (E-group in Maribor, C-group in Ljubljana!) • random errors (the weather, the time of the day/year, the general mood in the country, …) • Constant errors must be under control at all cost, e.g. through randomization. This does not eliminate errors, but makes them into random errors. • Random errors cannot be avoided! • When nevertheless we find an effect in the E-group: ‘robust’ effect!
Hence • We must estimate how great the probability is that the effect came about through random errors. • I.e.: how probable is it that only the unavoidable random errors created the observed effect of the IV? • When this is not particularly probable, we decide that the IV had an effect on the DV. • But when do we judge something ‘not particularly probable’?
An example • We wish to know, whether reading a story with a sad ending is judged more rewarding than reading a story with a ‘happy end’. • Imagine we asked 8 people, 7 of whom said they preferred the sad version, and only 1 the happy version • How probable is such a result? • To investigate this, let us start from the fact that every informant had two possible choices (prefer ‘sad’, or prefer ‘happy’), both therefore having a probability of 50 %.
Probability • VP1: + - • VP2: + - + - • 4 possible results (22): • VP1 + VP2 + [+ here means: prefer sad version] • VP1 + VP2 - • VP1 - VP2 + • VP1 - VP2 -
Out of these 4 possibilities • 2 x + occurs once: p = 1/4 = 0.25 • 1 x + occurs twice: p = 2/4 = 0.50 • 0 x + occurs once: p = 1/4 = 0.25
Tree structure for 4 Ss S1 + - S2 + - + - S3 + - + - + - + - S4 + - + - + - + - + - + - + - + - Now there are 16 possibilities in all: 24
16 possibilies are distributed as follows: No of + Fraction Probability 4 1/16 0.0625 3 4/16 0.2500 2 6/16 0.3750 1 4/16 0.2500 0 1/16 0.0625
For 8 Ss No. of + Fraction Probability 8 1/256 0.004 7 8/256 0.031 6 28/256 0.110 5 56/256 0.220 4 70/256 0.270 3 56/257 0.220 2 28/256 0.110 1 8/256 0.031 0 1/256 0.004
p • 7 out of 8 informants said, they preferred the ‘sad’ version of the story best. • The probability of which we now know: p = 0.031 • p = Probability, varying between 0 (NEVER happens) and 1 (ALWAYS happens). • Placing the comma 2 places to right = % • p = 0.031 means: 3,1 % probability • = the probability, that the results came about by random errors! • I.e. error probability. • Which must be as low as possible!
Because random errors will NEVER go away • p means the probability that we falsely conclude that the IV had an effect on the DV • How certain are we therefore? 100 - 3,1 = 96,9% • When we repeat this experiment 100 x, we will on average find the same results 96,9 times. • In such a situation it is allowed to say that the ending of a story has an effect on readers’ preference.
graphically! • Distribution of the number of + when only random errors are at stake. • both 0 + and 8 + are rare (p = 0.004; = 0,4 %) • also 1 + and 7 + (p = 0.031) • How low must p be? • No ultimate answer • because random errors remain!
H0 vs. Ha • We are testing a hypothesis. • Usually a hypotheses of difference (between groups) = Ha, the alternative hypothesis. • Alternative to its logical opposite is the hypothesis of no difference = H0 (‘null hypothesis). • We try to REJECT Ha • If not successful, we reject the null hypothesis = Ho • But watch out: in science, we have to be cautious!
Alpha () • choose a ‘Significance level’ (= ) • high: high probability to make a Type 1 error : to conclude that the IV had an effect on the DV, when in reality it did not. • small: high chance to make a Type 2 error: accept H0 although it is wrong.
Error types and alpha + : Type 1 Error: we falsely accept the Ha (we think the IV had an influence, but it does not) - : Type 2 Error: we falsely accept the H0 (we think the IV had no influence, but it does)
However, • Since we have no means of knowing whether the H0 is really true of false • All we can do is to reduce the uncertainty of our decision. • And thereby reduce the chance of making a Type I error. • There are no certainties, only probabilities in statistics.
Compare to a case in court • When very weak evidence for a crime is accepted by a court of law, then a lot of (innocent) people are going to be convicted. • If a court accepts only the strongest form of evidence, then a lot of criminals will get free without a conviction. • So … some kind of balance is needed. • And this balance can best be provided if you know a bit about statistics.
A memory enhancing drug • We select 100 students, 50 of which get the drug, the other 50 a placebo (without they themselves knowing who got what!) • We then give them some exam which is heavily dependent on memory. • Results are scored by examiners who do not know which student got which pill. • This is called a double blind design: • Neither observer nor observed know who is who.
How big must the difference be? • This is a somewhat misleading question. • It is like asking “How tall must you be to become a good basketball-player?” • Well, ideally as tall as possible. • But there is not clear-cut height below which you cannot dream of it. • So it is a gliding scale. • So it is with p-values: the lower the better! • But statisticians have established a conventional level: p < .05
But does this mean that p = .049 is significant, while p = .051 is not? • That is to fundamentally misunderstand the nature of p-values. • The criterion of .05 is merely a convention. • The lower it is, the more confident we are that we may reject the H0. • If that level is marginally about the .05 criterion, it does not mean that the Ha has no plausibility. • It is exactly this gliding scale that makes significance values so informative. • BTW: .05 means one in twenty!
Within this range = probability: 95 % of all observations. • Outside this range remains 5 % of all observations. • This is the level of random errors we are ready to accept. • Here we say that the IV had an effect on the DV • Therefore we reject the H0
Since we know about the normal distribution • We know that 68 % of all values lie within 1 SD of the mean • 95 % between 2 SD. • Mean = 4.00, SD = 1.44 • 2 SD – 2.88. • I.e. between 1.12 and 6.88. • Our observations lie outside this range. • Hence: significant!
Region of rejection • p < 0.05 = ‘significant’ / p < 0.01 = ‘highly significant’/ p < 0.001 = ‘very highly significant’ • Significance level leads to separation between: 1) area where only random errors had an effect • 2) where the IV had an effect on the DV (the critical region) • Where we reject the H0
NB • A significant difference does NOT imply a value judgment • It merely tells us how likely the results are due to chance. • Whether this leads to any change (for instance in instruction methods, a new medicine, etc.) has to be decided on other than statistical grounds • E.g. how much does it cost (in time, money, learning curve,…), what the consequences are of not changing anything, etc.
Comparison of means • In general: between E- and C-groups • or between 2 E-groups • to compare both: certain statistical techniques (=Tests --> Test-statistics = Inferential statistics) • Matrix with measurement level + ? Normal distribution + type of sample (independent / dependent) • Better still: decision chart (see Scientific Methods for the Humanities, p. 231-)
Levels of measurement • Nominal: putting a variable into a category, e.g. gender, place of living, political preference, etc. • Ordinal: these are ordered categories, e.g. education level, EFL proficiency level, preference of musical composer, price of a car, …. [lacking is the distance between ranks: is the 2nd composer only half as good as the first one? And the 3rd one?] • Interval: scaled order, with equal distances. • Ratio: likewise, but now with a zero-point. E.g. age, divide a 100 points among 4 authors,…
Three possibilities • The means of the samples differ • The variance of the samples differs • Both the mean and the variance differ • In each case, we apply statistical tests to estimate the significance of the differences. • When p is below the conventional level of 5 % (error probability), we accept that the sample differences may be generalized to the population.
Variables • Attributes, characteristics, qualities, etc. • E.g. Gender, age, nationality, but above all: • ‘treatment’ (what you think exerts an influence). • = independent variable. • ‘reactions to the treatment (what you expect the influence will be). • = dependent variable. • The IV causes the DV, the DV is caused by the IV.
Kinds of tests • T-test: 1 IV, 1 DV • ANOVA: 1 IV, >1 DV • MANOVA: > 1 IV, > 1 DV (= GLM) • But these are parametric tests: they presuppose that your data are normally distributed, and at least in interval measurement. • How to know whether my data follow a normal distribution?
The Kolmogorov-Smirnov test • This test takes an ideal distribution • And projects your distribution on it • And then gauges whether the two differ significantly from each other. • ‘Significance’ here in the statistical sense! • Meaning: the error probability < .05 • Or, in the table of the results: p < .05. • In that case, your data are NOT normally distributed!
What to do in that case? • The parametric tests assume a number of things (i.e. interval measurement, normal distribution, etc.) • When these assumptions are not fulfilled: use non-parametric tests! • 2 independent / dependent samples • k [> 2] independent / dependent samples • Independent samples: no overlap between the samples. • Dependent samples: the same people.
One or two-tailed? • Ha says only THAT there will be a difference between 2 groups (without direction of the difference) = two-tailed. • WITH a direction (e.g.: E > C) = one-tailed. • In case of one-tailed: divide p-values by 2. • But note that this is a controversial issue among statisticians. • E.g.: if you know the direction of the hypothesis, then why do you need to test for significance?
F-ratio (Between-groups means)2 • F = ------------------------ (Within-groups mean)2 • When H0 , F = 1 • F > 1 means an effect • Very high F-values mean very low p-values! • I.e. very unlikely the result of chance! • Hence accept Ha