Lecture 4

Lecture 4 An RPG approach to hypothesis testing

Everbody, roll a D20 for an implausibility check

Modifier + dice roll > 19

Guess the modifierIf it quacks like a duck … • Suppose we want to know whether a character has a 0 modifier for a trait checked with D20=20. • Note if the check is passed. • If passed, assume the modifier is greater than 0. • If fail, assume the modifier is greater than 0.

A problem • Note that characters with very small modifiers will probably fail the test. This is called a Type 2 error. • So the test works best if the character has a large modifier. • A non-significant result does not “prove” that the character has a 0 modifier.

Power • A test is powerful if it rejects with high probability when the null hypothesis is false. • Power is defined against particular alternatives. • The modifier test is powerful against the alternative that the modifier is 16 • The modifier test is weak against the alternative that the modifier is 4.

Gaining power • Increase the sample size • Use a powerful test (technical stats issue) • Refine the study design to reduce variance

Some problems with NHST Multiple testing

Multiple testing • If I roll the dice often enough, I will pass the implausibility check • This applies to hypothesis testing • Repeated tests on the same data set, within the same study, may yield a spurious “significant” result • This is called a type 1 error

My recommendation • It is best to save the hypothesis test for the primary outcome • Use confidence intervals and effect sizes for secondary outcomes

Other possibilities • Adjust the alpha level of significance to take into account the fact that many tests are being made.

Some problems with NHST The p-value is not as informative as one might think

What is p (the p-value)?

The correct answer • The correct answer is c) • The p-value is the probability of getting something at least as extreme as what one got, assuming that the null hypothesis is true.

p-value and sample size • The p-value is a function of the sample size • If the null is false (even by a small amount) a large sample size will yield a small p-value • A large study will almost undoubtedly yield a significant result, even when nothing interesting is happening. • A small study will almost undoubtedly yield a non-significant result, even when the intervention is effective.

abuses of NHST • Fishing expeditions • No clear hypothesis • Many measurements of interest • Measurements with high degree of variability, uncertain distributions • Convenience samples • Cult-like adherence to • In the presence of electronic computers, very large data bases are available for analysis. • Alternatively, underpowered studies • Relying on the statistician to come up with the research question

A possible solution • Quote estimate and confidence interval and/or • Quote an effect size Cohen’s d: • The effect size is independent of the sample size. • Alternate suggested effect size: This statistic falls between 0 and 1. There are rules of thumb for what constitute large, medium and small effects.

Problems with the effect size • The effect size is sometimes taken to represent some sort of absolute measure of meaningfulness • Measures of meaningfulness need to come from the subject matter.

Advantages of the p-value • The p-value measures the strength of the evidence that you have against the null hypothesis. • The p-value is a pure number (no unit of measurement) • A common standard across all experiments using that methodology

Ideal conditions for NHST • Carefully designed experiments • Everything randomized that should be randomized • One outcome of interest • No more subjects than necessary to achieve good power • Structure of measurements known to be normal (or whatever distribution is assumed by the test)

The take-away Most important points to remember

The p-value is a function of the sample size • it’s not a measure of truth; it’s a measure of evidence • A significant result does not prove the null hypothesis to be true • If the data are matched, analyse them as matched pairs. • The t-test is fairly tolerant of departures from normality • The t-test is sensitive to differences in variance when the sample sizes are unequal. (when in doubt, use the Welch test)

vocabulary The following are equivalent • The significance level • The probability of a type 1 error The following are related • The probability of a type 2 error • The power of the test, Difference between and : • is set by the experimenter • is a consequence of the design.

Pop quiz • What is the difference between the significance level of a test and the p-value of that test?

Lecture 4

Lecture 4

Presentation Transcript

Lecture 4

Lecture 4

Lecture 4

Lecture 4

Lecture 4

Lecture 4

Lecture 4

Lecture 4

Lecture 4

Lecture 4

Lecture 4

LECTURE # 4

Lecture 4

Lecture 4

LECTURE 4

LECTURE 4

Lecture 4

Lecture 4

Lecture 4

Lecture 4

LECTURE № 4