1 / 73

Lecture 5

Lecture 5. Hypothesis testing. What should you know?. Confidence intervals (Wald and bootstrap) p-value how to find a normal probability (relates to a p-value) how to find a normal quantile (relates to the confidence interval)

arwen
Download Presentation

Lecture 5

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Lecture 5 Hypothesis testing

  2. What should you know? • Confidence intervals (Wald and bootstrap) • p-value • how to find a normal probability (relates to a p-value) • how to find a normal quantile (relates to the confidence interval) • Central limit theorem (i.e. the standard deviation of the mean is and the distribution is approximately normal • histogram (good for looking at data, assessing skeweness) • quantile plot (good for assessing normality) • box plot (good for comparing samples) • two sample t-test and its assumptions • power of a test • Type 1 and type 2 error

  3. Example t-test and confidence interval

  4. Example • Load SomeData1.sav • Do the test • Observe the confidence interval • Do a box plot

  5. Confidence interval • 95 % Confidence interval for the difference between the means, • is the 0.975 quantile of a t distribution with degrees of freedom. • This is a Wald interval: estimate plus/minus quantile x std.err.

  6. Return to the example SomeData1.sav • How would you calculate the pooled standard deviation from this output? • Take the standard error for the difference and divide by • Or, use the standard deviations from each sample and do:

  7. Why do we care? • When doing a sample size calculation or a meta-analysis, you sometimes need to be able to retrieve the standard deviation from output that displays different information.

  8. Recall classic hypothesis testing framework • State the hypotheses • Get the test statistic • Calculate the p-value • If p-value is less than the significance level (say 0.05) reject the null • Otherwise Do not reject the null

  9. Technical point • If p-value is less than , say “there is sufficient evidence to reject the null hypothesis.” • If p-value is greater than , say “there is insufficient evidence to reject the null”, because: • Either the null is true • Or the sample size was not large enough to detect the alternative • Or the alternative is very close to the null (so we could not detect it) • Or we got unlucky

  10. Significance What is the probability that we reject the null when the null is true? (i.e. probability of a type 1 error)

  11. Power Is the sample size large enough to reject the Null when the null is false?

  12. Everbody, roll a D20 for an implausibility check

  13. Guess the modifierIf it quacks like a duck … • Suppose we want to know whether a character has a 0 modifier for a trait checked with D20=20. • Note if the check is passed. • If passed, assume the modifier is greater than 0. • If fail, assume the modifier is greater than 0.

  14. Modifier + dice roll > 19

  15. A problem • Note that characters with very small modifiers will probably fail the test. This is called a Type 2 error. • So the test works best if the character has a large modifier. • A non-significant result does not “prove” that the character has a 0 modifier.

  16. Power • The power of a test is the probability of rejecting the null when the null is false. • Power is defined against particular alternatives. • The modifier test is powerful against the alternative that the modifier is 16 • The modifier test is weak against the alternative that the modifier is 4.

  17. Gaining power • Increase the sample size • Use a powerful test (technical stats issue) • Refine the study design to reduce variance Theses are the same techniques used to reduce the confidence interval.

  18. Some problems with NHST Multiple testing

  19. Multiple testing • If I roll the dice often enough, I will pass the implausibility check • This applies to hypothesis testing • Repeated tests on the same data set, within the same study, may yield a spurious “significant” result • This is called a type 1 error

  20. Example SomeData1.sav

  21. When the null is true • Open SPSS • Go to Transform -> random number generators -> set active generator -> Mersenne Twister • -> Set starting points -> random start • Load SomeData1.sav • Add a column of random normal (all mean 0, sd 1) • Go to Analysis -> compare means -> independent samples • At least one person in the class should get a significant result (p < 0.05)

  22. My recommendation • It is best to save the hypothesis test for the primary outcome • Use confidence intervals and effect sizes for secondary outcomes

  23. What does the p-value tell me? The p-value is not as informative as one might think

  24. What is p (the p-value)?

  25. The correct answer • The correct answer is c) • The p-value is the probability of getting something at least as extreme as what one got, assuming that the null hypothesis is true.

  26. p-value and sample size • The p-value is a function of the sample size • If the null is false (even by a small amount) a large sample size will yield a small p-value • A large study will almost undoubtedly yield a significant result, even when nothing interesting is happening. • A small study will almost undoubtedly yield a non-significant result, even when the intervention is effective.

  27. How many subjects do I need? • A sample size calculation is an essential part of the design of any study. • The number of subjects you need depends on • variance of the data • the design of the study • the clinically meaningful effect that you want to be able to detect • MCID (minimal clinically important difference) The smallest change that a patient (or other subject) would view as personally important.

  28. Calculations • Simple cases can be solved analytically • More complex cases are resolved through simulation • Avoid online power calculators

  29. NHST A history of abuse

  30. Abuses of NHST • Fishing expeditions (NHST used as an exploratory technique) • Many measurements of interest (leads to multiple testing) • Measurements with high degree of variability, uncertain distributions (normality assumption violated, so p-values not accurate) • Convenience samples (violates assumptions of randomness, independence) • Cult-like adherence to • In the presence of electronic computers, very large data bases are available for analysis; everything is significant • Alternatively, underpowered studies; nothing is significant • Relying on the statistician to come up with the research question (no clear hypothesis) • RESULT: We are a long way from the scientific method

  31. Possible solutions • Quote estimate and confidence interval and/or • Quote an effect size. • Never only quote the decision (reject/accept); quote the p-value

  32. What is an effect size? • A measure of the effect (difference between groups) that does not depend on the sample size. • Cohen’s d: • Alternate suggested effect size: This statistic falls between 0 and 1. There are rules of thumb for what constitute large, medium and small effect.

  33. SPSS alert • SPSS does not give you a Cohen’s d or the other effect size for the two-sample comparison. • It does give the mean difference and the confidence interval.

  34. Problems with the effect size • The effect size is sometimes taken to represent some sort of absolute measure of meaningfulness • Measures of meaningfulness need to come from the subject matter • Quote the p-value, not the decisions (SPSS does this)

  35. Advantages of the p-value • The p-value measures the strength of the evidence that you have against the null hypothesis. • The p-value is a pure number (no unit of measurement) • A common standard across all experiments using that methodology • Sometimes we need to make a decision: do we introduce the new treatment or not? Hypothesis testing gives an objective criterion.

  36. Ideal conditions for NHST • Carefully designed experiments • Everything randomized that should be randomized • One outcome of interest • No more subjects than necessary to achieve good power • Structure of measurements known to be normal (or whatever distribution is assumed by the test)

  37. vocabulary The following are equivalent • The significance level • The probability of a type 1 error The following are related • The probability of a type 2 error • The power of the test, Difference between and : • is set by the experimenter • is a consequence of the design.

  38. Pop quiz • What is the difference between the significance level of a test and the p-value of that test?

  39. Answer • The significance level (0.05, say) determines whether the null hypothesis is rejected or not. • The p-value (or observed significance level) measures the degree of evidence against the null hypothesis in the data.

  40. The two-sample test Assumptions

  41. Assumptions • The sample means are normally distribution (or almost) • Variances are equal • Everything is independent

  42. Normality • The t-test is usually robust with respect to this conditions. • If the sample is large enough, this condition will hold. • As a reality check, a bootstrap test is possible or a non-parametric test.

  43. Bootstrap two-sample test • This is a resampling test. • The computer repeatedly permutes group membership labels amongst the cases and calculates the T-statistic with the new groups. • If the null hypothesis is true, group membership is irrelevant. • What proportion of the bootstrapped T statistics are more extreme that the “real” one? • This proportion is the p-value of the test.

  44. Transforming the data • Sometimes a transformation of the data will produce something more normal like • Take logs • Take square roots • Other transformations are possible • My experience: this rarely works, but sometimes it does.

  45. Example: the cloud seeding data • Load clouds.csv into SPSS • Do a t-test of seeded vs unseeded data • Transform with logarithms • Repeat • Notice that there is a significant difference when the data have been transformed. • Questions: Does it matter whether you use natural or base 10 logarithms?

  46. Check for normality • Quantile plot on each of the two samples (SPSS does not do this easily) • Boxplot (at least gives an idea of symmetry) • Check the residuals (SPSS does not do this easily)

  47. Heteroscedasticity When the variances are unequal

  48. Unequal variances • The two-sample t test assumes both samples have the same variance (resp. standard deviation) • Violation of this assumption can be bad, especially when the sample sizes are equal.

More Related