1 / 30

BMS 617

BMS 617. Lecture 5 – Testing for equivalence or non-inferiority. Power and sample size. Multiple hypothesis testing. Recap. Recall: a p-value is the probability of obtaining data at least as extreme as the observed data, under the assumption that the null hypothesis is true

Download Presentation

BMS 617

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. BMS 617 Lecture 5 – Testing for equivalence or non-inferiority. Power and sample size. Multiple hypothesis testing Marshall University Genomics Core Facility

  2. Recap Recall: a p-value is the probability of obtaining data at least as extreme as the observed data, under the assumption that the null hypothesis is true A small p-value provides support for the assertion that the null hypothesis is false What does a large p-value tell us? Marshall University School of Medicine

  3. Example • Frye et al. (NEJM 1996, Vol 335;4, p217-225) published a study comparing Coronary Artery Bypass Grafting (CABG) to Percutaneous Transluminal Coronary Angioplasty (PTCA) as treatments for coronary artery disease (CAD). • CABG is a highly invasive procedure, while PTCA is less invasive. • They wanted to show that PTCA was not inferior as a treatment. • Their main findings were presented as five year survival rates:There was no statistically significant difference in the cumulative survival curves for the two treatment groups (… p=0.19 by the logrank test). The cumulative survival rates were 89.3% for CABG and 86.3% for PTCA. The difference between the groups [is] 2.9 percent with a 95% confidence interval of [-0.2%, 6.0%]. Marshall University School of Medicine

  4. Frye et al. interpretation • Frye et al. wanted to show there was no difference between the two groups • They computed the difference between five year survival rates, and a p-value for this difference • Because the p-value was relatively large (certainly more than 0.05), they concluded there was no difference between their groups • Remember what the p-value means:Assuming there is no difference between the two groups, the chances of seeing data this extreme are 0.19 Marshall University School of Medicine

  5. p-values cannot prove the null hypothesis • Because the calculation of the p-value starts with the assumption that the null hypothesis is true, no p-value can be used to prove the null hypothesis • At the very most, a “large” p-value says that your data are not inconsistent with the null hypothesis • All you have done is fail to disprove it • There are many ways to get a large p-value • Too few samples • Too much variability within groups Marshall University School of Medicine

  6. Not statistically significant does not mean “no difference” • Frye et al. essentially interpret a high p-value as meaning there is no important difference between the two treatments. • This is a misinterpretation of their experimental results • However, others have since shown that their conclusions are indeed correct • The 95% confidence interval includes the possibility that as many as 6% of all patients would survive 5 years with CABG but not with PTCA. • A thorough analysis of the data would need a discussion as to whether this constituted an important difference. • Notice how the confidence interval is far more informative than the p-value! Marshall University School of Medicine

  7. Testing for Equivalence • Most experiments aim to show that there is a difference between two sets of values • Sometimes an experiment aims to show that two sets of values are equivalent • Showing that a new (perhaps cheaper) drug is just as effective as a standard treatment • Showing that a new methodology produces the same results as an existing methodology • The first step in these kinds of analyses is to define what is meant by "equivalent” • Remember, continuous values are never exactly equal • We must define a range of differences which we consider to be inconsequential Marshall University School of Medicine

  8. The Region of Equivalence • The range of values which are considered inconsequential changes is called the region of equivalence (sometimes equivalence zone or equivalence margin) • The definition of the region of equivalence should be made according to scientific (or clinical) considerations • Not based on the data • As an example, the FDA defines two drug formulations to be equivalent if the ratio of their peak concentrations in blood plasma is between 0.8 and 1.25 • Definition based on a clinical understanding of drug action • To be considered equivalent, the entire 90% Confidence Interval of this ratio must lie within this region Marshall University School of Medicine

  9. Example In a recent experiment, we measured expression of the gene GRHL2 in 51 different breast cancer cell lines. The cell lines were categorized by luminal type as "Basal A", "Basal B", or "Luminal”. We expect the log of the expression value to be normally distributed, so we will analyze the logs of the data Marshall University School of Medicine

  10. Example: Basal A vs Basal B Difference between log2 expression values is 2.82, with a 95% confidence interval (for the difference of means) of [1.815, 3.817]. p-value is 2.17 x 10-5. Marshall University School of Medicine

  11. Example: Basal A vs Luminal Difference between log2 expression values is 0.048, with a 95% confidence interval (for the difference of means) of [-0.856, 1.204]. p-value is 0.726. Marshall University School of Medicine

  12. Test of equivalence for GRHL2 expression in Basal A vs Luminal cells • We must first decide what change in gene expression we consider to be "equivalent" to no change • This is a scientific decision which must be made based on knowledge of how the mRNA is translated to protein • Complex! • We will settle on a change of 25% (i.e. a fold change of 1.25). • Since we are working with log2 data, this translates to a region of equivalence of ±log2(1.25) or [-0.322, 0.322]. • The 90% confidence interval for the difference in log2 expression is [-0.212, 0.307], with a mean of 0.048. Marshall University School of Medicine

  13. Concluding Equivalence Since the entire confidence interval lies within the region of equivalence, we can conclude that the expression is equivalent We could state this as:“The expression of GRHL2 in Basal A cells was equivalent to that in Luminal cells, with a region of equivalence of 1.25 fold up or down and a confidence level of 90%.” Marshall University School of Medicine

  14. Statistical Power • Recall, a p-value is the probability of getting results "at least this strong", assuming the null hypothesis is true • This means that the level of statistical significance, α, is the chance of getting a type I error if the null hypothesis is true • What if the null hypothesis really is false? • We still may or may not get a statistically significant result • I.e. we may have a true positive or a false negative result • We can ask a similar question about the probabilities of these outcomes: Assuming the null hypothesis is false, what are the chances of obtaining a statistically significant result? • This quantity is the statistical power. The quantity 1-(power) is denoted β • β is the chance of getting a type II error (false negative) if the null hypothesis is false Marshall University School of Medicine

  15. Statistical Power • The statistical power of an experiment depends on three things: • The sample size. • The larger the sample size, the larger the power. • The variance (amount of scatter) of the data. • The higher the variance, the lower the power. • The size of the difference (or the degree of correlation, etc) that actually exists. • The greater the effect, the higher the power. Marshall University School of Medicine

  16. When to compute power • Power calculations are best performed before the experiment is performed. • Usually the calculation answers the question: "What sample size do I need in order to achieve a power of …?” • It is never useful to compute the power of the effect size observed in an experiment, once the experiment has been completed. • Either the experiment found statistical significance for that effect size, or it didn’t • It is sometimes useful to compute the power after an experiment has been performed, but with an effect size that is determined by scientific reasoning, not by the data itself • In particular, if you didn't find statistical significance, you might ask what the chances of finding statistical significance would have been if the effect size had been large enough to be scientifically interesting Marshall University School of Medicine

  17. Power calculations for the GRHL2 expression experiment • There is no point in performing a power calculation for the Basal A vs Basal B comparison • Since the results were statistically significant • Should still justify that the size of the difference (log2 fold change of 2.82, or fold change of 22.82=7.06) is scientifically meaningful • To calculate the power for the Basal A vs Luminal comparison, we must first decide on a difference we would consider to be scientifically meaningful • We are going to answer the question If there really were a difference of …, how likely is it our experiment would have found a statistically significant result? • This amount must be based on scientific reasoning, not on our data • What change in mRNA levels causes a detectable/interesting change in protein levels? • This quantity is hard to establish Marshall University School of Medicine

  18. Power for Basal A – Luminal GRHL2 expression • Suppose we determined that a change of 25% (i.e. a fold change of 1.25, or a log2 change of log2(1.25)=0.322) was the minimum to be of scientific interest • We could then compute the power to detect this level, given the standard deviation observed in our data and a statistical significance level of 0.05 • This power is 0.99997 • Would almost certainly have found this big a difference! • If we’d established a change of 10% was important, then our power would be lower: 0.7255 Marshall University School of Medicine

  19. Power calculations for experimental design • The most usual scenario to perform power calculations is when designing an experiment • Remember, the power is a function of • The sample size • The effect size • The variation in the data • Typically do the power calculation to establish a sample size which gives us an reasonable chance (usually 80%) of finding a significant result • We have to choose an effect size of interest • We also have to “know” how much variation we will see in our experiment • The last part is problematic… Marshall University School of Medicine

  20. Practical advice for sample size calculations • Choosing a sample size is often a balance between statistical considerations and resource considerations • Time, money, etc • The computation will require an estimate of the standard deviation (or SEM) you will see in your data • Use previous data from similar experiments • Consider repeating the computation with different values • In grant applications, consider presenting a table • Given a chosen sample size, here is the power to detect various effect sizes under the assumption of various realistic estimates of the SD. Marshall University School of Medicine

  21. Multiple Hypothesis Testing • In a study we are currently planning, we will be measuring various biomarkers in pregnant women at 6 weeks of pregnancy • Compare the values of the biomarkers between those who develop preeclampsia and those who don’t • Idea is to be able to develop a test for early prediction of risk of preeclampsia • Study design seems fairly straightforward • Measure the biomarker values in each of our ~40 biomarkers • Compare the values between those who develop preeclampsia and those who don’t, and compute the p-value for each • Biomarkers that have a p<0.05 would be considered predictive • What is the problem with this approach? Marshall University School of Medicine

  22. p-values and multiple hypotheses • Remember, the p-value is the probability of a false positive, assuming the null hypothesis is true • Suppose we do 40 such tests, and all the null hypotheses are true • I.e. none of our biomarkers are predictive • We would hope to have no statistically significant results (i.e. all true negatives) • For each test, the chance of a true negative is 0.95 (with α=0.05) • So the chance of all true negatives is 0.95 x 0.95 x 0.95 x … x 0.95 = 0.129 • In other words, the chances of at least one false positive would be 0.871, or 87.1%. Marshall University School of Medicine

  23. Approaches to dealing with multiple comparisons • Here are some approaches to dealing with multiple hypothesis testing: • Ignore the problem! • This may be ok if • You report all your p-values and let the reader interpret them, or • You have one or two “main” tests and the others are related to them • Correct your p-values for the problem • Two ways to do this: • Set a significance level so that the chance of any false positives is 0.05, assuming all null hypotheses are true • Called a family-wise error rate • Set a target false discovery rate instead of a false positive rate • Can be done if the number of hypotheses is very large Marshall University School of Medicine

  24. Controlling the family-wise error rate • Suppose you are performing n independent tests • Choose a significance level, α (usually 0.05) • We will aim for the probability of any false positives to be α, assuming all null hypotheses are true • Divide α by the number of comparisons, n • For each individual test, compute the p-value • The p-value is considered significant if p<α/n • This is called a Bonferroni correction Marshall University School of Medicine

  25. High throughput gene expression experiments • In high-throughput gene expression experiments (microarray or RNA-Seq), comparisons between the expression of all genes are made between two (or more) different conditions • Typical application is to determine which genes’ expression changes after treatment of a drug, etc. • Effectively performing tens of thousands of different tests simultaneously • One for each gene in the genome Marshall University School of Medicine

  26. Bonferroni correction for high-throughput experiments • The Bonferroni correction is not appropriate for high-throughput experiments • The level required for significance for each individual gene would be ~0.05/20000=0.0000025 • Very limited power • The expression of each gene is not independent, so the Bonferroni correction is overly conservative • In many of these experiments, some false positives can be tolerated Marshall University School of Medicine

  27. Controlling the False Discovery Rate in high-throughput experiments • An alternative approach is to control the False Discovery Rate (FDR) • A method devised by Benjamini and Hochberg in 1995 allows us to do this • Idea is to choose a target FDR • Usually called Q • If Q is 0.1, then 10% of the genes we consider “significant” would be false positives, and 90% would be true positives • Of course, you don’t know which! • This is an estimate Marshall University School of Medicine

  28. The Benjamini-Hochberg procedure The idea behind the Benjamini-Hochberg procedure is that if all null hypotheses were true, the p-values would be evenly distributed between 0 and 1 Suppose we are performing n tests, and we choose a target FDR of Q Compute the p-values for each of the tests, and order them with the smallest first If the smallest p-value is less than Q/n, we consider that test to be “positive” (termed a “discovery”) If so, then look at the next smallest p-value. If it is smaller than 2Q/n, consider it a discovery, and move to the next. If that is smaller than 3Q/n, call it positive, and move to the next. Keep going until you find one test that is negative, and then stop. Marshall University School of Medicine

  29. Ways to use the FDR • When computing the False Discovery Rate like this, there are three different ways to use it • Choose a target FDR (say 5% or 10%) as above, and then see which tests are discoveries within this FDR • Choose some number of tests which you consider important (e.g. the 200 genes with the biggest fold change). Compute the FDR which would include make all these tests discoveries. • For each test, compute the FDR which would just include that test as a discovery. This assigns a FDR for each individual test – this is usually called a q-value for that test. Report the q-values for all tests. Marshall University School of Medicine

  30. p-values and high-throughput experiments • When dealing with very large numbers of tests, individual p-values become almost meaningless • In a genome-wide experiment, testing 20,000 genes, even with a significance level of 0.001, we would expect 20 false positives on average • But we have little control over the distribution of the number of false positives • How likely is it we have 5 false positives? 50? 200? • Given this, how do you interpret a p-value for an individual test (gene)? • In experiments such as these the p-value tells us very little • The FDR is much more useful Marshall University School of Medicine

More Related