520 likes | 530 Views
Learn about statistical inference, hypothesis testing, confidence intervals, and pitfalls of p-values at IRCCS San Raffaele Pisana, Rome, Italy from 28 February - 2 March 2018.
E N D
Statistical inference (probability, probability distributions, statistical significance, hypothesis testing, confidence intervals, pitfalls of p-values) IRCCS San Raffaele Pisana, Rome, Italy,28 February - 2 March 2018
Statistical inference is the process through which inferences about a population are made based on certain statistics calculated from a sample of data drawn from that population. From: Principles and Practice of Clinical Research (Third Edition), 2012 Statistical inference is a data analysis technique used to deduce properties of an underlying probability distribution. Inferential statistical analysis utilizes hypothesis testing and estimate derivation to infer properties of a population. It is assumed that the observed data set is sampled from a larger population.
Statistical inference makes propositions about a population, using data drawn from the population with some form of sampling. Given a hypothesis about a population, for which we wish to draw inferences, statistical inference consists of (first) selecting a statistical model of the process that generates the data and (second) deducing propositions from the model. The conclusion of a statistical inference is a statistical proposition. Some common forms of statistical proposition are the following: a point estimate, i.e. a particular value that best approximates some parameter of interest; an interval estimate, e.g. a confidence interval (or set estimate), i.e. an interval constructed using a dataset drawn from a population so that, under repeated sampling of such datasets, such intervals would contain the true parameter value with the probability at the stated confidence level;
Models and assumptions Any statistical inference requires some assumptions. A statistical model is a set of assumptions concerning the generation of the observed data and similar data. Descriptions of statistical models usually emphasize the role of population quantities of interest, about which we wish to draw inference. Descriptive statistics are typically used as a preliminary step before more formal inferences are drawn. Statisticians distinguish two main levels of modeling assumptions; Fully Parametric Non-Parametric
What is "statistical significance" (p-value). The statistical significance of a result is the probability that the observed relationship (e.g., between variables) or a difference (e.g., between means) in a sample occurred by pure chance, and that in the population from which the sample was drawn, no such relationship or differences exist.
What is "statistical significance" (p-value). The value of the p-value represents a decreasing index of the reliability of a result. The higher the p-value, the less we can believe that the observed relation between variables in the sample is a reliable indicator of the relation between the respective variables in the population.
Hypotheses H0(null hypothesis) claims “no difference” Ha(alternative hypothesis) contradicts the null Example: We test whether an exposed population shows an increased level of DNA damage … H0: no average increased DNA damage in populationHa: H0 is wrong (i.e., there was an increased level of DNA damage in the exposed population)
Real World Null is true Null is false Null is true Correct decision Type II βerror (fn) Conclusion of the significance test Correct decision Null is false Type I αerror (fp)
Type I error A type I error occurs when the null hypothesis (H0) is true, but is rejected. It is asserting something that is absent, a false hit. A type I error may be likened to a so-called false positive (a result that indicates that a given condition is present when it actually is not present). The type I error rate or significance level is the probability of rejecting the null hypothesis given that it is true. It is denoted by the Greek letter α (alpha) and is also called the alpha level. Often, the significance level is set to 0.05 (5%), implying that it is acceptable to have a 5% probability of incorrectly rejecting the null hypothesis.
Type II error A type II error occurs when the null hypothesis is false, but erroneously fails to be rejected. A type II error may be compared with a so-called false negative in a test checking for a single condition with a definitive result of true or false. A Type II error is committed when we fail to believe a true alternative hypothesis. In terms of folk tales, an investigator may fail to see the wolf ("failing to raise an alarm"). Again, H0: no wolf. The rate of the type II error is denoted by the Greek letter β (beta) and related to the power of a test (which equals 1−β).
Significance Testing Also called “hypothesis testing” Objective: to test a claim about parameter μ Procedure: State hypotheses H0 and Ha Calculate test statistic Convert test statistic to P-value and interpret Consider significance level (optional)
P-value P-value ≡ the probability the test statistic would take a value as extreme or more extreme than observed test statistic, when H0 is true (I Type error) Smaller-and-smaller P-values → stronger-and-stronger evidence against H0 Conventions for interpretation P > .10 evidence against H0 not significant .05 < P ≤ .10 evidence marginally significant .01 < P ≤ .05 evidence against H0 significant P ≤ .01 evidence against H0 very significant 1/3/2020 Basics of Significance Testing 12
α≡ threshold for “significance” We set α For example, if we choose α = 0.05, we require evidence so strong that it would occur no more than 5% of the time when H0 is true Decision rule P ≤ α statistically significant evidence P >α nonsignificant evidence For example, if we set α = 0.01, a P-value of 0.0006 is considered significant Significance Level 1/3/2020 Basics of Significance Testing 13
In a two-tailed test, the rejection region for a significance level of α=0.05 is partitioned to both ends of the sampling distribution and makes up 5% of the area under the curve (white areas).
Hypothesis testing evaluates at which level of confidence H0 can be rejected Choose the study hypothesis to be tested Select the proper statistics Determine the distribution of the selected statistics Perform statistical test in the study groups. Assess decision rules Decide according to the test results Reject H0 because H1 is (most likely) true Do not reject H0 – May be true
Hypothesis testing evaluates at which level of confidence H0 can be rejected Choose the study hypothesis to be tested Select the proper statistics Determine the distribution of the selected statistics 50 COPD patients with a new rehabilitaion strategy have >> 6MWT as compared to 50 patients SoC Perform statistical test in the study groups. Assess decision rules Decide according to the test results Reject H0 because H1 is (most likely) true Do not reject H0 – May be true
Hypothesis testing evaluates at which level of confidence H0 can be rejected Choose the study hypothesis to be tested Select the proper statistics Select the distribution of the selected statistics 50 COPD patients with a new rehabilitaion strategy have >> 6MWT as compared to 50 patients SoC 6MWT has a normal distribution and there are 2 study group = Student’s t test Perform statistical test in the study groups. Assess decision rules Decide according to the test results Reject H0 because H1 is (most likely) true Do not reject H0 – May be true
Hypothesis testing evaluates at which level of confidence H0 can be rejected Choose the study hypothesis to be tested Select the proper statistics Select the distribution of the selected statistics 50 COPD patients with a new rehabilitaion strategy have >> 6MWT as compared to 50 patients SoC 6MWT has a normal distribution and there are 2 study group = Student’s t test The relevant distribution is the t with n > 50 Critical treshold t = 2.01 Perform statistical test in the study groups. Assess decision rules Decide according to the test results Reject H0 because H1 is (most likely) true Do not reject H0 – May be true
• t value depends on the sample size n • With latrge sample size t value aproximate z value • Like for the normal distribution is possible to transform an observed value in a t value when we know the mean and SD of the sample.
Table of t critical values ………………………….. …………………………………………….
Hypothesis testing evaluates at which level of confidence H0 can be rejected Choose the study hypothesis to be tested Select the proper statistics Select the distribution of the selected statistics 50 COPD patients with a new rehabilitaion strategy have >> 6MWT as compared to 50 patients SoC 6MWT has a normal distribution and there are 2 study group = Student’s t test The relevant distribution is the t with n > 50 Critical treshold t = 2.01 Perform statistical test in the study groups. Assess decision rules Decide according to the test results 350 mt new treatment vs. 280 standard. t=2.12 Reject H0 because H1 is (most likely) true Do not reject H0 – May be true
Hypothesis testing evaluates at which level of confidence H0 can be rejected Choose the study hypothesis to be tested Select the proper statistics Select the distribution of the selected statistics 50 COPD patients with a new rehabilitaion strategy have >> 6MWT as compared to 50 patients SoC 6MWT has a normal distribution and there are 2 study group = Student’s t test The relevant distribution is the t with n > 50 Critical treshold t = 2.01 Perform statistical test in the study groups. Assess decision rules Decide according to the test results 350 mt new treatment vs. 280 standard. t=2.12 Adopt the new treatment if p < 0.05 (5% I type error) Reject H0 because H1 is (most likely) true Do not reject H0 – May be true
Hypothesis testing evaluates at which level of confidence H0 can be rejected Choose the study hypothesis to be tested Select the proper statistics Select the distribution of the selected statistics 50 COPD patients with a new rehabilitaion strategy have >> 6MWT as compared to 50 patients SoC 6MWT has a normal distribution and there are 2 study group = Student’s t test The relevant distribution is the t with n > 50 Critical treshold t = 2.01 Perform statistical test in the study groups. Assess decision rules Decide according to the test results 350 mt new treatment vs. 280 standard. t=2.12 Adopt the new treatment if p < 0.05 (5% I type error) The critical value associated to the test is > treshold value Reject H0 because H1 is (most likely) true Do not reject H0 – May be true
Hypothesis testing evaluates at which level of confidence H0 can be rejected Choose the study hypothesis to be tested Select the proper statistics Select the distribution of the selected statistics 50 COPD patients with a new rehabilitaion strategy have >> 6MWT as compared to 50 patients SoC 6MWT has a normal distribution and there are 2 study group = Student’s t test The relevant distribution is the t with n > 50 Critical treshold t = 2.01 Perform statistical test in the study groups. Assess decision rules Decide according to the test results 350 mt new treatment vs. 280 standard. t=2.12 Adopt the new treatment if p < 0.05 (5% I type error) The critical value associated to the test is > treshold value Reject H0 because H1 is (most likely) true Do not reject H0 – May be true
Statistical issues specifically dealing with biomarkers Data transformation Statistical power Multiple comparison
Measures of Dispersion RANGE highest to lowest values STANDARD DEVIATION how closely do values cluster around the mean value SKEWNESS refers to symmetry of curve
Skewness Curve A Curve B Mode Median negative skew Mean
Transformations are a remedy for outliers, failures of normality, linearity, and homoscedasticity. Yet, caution must still be employed in the usage of transformations due to the increased difficulty of interpretation of transformed variables.
Distribution of MN frequency in 221 NPP workers mode median mean Hadjidekova V., Bulanova M., Bonassi S., Neri M. ‘Micronuclei frequency is increased in peripheral blood lymphocytes of nuclear power plant workers’ Radiation Research, 160, 684-90, 2003.
Distribution of MN frequency in 221 NPP workers log-transformed data Hadjidekova V., Bulanova M., Bonassi S., Neri M. ‘Micronuclei frequency is increased in peripheral blood lymphocytes of nuclear power plant workers’ Radiation Research, 160, 684-90, 2003.
Distribution of MN frequency in 221 NPP workers SQR-transformed data Hadjidekova V., Bulanova M., Bonassi S., Neri M. ‘Micronuclei frequency is increased in peripheral blood lymphocytes of nuclear power plant workers’ Radiation Research, 160, 684-90, 2003.
Distribution of MN frequency in 221 NPP workers Average Square Root (ASR)-transformed data (½(x+x+1) Hadjidekova V., Bulanova M., Bonassi S., Neri M. ‘Micronuclei frequency is increased in peripheral blood lymphocytes of nuclear power plant workers’ Radiation Research, 160, 684-90, 2003.
Transforming a dataset is usually an empirical exercise - perform several of the most likely transformations and test for a normal distribution. Plot a histogram of the data to check that it has a continuous frequency distribution. Calculate the mean, median and mode. The closer these values are to each other, the closer the data is to a normal distribution.
Distribution of FEV1 in 211 lung cancer patients
The one-sample Kolmogorov-Smirnov test can be used to test that a variable is normally distributed.
The one-sample Kolmogorov-Smirnov test can be used to test that a variable is normally distributed. The One-Sample Kolmogorov-Smirnov Test procedure compares the observed cumulative distribution function for a variable with a specified theoretical distribution, which may be normal, uniform, Poisson, or exponential. The Kolmogorov-Smirnov Z is computed from the largest difference (in absolute value) between the observed and theoretical cumulative distribution functions. This goodness-of-fit test tests whether the observations could reasonably have come from the specified distribution.
Different outcome for different biomarkers • count, mean, ranks, yes/no, etc.
Different outcome for different biomarkers • count, mean, ranks, yes/no, etc. In general categorical data ara simpler to handle and to interpret Examples: Age vs. age-classes Cigarettes smoked per day vs. non-smoker; 1-9; 10-19; 20+ Frequency of CA vs. Low; median, High
Statistical Power Statistical power is defined as the probability of correctly rejecting the null hypothesis when it is false (Since Beta/type II error was the probability of rejecting the null hypothesis when is true) Statistical power is (1-beta). Power is strongly influenced by sample size. With a larger N, we are more likely to reject the null hypothesis if it is truly false. (As N increases, the standard error shrinks. Sampling error becomes less problematic, and true differences are easier to detect.)
but …. how many subjects should I include in my new study ????
There are four interrelated components that influence the conclusions you might reach from a statistical test in a research project. sample size, or the number of units (e.g., people) accessible to the study effect size, or the salience of the treatment relative to the noise in measurement alpha level (or significance level), or the odds that the observed result is due to chance power, or the odds that you will observe a treatment effect when it occurs
Example: Baseline frequency 1 per 1000 in exposed subjects 1,5, i.e., 50% higher = Exposure effect = 0.5 MN‰, Assuming SD = 1 ( Mean of controls) 0.5/1=0.5 Total population required = 120 (I type error 5%, Stat Power 80%)
Multiple comparison ! Multiple comparison ! Multiple comparison ! Multiple comparison ! Multiple comparison ! Multiple comparison ! Multiple comparison ! Multiple comparison ! Multiple comparison ! Multiple comparison ! Multiple comparison ! Multiple comparison ! Multiple comparison ! Multiple comparison ! Multiple comparison ! Multiple comparison ! Multiple comparison ! Multiple comparison ! Multiple comparison ! Multiple comparison ! Multiple comparison ! Multiple comparison ! Multiple comparison ! Multiple comparison ! Multiple comparison ! Multiple comparison ! Multiple comparison ! Multiple comparison ! Multiple comparison ! Multiple comparison ! Multiple comparison ! Multiple comparison ! Multiple comparison ! Multiple comparison ! Multiple comparison ! Multiple comparison ! Multiple comparison ! Multiple comparison ! Multiple comparison ! Multiple comparison ! Multiple comparison ! Multiple comparison ! Multiple comparison ! Multiple comparison ! Multiple comparison ! Multiple comparison ! Multiple comparison ! Multiple comparison ! Multiple comparison ! Multiple comparison ! Multiple comparison ! Multiple comparison ! Multiple comparison ! Multiple comparison ! Multiple comparison ! Multiple comparison ! Multiple comparison ! Multiple comparison ! Multiple comparison ! Multiple comparison ! Multiple comparison ! Multiple comparison ! Multiple comparison ! Multiple comparison ! Multiple comparison ! Multiple comparison ! Multiple comparison ! Multiple comparison ! Multiple comparison ! Multiple comparison ! Multiple comparison ! Multiple comparison ! Multiple comparison ! Multiple comparison ! Multiple comparison ! Multiple comparison ! Multiple comparison ! Multiple comparison ! Multiple comparison ! Multiple comparison ! Multiple comparison ! Multiple comparison ! Multiple comparison ! Multiple comparison ! Multiple comparison ! Multiple comparison ! Multiple comparison ! Multiple comparison ! Multiple comparison ! Multiple comparison ! Multiple comparison ! Multiple comparison ! Multiple comparison ! Multiple comparison ! Multiple comparison ! Multiple comparison ! Multiple comparison ! Multiple comparison ! Multiple comparison ! Multiple comparison ! Multiple comparison ! Multiple comparison ! Multiple comparison ! Multiple comparison ! Multiple comparison ! Multiple comparison ! Multiple comparison ! Multiple comparison ! Multiple comparison ! Multiple comparison ! Multiple comparison ! Multiple comparison ! Multiple comparison ! Multiple comparison ! Multiple comparison ! Multiple comparison ! Multiple comparison ! Multiple comparison ! Multiple comparison ! Multiple comparison ! Multiple comparison ! Multiple comparison ! Multiple comparison ! Multiple comparison ! Multiple comparison ! Multiple comparison ! Multiple comparison ! Multiple comparison ! Multiple comparison ! Multiple comparison ! Multiple comparison ! Multiple comparison ! Multiple comparison ! Multiple comparison ! Multiple comparison ! Multiple comparison ! Multiple comparison ! Multiple comparison ! Multiple comparison ! Multiple comparison ! Multiple comparison ! Multiple comparison ! Multiple comparison ! Multiple comparison ! Multiple comparison ! Multiple comparison ! Multiple comparison ! Multiple comparison ! Multiple comparison ! Multiple comparison ! Multiple comparison ! Multiple comparison ! Multiple comparison ! Multiple comparison ! Multiple comparison ! Multiple comparison ! Multiple comparison ! Multiple comparison ! Multiple comparison ! Multiple comparison ! Multiple comparison ! Multiple comparison ! Multiple comparison ! Multiple comparison !