700 likes | 707 Views
Delve into the fundamentals of hypothesis testing, association, and regression in the realm of biostatistics. Explore the significance of statistical analysis in studies on diabetes type 2 and pancreatic cancer. Understand the steps involved in statistical inference and the importance of distinguishing between null and alternative hypotheses.
E N D
Hypothesis testing.Association and regression GeorgiIskrov, PhD Department of Social Medicine
Outline • Hypothesistesting • Type I andtype II errors • Student t test • ANOVA • Parametric vsnon-parametric tests • Normality tests • Rank-based tests • Chi-square test • Fisher’s exact test • Correlation analysis • Regression analysis
Importance of biostatistics • Diabetes type 2 study • Experimental group: Mean blood sugar level: 103 mg/dl • Control group: Mean blood sugar level: 107 mg/dl • Pancreatic cancer study • Experimental group: 1-year survival rate: 23% • Control group: 1-year survival rate: 20% Is there a difference? Statistics are needed to quantify differences that are too small to recognize through clinical experience alone.
Statistical inference • Diabetestype 2 study • Experimental group: Meanbloodsugarlevel: 103 mg/dl • Control group: Mean blood sugar level: 107 mg/dl • Increasedsamplesize: • Diabetestype 2 study • Experimental group: Meanbloodsugarlevel: 99 mg/dl • Control group: Mean blood sugar level: 112 mg/dl
Statistical inference • Compare the mean between 2 samples/ conditions • if 2 means are statistically different, then the samples are likely to be drawn from 2 different populations, ie they really are different µ1 µ2 X1 X2
Statistical inference • Diabetestype 2 study • Experimental group: Meanbloodsugarlevel: 103 mg/dl • Control group: Mean blood sugar level: 107 mg/dl • Increasedsamplesize: • Diabetestype 2 study • Experimental group: Meanbloodsugarlevel: 105 mg/dl • Control group: Mean blood sugar level: 106 mg/dl
Statistical inference • Compare the mean between 2 samples/ conditions • if 2 samples are taken from the same population, then they should have fairly similar means X1 µ X2
Hypothesis testing • The general idea of hypothesis testing involves: • Making an initial assumption; • Collecting evidence (data); • Based on the available evidence (data), deciding whether to reject or not reject the initial assumption. • Every hypothesis test — regardless of the population parameter involved — requires the above three steps.
Criminal trial • Criminal justice system assumes the defendant is innocent until proven guilty. That is, our initial assumption is that the defendant is innocent. • In the practice of statistics, we make our initial assumption when we state our two competing hypotheses – the null hypothesis (H0) and the alternative hypothesis (HA). Here, our hypotheses are: • H0: Defendant is not guilty (innocent) • HA: Defendant is guilty • In statistics, we always assume the null hypothesis is true. That is, the null hypothesis is always our initial assumption.
Null hypothesis – H0 • This is the hypothesis under test, denoted as H0. • The null hypothesis is usually stated as the absence of a difference or an effect. • The null hypothesis says there is no effect. • The null hypothesis is rejected if the significance test shows the data are inconsistent with the null hypothesis.
Alternative hypothesis – H1 • This is the alternative to the null hypothesis. It is denoted as H', H1, or HA. • It is usually the complement of the null hypothesis. • If, for example, the null hypothesis says two population means are equal, the alternative says the means are unequal
Criminal trial • The prosecution team then collects evidence with the hopes of finding sufficient evidence to make the assumption of innocence refutable. • In statistics, the data are the evidence. • The jury then makes a decision based on the available evidence: • If the jury finds sufficient evidence — beyond a reasonable doubt — to make the assumption of innocence refutable, the jury rejects H0 and deems the defendant guilty. We behave as if the defendant is guilty. • If there is insufficient evidence, then the jury does not reject H0. We behave as if the defendant is innocent.
Making the decision • Recall that it is either likely or unlikely that we would observe the evidence we did given our initial assumption. • If it is likely, we do not reject the null hypothesis. • If it is unlikely, then we reject the null hypothesis in favor of the alternative hypothesis. • Effectively, then, making the decision reduces to determining likely or unlikely.
Making the decision • In statistics, there are two ways to determine whether the evidence is likely or unlikely given the initial assumption: • We could take the critical value approach (favored in many of the older textbooks). • Or, we could take the P-value approach (what is used most often in research, journal articles, and statistical software).
Making the decision • Suppose we find a difference between two groups in survival: • patients on a new drug have a survival of 15 months; • patients on the old drug have a survival of 18 months. • So, the difference is 3 months. • Do we accept or reject the hypothesis of no true difference between the groups (the two drugs)? • Is a difference of 3 a lot, statistically speaking – a huge difference that is rarely seen? • Or is it not much – the sort of thing that happens all the time?
Making the decision • A statistical test tells you how often you would get a difference of 3, simply by chance, if the null hypothesis is correct – no real difference between the two groups. • Suppose the test is done and its result is that P = 0.32. This means that you would get a difference of 3 quite often just by the play of chance – 32 times in 100 – even when there is in reality no true difference between the groups.
Making the decision • A statistical test tells you how often you’d get a difference of 3, simply by chance, if the null hypothesis is correct – no real difference between the two groups. • On the other hand if we did the statistical analysis and P = 0.0001, then we say that you’d only get a difference as big as 3 by the play of chance 1 time in 10 000. That’s so rarely that we want to reject our hypothesis of no difference: there is something different about the new therapy.
Hypothesis testing • Somewhere between 0.32 and 0.0001 we may not be sure whether to reject the null hypothesis or not. • Mostly we reject the null hypothesis when, if the null hypothesis were true, the result we got would have happened less than 5 times in 100 by chance. This is the conventional cutoff of 5% or P < 0.05. • This cutoff is commonly used but it’s arbitrary i.e. no particular reason why we use 0.05 rather than 0.06 or 0.048 or whatever.
Type I and II errors A type I error is the incorrect rejection of a true null hypothesis (also known as a false positive finding). The probability of a type I error is denoted by the Greek letter (alpha). A type II error is incorrectly retaining a false null hypothesis (also known as a false negative finding). The probability of a type II error is denoted by the Greek letter (beta).
Level of significance Level of significance (α) – the threshold for declaring if a result is significant. If the null hypothesis is true, α is the probability of rejecting the null hypothesis. α is decided as part of the research design, while P-value is computed from data. α = 0.05 is most commonly used. Small α value reduces the chance of Type I error, but increases the chance of Type II error. Trade-off based on the consequences of Type I (false-positive) and Type II (false-negative) errors.
Power Power – the probability of rejecting a false null hypothesis. Statistical power is inversely related to β or the probability of making a Type II error (power is equal to 1 – β). Power depends on the sample size, variability, significance level and hypothetical effect size. You need a larger sample when you are looking for a small effect and when the standard deviation is large.
Common mistakes • P-value is different from the level of significance α. P-value is computed from data, while α is decided as part of the experimental design. • P-value is not the probability of the null hypothesis being true. P-value answers the following question: If the null hypothesis is true, what is the chance that random sampling will lead to a difference as large as or larger than observed in the study.
Common mistakes • Do not focus only on whether a result is statistically significant. Look at the size of the effect and its precision as quantified by the confidence interval. • A statistically significant result does not necessarily mean that the finding is scientifically or clinically important. • Lack of difference may a meaningful result too! • If you repeat an experiment, expect the P-value to be different. P-values are much less reproducible.
Choosing a statistical test Choice of a statistical test depends on: Level of measurement for the dependent and independent variables; Number of groups or dependent measures; Number of units of observation; Type of distribution; Population parameter of interest (mean, variance, differences between means and/or variances).
1-sample t-test • Comparison of sample mean with a population mean • It is known that the weight of young adult male has a mean value of 70.0 kg with a standard deviation of 4.0 kg. Thus the population mean, µ= 70.0 and population standard deviation, σ= 4.0. • Data from random sample of 28 males of similar ages but with specific enzyme defect: mean body weight of 67.0 kg and the sample standard deviation of 4.2 kg. • Question: Whether the studied group have a significantly lower body weight than the general population?
2-sample t-test Aim: Compare two means Example: Comparing pulse rate in people taking two different drugs Assumption: Both data sets are sampled from Gaussian distributions with the same population standard deviation Effect size: Difference between two means Null hypothesis: The two population means are identical Meaning of P value: If the two population means are identical, what is the chance of observing such a difference (or a bigger one) between means by chance alone?
Paired t-test Aim: Compare a continuous variable before and after an intervention Example: Comparing pulse rate before and after taking a drug Assumption: The population of paired differences is Gaussian Effect size: Mean of the paired differences Null hypothesis: The population mean of paired differences is zero Meaning of P value: If there is no difference in the population, what is the chance of observing such a difference (or a bigger one) between means by chance alone?
One-way ANOVA Aim: Compare three or more means Example: Comparing pulse rate in 3 groups of people, each group taking a different drug Assumption: All data sets are sampled from Gaussian distributions with the same population standard deviation Effect size: Fraction of the total variation explained by variation among group means Null hypothesis: All population means are identical Meaning of P value: If the population means are identical, what is the chance of observing such a difference (or a bigger one) between means by chance alone?
Parametric andnon-parametric tests Parametric test – the variable we have measured in the sample is normally distributed in the population to which we plan to generalize our findings Non-parametric test – distribution free, no assumption about the distribution of the variable in the population
Normality test Normality tests are used to determine if a data set is modeled by a normal distribution and to compute how likely it is for a random variable underlying the data set to be normally distributed. In descriptive statistics terms, a normality test measures a goodness of fit of a normal model to the data – if the fit is poor then the data are not well modeled in that respect by a normal distribution, without making a judgment on any underlying variable. In frequentist statistics statistical hypothesis testing, data are tested against the null hypothesis that it is normally distributed.
Normality test Graphical methods An informal approach to testing normality is to compare a histogram of the sample data to a normal probability curve. The empirical distribution of the data (the histogram) should be bell-shaped and resemble the normal distribution. This might be difficult to see if the sample is small.
Normality test Graphical methods
Normality test Frequentist tests Tests of univariate normality include the following: D'Agostino's K-squared test Jarque–Bera test Anderson–Darling test Cramér–von Mises criterion Lilliefors test Kolmogorov–Smirnov test Shapiro–Wilk test Etc.
Normality test Kolmogorov–Smirnov test K–S test is a nonparametric test of the equality of distributions that can be used to compare a sample with a reference distribution (1-sample K–S test), or to compare two samples (2-sample K–S test). K–S statistic quantifies a distance between the empirical distribution of the sample and the cumulative distribution of the reference distribution, or between the empirical distributions of two samples. The null hypothesis is that the sample is drawn from the reference distribution (in the 1-sample case) or that the samples are drawn from the same distribution (in the 2-sample case).
Normality test Kolmogorov–Smirnov test In the special case of testing for normality of the distribution, samples are standardized and compared with a standard normal distribution. This is equivalent to setting the mean and variance of the reference distribution equal to the sample estimates, and it is known that using these to define the specific reference distribution changes the null distribution of the test statistic.
Mann–Whitney U test Ordinal data independent samples. H0: Two sampled populations are equivalent in location (they have the same mean ranks or medians). The observations from both groups are combined and ranked, with the average rank assigned in the case of ties. If the populations are identical in location, the ranks should be randomly mixed between the two samples.
Mann–Whitney U test Aim: Compare the average ranks or medians of two unrelated groups. Example: Comparing pain relief score of patients undergoing two different physiotherapy programmes. Effect size: Difference between the two medians (mean ranks). Null hypothesis: The two population medians (mean ranks) are identical. Meaning of P value: If the two population medians (mean ranks) are identical, what is the chance of observing such a difference (or a bigger one) between medians (mean ranks) by chance alone?
Kruskal–Wallis H test Ordinal data independent samples. H0: K sampled populations are equivalent in location (they have the same mean ranks). The observations from all groups are combined and ranked, with the average rank assigned in the case of ties. If the populations are identical in location, the ranks should be randomly mixed between the K samples.
Wilcoxon signed rank test Ordinal data two related samples. H0: Two sampled populations are equivalent in location (they have the same mean ranks). Takes into account information about the magnitude of differences within pairs and gives more weight to pairs that show large differences than to pairs that show small differences. Based on the ranks of the absolute values of the differences between the two variables.
Chi-square χ2 test Chi-square χ2 test is used to check for an association between 2 categorical variables. H0: There is no association between the variables. HA: There is an association between the variables. If two categorical variables are associated, it means the chance that an individual falls into a particular category for one variable depends upon the particular category they fall into for the other variable. Is there an association?
Let’s say that we want to determine if there is an association between Place of birth and Alcohol consumption. When we test if there is an association between these two variables, we are trying to determine if coming from a particular area makes an individual more likely to consume alcohol. If that is the case, then we can say that Place of birth and Alcohol consumption are relatedorassociated. Assumptions: A large sample of independent observations; All expected counts should be ≥ 1 (no zeros); At least 80% of expected counts should ≥ 5. Chi-square χ2 test
The following table presents the data on place of birth and alcohol consumption. The two variables of interest, place of birth and alcohol consumption, have r = 4 and c = 2, resulting in 4 x 2 = 8 combinations of categories. Chi-square χ2 test
For i taking values from 1 to r (number of rows) and j taking values from 1 to c (number of columns), denote: Ri = total count of observations in the i-th row. Cj = total count of observations in the j-th column. Oij = observed count for the cell in the i-th row and the j-th column. Eij = expected count for the cell in the i-th row and the j-th column if the two variables were independent, i.e if H0 was true. These counts are calculated as Expected counts
E11 = (695 x 1180) / 1363 E12 = (695 x 183) / 1363 E21 = (281 x 1180) / 1363 E22 = (281 x 183) / 1363 E31 = (159 x 1180) / 1363 E32 = (159 x 183) / 1363 E41 = (228 x 1180) / 1363 E42 = (228 x 183) / 1363 Expected counts
The test statistic measures the difference between the observed the expected counts assuming independence. If the statistic is large, it implies that the observed counts are not close to the counts we would expect to see if the two variables were independent. Thus, 'large' χ2 gives evidence against H0 and supports HA. To get the corresponding p-value we need to use a χ2 distribution with (r-1) x (c-1) df. Chi-square χ2 test
Association is not causation. The observed association between two variables might be due to the action of a third, unobserved variable. Beware!
Limitations • No categories should be less than 1 • No more than 1/5 of the expected categories should be less than 5 • To correct for this, can collect larger samples or combine your data for the smaller expected categories until their combined value is 5 or more • Yates Correction* • When there is only 1 degree of freedom, regular chi-test should not be used • Apply the Yates correction by subtracting 0.5 from the absolute value of each calculated O-E term, then continue as usual with the new corrected values