Understanding Normality Tests in R and Choosing the Right Statistical Test

MASH R workshop 2:

How to check normality in Rand determine when to use a parametric or a non-parametric test. How to run the main parametric tests: - T-test, paired or unpaired (independent) - ANOVA, one-way or repeated measures. In this session you will know:

If your data is normally distributed approximately, then in your statistical analysis, you will use parametric tests. If your data is not normally distributed, then you are more likely to use non-parametric tests. The non-parametric tests will be covered in the next session (R Workshop Session 3). WHY Checking normality?

By “normality”, we mean to check if your column of measurements(or data) is approximately distributed as a “bell shape” - when plotting the histogram. Checking “NORMALITY” for your data Corresponding Histogram/Data Data approximately normally distributed?Bell Shape so : YES!

By “Bell” shape, we also mean a symmetry around the mean. The normal distribution is a symmetricdistribution (see R Workshop Session 1). In a symmetric distribution, it is important to note that: You should notice a symmetry between the left hand side and the right hand side of this Mean (or Mode or Median). The axis of symmetry should pass through the Mean. Checking “NORMALITY” for your data Mean = Mode = Median

Checking “NORMALITY” for your data Don’t forget your data should beapproximately normally distributed!Not necess. exactly normally distributed… Axis of Symmetry passes through the mean.(Approximately!) Right Hand side Left Hand side The Mode (68) is approximately equal to the Mean (68.30).

Checking “NORMALITY” for your data Example of Skewed Data: - No Symmetry.- Mode.- Median.- Mean.

Following what you learned in Session 1, you can plot a histogram of your data by the command “hist()”. From MASH website, download the ‘normR’ data set at https://www.sheffield.ac.uk/mash/statistics/datasets Put the data sets in a folder and set this folder as the working directory. Checking “NORMALITY” for your data: Plotting histograms in r.

Checking “NORMALITY” for your data: Plotting histograms in r.

Skewed Data

Symmetrical data

Checking “NORMALITY” for your data: Test for Normality There are tests to assess whether or not your data is normally distributed.In all cases, the Null Hypothesis is:“H0 : Data Normally distributed”.If the p-value is smaller than 0.05, then you reject the null and therefore conclude that the data is not normally distributed.

If the sample has less than 50 participants, use the Shapiro-Wilk test. If the sample has more than 50 participants,use the Kolmogorov-Smirnov test. Checking “NORMALITY” for your data: NORMALITY TESTS.

Checking “NORMALITY” for your data: Test for Normality: Shapiro-Wilk (<50 people). The size of each sample is less than 50, so I can use the Shapiro-Wilk test in both cases. P-Value < 0.05Therefore: Null Hypothesis rejectedData not Normally distributed P-Value > 0.05Therefore: Null Hypothesis not rejectedData Normally distributed

Checking “NORMALITY” for your data: Test for Normality: kolmogorov-smirnov (>50 people). If the sample is more than 50 people, the Kolmogorov-Smirnov test is preferred. Let us just assume that our data set has more than 50 people. Watch out! The syntax differs a little from the Shapiro-Wilk test!

Checking “NORMALITY” for your data: TESTS However, be cautious with those tests because the presence of a simple outlier can reject the null hypothesis while the rest of your data is perfectly symmetric. We recommend you to check both graphs and tests. If you are still undecided, don’t hesitate to come to MASH and ask us!

Checking “NORMALITY” for your data

You can also show a P-P plot to prove that your data is normally distributed (or not). It is more common however to show the Q-Q plot. Checking “NORMALITY” for your data

In this Session, we will only study parametric tests, that are, tests to use when your data is normally distributed. Parametric tests

An Independent t-test will detect if there is any statistically significant difference in a measurement (score) between 2 groups (Group 1 and Group 2). We have therefore one categorical variable (Group) and one continuous variable (score). You need to check that your measurement is normally distributed in both groups. You also need to check if there are any outliers. If so, it is better to remove them! (We can keep outliers that are not too extreme though) Finally our last assumption is to check whether the variance in each group is roughly the same. We verify this assumption thanks to Levene’s test. Comparing a measurement between 2 independent groups: INDEPENDENT T-TEST For assumptions of each Test: Go to “LAERD SPSS name_of_the_test” !

Download the Birthweight data set for R (.csv format on the website). Comparing a measurement between 2 independent groups: INDEPENDENT T-TEST File to download And store in the correct Working directory. Open the .csv file:

Comparing a measurement between 2 independent groups: INDEPENDENT T-TEST Shows the 6 First rows of the data set. Thanks to this, the system will Recognize the variables by directly calling them: “id”, “length”,”Gestation”,etc. Translates the binary code into words: 0 means Non-Smoker and 1 means Smoker. The original data set contains 0 and 1 instead of the words “Non-Smoker” and “Smoker”.

Comparing a measurement between 2 independent groups: INDEPENDENT T-TEST Is my measurement (Birthweight) normally distributed in both groups “Smoker” and “Non-Smokers? I am plotting 2 histograms representing the distribution of Birthweight in both groups. This function allows you to plot 2 graphs in the same plot. Does this look symmetric to you? If you are not sure, make a qqplot Of Birthweight for the 2 groups and See if the points are close to the line.

Comparing a measurement between 2 independent groups: INDEPENDENT T-TEST In order to plot 2 graphs in the same plot. First plot Second plot The 2 qqplots seem to be corresponding to a normal distribution. You do not have any “S” shape around the curve and the points are close to the line.

We can also perform a normality test. The length of each group does not exceed 50 so we can do a Shapiro-Wilk test for Smoker and Non-smoker. Comparing a measurement between 2 independent groups: INDEPENDENT T-TEST Both of these tests have a p-value > 0.05, therefore accepting the null. Reminder: The null is “My data is normally distributed”. Hence we can conlude that the data is normally distributed.

Comparing a measurement between 2 independent groups: INDEPENDENT T-TEST No outliers!

Finally our last assumption is to check whether the variance in each group is roughly the same. We verify this assumption thanks to Levene’s test. In this test, the null hypothesis is “Variance Group 1 = Variance Group 2”. This test is contained in R package “car”. Comparing a measurement between 2 independent groups: INDEPENDENT T-TEST The command for Levene’s test is then: The p-value is more than 0.05 so we can retain the Null hypothesis of equal variances. Last assumption Checked!

We can finally run the independent t-test. The null hypothesis is “The birthweight is the same in both group”. Comparing a measurement between 2 independent groups: INDEPENDENT T-TEST If the Levene’s test fails, You need to put FALSE to the Assumption var.equal. P-value less than 0.05, so we reject the null hypothesis We conclude that there is a significant difference for the birthweight of the babies between the Smokers and the Non-smokers (mothers).

A one-way ANOVA will detect if there is any statistically significant difference in a measurement (score) between 3 or more groups (Group 1, Group 2, Group 3, etc.). We have therefore one categorical variable (Group) and one continuous variable (score). You need to check that your measurement is normally distributed in each group. You also need to check if there are any outliers per group. If so, you will need to remove them! Finally our last assumption is to check whether the variance in each group is roughly the same. We verify this assumption thanks to Levene’s test. Comparing a measurement between 3 independent groups: one-way anova For assumptions of each Test: Go to “LAERD SPSS name_of_the_test” !

The one-way ANOVA can be taken as the same as the independent T-test but if you want to compare a measurement between 3 or more independent groups. From the MASH website, download the Diet.csv file in the working directory, i.e. the same directory where the Birthweight.csv file is located. Comparing a measurement between 3 independent groups: one-way anova

Comparing a measurement between 3 independent groups: one-way anova The Research question is: “Which of the 3 Diets was the best for losing weight? - There are 3 Diets , hence 3 Groups. The variable Diet is our categorical variable. • The weight lost will be our measurement (continuous/scale) to compare the 3 Diets. • We therefore need to create another column representing the weight lost, • By subtracting “pre-weight” by “weight6weeks”:

Comparing a measurement between 3 independent groups: one-way anova You can create and add the weight lost variable to your DietR data set directly via: Each row does the subtraction : Weight6weeks – pre.weight. E.g. for Row 6: 61.1-64 = 2.9 (weightlost) Don’t forget to attach the file so that the software Recognizes the variables when you call them! This command defines Diet as a categorical variable

Comparing a measurement between 3 independent groups: one-way anova Assumption 1: Your measurement (weightlost) should be approximately normally distributed in each group. If in one group, it is not normally distributed, then it is better to choose the non-parametric alternative (Kruskal-Wallis). The non-parametric tests are seen in the next R session (R Workshop 3).

Comparing a measurement between 3 independent groups: one-way anova

Comparing a measurement between 3 independent groups: one-way anova The 3 Shapiro –Wilk tests show a p-value Higher than 0.05, accepting the null hypothesis “Normally distributed data” Assumption 1 is then checked!

Comparing a measurement between 3 independent groups: one-way anova Assumption 2: No outliers. Your data should have no outliers in the 3 groups. 2 outliers in Diet 1! We can see on the boxplot that the outliers lie above 8. The code below eliminate the participants of Diet 1, who havetheir “weightlost” more than 8.

Comparing a measurement between 3 independent groups: one-way anova Assumption 3: Homogeneity of variance test.Basically, it means that each variancecorresponding to each group is the same. The Levene’s test does it. If the p-value is above 0.05, then you can assume the variances from different groups as approximately equal. The p-value indicated here (0.5377) is more than 0.05, so We accept the null hypothesis that the groups have similar variance.

Comparing a measurement between 3 independent groups: one-way anova • Running the Anova: • If the Levene’s test fails, you can replace the parameter “var.equal=TRUE” by • “var.equal=FALSE”. The p-value is 0.003229, it is smaller than 0.05, so you will reject the null hypothesis. The Null Hypothesis is:”The lost weight is the same in every group”. You conclude that there is a statistically significant difference of lost weight between the 3 diets.

Now that you know there is a statistically significant difference between the diets, you may want to know which groups differ most. You need to run multiple comparisons tests, often called post-hoc tests, because they are tested after the ANOVA test. Comparing a measurement between 3 independent groups: one-way anova • 3 Possible comparisons: • Group 1 vs. Group 2 • Group 3 vs Group 1 • Group 3 vs. Group 2 P-value>0.05 P-value<0.05 The p-value is lower than 0.05 when comparing Group 3 with Group 1, and Group 3 with Group 2. The p-value is more than 0.05 when comparing Groups 1 and 2. Therefore we conclude that there exists a statistically significant difference between Group 3 and the other 2 groups. However there is no significant difference between Group 1 and Group 2.

A Paired t-test will detect if there is any statistically significant difference in a measurement (score) for the same group of participants but at 2 different times or 2 different conditions. We have therefore one categorical variable (Time) and one continuous variable (score). You need to check that difference of measurement between time 1 and time 2 is normally distributed You also need to check if there are any outliersin this difference. If so, you will need to remove them! (No need to check the equality of variances between time 1 and time 2 for this one!) Comparing a measurement twice on the same group: paired t-test For assumptions of each Test: Go to “LAERD SPSS name_of_the_test” !

You will need to download the Cholesterol file for R on MASH website and put it in the same directory as the Diet and Birthweight. Comparing a measurement twice on the same group: paired t-test

Research question: Is there a statistically significant difference of Cholesterol level between Before and After 4 weeks? We have Cholesterol level at “Before” and Cholesterol level at “After4weeks”. In order to validate our assumptions, we need to create the difference between Cholesterol level After4weeks and Cholesterol level Before. We will study if there are any outliers and if the distribution of the difference is approximately normal. Comparing a measurement twice on the same group: paired t-test

Comparing a measurement twice on the same group: paired t-test Assumption 1: Your measurement difference should be approximately normally distributed. If this is not the case then it is better to choose the non-parametric alternative (Wilcoxon). The non-parametric tests are seen in the next R session (R Workshop 3). Difficult to conclude! We might need the normality test.

Comparing a measurement twice on the same group: paired t-test The null hypothesis of the Shapiro-Wilk test is that the difference is Normally distributed. The p-value is more than 0.05, therefore we can Keep the null hypothesis and conclude that the data is normally distributed. By Data, we mean the difference of Cholesterol level between After 4 weeks and Before. Normality distribution checked!

Comparing a measurement twice on the same group: paired t-test Assumption 2: No outliers. The cholesterol difference should have no outliers. One outlier here!

Comparing a measurement twice on the same group: paired t-test The p-value is very small! 0.00000000001958 There is a strong evidence against the null hypothesis (no difference between after and before). Therefore there is a statistically difference between Before and after.

A Repeated Measures ANOVA will detect if there is any statistically significant difference in a measurement (score) for the same group of participants but at 3 or more different times or 3 or more different conditions. We have therefore one categorical variable (Time) and one continuous variable (score). You need to check that your measurement at each time is normally distributed.Youalso need to check if there are any outliersfor each time. If so, you will need to remove them! Assumption of Sphericity: This assumption will be computed by our function. Sphericity means that all possible differences between times have the same variances. We will not enter into too much detail but this assumption needs to be checked: Variance(time2 – time1) = Variance(time3 - time1) = Variance(time3 – time2) Comparing a measurement +3 times on the same group: repeated measures anova Where time3, time2 and time1 are the measurements made at these times.

Comparing a measurement +3 times on the same group: repeated measures anova Assumption 1: Measurement normally distributed at each time.

Comparing a measurement +3 times on the same group: repeated measures anova

Comparing a measurement +3 times on the same group: repeated measures anova Each p-value is more than 0.05 so we can accept the null hypothesis That the cholesterol is normally distributed at each time. Assumption 1 checked!

Understanding Normality Tests in R and Choosing the Right Statistical Test