1 / 51

R session 3

MASH. R session 3. How to check normality in R and determine when to use a parametric or a non-parametric test.

kangelia
Download Presentation

R session 3

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. MASH R session 3

  2. How to check normality in Rand determine when to use a parametric or a non-parametric test. How to run the main non-parametric tests:- Mann-Whitney U test (unpaired, 2 groups) and Wilcoxon test (paired, 2 groups).- Kruskal-Wallis (unpaired, +3 groups) and Friedman tests (paired, +3 groups).- Chi-Square test (to test the association between 2 categorical variables). - McNemar test (the paired Chi-square, i.e. to test If a categorical variable changes between two paired groups). In this session you will know:

  3. If your data is normally distributed approximately, then in your statistical analysis, you will use parametric tests. If your data is not normally distributed, then you are more likely to use non-parametric tests. The non-parametric tests are covered in this session. WHY Checking normality?

  4. Checking “NORMALITY” for your data: Test for Normality There are tests to assess whether or not your data is normally distributed.In all cases, the Null Hypothesis is:“H0 : Data Normally distributed”.If the p-value is smaller than 0.05, then you reject the null and therefore conclude that the data is not normally distributed.

  5. Type of measurement where you use non-parametric tests. Example of Skewed Data: - No Symmetry.- Mode.- Median.- Mean.

  6. Checking normality: remember the 3 methods: Plot a histogram Run a normality test (Shapiro-Wilk test) Plot a Q-Q plot. P-Value < 0.05Therefore: Null Hypothesis rejectedData not Normally distributed

  7. Open Rstudio first

  8. - The Mann-Whitney U test is the non-parametric Version of the independent t-test. Use it when: • You want to investigate whether there is a significant difference of a score data between 2 independent groups. • One of the measurement can be skewed for one group or both. In that case, the assumption of normally distributed data of the t-test is violated and we will use the Mann-Whitney U test instead. Mann-Whitney: Comparing the measurement between 2 independent groups.

  9. Mann-Whitney: Comparing the measurement between 2 independent groups. Download the Titanic Data set and save it to the directory Where you are currently set on R.

  10. Mann-Whitney: Comparing the measurement between 2 independent groups.

  11. Mann-Whitney: Comparing the measurement between 2 independent groups. Remember that You must have a look at the data in both groups seperately! These 2 histograms both look skewed on the left.

  12. Mann-Whitney: Comparing the measurement between 2 independent groups. The Q-Q plot reveals a little the beginning of a S-shape… but not that obvious! We need a normality test to conlude…

  13. Mann-Whitney: Comparing the measurement between 2 independent groups. We need to remove NA from the data. The p-value is very small! so we will reject the normal distribution for the people who did not survive. Here, the KS test suggeststhat the age among thosewho survived is normallydistributed.

  14. By convention, if on the one side you have a normally distributed data but on the other side you have a skewed data, we should use the non-parametric test. In addition, since the two histograms look skewed, we are confident to use a non-parametric test (Mann-Whitney U test). In any case, you can perform a parametric test (i.e. here an independent T-test) to compare the two results afterwards. Mann-Whitney: Comparing the measurement between 2 independent groups.

  15. Mann-Whitney: Comparing the measurement between 2 independent groups. In order to perform a Mann-Whitney U test, we call the function “wilcox.test” and specify “Continuous variable first ~ Group variable second” as first parameter. Then we specify “paired=FALSE”because we have unpaired data: the Age of the group of people who survived, and the Ages of the people who did not survive. Try now with the fare of the ticket Instead of the age! The p-value is bigger than 0.05, so we can conclude that there is no significant difference of age between people who survived and people who did not survive.

  16. Kruskal-wallis: Comparing the measurement between more than 3 independent groups. Non-parametric alternative of the One-Way ANOVA (see Session 2). In the same data set (titanicR), we will investigate if there is a significant difference of the age between the 3 classes on the Titanic.

  17. Kruskal-wallis: Comparing the measurement between more than 3 independent groups. Let us first download the Ice Cream .csv file from the MASH website.

  18. Kruskal-wallis: Comparing the measurement between more than 3 independent groups. Store the file in the right current directory you are in Rstudio: Attach the data or otherwise Rstudio will not recognize the variables names! Research Question: Does the favorite ice cream of the participantsinfluence their video score?

  19. Kruskal-wallis: Comparing the measurement between more than 3 independent groups. In the CSV file it is not clearly specified, but the numbers of the variable “ice_cream” each represent the favourite taste (“Vanilla”, “Chocolate” and “Strawberry”) of each participant. The code is the following: - 1 : Vanilla - 2 : Chocolate - 3 : Strawberry If you want you can create a factor variable taking strings values instead of number and add it to the data frame “icecreamR”. Video score is a score after playing certain video games. Assumption for the test: There are no real main assumption. You should make sure that you have a sample size big enough and that you have a continous/ordinal variable as your dependent variable and a categorical variable with more than 3 groups as your independent variable.

  20. Kruskal-wallis: Comparing the measurement between more than 3 independent groups. Looks skewed on the right. This histogram looks skewed on the left.

  21. Kruskal-wallis: Comparing the measurement between more than 3 independent groups. The points seem to draw a “S” shape around the curve. It is a sign of skewed distribution

  22. Kruskal-wallis: Comparing the measurement between more than 3 independent groups. The length of the data are not so big so we canuse a Shapiro-Wilk test. We have one sample that seems To be skewed to the right. The other 2 samples reveal normality. In that case, we should priviledge thenon-parametric test:Kruskal-Wallis and not One-Way ANOVA.

  23. Kruskal-wallis: Comparing the measurement between more than 3 independent groups. The p-value of the Kruskal-Wallis test strongly rejects the null hypothesis of no difference between the 3 types of Ice Creams.

  24. Kruskal-wallis: Pairwise comparisons. If we detect a difference between the groups that we have, we can run pairwise comparisons or post-hoc tests for Kruskal-Wallis. The multiple comparisons reveal a statistically significant difference of Video Game Score for the pair Strawberry-Chocolateand the pair Strawberry-Vanilla.

  25. Kruskal-wallis: Pairwise comparisons. We can represent the boxplots to have a better insight of the Video Game score between the different favourite ice cream tastes.

  26. Kruskal-wallis: Pairwise comparisons. Strawberry Is the one differing the mostfrom the 2 othergroups.

  27. This test is the non-parametric version of the paired t-test. The main assumption is to have 2 paired data, i.e. a measurement score on the same group of people at 2 differenttimes. On the MASH website, download the video data set for R. As usual, make sure you copy paste it in the right current directory where Rstudio is. Wilcoxon Signed-rank test

  28. Usually, you would conduct a Wilcoxon signed rank test if the difference between the 2 measurement looks skewed. If not, you would choose the paired t-test. Difference = Measurement (time 2) – Measurement (time 1) is not normally distributed: Wilcoxon Signed Rank test. Assumption: The difference of the measurement between the 2 time points is skewed. No other assumption. Wilcoxon Signed-rank test

  29. Research Question: Do members of the public prefer the demonstration technique to the old video C? We have the score of Demonstration technique video (TotalDDEMO) and the score of the old video C (TotalCOld). These two scores are measured on the same group of people but at different times. We have 2 paired measurements. Wilcoxon Signed-rank test

  30. Wilcoxon Signed-rank test Distribution clearly skewed! We will run a Wilcoxon signed-rank test!

  31. Wilcoxon Signed-rank test The p-value is actually equal to 0.0000933, which means thatthere is a very strong evidence against the Null Hypothesis.The Null Hypothesis is that there is no difference between the scoreof video C and the score of video D, so we reject this. There is a very strong statistically significant difference between the 2 scores.

  32. In the same video data set, we will investigate the following research question: Research Question: Which video score is the best between video A, video B, video C and video D? The names of the variables in the data set are:- TotalAGen for video A score.- TotalBdoc for video B score.- TotalCOld for video C score.- TotalDDEMO for video D score. Assumption: The distribution of some of these scores or all of them should be skewed. Friedman test: comparing more than 3 paired measurements when the measurement is skewed.

  33. Friedman test: comparing more than 3 paired measurements when the measurement is skewed.

  34. Friedman test: comparing more than 3 paired measurements when the measurement is skewed. The friedman test function in R requires a matrix as parameter. A matrix is simply just a table, of which the columns represent respectively: Video A Score, Video B Score, Video C Score and Video D Score. The p-value reads 0.000000005452 and therefore suggests to strongly reject the null hypothesis that there is no difference between video scores. THEREFORE, there is a very strong statistically significant difference between the videoscores.

  35. Friedman test: which pair is the most different? Post-hoc test. The library PMCMR needs to be installed.

  36. The Chi-Square test is an analysis that studies the association between 2 categorical variables. E.g. we want to investigate if Survival and Class of passengers on the Titanic are significantly associated. If so, we can conclude that depending on the Class you will more likely survive. The Null Hypothesis is that “There is no Association”. As usual, if the p-value of the test is less than 0.05, then you will reject the null hypothesis and conclude fora statistically significant association. The Chi-square test is considered as a non-parametric test! Chi-square test: Testing if 2 categorical data are associated

  37. In the Titanic data set. Chi-square test: Testing if 2 categorical data are associated Research Question: Is there a significant relationship in survival on the Titanic whether you are a man or a woman? • We will test if there is a significant association between 2 categorical variables: • Survived (0: Died, 1: Survived) • Gender (0: Female, 1:Male)

  38. Let us have a closer look to the data that we have. Pronounced “Ky” square, like in “Sky news” because It comes from the Greek letter “ “. Before running the Chi-square test, we will always create a table of descriptive statistics. The table created below counts the number of people per category. It will give you e.g. the number of people who have gender “0” (Male) and who did not survived (survived “0”). That number is for example 682 (see below). Don’t forget to attach the data! Otherwise Rstudio will not recognize “survived” and “Gender”. Create the cross table: Chi-square test: Testing if 2 categorical data are associated

  39. Chi-square test: Testing if 2 categorical data are associated You can make it prettier in adding names.

  40. The first color “blue” will match with the first legend value “Died”. You have to tell R! Voila! How to show the Names?

  41. You need to create another variable, along the Gender, that contains the names “Male” and “Female”instead of the numbers “0” and “1”. You will then create a new cross table “tbl2”, which will be the first parameter of the barplot function. OK!

  42. In fact, we just created another variable newgender that contains the english words rather than the 0 and 1: These new values match with the Titanic data set and could be inserted in the Titanic data frame as an extra column of data.

  43. Assumption of the Chi-square test:The expected counts per cross category should be more than 5people. However, this assumption can be relaxed if you find that no more than 20% of the cross categories have expected count less than 5. Chi-square test: Testing if 2 categorical data are associated

  44. In order to check this assumption, you will need to call a function that is in a new package called gmodels. You should know how to install this package in Rstudio (from the first workshop). The two first parameters are the two categories. You will specify at the end of this function that you want the expected values. Chi-square test: Testing if 2 categorical data are associated

  45. Chi-square test: Testing if 2 categorical data are associated All the expected numbers of each cross category is represented as the second value in each cell. All those expected values are more than 5 so the assumption of the Chi-square is checked!

  46. Chi-square test: Testing if 2 categorical data are associated Well, from this very very small p-value, we can very very strongly reject the null hypothesis! And conclude that: there is a statistically Significant association between Gender and Survival. (Which goes hand in hand with the film!) The p-value is not less than 2.2! 2.2e-16 actually means: 0.00000000000000022!

  47. If assumption of expected count per cross category is violated, in the case of a 2x2 table, it is possible to run another test called the Fisher’s exact test: However if you have another kind of cross table (3x2 e.g.), you will have to analyse the descriptive statistics. Chi-square test: Testing if 2 categorical data are associated

  48. The McNemar’s test can be taken as a paired Chi-square test. No “y” at McNemar”. In a McNemar’s test, we are interested by the evolution of a binary response variable across repeated measures. Research Question:We have 50 participants, consisting of 25 smokers and 25 non-smokers at time 1. All participants watch an emotive video showing the impact from smoking-related cancers. 2 weeks after the video intervention (time 2), the same participants were asked whether they remained smokers or non-smokers.Do we obtain less smokers 2 weeks after the video showing at time 2? McNEmar test:

  49. From the MASH website, download the smoker .csv file for R and paste it in the correct current directory where Rstudio is working. Load the data in RStudio McNEmar test: You can either keep this column name or change the column name with a better name like:For example, a column “Before”.

  50. McNEmar test: Changing column name The function “colnames” takes a data frame as argument and returns a vector of words. Each word represents each columnof the data frame’s name. We can just modify the name of thefirst column by another name without forgetting the inverted commas. We want to modify the first column name. Column #1! Voila!

More Related