570 likes | 726 Views
R for psychologists. Professor Thom Baguley, Psychology, Nottingham Trent University Thomas.Baguley@ntu.ac.uk. 0. Overview. 1. What is R ? 2. Why use R ? 3. R basics 4. R graphics 5. Linear models in R 6. ANOVA and ANCOVA 6. Writing your own R functions
E N D
R for psychologists Professor Thom Baguley, Psychology, Nottingham Trent University Thomas.Baguley@ntu.ac.uk
0. Overview 1. What is R? 2. Why use R? 3. R basics 4. R graphics 5. Linear models in R 6. ANOVA and ANCOVA 6. Writing your own R functions 8. Simulation and bootstrapping
1. What is R? R is a software environment for statistical analysis “the lingua franca of computational statistics” De Leeuw & Mair (2007)
2. Why use R? It is: • free • open source • cross-platform (Mac, Linux, PC) It has: • excellent basic statistics functions • powerful and versatile graphics • hundreds of user-contributed packages • a large community of users
3. R basics Some common R objects: • characters (text strings e.g., 'a' or ’1’) • numbers (numbers like 2 or 1e+3) • vectors (1D set of numbers or other elements) • data frames (like vectors organized in columns) • matrices (r by c array of numbers) • lists (objects that contain other objects)
Assignment Input: > vector1 <- 6 > vector1 Output: [1] 6
Arithmetic > 6 + 6 * 2 [1] 18 > vector1^2 [1] 36
Calling functions > c(1, 9, 25) > log(vector1) [1] 1 9 25 [1] 1.791759 > rnorm(1, 100, 15) ?rnorm [1] 123.5336 help(rnorm)
Loading data > dat1 <- read.csv('pride.csv') > library(foreign) > h05 <- read.spss('Hayden_2005.sav', to.data.frame = TRUE)
Addressing the contents of an object > vector1[1] [1] 6 > days <- h05$days > days[1:9] [1] 2864 1460 2921 2921 2921 1460 2921 1460 31
Addressing the contents of an object > vector1[1] [1] 6 > days <- h05$days > days[1:9] [1] 2864 1460 2921 2921 2921 1460 2921 1460 31 Or combine the steps withh05$days[1:9]
Data frames > is.data.frame(h05) [1] TRUE > dim(h05) > names(h05) [1] 43 2 [1] "names" "days” • this data frame has 43 rows and 2 named columns • it can be helpful to think of a data frame as a set of named variables (vectors) bound into columns
Matrix objects (matrices) > cells <- c(3677,56,3924,31) > cat.names <- list(c("Before", "After"), c("Alive", "Dead")) > checklist <- matrix(cells, 2, 2, byrow=TRUE, dimnames= cat.names) > checklist Alive Dead Before 3677 56 After 3924 31
The power of objects > chisq.test(checklist) Pearson's Chi-squared test with Yates' continuity correction data: checklist X-squared = 8.1786, df = 1, p-value = 0.004239 > chisq.test(c(2, 18)) Chi-squared test for given probabilities data: c(2, 18) X-squared = 12.8, df = 1, p-value = 0.0003466
Calling functions: defaults and named arguments > chisq.test(checklist, correct=FALSE) Pearson's Chi-squared test data: checklist X-squared = 8.8072, df = 1, p-value = 0.003000 • the default argument (for matrix input is)correct=TRUE • setting correct=FALSE over-rides the default • because unnamed arguments to functions are matched in order, this argument must be named (otherwise naming the arguments is optional)
Exercise 1 (R Basics) • entering data • working with objects • simple statistical functions
4. R graphics > boxplot(h05$days) n = 43
> hist(h05$days) > plot(density(h05$days))
Distribution functions e.g., Normal distribution dnorm(x, mean = 0, sd = 1) pnorm(q, mean = 0, sd = 1, lower.tail = TRUE) qnorm(p, mean = 0, sd = 1, lower.tail = TRUE) rnorm(n, mean = 0, sd = 1)> rdat <- rnorm(30, 100, 15)
> curve(dnorm(x), xlim=c(-4,4), col='purple') > curve(dt(x, 1), col='red', add=TRUE, lty=3)
Exercise 2 (R Graphics) • exploratory plots • plot parameters • plotting distribution functions • plotting a serial position curve (optional)
5. Linear models in R > expenses <- read.csv('expenses.csv') > t.test(majority ~ problem, data = expenses) Welch Two Sample t-test data: majority by problem t = -3.7477, df = 505.923, p-value = 0.000199 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -2044.457 -638.160 sample estimates: mean in group 0 mean in group 1 7092.712 8434.021 Data courtesy of Mark Thompson http://markreckons.blogspot.com/
> expenses <- read.csv('expenses.csv') > lm(majority ~ problem, data = expenses) Call: lm(formula = majority ~ problem, data = expenses) Coefficients: (Intercept) problem 7093 1341
> lm.out <- lm(majority ~ problem, data = expenses) > max(rstandard(lm.out)) [1] 2.629733 > AIC(lm.out) [1] 12674.85 > lm.out$coefficients (Intercept) problem 7092.712 1341.308> ?lm
> summary(lm.out) Call: lm(formula = majority ~ problem, data = expenses) Residuals: Min 1Q Median 3Q Max -8397.0 -3403.5 -260.9 3118.8 11543.3 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 7092.7 218.9 32.397 < 2e-16 *** problem 1341.3 357.0 3.758 0.000187 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 4395 on 644 degrees of freedom Multiple R-squared: 0.02145, Adjusted R-squared: 0.01993 F-statistic: 14.12 on 1 and 644 DF, p-value: 0.0001871
> glm(problem~I(majority/10000), family='binomial', data = expenses) Call: glm(formula = problem ~ I(majority/10000), family = "binomial", data = expenses) Coefficients: (Intercept) I(majority/10000) -1.0394 0.6878 Degrees of Freedom: 645 Total (i.e. Null); 644 Residual Null Deviance: 855.5 Residual Deviance: 841.6 AIC: 845.6
> lrm.out <- glm(problem ~ I(majority/10000), family='binomial', data = expenses) > plot(election$majority, lrm.out$fitted, col='dark green')
Exercise 3 (Linear models) - using a formula in a call to a model function • linear models • plotting a regression line • generalized linear models (optional) • plotting confidence bands (optional)
6. ANOVA > factor1 <- gl(10,4)> factor 1[1] 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 4 4 4 4 4 4 4 4 4 4Levels: 1 2 3 4> library(foreign)> diag.data <- read.spss("diagram.sav", to.data.frame = T)> diag.fit <- lm(descript ~ group, data=diag.data)
Regression model (dummy coding) > diag.fit Call: lm(formula = descript ~ group, data = diag.data) Coefficients: (Intercept) groupPicture groupFull Diagram groupSegmented 17.8 0.5 4.6 9.5
aov() > aov(diag.fit) Call: aov(formula = diag.fit) Terms: group Residuals Sum of Squares 583.7 2440.2 Deg. of Freedom 3 36
anova() > summary(aov(diag.fit)) > anova(diag.fit) Analysis of Variance Table Response: descript Df Sum Sq Mean Sq F value Pr(>F) group 3 583.7 194.567 2.8704 0.04977 * Residuals 36 2440.2 67.783 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Factorial ANOVA and ANCOVA > lm(DV ~ factor1 + factor2 + factor1:factor2, data=data2) > lm(DV ~ factor1 * factor2, data=data2) > lm(descript ~ group + time, diag.data) > lm(descript ~ 0 + group + scale(time, scale=F), diag.data) > diag.ancov <- lm(descript ~ group + scale(time, scale=F), diag.data)
Type I and Type II Sums of squares > drop1(diag.ancov, test = 'F') Single term deletions Model: descript ~ group + scale(time, scale = F) Df Sum of Sq RSS AIC F value Pr(F) <none> 2154.4 169.46 group 3 821.45 2975.9 176.38 4.4484 0.009478 ** scale(time, scale = F) 1 285.78 2440.2 172.44 4.6427 0.038149 * - sequential sums of squares (Type I) tests are given by default • hierarchical sums of squares (Type II) tests are given by the drop1() function [Software such as SAS and SPSS defaults to unique (Type III) SS]
Repeated measures (within-subjects) factors • a weakness in the base R installation, but easily done using packages such as ez, nlme or lme4 (the latter two able to fit a wider range of repeated measures models) > pride.long <- read.csv("pride_long.csv") > pride.mixed <- aov(accuracy ~ emotion*condition + Error(factor(participant)/emotion), pride.long) > summary(pride.mixed) [Software such as SAS and SPSS defaults to unique (Type III) SS]
Multiple comparisons and contrasts • you can run contrasts by changing default contrasts in R • Fisher LSD, Tukey’s HSD and p adjustment (e.g., Hochberg, Holm, Bonferroni, FDR) in R base installation • powerful modified Bonferroni (e.g., Shaffer, Westfall) and general linear contrasts available in multcomp package • flexible contrast and estimable functions in gmodels package
Exercise 4 (ANOVA) - factors • aov() and anova() • ANCOVA • repeated measures (within-subjects) • contrasts and multiple comparisons (optional)
7. Writing your own functions sd.pool <- function(sd.1, n.1, sd.2, n.2 = n.1){ num <- (n.1-1)*sd.1^2+(n.2-1)*sd.2^2 denom <- n.1+n.2-2 output <- sqrt(num/denom) output } > sd.pool(6.1, 20, 5.3, 25) [1] 5.66743
Exercise 5 (Write a function) • pick a simple statistical procedure • write a function to automate it e.g., pooled standard deviation
8. Simulation and bootstrapping • distribution functions > rnorm(100, 10, 1) • R boot package • sample() and replicate()
The bootstrap (and other Monte Carlo methods) Bootstrapping involves repeatedly resampling with replacement from a data set e.g., Data set = {0,1} 7 simulated samples: {1,0}, {0,1}, {0,0}, {0,0}, {0,1}, {0,0}, {1,1} Bootstrapping effectively treats the sample distribution as the population distribution > replicate(7, sample(x,2,replace=TRUE))
Bootstrapping using R > library(boot) > x1.boot <- boot(samp,function(x,i) median(x[i]),R=10000) > boot.ci(x1.boot) BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS Based on 10000 bootstrap replicates CALL : boot.ci(boot.out = x1.boot) Intervals : Level Normal Basic 95% (-0.1747, 0.6948 ) (-0.1648, 0.6614 ) Level Percentile BCa 95% (-0.1538, 0.6724 ) (-0.2464, 0.6620 ) Calculations and Intervals on Original Scale Normal 95% CI for mean (using t) [-2.00, 1.33]
Exercise 6 (Bootstrapping) • resample data • construct a simple percentile bootstrap • run a BCa bootstrap using the bootpackage (optional)
9. Advanced statistical modelling in R Multilevel modeling nlme package lme4 package MCMCglmm package Meta-analysis meta package metafor package … and many, many more other packages
metaforpackage > m.e <- c(10.7, 16.24, 10.03, 3.65, 5.73) > n.e <- c(31, 57, 9, 17, 10) > m.c <- c(2.87, 6.83, 7.18, 4.65, 7.47) > n.c <- c(14, 52, 9, 18, 10) > sd.pooled <- c(7.88, 9.84, 8.67, 6.34, 5.74) > diff <- m.c - m.e > se.diffs <- sd.pooled * sqrt(1/n.e + 1/n.c)
> install.packages('metafor') > library(metafor) > meta.out <- rma(yi=diff, sei=se.diffs, method = 'FE') > meta.out Fixed-Effects Model (k = 5) Test for Heterogeneity: Q(df = 4) = 21.0062, p-val = 0.0003 Model Results: estimate se zval pval ci.lb ci.ub -4.1003 1.0750 -3.8141 0.0001 -6.2073 -1.9933 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1