540 likes | 756 Views
An Introduction to R! Roni Kobrosly Epidemiology Nuts and Bolts Talk November 2009. My goal is to. Explain the pros and cons of R Introduce you to some basic syntax Live demo of basic epidemiologic statistical analyses
E N D
An Introduction to R! Roni Kobrosly Epidemiology Nuts and Bolts Talk November 2009
My goal is to... • Explain the pros and cons of R • Introduce you to some basic syntax • Live demo of basic epidemiologic statistical analyses • Basic descriptive analysis and hypothesis testing, linear regression, logistic regression, survival analysis • Show you some helpful R resources!
Things you should have in front of you • Data dictionaries for STAR and NCCTG Lung Cancer datasets • A R reference sheet, by Tom Short • A copy of this presentation
The pros of using R! • It is 100% FREE • Huge community of users, which means lots of help and support. • Available on virtually all platforms: OS X, Unix/Linux, and Windows XP, Vista, and 7! • Documentation is available in many languages • Publication quality graphics with default options
The pros of using R! cont. • R library database has virtually every type of statistical analysis that has ever been conceived. The most cutting edge biostatistical ideas are always implemented in R first. • R is very flexible. All models and graphics can be completely customized and modified for the analysis you’re working on.
The cons of R ... • The spreadsheet-style data manipulation is poor for editing • Steep learning curve • The programming is largely done by statisticians not computer scientists … so can be slow during computationally intensive tasks
Throughout the presentation... • My narrative bullets will use “this” font style • R code and output will be shown with this font style: women <- c(5,7,3,13,5) men <- c(3,8,8,2,3) t.test(women,men,paired=F)
R terminology • Vector: One dimensional data object (numeric, character, or logical) • Matrix: Two dimensional data object with elements of all the same data type (numeric, character, logical) • Array: Like a matrix, but can have more than two dimensions • Dataframe: This is a dataset as we know it. The first line contains variable names, and columns can have different data types. All columns must be of same length
R terminology cont. • Package/Library: downloadable set of software that can contain data and functions. All basic R functions are contained in the core R software. If you want to do something fancy, will you need to install packages. • Factor: A nominal variable • Ordered factor: An ordinal variable
Access R help files • If you ever want to learn about what a specific function does or its syntax, using the “help()” function • It will also provide a data dictionary for datasets • If you end up trying R, you will be using this function often! help(glm) help(STAR)
R syntax • This table shows the most common operators. • R is object oriented, so you will use the assignment operator and list indexing alot.
Using the R workspace • You can type individual commands in the console • You can also create an R script and save lines of code, like you have done with SAS. • Press CTRL+R to run highlighted script lines • Press CTRL+L to clear the console of output • Press the up arrow to retrieve the last command you just typed into the console
Writing commands in R... • Variables, datasets, and output can be saved as objects with complex names like: Roni.Data.Final.02_02_2009 • Unlike SAS, R is case-sensitive. So the variables... HAPPYvar Happyvar andhappyvar … are recognized as different objects
Writing commands in R... • Spaces don't matter in R: weight/(height^2) is interpreted the same as... weight / (height ^ 2) • Comments in R code can be made with “##” ## This is cumulative incidence (incid / total.at.risk) → c.incid
Writing commands in R... • Use a semi-colon to have R run a batch of commands in the console: factorial(4); factorial(3); factorial(2); factorial(1) This code will display the factorials of 4,3,2, and 1
Basic command line operations • Simple arithmetic: ((2 + 4)^2)/3 • Save this result to variable “a” ((2 + 4)^2)/3 -> a • Creating list of consecutive numbers (1-15) 1:15 -> nums • Create a vector c(2,4,6,7) -> num.list or c('a','b','c') -> char.list
Basic command line operations • Accessing the third number in the vector num.list[3] • Logic: • Is element 3 of the num.list vector > 10? num.list[3] > 10 • Are any of the elements of the num.list vector = 0? num.list[1] == 0 | num.list[2] == 0 | num.list[3] == 0 | num.list[4] == 0
Basic command line operations • Create a 3x3 matrix matrix( c(2,4,5,6,7,8,9,10,11), nrow=3, byrow=T) -> my.matrix • Access the element in the 3rd row and 1st column my.matrix[3,1] • Change the element in the 3rd row and 1st column to the value '300' 300 -> my.matrix[3,1]
Basic command line operations • See what variables are saved ls() • Remove the “nums” variable rm(nums) • Remove all saved variables rm(list=ls())
Working with vectors • Lets create two numeric data objects of the same length. This is data from four, male subjects: weight <- c(175, 232, 145, 193) height <- c(72, 75, 70, 64) • Simply type these variables names into the console, and R will show you the contents of these vectors. • We will create BMI values for these subjects... but there is a problem, they are in units of pounds and inches...
Working with vectors • To convert vectors, just multiply their assigned names by a constant. Lets save these results to new variables: weight*0.454 -> kg.weight height*0.0254 -> m.height • We can now create a vector of BMI values: kg.weight / (m.height^2) -> BMI • When you are operating with vectors like this, the vectors must be of equal length!
Working with vectors • Finally, lets combine metric weight, height, and BMI vectors into one matrix (so we can see them side by side). matrix(c(kg.weight,m.height,BMI), ncol=3) -> final
Misc. useful functions • Pick 5 random integers between 1 and 100 sample(1:100, 5, replace=F) -> lottery • Generate 100 random numbers from normal distribution rnorm(n=100, mean=70, sd=6) • Find left-sided area under curve of Z distribution pnorm(110, mean=100, sd=10) • Find value for 95%, left-sided area under Z distribution qnorm(0.95, mean=100, sd=10)
Misc. useful functions • Find left-sided area under curve of t distribution pt(2.015, df=5) • Find value for 95%, left-sided area under t distribution qt(0.95, df=5) • Find left-sided area under curve of chi-square distribution pchisq(3.84,df=1) • Find value for 95%, left-sided area under chi-square distribution qchisq(0.95,df=1)
Let's install packages • The “epicalc” package is short for “Epidemiological Calculator.” • This package is a hodge-podge of epidemiological analysis tools, utilities, and datasets to test these on. • The “epiR” package is also helpful • You will need an internet connection to get any package! You only need to download a package once!
Let's install a package • It is easy to do from the console: • Type: install.packages() • A window titled, “CRAN mirror” will appear, giving you a choice of servers to download the package from. Just choose anything from the U.S. (it will download faster). • Then, you can choose what package to download. R will automatically download any pre-requisite packages too.
The “epicalc” package • Once the installation is complete, type this to activate the package: library(epicalc) • Let's use the “cci” function to calculate an odds ratio and 95% CI • Type: help(cci) to understand the syntax • An example: cci(145,200,300,280, graph=F)
The “epiR” package • The “epi.studysize” function calculates power or minimal sample size needed for various epidemiologic study types. • This will give you power to detect an approximate odds ratio of 2.0 using a two-sided 0.05 test when 188 cases and 940 controls are available, assuming a 0.30 prevalence of smoking in the male population epi.studysize(treat = 2/100, control = 1/100, n = 1128, power = NA, sigma = 0.30, r = 0.2, conf.level = 0.95, sided.test = 2, method = "case.control")
Other helpful functions • These two packages (epicalc and epiR) also provide functions relating to imputation, Mantel-Haenszel and CMH tests, tests for homogeneity across strata, matched case-control analysis, ROC curves, descriptive analysis, calculating adjusted rates, and much more!
Lets invoke a dataset • From the “epicalc” package, lets load a simple dataset with data on marriage • Type: data(Marryage) • If you type in Marryage, R will print out every observation of the dataset • For large datasets, type head(Marryage) to see the first six observations and variable names … or try tail(Marryage) to see the last six observations.
The “Marryage” dataset • To isolate a single variable for analysis or editting, use the “$” operator: Marryage$birthyr • To avoid having to constantly specify the dataset of variables, type this: attach(Marryage) • detach(Marryage), will reverse this option and will again have to specify the dataset
On reading and writing datasets not from a package ... • There are so many file formats that data can be stored in: .txt .dat .csv .ods .xls .sas7bdat .dta .sav The list goes on and on... • I would recommend using the .CSV file format to store and transfer your files. It will never become obsolete! • R can read in datafiles from SAS, SPSS, and Stata using the “foreign” package
How to read/import datasets • Let's say you've just finished cleaning a dataset in Excel, you're ready to analyze it, and it is saved to your desktop as a .CSV file. read.csv(“C:/Documents and Settings/Roni/Desktop/mydata.csv”, header=T) -> mydata • IMPORTANT: notice how specifiying the file location involves the use of forward-slashes (/), not back-slashes (\)!!
How to write/export datasets • Writing/exporting datasets from R to your desktop, for instance, is also very easy: write.csv(mydata, “C:/Documents and Settings/Roni/Desktop/mynewdata.csv”)
Don't like the way R looks? • If R doesn't look SAS-ish (SASy?) enough for you, try out these meta-packages that changes the appearance and capabilities of the R console and output: • Java GUI R (JGR): Very sleek setup that helps you complete input commands! With color-coded text! • Rcommander: Simple, yet elegant console
User-created functions • You can create your own functions in R! • These are much more versatile than macros in SAS • The R support community posts lots of homemade functions on the message boards.
The STAR dataframe • Tennessee’s Student Teacher Achievement Ratio (STAR) project was a large, four-year study of the effect of reduced class size • Lets take the Star dataset from the “mlmRev” package. • There are > 26000 observations! • Lets randomly choose 1000 observations to analyze.
Preparing the STAR data • Lets invoke the mlmRev package and the data: install.packages(), select the mlmRev package, then type: library(mlmRev); data(star) • See what variables STAR contains: head(star) • Lets removing all observations with missing data: na.omit(star) -> star2
Preparing the STAR data • We need to know how many observations there are: length(star2$id) • There are now 22815 observations, lets randomly choose 5000 of these: sample(1:22815, 5000, replace=F) -> xyz star2[xyz, ] -> star3
Descriptive analysis • Try this: summary(star3) • This will provide basic univariate analysis output for all continuous variables, and one-way frequency tables for categorical variables. • For a histogram of a continuous variable, try this: hist(star3$read, breaks=20) • For user-defined quantiles (deciles for ex.): quantile(star3$read, c(0,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1))
Preparing the STAR data • Other useful functions for univariate analysis • Standard Deviation: sd() • Variance: var() • Show a Q-Q Plot: qqnorm() • Show a Boxplot: boxplot() • Frequency Tables: table()
Preparing the STAR data • Let's dichotomize the “read” variable at the median, and create a dummy variable for “read” values above and below the median: median(star3$read) -> cutoff 1 -> star3$readdummy[star3$read <= cutoff] 0 -> star3$readdummy[star3$read > cutoff] • Lets attach this new dataset ... we are ready for analysis! attach(star3)
Some bivariate analysis • Basic correlation between reading and math scores: cor.test(read,math,method=”spearman”) • T-test to see if math scores differ between males and females: t.test(math~sx) • ANOVA to see if reading scores differ among the four school types: anova(lm(read~schtype)) • Chi-sq to see if ethnicity is associated with school type: chisq.test(eth,schtype)
Sub-setting your data • To examine subsets of data, you don't use a “where” statement like SAS. To see if math scores differ between males and females, only among non-White students: t.test(math[eth != 'W'] ~ sx[eth != 'W']) • What is the mean reading scores for 3rd grade, inner city students? mean(read[gr == '3' & schtype == 'inner'])
Linear Regression • Try simple linear regression. Use reading scores to predict math scores. Here is the syntax: lm(math ~ read, data=star3) -> SLRmodel summary(SLRmodel) • Let's make a scatterplot and include the regression line: plot(read,math,main=”Neat Scatterplot”) abline(SLRmodel, col="red")
Multiple Linear Regression • Let's try multiple linear regression in predicting math scores: lm(math ~ read + eth + sx + schtype, data=star3) -> MLRmodel summary(MLRmodel) • Notice how R deals with factor variables (like “eth,” “sx,” and “schtype”) • You can specifiy the reference group for factor variables with the group() function, or just use dummy variables
Comparing Models • R allows you to easily compare nested models through a likelihood ratio test. You need to obtain the “lmtest” package and use the “lrtest()” function. install.packages() library(lmtest) lrtest(SLRmodel, MLRmodel)
Logistic Regression • Let's use that dichtomous reading variable we made for a logistic regression model. You now have to use the glm() function: glm(readdummy ~ math + schtype, family = binomial(“logit”)) -> logmodel summary(logmodel) • Note: These are not odds ratios that are printed, they are log odds of having a low reading score! Take the anti-log to get the odds ratios: exp(logmodel$coeff)
Survival Curves and Cox Regression • R can generate beautiful Kaplan-Meier survival curves and can perform Cox proportional hazards regression (with or without time-dependent covariates) • You will need the “survival” package. It is part of the core R software, so you don't need to download it. Just type: library(survival) • Invoke the North Central Cancer Treatment Group (NCCTG) lung cancer dataset by typing: data(lung)
Survival Curves and Cox Regression • Type: head(lung) and examine the data dictionary you've got to get a sense of what's in this dataset. Attach the dataset, so we don't have to continuously type “lung$” before each variable: attach(lung) • This will print out the standard KM survival function output: summary(survfit(Surv(time,status==2) ~ 1 , data=lung))