An Introduction to R! Roni Kobrosly Epidemiology Nuts and Bolts Talk November 2009

An Introduction to R! Roni Kobrosly Epidemiology Nuts and Bolts Talk November 2009

My goal is to... • Explain the pros and cons of R • Introduce you to some basic syntax • Live demo of basic epidemiologic statistical analyses • Basic descriptive analysis and hypothesis testing, linear regression, logistic regression, survival analysis • Show you some helpful R resources!

Things you should have in front of you • Data dictionaries for STAR and NCCTG Lung Cancer datasets • A R reference sheet, by Tom Short • A copy of this presentation

The pros of using R! • It is 100% FREE • Huge community of users, which means lots of help and support. • Available on virtually all platforms: OS X, Unix/Linux, and Windows XP, Vista, and 7! • Documentation is available in many languages • Publication quality graphics with default options

The pros of using R! cont. • R library database has virtually every type of statistical analysis that has ever been conceived. The most cutting edge biostatistical ideas are always implemented in R first. • R is very flexible. All models and graphics can be completely customized and modified for the analysis you’re working on.

The cons of R ... • The spreadsheet-style data manipulation is poor for editing • Steep learning curve • The programming is largely done by statisticians not computer scientists … so can be slow during computationally intensive tasks

Throughout the presentation... • My narrative bullets will use “this” font style • R code and output will be shown with this font style: women <- c(5,7,3,13,5) men <- c(3,8,8,2,3) t.test(women,men,paired=F)

R terminology • Vector: One dimensional data object (numeric, character, or logical) • Matrix: Two dimensional data object with elements of all the same data type (numeric, character, logical) • Array: Like a matrix, but can have more than two dimensions • Dataframe: This is a dataset as we know it. The first line contains variable names, and columns can have different data types. All columns must be of same length

R terminology cont. • Package/Library: downloadable set of software that can contain data and functions. All basic R functions are contained in the core R software. If you want to do something fancy, will you need to install packages. • Factor: A nominal variable • Ordered factor: An ordinal variable

Access R help files • If you ever want to learn about what a specific function does or its syntax, using the “help()” function • It will also provide a data dictionary for datasets • If you end up trying R, you will be using this function often! help(glm) help(STAR)

R syntax • This table shows the most common operators. • R is object oriented, so you will use the assignment operator and list indexing alot.

Using the R workspace • You can type individual commands in the console • You can also create an R script and save lines of code, like you have done with SAS. • Press CTRL+R to run highlighted script lines • Press CTRL+L to clear the console of output • Press the up arrow to retrieve the last command you just typed into the console

Writing commands in R... • Variables, datasets, and output can be saved as objects with complex names like: Roni.Data.Final.02_02_2009 • Unlike SAS, R is case-sensitive. So the variables... HAPPYvar Happyvar andhappyvar … are recognized as different objects

Writing commands in R... • Spaces don't matter in R: weight/(height^2) is interpreted the same as... weight / (height ^ 2) • Comments in R code can be made with “##” ## This is cumulative incidence (incid / total.at.risk) → c.incid

Writing commands in R... • Use a semi-colon to have R run a batch of commands in the console: factorial(4); factorial(3); factorial(2); factorial(1) This code will display the factorials of 4,3,2, and 1

Basic command line operations • Simple arithmetic: ((2 + 4)^2)/3 • Save this result to variable “a” ((2 + 4)^2)/3 -> a • Creating list of consecutive numbers (1-15) 1:15 -> nums • Create a vector c(2,4,6,7) -> num.list or c('a','b','c') -> char.list

Basic command line operations • Accessing the third number in the vector num.list[3] • Logic: • Is element 3 of the num.list vector > 10? num.list[3] > 10 • Are any of the elements of the num.list vector = 0? num.list[1] == 0 | num.list[2] == 0 | num.list[3] == 0 | num.list[4] == 0

Basic command line operations • Create a 3x3 matrix matrix( c(2,4,5,6,7,8,9,10,11), nrow=3, byrow=T) -> my.matrix • Access the element in the 3rd row and 1st column my.matrix[3,1] • Change the element in the 3rd row and 1st column to the value '300' 300 -> my.matrix[3,1]

Basic command line operations • See what variables are saved ls() • Remove the “nums” variable rm(nums) • Remove all saved variables rm(list=ls())

Working with vectors • Lets create two numeric data objects of the same length. This is data from four, male subjects: weight <- c(175, 232, 145, 193) height <- c(72, 75, 70, 64) • Simply type these variables names into the console, and R will show you the contents of these vectors. • We will create BMI values for these subjects... but there is a problem, they are in units of pounds and inches...

Working with vectors • To convert vectors, just multiply their assigned names by a constant. Lets save these results to new variables: weight*0.454 -> kg.weight height*0.0254 -> m.height • We can now create a vector of BMI values: kg.weight / (m.height^2) -> BMI • When you are operating with vectors like this, the vectors must be of equal length!

Working with vectors • Finally, lets combine metric weight, height, and BMI vectors into one matrix (so we can see them side by side). matrix(c(kg.weight,m.height,BMI), ncol=3) -> final

Misc. useful functions • Pick 5 random integers between 1 and 100 sample(1:100, 5, replace=F) -> lottery • Generate 100 random numbers from normal distribution rnorm(n=100, mean=70, sd=6) • Find left-sided area under curve of Z distribution pnorm(110, mean=100, sd=10) • Find value for 95%, left-sided area under Z distribution qnorm(0.95, mean=100, sd=10)

Misc. useful functions • Find left-sided area under curve of t distribution pt(2.015, df=5) • Find value for 95%, left-sided area under t distribution qt(0.95, df=5) • Find left-sided area under curve of chi-square distribution pchisq(3.84,df=1) • Find value for 95%, left-sided area under chi-square distribution qchisq(0.95,df=1)

Let's install packages • The “epicalc” package is short for “Epidemiological Calculator.” • This package is a hodge-podge of epidemiological analysis tools, utilities, and datasets to test these on. • The “epiR” package is also helpful • You will need an internet connection to get any package! You only need to download a package once!

Let's install a package • It is easy to do from the console: • Type: install.packages() • A window titled, “CRAN mirror” will appear, giving you a choice of servers to download the package from. Just choose anything from the U.S. (it will download faster). • Then, you can choose what package to download. R will automatically download any pre-requisite packages too.

The “epicalc” package • Once the installation is complete, type this to activate the package: library(epicalc) • Let's use the “cci” function to calculate an odds ratio and 95% CI • Type: help(cci) to understand the syntax • An example: cci(145,200,300,280, graph=F)

The “epiR” package • The “epi.studysize” function calculates power or minimal sample size needed for various epidemiologic study types. • This will give you power to detect an approximate odds ratio of 2.0 using a two-sided 0.05 test when 188 cases and 940 controls are available, assuming a 0.30 prevalence of smoking in the male population epi.studysize(treat = 2/100, control = 1/100, n = 1128, power = NA, sigma = 0.30, r = 0.2, conf.level = 0.95, sided.test = 2, method = "case.control")

Other helpful functions • These two packages (epicalc and epiR) also provide functions relating to imputation, Mantel-Haenszel and CMH tests, tests for homogeneity across strata, matched case-control analysis, ROC curves, descriptive analysis, calculating adjusted rates, and much more!

Lets invoke a dataset • From the “epicalc” package, lets load a simple dataset with data on marriage • Type: data(Marryage) • If you type in Marryage, R will print out every observation of the dataset • For large datasets, type head(Marryage) to see the first six observations and variable names … or try tail(Marryage) to see the last six observations.

The “Marryage” dataset • To isolate a single variable for analysis or editting, use the “$” operator: Marryage$birthyr • To avoid having to constantly specify the dataset of variables, type this: attach(Marryage) • detach(Marryage), will reverse this option and will again have to specify the dataset

On reading and writing datasets not from a package ... • There are so many file formats that data can be stored in: .txt .dat .csv .ods .xls .sas7bdat .dta .sav The list goes on and on... • I would recommend using the .CSV file format to store and transfer your files. It will never become obsolete! • R can read in datafiles from SAS, SPSS, and Stata using the “foreign” package

How to read/import datasets • Let's say you've just finished cleaning a dataset in Excel, you're ready to analyze it, and it is saved to your desktop as a .CSV file. read.csv(“C:/Documents and Settings/Roni/Desktop/mydata.csv”, header=T) -> mydata • IMPORTANT: notice how specifiying the file location involves the use of forward-slashes (/), not back-slashes (\)!!

How to write/export datasets • Writing/exporting datasets from R to your desktop, for instance, is also very easy: write.csv(mydata, “C:/Documents and Settings/Roni/Desktop/mynewdata.csv”)

Don't like the way R looks? • If R doesn't look SAS-ish (SASy?) enough for you, try out these meta-packages that changes the appearance and capabilities of the R console and output: • Java GUI R (JGR): Very sleek setup that helps you complete input commands! With color-coded text! • Rcommander: Simple, yet elegant console

User-created functions • You can create your own functions in R! • These are much more versatile than macros in SAS • The R support community posts lots of homemade functions on the message boards.

The STAR dataframe • Tennessee’s Student Teacher Achievement Ratio (STAR) project was a large, four-year study of the effect of reduced class size • Lets take the Star dataset from the “mlmRev” package. • There are > 26000 observations! • Lets randomly choose 1000 observations to analyze.

Preparing the STAR data • Lets invoke the mlmRev package and the data: install.packages(), select the mlmRev package, then type: library(mlmRev); data(star) • See what variables STAR contains: head(star) • Lets removing all observations with missing data: na.omit(star) -> star2

Preparing the STAR data • We need to know how many observations there are: length(star2$id) • There are now 22815 observations, lets randomly choose 5000 of these: sample(1:22815, 5000, replace=F) -> xyz star2[xyz, ] -> star3

Descriptive analysis • Try this: summary(star3) • This will provide basic univariate analysis output for all continuous variables, and one-way frequency tables for categorical variables. • For a histogram of a continuous variable, try this: hist(star3$read, breaks=20) • For user-defined quantiles (deciles for ex.): quantile(star3$read, c(0,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1))

Preparing the STAR data • Other useful functions for univariate analysis • Standard Deviation: sd() • Variance: var() • Show a Q-Q Plot: qqnorm() • Show a Boxplot: boxplot() • Frequency Tables: table()

Preparing the STAR data • Let's dichotomize the “read” variable at the median, and create a dummy variable for “read” values above and below the median: median(star3$read) -> cutoff 1 -> star3$readdummy[star3$read <= cutoff] 0 -> star3$readdummy[star3$read > cutoff] • Lets attach this new dataset ... we are ready for analysis! attach(star3)

Some bivariate analysis • Basic correlation between reading and math scores: cor.test(read,math,method=”spearman”) • T-test to see if math scores differ between males and females: t.test(math~sx) • ANOVA to see if reading scores differ among the four school types: anova(lm(read~schtype)) • Chi-sq to see if ethnicity is associated with school type: chisq.test(eth,schtype)

Sub-setting your data • To examine subsets of data, you don't use a “where” statement like SAS. To see if math scores differ between males and females, only among non-White students: t.test(math[eth != 'W'] ~ sx[eth != 'W']) • What is the mean reading scores for 3rd grade, inner city students? mean(read[gr == '3' & schtype == 'inner'])

Linear Regression • Try simple linear regression. Use reading scores to predict math scores. Here is the syntax: lm(math ~ read, data=star3) -> SLRmodel summary(SLRmodel) • Let's make a scatterplot and include the regression line: plot(read,math,main=”Neat Scatterplot”) abline(SLRmodel, col="red")

Multiple Linear Regression • Let's try multiple linear regression in predicting math scores: lm(math ~ read + eth + sx + schtype, data=star3) -> MLRmodel summary(MLRmodel) • Notice how R deals with factor variables (like “eth,” “sx,” and “schtype”) • You can specifiy the reference group for factor variables with the group() function, or just use dummy variables

Comparing Models • R allows you to easily compare nested models through a likelihood ratio test. You need to obtain the “lmtest” package and use the “lrtest()” function. install.packages() library(lmtest) lrtest(SLRmodel, MLRmodel)

Logistic Regression • Let's use that dichtomous reading variable we made for a logistic regression model. You now have to use the glm() function: glm(readdummy ~ math + schtype, family = binomial(“logit”)) -> logmodel summary(logmodel) • Note: These are not odds ratios that are printed, they are log odds of having a low reading score! Take the anti-log to get the odds ratios: exp(logmodel$coeff)

Survival Curves and Cox Regression • R can generate beautiful Kaplan-Meier survival curves and can perform Cox proportional hazards regression (with or without time-dependent covariates) • You will need the “survival” package. It is part of the core R software, so you don't need to download it. Just type: library(survival) • Invoke the North Central Cancer Treatment Group (NCCTG) lung cancer dataset by typing: data(lung)

Survival Curves and Cox Regression • Type: head(lung) and examine the data dictionary you've got to get a sense of what's in this dataset. Attach the dataset, so we don't have to continuously type “lung$” before each variable: attach(lung) • This will print out the standard KM survival function output: summary(survfit(Surv(time,status==2) ~ 1 , data=lung))

An Introduction to R! Roni Kobrosly Epidemiology Nuts and Bolts Talk November 2009

An Introduction to R! Roni Kobrosly Epidemiology Nuts and Bolts Talk November 2009

Presentation Transcript

Hyperglycemia in Hospitalized Patients

Epidemiology of Communicable Diseases

Cancer PowerPoint #2 November 3, 2009

Epidemiology of Hypertension

Student Conduct and Discipline Nuts and Bolts Training

An Introduction to CDMA

City of Portland The Nuts and Bolts of Insurance

Teach Epidemiology

Teach Epidemiology

Descriptive Epidemiology

small talk BIG TALK

Nuts and Bolts of WIDA

Epidemiology

Chapter 1: Introduction

Teach Epidemiology

Nuts and Bolts of WIDA

Teach Epidemiology

Tire, Wheel, and Wheel Bearing Fundamentals

Nuts and Bolts of Dysplasia in IBD

School-Based Health Care (SBHC) 101: Nuts and Bolts

Teach Epidemiology