1 / 22

An Introduction to R

An Introduction to R. Xiaozhi Zhou. What is R?. The R system for statistical computing is an environment for data manipulation, calculation and graphical display.

marge
Download Presentation

An Introduction to R

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. An Introduction to R Xiaozhi Zhou

  2. What is R? • The R system for statistical computing is an environment for data manipulation, calculation and graphical display. • R can be regarded as an implementation of the S language (S-Plus system) which was developed at Bell Laboratories by Rick Becker, John Chambers and Allan Wilks. • Strengths • free and open source, supported by a strong user community • highly extensible and flexible • implementation of modern statistical methods • moderately flexible graphics with intelligent defaults • Weaknesses • slow or impossible with large data sets • non-standard programming paradigms

  3. Installing R • Main source: • http://www.R-project.org • The base system is available for Windows platforms and Mac OS X • Current Version: 2.8.0 (2008-10-20)

  4. Documentation • An Introduction to R: gives an introduction to the language and how to use R for doing statistical analysis and graphics. • R Data Import/Export: describes the import and export facilities available either in R itself or via packages which are available from CRAN. • R Installation and Administration: hints for installing R on special platforms • Writing R Extensions: covers how to create your own packages, write R help files, and the foreign language (C, C++, Fortran, ...) interfaces. • The R Reference Index: contains all help files of the R standard and recommended packages in printable form.

  5. Installing Packages • A huge amount of additional functionality is implemented in add-on packages (can be download from the official home page) • Packages can be installed directly from the R prompt: • R> install.packages(“package name”) • menu: ‘packages’- ‘install package(s)’ • The package functionality is available after attaching the package by: • R> library(“package name”)

  6. Help • Online help via the official home page • Help systems in R: • R> help(help) or use ?help for short • R> help.start - HTML format • R> help(“mean”) or use ?mean for short • R> help(package=“survey”) • help.search

  7. R commands, case sensitivity, recall commands • Case sensitive: • A and a are different symbols and would refer to different variables • Elementary commands consist of either expressions or assignments • Expressions: after being evaluated and printed, the value is lost • Assignments: evaluate an expression and pass the value to a variable, but the result is not automatically printed • Commands are separated either by a newline or by a ‘;’ • Elementary commands can be grouped together into one compound expression by ‘{’ and ‘}’ • Comments can be put anywhere starting with a # • Recall previous commands: vertical arrow keys ↑ ↓

  8. Data import and export • Reading data from files • read.table() function: read entire data frame • R> survey <- read.table("survey.csv", hearder=TRUE, sep=", ", row.name=1) • Header=TRUE option specifies that the first line is a line of headings, no explicit row labels are given • sep=", ": columns are separated by a comma • row.name=1: the first column should be interpreted as row names but not as a variable • scan() function: read in vectors as a list • R> inp <- scan("input.dat", list("",0,0))

  9. Reading an excel file • R> read.csv(file, header = TRUE, sep = ",", quote="\"", dec=".", fill = TRUE, comment.char="", ...) • file - the name of the file which the data are to be read from • header - a logical value indicating whether the file contains the names of the variables as its first line • sep - the field separator character. Values on each line of the file are separated by this character • Quote - the set of quoting characters • dec - the character used in the file for decimal points • Check help(read.csv) for more information • R> read.csv("F:/houses.data.csv", header = TRUE, sep = ",", quote="\"", dec=".", fill = TRUE, comment.char="") • X Price Floor Area Rooms Age Cent.heat • 1 1 52.00 111 830 5 6.2 No • 2 2 54.75 128 710 5 7.5 No • 3 3 57.50 101 1000 5 4.2 No • 4 4 57.50 131 690 5 8.8 No • 5 5 59.75 93 900 5 1.9 Yes

  10. Simple manipulations; numbers and vectors • Vectors and assignment • R> x<- c(10.4, 5.6, 3.1, 6.4, 21.7) • R> c(10.4, 5.6, 3.1, 6.4, 21.7) -> x • R> assign("x", c(10.4, 5.6, 3.1, 6.4, 21.7)) • R> y <- 2*x-3.5 • Vector arithmetic • R> sum((x-mean(x))^2)/(length(x)-1) • Generating regular sequences • R> seq(-5, 5, by=.2) -> s3 • R> s4 <- seq(length=51, from=-5, by=.2) • R> s5 <- rep(x, times=5) • R> x6 <- rep(x, each=5)

  11. Simple manipulations; numbers and vectors • Logical vectors • R> temp <- x>13 (values TRUE, FALSE and NA) • logi [1:5] FALSE FALSE FALSE FALSE TRUE • Character vectors • R> labs <- paste(c("X", "Y"), 1:10, sep="") • chr [1:10] "X1" "Y2" "X3" "Y4" "X5" "Y6" "X7" "Y8" "X9" "Y10 " • Missing values • R> z <- c(1:3, NA); is.na(z) • is.na(xx) is TRUE both for NA and NaN values • is.nan(xx) is only TRUE for NaNs

  12. Simple summary statistics • R> x<- c(10.4, 5.6, 3.1, 6.4, 21.7) • R> summary(x) • Min. 1st Qu. Median Mean 3rd Qu. Max. • 3.10 5.60 6.40 9.44 10.40 21.70 • R> str(x) • num [1:5] 10.4 5.6 3.1 6.4 21.7

  13. Lists and data frames • R> Lst <- list(name="Fred",wife="Mary",no.children=3, child.ages=c(4, 7, 9)) • R> str(Lst) • List of 4 • $ name: chr "Fred" • $ wife: chr "Mary" • $ no.children: num 3 • $ child.ages: num [1:3] 4 7 9 • R> name$component_name • R> str(Lst$child.ages) • num [1:3] 4 7 9 • attach() and detach() • R> attach(Lst) • R> str(child.ages) • num [1:3] 4 7 9

  14. Calculate simple correlation • Arrange data in a .CSV file and read data file into memory • R> weight = read.csv("F:/height-weight.csv", header=T, row.names=1) • R> height = read.csv("F:/height-weight.csv", header=T, row.names=2) • R> cor( height, weight, method = "pearson") • Weight • Height 0.9936593

  15. Conduct a linear regression analysis • The basic function is: lm(model, data) • "package=graphics" # Annette Dobson (1990) "An Introduction to Generalized Linear Models” • R> ctl <- c(4.17,5.58,5.18,6.11,4.50,4.61,5.17,4.53,5.33,5.14) • R> trt <- c(4.81,4.17,4.41,3.59,5.87,3.83,6.03,4.89,4.32,4.69) • R> group <- gl(2,10,20, labels=c("Ctl","Trt")) • gl(): Generate factors by specifying the pattern of their levels. • R> weight <- c(ctl, trt) • anova(lm.D9 <- lm(weight ~ group)) • Analysis of Variance Table • Response: weight • Df Sum Sq Mean Sq F value Pr(>F) • group 1 0.6882 0.6882 1.4191 0.249 • Residuals 18 8.7293 0.4850

  16. Conduct a linear regression analysis • R> summary(lm.D90 <- lm(weight ~ group - 1)) • Call: • lm(formula = weight ~ group - 1) • Residuals: • Min 1Q Median 3Q Max • -1.0710 -0.4938 0.0685 0.2462 1.3690 • Coefficients: • Estimate Std. Error t value Pr(>|t|) • groupCtl 5.0320 0.2202 22.85 9.55e-15 *** • groupTrt 4.6610 0.2202 21.16 3.62e-14 *** • --- • Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 • Residual standard error: 0.6964 on 18 degrees of freedom • Multiple R-squared: 0.9818, Adjusted R-squared: 0.9798 • F-statistic: 485.1 on 2 and 18 DF, p-value: < 2.2e-16

  17. Conduct a logistic regression • Logistic regression is a special case of a Generalized Linear Model (GLM); function glm() • R> counts <- c(18,17,15,20,10,20,25,13,12) • R> outcome <- gl(3,1,9) • Factor w/ 3 levels "1","2","3": 1 2 3 1 2 3 1 2 3 • R> treatment <- gl(3,3) • Factor w/ 3 levels "1","2","3": 1 1 1 2 2 2 3 3 3 • R> print(d.AD <- data.frame(treatment, outcome, counts)) • R> glm.D93 <- glm(counts ~ outcome + treatment, family=poisson()) • R> anova(glm.D93) • Analysis of Deviance Table • Model: poisson, link: log • Response: counts • Terms added sequentially (first to last) • Df Deviance Resid. Df Resid. Dev • NULL 8 10.5814 • outcome 2 5.4523 6 5.1291 • treatment 2 0.0000 4 5.1291

  18. Conduct a logistic regression • R> summary(glm.D93) • Call: • glm(formula = counts ~ outcome + treatment, family = poisson()) • Deviance Residuals: • 1 2 3 4 5 6 7 8 9 • -0.67125 0.96272 -0.16965 -0.21999 -0.95552 1.04939 0.84715 -0.09167 -0.96656 • Coefficients: • Estimate Std. Error z value Pr(>|z|) • (Intercept) 3.045e+00 1.709e-01 17.815 <2e-16 *** • outcome2 -4.543e-01 2.022e-01 -2.247 0.0246 * • outcome3 -2.930e-01 1.927e-01 -1.520 0.1285 • treatment2 8.717e-16 2.000e-01 4.36e-15 1.0000 • treatment3 4.557e-16 2.000e-01 2.28e-15 1.0000 • Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 • (Dispersion parameter for poisson family taken to be 1) • Null deviance: 10.5814 on 8 degrees of freedom • Residual deviance: 5.1291 on 4 degrees of freedom • AIC: 56.761 • Number of Fisher Scoring iterations: 4

  19. Simple graphs - Histogram • The basic function is hist(variables) • R> test.data <- c(2.1, 2.6, 2.7, 3.2, 4.1, 4.3, 5.2, 5.1, 4.8, 1.8, 1.4, 2.5, 2.7, 3.1, 2.6, 2.8) • R> hist(test.data) • R> hist(log(test.data))

  20. Simple graphs – Scatter Plot • The basic function is plot(x~y) • R> ctl <- c(4.17,5.58,5.18,6.11,4.50,4.61,5.17,4.53,5.33,5.14) • R> trt <- c(4.81,4.17,4.41,3.59,5.87,3.83,6.03,4.89,4.32,4.69) • R> plot(ctl ~ trt)

  21. Simple graphs –Boxplot • The basic function is boxplot(variables) • R> counts <- c(18,17,15,20,10,20,25,13,12) • R> outcome <- gl(3,1,9) • R> boxplot(counts~outcome) • R> test.data <- c(2.1, 2.6, 2.7, 3.2, 4.1, 4.3, 5.2, 5.1, 4.8, 1.8, 1.4, 2.5, 2.7, 3.1, 2.6, 2.8) • boxplot(test.data, xlab="Single sample", ylab="Value axis", col="lightblue") • title(main="Plot with outlier", font.main= 4)

  22. References • B. S. Everitt and T. Hothorn (2006). A handbook of statistical analyses using R. Chapman & Hall/CRC: FL. • W. N. Venables, D. M. Smith and the R Development Core Team. An Introduction to R. http://www.R-project.org

More Related