1 / 50

A gentle introduction to R – how to load in data and produce summary statistics

A gentle introduction to R – how to load in data and produce summary statistics. BRC MH Bioinformatics group. Tutorial outline. How to install R on your own computers Its free But its already installed on these computers Loading data from excel Plotting Summary statistics. Files.

shiri
Download Presentation

A gentle introduction to R – how to load in data and produce summary statistics

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A gentle introduction to R – how to load in data and produce summary statistics BRC MH Bioinformatics group

  2. Tutorial outline • How to install R on your own computers • Its free • But its already installed on these computers • Loading data from excel • Plotting • Summary statistics

  3. Files • Data and slides on: • http://core.brc.iop.kcl.ac.uk/brc-bioinformatics-workshop-october-2012

  4. Show file extensions

  5. Show file extensions • Uncheck ‘hide extensions for known file types’ • Click ‘Apply’

  6. Installing R – skip as already installed

  7. Installing R – skip as already installed

  8. Installing R – skip as already installed

  9. Installing R – skip as already installed And follow operating system specific installation instructions

  10. Starting R on these computers

  11. Help files

  12. ?read.table Loading help files • A useful function is read.table() • It allows you to read data from spreadsheets into R • To see it’s help file you can use • You can use ?function_name for any function to see a help file

  13. Loading data into R from excel

  14. From excel Open testdata.xls

  15. From excel • You need to save it as a comma separated value file (.csv), go to file>save as>other formats

  16. From excel

  17. R working directory • To open a file you will need to point R towards the folder that contains it. • You can do this with setwd(), but we’ll do it using the mouse • Suppose you have the file in My Documents

  18. getwd() list.files() ls() Browsing folders • To check that you are in the right folder type • To see files in this folder you can type • To list the current variables type • Nothing should be loaded yet

  19. Loading data To follow along with this section, make sure your R working directory is that which contains the tutorial data

  20. my.data <- read.csv(‘testdata.csv’) my.data <- read.table(‘testdata.csv’,sep=‘,’,header=TRUE) • Read the contents of file testdata.csv into an R variable my.data with: • read.csv is a wrapper for read.table which lets you specify more details about your file, eg:

  21. read.table() • sep : Column separator • header : Does the first row of the file contain column headers? • skip : Number of rows to skip at the top of the file • ?read.table for other useful parameters

  22. Looking at loaded data

  23. ls() head(my.data) summary(my.data) • Check your new variable is in the R environment: • Take a look at the top couple of lines: • Generate some basic summary stats:

  24. rownames(my.data) colnames(my.data) nrow(my.data) ncol(my.data) dim(my.data) • Check the dimensions of your dataset: • Number of rows and columns • Row and column names

  25. Subsetting Data

  26. my.data[1,] my.data[,1] my.data[10,3] • Look at the first row: • Look at the first col: • Look at the third column of row 10

  27. my.data[100:110,1] my.subset <- my.data[100:110,1] my.data[c(30,40,50,60),] my.indices <- c(30, 40, 50, 60) my.data[my.indices,] • Look at the first column for rows 100 to 110 • Same as above, but save to a variable • Look at rows 30,40,50 and 60 • Same as above but pre-defining the index vector

  28. my.data[1,’weight’] my.data[1,c(’weight’,’height’)] cols <- c(’weight’,’height’) my.data[1,cols] You can subset on names instead of indices: • Look at the column named 'weight' for row 1: • Look at the columns named 'height' and 'weight' for row 1: • Same as above but pre-define the colnames vector

  29. my.indices <- -1 * c(35, 67, 101) my.new.data <- my.data[my.indices,] my.data[1,-2] my.new.data <- my.data[-1:-100,] Negative indices exclude elements: • Look at all columns except the second for row 1 • Extract all rows except 1-100 • Extract all rows except 35, 67,101

  30. Quiz!

  31. How tall is the person in the 7th row? • What gender is the person in the 300th row? • For the people in rows 20-30, who is the heaviest? • For the people in rows 110, 350, 219, 74, who is the tallest? • Save all rows except 500-600 in a variable my.new.data • How many males and females are in this new dataset?

  32. Formatting problems

  33. Data isn't comma-separated? • Specify the separator in read.table • tab-delimited text is another common format, for which you can use sep=”\t” Load "testdata.txt", a tab-delimited version of the data

  34. Data has extra header information at the top? • Either delete this data in Excel before exporting to csv • Or, use the skip=N argument to read.table Have a look at "testdata_1.csv" in Excel and then load it into R using read.table

  35. Factors are inconsistently named • R will just read in the data you give it. • If you aren't consistent naming the levels of your factors it will see them as different levels • R is case sensitive. 'MyLevel' != 'mylevel' Load the data from testdata_2.csv and have a look at the gender variable. Try and fix the problems in Excel and reload.

  36. Measurements and units in a single column • If you store values like 10kg, R will not interpret this as a numeric column Try loading file 'testdata_3.csv' - what has happened to the weights and heights information? Try loading again so that the two are loaded as character vectors. Have a look at the sub() function and see if you can fix the problem

  37. Excel has just screwed up your data • Older versions of Excel have a limit of 65536 rows. If you open a larger dataset in Excel it will be truncated. If you then save this dataset you will be saving the truncated version. Avoid opening large datasets in Excel, use R • Excel tries to be helpful by formatting elements for you. Try the following and then open in Excel, save as csv and reload into R. What has happened? my.genes<-c('MASH1','SOX2','OCT4') write.csv(my.genes, file='mygenes.csv')

  38. Plotting

  39. Drawing histograms Optional exercises – 1) Try drawing a histogram of height 2) Try and label the x axis [hint: read the help file]

  40. qqnorm(my.data$weight);qqline(my.data$weight) Drawing normal QQ plots

  41. plot(height~weight,data=my.data,col=as.numeric(gender)) plot(height~weight,data=my.data) Drawing scatterplots Optional exercises: try these, do you understand this plot?

  42. boxplot(height~gender,data=my.data) Drawing boxplots

  43. jpeg(“boxplot.jpg”) boxplot(height~gender,data=my.data) dev.off() pdf(“boxplot.pdf”) boxplot(height~gender,data=my.data) dev.off() Saving plots JPEGs PDFs

  44. Summary statistics

  45. Functions Covered • read.table() • head() • dim() • write.table() • mean() • sd() • cor() • cor.test() • t.test() • shapiro.test() • wilcox.test() • kruskal.test() • lm() • anova() • coefficients() • fitted() • residuals() • NB: to find help type • ?function • Eg: ?cor http://www.statmethods.net/index.html

  46. Writing tables

  47. Calculate Mean and SD

  48. Correlate phenotypes and test for group differences

  49. It is always important to check model assumptions before making statistical inferences

  50. Linear regression

More Related