960 likes | 975 Views
Learn the grammar of graphics, ggplot syntax, data-aes-geoms, facets & themes, scales, colors, labels & legends, stats, and ggplot extension packages for effective data visualization using ggplot in R.
E N D
R for Epi WorkshopModule 3:Data visualization with ggplot Mike Dolan Fliss, MSW, MPS PhD Candidate in Epidemiology UNC Gillings School of Global Public Health
Module Outline • The grammar of graphics • ggplot syntax • Data-aes-geoms • Facets & Themes • Preparing data to Plot • Scales • Colors • Labels & legends • Stats • ggplot extension packages
Same MO as before: Code along! • Recommend typing it out • Tab-complete is your speed and typo friend • Copy and paste always available • Mostly follow-along, with a little bit of free-work at the end of the module. A little bit of live coding. • Minimum to start: library(tidyverse) setwd("D:/<YOUR DIR HERE>/R Workshop/") births_sm = readRDS("births_final.rds")
To start: New script, load data, headers #........................................ # IPH Workshop: R for Epi # Examples for modules 3 and 4 # Mike Dolan Fliss #........................................ library(tidyverse) setwd("D:/<YOUR FILE PATH HERE!>/R Workshop/") # ^ Above is Mike’s – make your own! births_sm = readRDS("births_final.rds") #........................................ # Module 3 #### #........................................ # _ Slide: ggplot components ####
1. Grammar of Graphics Underlying theory
Grammar of Graphics • A consistent, theoretical framework for describing data visualizations. • Based on a book (1999) by Leland Wilkinson • Helps think about and plan graphics outside of R… but implemented deeply in R’s in ggplot2 package. • May also be familiar if you’ve worked in Tableau – Wilkinson now works for Tableau
Grammar of graphics components https://medium.com/tdebeus/think-about-the-grammar-of-graphics-when-improving-your-graphs-18e3744d8d18
Grammar of graphics components data aesthetic mapping geometric object statistical transformations scales coordinate system position adjustments faceting
Anatomy of a ggplot data aesthetic mapping geometric object statistical transformations scales coordinate system position adjustments faceting
2. ggplot2 syntax An implementation of gg!
ggplot components (are the same!) data aesthetic mapping geometric object statistical transformations scales coordinate system position adjustments faceting
ggplot components Or minimally, ggplot(data=<DATA>)+ <GEOM_FUNCTION>(mapping = aes(<MAPPINGS>)) e.g., using mpg dataset ggplot(mpg)+geom_point(aes(displ, hwy, color=class)) * Read the “+” here as “and” … And a note on overloaded operators!
Versus base / qplot shorthand hist(mpg$cyl) ggplot(mpg)+geom_histogram(aes(x=cyl)) ggplot(mpg, aes(cyl))+geom_histogram() qplot(mpg$cyl, geom="histogram")
3. Data, aes, geoms The flexible power of aesthetic mappings
data ggplot likes “long”, well structured data.frames ggplot “stats” can make quick transformations dplyr will help with complicated transformations tidyr will help go from wide to long Extensions allow ggplot to understand other kinds of data (e.g. maps, network data)
geoms …and many more in sister packages. Like other key packages, a cheat sheet is built into R.
Aesthetic Mappings Data Aesthetics x y color (name, rgb) fill size linetype (int or name) alpha height width shape angle ….and more Geometry Numbers & Factors (characters coerced) • meduc • mage • cigdur • wksgest • preterm_f • pnc5_f • county_name • raceeth_f • …and more Minimum for geom_point(), with rest are defaulted https://cran.r-project.org/web/packages/ggplot2/vignettes/ggplot2-specs.html
Cheatsheets Cheatsheet: https://www.rstudio.com/wp-content/uploads/2015/03/ggplot2-cheatsheet.pdf ... Or right in RStudio! (One more benefit of using current RStudio IDE version)
Sampling for prototyping speed Graphs can take a few moments, and for prototyping we want speed. Let’s use some dplyr to generate a sample to play with, and transfer to a dplyrtibble for the dataset viewing benefits. set.seed(1) # for replicability table(complete.cases(births_sm)) # test how it works births_10k = births_sm %>% filter(complete.cases(.)) %>% # dot syntax as_tibble() %>% sample_n(10000) births_10k %>% head(2) # helpful when doing EDA
Basic geometries and mappings • Let’s create a scatterplot of wksgest and mage using ggplot and geom_point. ggplot(births_10k, aes(mage, ksgest))+geom_point()
Basic geometries and mappings D’oh! Overplotting! Use the geom_jitter() geometry instead. ggplot(births_10k, aes(mage, ksgest))+geom_jitter()
Basic geometries and mappings Let’s try colors. Map cigdur to color. That’s it! ggplot(births_10k, aes(mage, wksgest, color=cigdur))+ geom_jitter()
Basic geometries and mappings Fancier: change color to preterm_f, change the point character to a “.” and alpha to 0.5. Note globalaes! ggplot(births_10k, aes(mage, wksgest, color=preterm_f))+ geom_jitter(pch=“.”, alpha=0.1) # ^ Typical chained spacing in last two examples, like dplyr
Aesthetic inheritance Subsequent geometric layers will inherit the aesthetic mappings from the original ggplot() call unless they’re overwritten. Meaning these are equivalent: ggplot(births_10k, aes(mage, wksgest))+ geom_jitter() ggplot(births_10k)+ geom_jitter(aes(mage, wksgest)) #^ equivalent
Aesthetic inheritance 2 …but there’s good reason to be intentional about this – like when we want multiple geometries to use the same mappings ggplot(births_10k, aes(mage, wksgest, color=cigdur))+ geom_jitter(alpha=0.1) + geom_smooth(method="lm")
Connecting dplyr to ggplot While we’re at it, we may want to send dplyr output (without saving) directly into ggplot. Example: ggplot(births_10k, aes(mage, wksgest, color=cigdur))+ geom_jitter(alpha=0.1) + geom_smooth() # smooth is unhappy. Why? births_10k %>% filter(cigdur %in% c("Y", "N")) %>% # Note new logical operator, %in%! ggplot(aes(mage, wksgest, color=cigdur))+ geom_jitter(alpha=0.1) + geom_smooth() # Sends the dplyr'edtibble / data.frame into # the first param of ggplot... the data param!
Common Visual EDA* workflow: Exploratory data analysis • Pick some variables: mage, wksgest • Consider a geometry: e.g. geom_bin2d • Check the help file (F1, or type it in) • Peruse, maybe run the example code (can run right within help window) • Write our own!
Common Visual EDA workflow: • Pick some variables: mage, wksgest • Consider a geometry: e.g. geom_bin2d • Check the help file (F1, or type it in) • Peruse, maybe run the example code (can run righ within help window). • Write our own!
Common Visual EDA workflow: • Pick some variables: e.g. mage, wksgest • Consider a geometry: e.g. geom_bin2d • Check the help file (F1, or type it in) • Peruse, maybe run the example code (can run righ within help window). • Write our own! ggplot(births_10k, aes(mage, wksgest))+ geom_bin2d() Note: Some blanks seem to be a function of our unassigned (so, default) bin width. We could do something like bins = max(births_10k$wksgest)-min(births_10k$wksgest) to fix this!
4. Facets & Themes Small multiples and overall look and feel
Facets (wrap/grid):small multiples • Facets take an R formula object (e.g . ~ x, y ~ x) and split your graph into small multiples based on that. • Can also “free the scales” so they aren’t shared across plots with scales = “free_x” and/or “free_y” ggplot(births_10k, aes(mage, wksgest))+ geom_jitter(aes(color=preterm_f), pch=".", alpha=0.5)+ geom_smooth()+ facet_wrap( ~ raceeth_f)
Facets (grid/wrap):small multiples • Facets take an R formula object (e.g . ~ x, y ~ x) and split your graph into small multiples based on that. • Can also “free the scales” so they aren’t shared across plots with scales = “free_x” and/or “free_y” ggplot(births_10k, aes(mage, wksgest))+ geom_jitter(aes(color=preterm_f), pch=".", alpha=0.5)+ geom_smooth()+ facet_grid( ~ raceeth_f)
Facets (grid/wrap):small multiples • Facets take an R formula object (e.g . ~ x, y ~ x) and split your graph into small multiples based on that. • Can also “free the scales” so they aren’t shared across plots with scales = “free_x” and/or “free_y” ggplot(births_10k, aes(mage, wksgest))+ geom_jitter(aes(color=preterm_f), pch=".", alpha=0.5)+ geom_smooth()+ facet_grid(methnic ~ mrace)
Themes • Change the theme with theme_NAME(), e.g. theme_minimal(). • Can define your own themes, or tweak existing ones. • See https://cran.r-project.org/web/packages/ggthemes/vignettes/ggthemes.html for more themes. More on ggplot extensions later! ggplot(births_10k, aes(mage, wksgest))+ geom_jitter(aes(color=preterm_f), pch=".", alpha=0.5)+ geom_smooth()+ facet_grid( ~ raceeth_f)+ theme_minimal()
5. Preparing data to plot Thinking ahead
Back to data Sometimes we need to transform our data to get at the question we have in mind. Saw with cigdur earlier. e.g. What is the association of preterm birth and maternal age?* * Again, we’re not approaching this from a stricter causal inference epidemiology frame in this workshop…
Preparing to plot: preterm/mage Might first try this as a plot of the relationship of preterm birth and maternal age using: ggplot(births)+geom_histogram(aes(x=mage, fill=preterm_f), position="fill") Maybe suggestive of some meaning… but YUCK!
Preparing to plot: preterm/mage Let’s try again, but first create a new dataset to plot with. Use dplyr, group the original dataset by (mage), and summarize it into two variables: pct_preterm and n. Then create a plot of the relationship of preterm birth and maternal age. magepreterm_df = births_sm %>% group_by(mage) %>% summarise(pct_preterm = mean(preterm_f == "Preterm", na.rm=T)*100, n=n()) ggplot(magepreterm_df, aes(mage, pct_preterm))+ geom_point(aes(size = n))+ geom_smooth(aes(weight=n)) # ^ Note the weight & size aesthetics!
Preparing to plot: preterm/mage Could do this all in one pass, if we don’t need to keep working with the temporary dataset. Readable! births_sample %>% group_by(mage) %>% summarise(pct_preterm = mean(preterm_f == "Preterm", na.rm=T)*100, n=n()) %>% ggplot(aes(mage, pct_preterm))+ geom_point(aes(size = n))+ geom_smooth(aes(weight=n))
Other plots to explore Starting to speak to some of the reality of the data. How might we interpret these? What do you see?
Plots & Interpretation ggplot(births_10k, aes(wksgest))+ geom_histogram(bins=25) * Note we don’t (yet) know how to clean up axes, labels, legends, etc. Just aiming for the functional look of the plot this time (see upper right.) But this is typical of iterative development – aim for visual communication, then hone details.
Plots & Interpretation births_10k %>% filter(cigdur %in% c("N", "Y")) %>% ggplot(aes(wksgest, fill=cigdur))+ geom_density(adjust=2, alpha=.5)+theme_minimal()
Plots & Interpretation births_10k %>% filter(cigdur %in% c("N", "Y")) %>% ggplot(aes(cigdur, wksgest, color=cigdur, fill=cigdur))+ geom_boxplot(alpha=.5)+ facet_wrap(~pnc5_f)+ theme_minimal()
Plots & Interpretation county_stats = births_sm %>% group_by(county_name, cores) %>% summarise(pct_preterm = mean(preterm, na.rm=T)*100, pct_earlyPNC = mean(pnc5, na.rm=T)*100, n=n()) ggplot(county_stats, aes(pct_earlyPNC, pct_preterm, size=n))+ geom_point()+ geom_smooth()
Plots & Interpretation births_sm %>% group_by(raceeth_f) %>% summarise(preterm = mean(preterm, na.rm=T), pnc5 = mean(pnc5, na.rm=T)) %>% gather(measure, percent, -raceeth_f) %>% mutate(percent = percent * 100) %>% ggplot(aes(measure, percent, fill=raceeth_f, group=raceeth_f, label=round(percent, 1))) + geom_bar(stat="identity", position="dodge")+ geom_text(position=position_dodge(.9), aes(y=percent+3), hjust=0.5)
Plots & Interpretation births_10k %>% filter(cigdur %in% c("N", "Y")) %>% ggplot(aes(wksgest, mage, color=cigdur))+ geom_density_2d(alpha=.5)+theme_minimal()
Worth a skim:http://r-statistics.co/Top50-Ggplot2-Visualizations-MasterList-R-Code.html • http://chartmaker.visualisingdata.com/