1 / 75

R for Epi Workshop Module 4: Demos

R for Epi Workshop Module 4: Demos. Mike Dolan Fliss, MSW, MPS PhD Candidate in Epidemiology UNC Gillings School of Global Public Health. MO this time:. It’s been a long, dense day: Just sit back and enjoy!

mikek
Download Presentation

R for Epi Workshop Module 4: Demos

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. R for Epi WorkshopModule 4:Demos Mike Dolan Fliss, MSW, MPS PhD Candidate in Epidemiology UNC Gillings School of Global Public Health

  2. MO this time: It’s been a long, dense day: Just sit back and enjoy! Will be demonstrating some fancier packages and workflows to give a sense of the breadth of what’s possible with R. Some has code, some doesn’t. I use much of this in practice in research or work with the state. Question and answer after: Either save up or interject! This Photo is licensed under CC BY-NC-ND

  3. Module Outline • Data from the internet, packages, APIs • Data out to excel tables • Maps • Models • purrr • Markdown • Shiny websites • Projects, packages & git • RStudio tips • …Questions?

  4. 1. Data from the internet Base R, googlesheets, rvest, apis, and more

  5. Data from the internet Data in a table or file somewhere on the internet as HTTP or FTP? Just open it with the URL as the file name. read_csv("http://bit.ly/nc_county_names") It’s that simple. Also, webpages in your comments are clickable!

  6. Data from the internet - google Data in google drive? No problem. googledrive::drive_download(), and upcoming googlesheets4 package. Validates your credentials in a web-browser, and you have access to all your own data. Great for collaborating on datasets, connecting “live” as the data develops, etc.

  7. Data from the internet - google Recent use cases • Collaborating on the recoding of a dataset of private name rebranding (e.g. Gillings) of public health schools • Collaborating with others on a case definition for child maltreatment – read the newest candidate case def, then apply to real data locally

  8. Data from the internet - google

  9. Data from the internet - google #.......................... # Pull directly from web #### google_sheet_location = “[REDACTED]" gs = googlesheets::gs_url(google_sheet_location) case_def = googlesheets::gs_read(gs) names(case_def) = tolower(names(case_def)) case_def # Note this is an outdated method. See: https://github.com/jennybc/googlesheets#readme . # In short, use googledrive or eventually googlesheets4 #.......................... # Case def parts histogram #### case_def %>% group_by(maltreatment_type) %>% summarise(n=n()) %>% filter(!is.na(maltreatment_type)) %>% ggplot(aes(maltreatment_type, n, label=n, fill=maltreatment_type))+ geom_col(show.legend=F)+ geom_text(size=5, nudge_y = 1)+ labs(title="Case definition parts per maltreatment type")

  10. Data from the internet - scraping Data in webpage? Scrape it off into some clean tables with rvest. Requires a bit of knowledge of HTML and CSS, though. Example: opioid policy dashboard - SEP programs measure https://www.ncdhhs.gov/divisions/public-health/north-carolina-safer-syringe-initiative/syringe-exchange-programs-north

  11. Data from the internet - scraping library(rvest) SEP_website = "https://www.ncdhhs.gov/divisions/public-health/north-carolina-safer-syringe-initiative/syringe-exchange-programs-north" SEPs = read_html(SEP_website) %>% html_nodes(".field-item p") %>% html_text() # Extract county info from "serving..." text counties = SEPs %>% tolower() %>% gsub(pattern=" ", replacement=" ") %>% str_extract("serving(.*)count") %>% gsub(pattern="serving ", replacement="") %>% gsub(pattern=" count", replacement="") # Save as table and write it out. SEP_tbl = tibble(SEP_text = SEPs, counties = counties) write_csv(SEP_tbl, "SEP_tbl.csv") https://www.ncdhhs.gov/divisions/public-health/north-carolina-safer-syringe-initiative/syringe-exchange-programs-north

  12. Data from the internet - scraping

  13. Data from the internet - sFTP Data on some sFTP server somewhere? R’s got you covered with secure connections. Example: Communicating to a state sFTP server through the firewall with a whitelisted ip address and a public / private key to run some cleaning scripts on data submitted from partners

  14. Data from the internet - sFTP # Get base directory list base_dir_names = getURL(url = sftp_location, verbose = T, port=22, .opts = sftp_curl_options, dirlistonly = TRUE) %>% str_split("\n") # Get files in folder test_filenames = getURL(url = paste0(sftp_location, "/mdfliss/"), verbose = T, port=22, .opts = sftp_curl_options, dirlistonly = TRUE) %>% str_split("\n") test_filenames = getURL(url = paste0(sftp_location, project_folder), verbose = T, port=22, .opts = sftp_curl_options, dirlistonly = TRUE) %>% str_split("\n") # test an upload - works! upload_test = ftpUpload("testfile.csv", to = paste0(sftp_location, project_folder, "testfile.csv"), .opts = sftp_curl_options) str(upload_test) # if 0, no errors, then it worked. # test a download - works! test_dl = getURLContent(url = paste0(sftp_location, project_folder, "testfile.csv"), verbose = T, .opts = sftp_curl_options) %>% read_csv test_dl

  15. Data from the internet - packages Many common data resources that have APIs also have R packages built to access them. e.g. tidycensusand others for the census API chapel_hill_data = tidycensus::get_acs(geography = "place", state = "NC", survey="acs5", year=2016, output = "wide", variables = c("B03002_001E", "B03002_003E", "B03002_004E"), key=“[REDACTED]") %>% rename(totpop = B03002_001E, WnH = B03002_003E, BnH = B03002_004E) %>% select(GEOID, NAME, totpop, WnH, BnH) %>% filter(grepl("Chapel Hill", NAME))

  16. Data from the internet - packages Works for the underlying shapefiles as well! nc_counties = tidycensus::get_acs(geography = "county", state = "NC", survey="acs5", year=2016, output = "wide", geometry = T, variables = c("B03002_001E"), key=“[REDACTED]") %>% rename(totpop = B03002_001E) %>% select(GEOID, NAME, totpop) plot(nc_counties %>% select(totpop))

  17. Data from the internet - APIs If no package for wrapped API access, you can write your own. Example from state alcohol & public health Tableau dashboard

  18. Data from the internet - APIs # Read from CDC community guide. Mike Dolan Fliss, 11/27/2018 library(httr) library(jsonlite) library(tidyverse) setwd("D:/User/Dropbox (Personal)/Work/IVPB/CDC Community Guide API") api_key = [REDACTED STRING] # If you're reading this, you should request your own key! url = "https://www.thecommunityguide.org/search/api/json/" filters_url = paste0("https://www.thecommunityguide.org/terms/api/json?key=", api_key) term = "alcohol“ get_cdc_guide_table = function(term){ search_url = paste0(url, term, "?key=", api_key) raw = GET(url=search_url) # execute the search json_object = fromJSON(rawToChar(raw$content)) # Convert from raw bytes to JSON, then list str(json_object, max.level = 2) # Structure of the returned json object content_tbl = as.tibble(json_object$records) # cast to friendly table format return(content_tbl) } get_cdc_guide_table("alcohol") %>% write_csv("alcohol_tbl.csv") get_cdc_guide_table("violence") %>% write_csv("violence_tbl.csv") get_cdc_guide_table("*") %>% write_csv("all_tbl.csv")

  19. Data from the internet - APIs

  20. Data from the internet - APIs https://public.tableau.com/profile/nc.injury.and.violence.prevention.branch#!/vizhome/NCAlcoholDataDashboard/Story

  21. Side note: data from elsewhere? Not internet, but the haven, readxl, sas7bdat and other packages help you read from and write to stata, SAS, excel, etc. Got a SQL server? Can directly write SQL in R… but also, dplyr verbs are overloaded to automatically generate SQL when attached to an object representing a connection to a server. Got a … whatever? R can probably read data from it.

  22. 2. Data out Clipboard, excel, table one, email…

  23. Data out - clipboard Can send data out in non-standard ways… like: clipboard! Often nice to paste it into excel, an email, etc. And remember literally everything is an object, and almost everything can be turned into a table, so you can send summaries, model results, etc. all out to clean tables. write.table(county_stats, "clipboard", sep="\t", row.names = F) Can also data-in that way, though I rarely do. R is almost always platform independent… but if you’re working with the file system, it has to be. So this command is a little different for a mac:https://stackoverflow.com/questions/14547069/how-to-write-from-r-to-the-clipboard-on-a-mac

  24. Data out - Excel My favorite workflows for custom formatted excel tables: • Output dplyr table to a specific range or clipboard • Custom design a table (or smarter dashboard) • Link raw data to the designed table with live links • Use the live camera tool to have a picture of your table you can paste into an email or otherwise.

  25. Data out - Excel Sky’s the limit: Construct multiple tables of results and inject them precisely the range you want them in an excel workbook of final tables. Rerun your analysis, and that excel workbook is constantly up to date. Extra fancy: put all the data you need into a smartly designed Excel book, then use VBA to automate spitting out multiple well-designed PDFs.

  26. Data out - Excel

  27. Data out - Excel shell.exec(“my_workbook.xlsx”) will open that file in its default application* … or run a script from another language, for instance.

  28. Data out: tableone

  29. Data out - Email Have a longer program? Have your R script email you the results when it finishes. A few packages for this: mailR, gmailr, etc. https://github.com/jennybc/send-email-with-r https://github.com/jimhester/gmailr

  30. Data out – Ping! Data’s ready? Want a notification as you’re working? library(beepr) beepr::beep() beepr::beep(sound = "fanfare")

  31. 3. Maps sp, sf, spatial analysis

  32. Maps • R is great at spatial analysis, merging data, can make pretty good maps or work in tandem with your dedicated GIS. • This is not a mapping lecture! So just the basics: • ESRI formats & parts of a spatial object • sp vs. sf • Choropleths • Spatial analysis

  33. Maps Using county_stats from before, we can join to spatial data we loaded before (or read from shapefile, etc.). You already know how, actually: county_stats = births_sm %>% group_by(county_name, cores) %>% summarise(pct_preterm = mean(preterm, na.rm=T)*100, pct_earlyPNC = mean(pnc5, na.rm=T)*100, n=n()) nc_counties_wdata = nc_counties %>% left_join(county_stats %>% mutate(GEOID = as.character(cores+37000)))

  34. Maps Now we have results bound with spatial data that we can map, whether quickly in base R… plot(nc_counties_wdata %>% select(pct_preterm)) plot(nc_counties_wdata %>% select(pct_preterm, pct_earlyPNC, totpop))

  35. Maps …or easier control withggplot… ggplot() + geom_sf(data=nc_counties_wdata, aes(fill=pct_preterm))+ scale_fill_viridis_c(option = "cividis")+ labs(title="% Preterm Birth", subtitle="Data from 2012 births, SCHS") + theme_minimal()

  36. Maps Or for complete multi-layer control in a true GIS (like ArcGIS or free QGIS), just export as a layer! st_write(nc_counties_wdata, "nc_counties_wdata.shp", delete_dsn = T) Great GIS workflow – carefully design the look and feel based on your layers, then use R to rebuild the layers “underneath” QGIS or ArcGIS. Close, rerun, reopen your GIS project and all the data has been updated, everything refreshed.

  37. Maps Spatial Analysis: • Geocoding • Buffers, distances • Neighbors • Cluster analysis, space-time analysis Examples Galore: Title VI Complaint. My first big R project

  38. Maps – Title VI complaint, public comment

  39. Maps – Alcohol Outlets

  40. 4. Models Basic lm/glms, broom

  41. Models Models are not only easy to run, but easy to capture their outputs. glm(data=births_sm, preterm ~ pnc5_f) # RD. Same as lm(...)

  42. Models Model families and link functions can be specified glm(data=births_sm, preterm_f ~ pnc5_f, family=binomial("identity")) # RD glm(data=births_sm, preterm ~ pnc5_f, family=poisson("log")) # RR glm(data=births_sm, preterm_f ~ pnc5_f, family=binomial("logit")) #OR

  43. Models Model objects can be (typically are) stored for later summarization or results extraction. Interaction, etc. is easy. preterm_model = glm(data=births_sm, preterm ~ pnc5_f*raceeth + mage + I(mage^2)) summary(preterm_model) confint(preterm_model)

  44. Models Broom provides methods to collapse the results of models into rectangular data.frames. And not just glms, but dozens of other model types, like mixed / multi-level models, etc. broom::tidy(preterm_model) %>% bind_cols(broom::confint_tidy(preterm_model))

  45. Models So, so many models. • Generalized Linear Models (glms) • Tree-based (e.g. recursive partitioning trees) • Multi-level / mixed models (see nlmeand lme4) • Geographically weighted models (gwr) • Think of something, there’s probably a package for it, maybe even an associated journal article with the math and rationale

  46. 5. purrr Power tools for lists, with so many applications

  47. Purrr – power tools for lists You used an apply statement earlier to act on many congenital anomaly variables at once. purrr provides consistent, strong tools for these kinds of actions. • Minimizes repeating code • Enables analysis hard to do any other way • Enables higher-level problem solving • Replaces loops (for, while, etc.) • Enables parallel/ distributed computing

  48. Purrr side note – why lists? R’s object types derive from more basic classes. E.g. • A vector is a single-level list of homogenous type • A data.frameis a list of vectors, each with their own (different) homogenous type, but equal length • A sf map object is a data.frame (actually, tibble) of regular data, with the last “column” a a vector of lists of geographic data (e.g. points, projection information) • A model is a list of model-related parts. • You can create lists of anything – functions, other lists, whatever. $ or [[…]] to subset lists. You’ve been doing this already!

  49. Purrr – Intro motivationstop repeating yourself. stop repeating yourself. map_lgl(births_sm, is.numeric) # following? births_sm[, map_lgl(births_sm, is.numeric)] %>% mutate_all(.funs = scale) %>% tibble()

  50. Purrr – many real world uses • Act similarly on all the variables of a dataset. • …or files of a directory • …or neighbors of spatial objects • …or stratified datasets • …or model results Organizing a multi-part study / simulation : Aim 1 of my dissertation is a simulation study. I load parameters in a tibble (powered up data.frame), then run the simulations using purrr, storing the resulting data, models, and summary statistics for each simulation in an ordered way.

More Related