Data Manipulation with R: A Comprehensive Guide

R與資料(數據)操作Data Manipulation with R 講者：陳慶文 Ching-Wen Chen, Ph. D. 國立高雄第一科技大學資訊管理系教授 Dept. of I.M., NKFUST中華民國106年3月5日

為何用R • 公開碼源、自由使用且免費 • 進入門檻低(但不易專精) • 參予「生產」套件的使用者多(不必受制於「菁英」所寫的內容)，且品質有「人」把關 • 論壇多，找尋答案容易。 • 「我」不喜歡使用套裝軟體，因為「貴」。$會阻礙知識的獲取與傳播，不利於「不富裕」的使用者學習。

R是什麼 • R is a language and environment for statistical computing and graphics. • The term “environment” is intended to characterize it as a fully planned and coherent system, rather than an incremental accretion of very specific and inflexible tools. • R is an integrated suite of software facilities for data manipulation, calculation and graphical display. • R can be extended (easily) via packages. • Read yourselves: https://www.r-project.org/about.html

資料操作的範疇 Data Scientist vs. Data Engineer

Packages in Use • eight core packages (base, stats,…) • Documentation for package ‘base’ version 3.4.0 • dplyr(main focus) • ggplot2

Data from package {nycflights13} • Description Airline on-time data for all flights departing NYC in 2013. Also includes useful 'metadata' on airlines, airports, weather, and planes • URL http://github.com/hadley/nycflights13 • airlines • airports • flights • planes • weather

Relational data

dplyr: two table verbs • Mutating joins • Filtering joins • Set operations

Mutating joins

dplyr: Single table verbs • filter() (and slice()) • arrange() • select() (and rename()) • distinct() • mutate() (and transmute()) • summarise() • sample_n() (and sample_frac())

Filter rows with filter() library(nycflights13) library(dplyr) filter(flights, month == 1, day == 1) View(filter(flights, month == 1, day == 1)) flights[flights$month == 1 & flights$day == 1, ] filter(flights, month == 1 | month == 2) slice(flights, 1:10) slice(flights, 11101:11110)

Arrange rows with arrange() arrange(flights, year, month, day) arrange(flights, desc(arr_delay)) flights[order(flights$year, flights$month, flights$day), ] flights[order(flights$arr_delay, decreasing = TRUE), ] flights[order(-flights$arr_delay), ]

Select columns with select() select(flights, year, month, day) select(flights, year:day) select(flights, -(year:day)) select(flights, tail_num = tailnum) rename(flights, tail_num = tailnum)

Extract distinct (unique) rows distinct(flights, tailnum) distinct(flights, origin, dest)

Add new columns with mutate() mutate(flights, gain = arr_delay - dep_delay, speed = distance / air_time * 60) mutate(flights, gain = arr_delay - dep_delay, gain_per_hour = gain / (air_time / 60)) transform(flights, gain = arr_delay - delay, gain_per_hour = gain / (air_time / 60)) transmute(flights, gain = arr_delay - dep_delay, gain_per_hour = gain / (air_time / 60) )

Summarise values with summarise() summarise(flights, delay = mean(dep_delay, na.rm = TRUE)) sample_n(flights, 10) sample_frac(flights, 0.01)

Grouped operations by_tailnum <- group_by(flights, tailnum) delay <- summarise(by_tailnum, count = n(), dist = mean(distance, na.rm = TRUE), delay = mean(arr_delay, na.rm = TRUE)) delay <- filter(delay, count > 20, dist < 2000)

Plot:delay vs. distance library(ggplot2) ggplot(delay, aes(dist, delay)) + geom_point(aes(size = count), alpha = 1/2) + geom_smooth() + scale_size_area()

A handful of others destinations <- group_by(flights, dest) summarise(destinations, planes = n_distinct(tailnum), flights = n()) daily <- group_by(flights, year, month, day) (per_day <- summarise(daily, flights = n()) (per_month <- summarise(per_day, flights = sum(flights))) (per_year <- summarise(per_month, flights = sum(flights)))

Chaining a1 <- group_by(flights, year, month, day) a2 <- select(a1, arr_delay, dep_delay) a3 <- summarise(a2, arr = mean(arr_delay, na.rm = TRUE), dep = mean(dep_delay, na.rm = TRUE)) a4 <- filter(a3, arr > 30 | dep > 30) a4

Pipes filter( summarise( select( group_by(flights, year, month, day), arr_delay, dep_delay ), arr = mean(arr_delay, na.rm = TRUE), dep = mean(dep_delay, na.rm = TRUE) ), arr > 30 | dep > 30 ) flights %>% group_by(year, month, day) %>% select(arr_delay, dep_delay) %>% summarise( arr = mean(arr_delay, na.rm = TRUE), dep = mean(dep_delay, na.rm = TRUE) ) %>% filter(arr > 30 | dep > 30)

dplyr and pipes: the basics • Verbs make up the majority of the data manipulation • Select certain columns of data. • Filter your data to select specific rows. • Arrange the rows of your data into an order. • Mutate your data frame to contain new columns. • Summarise chunks of you data in some way.

The data pantheria <- "http://esapubs.org/archive/ecol/E090/184/PanTHERIA_1-0_WR05_Aug2008.txt" download.file(pantheria, destfile = "mammals.txt") mammals <- read.table("mammals.txt", sep = "\t", header = TRUE, stringsAsFactors = FALSE) str(mammals) View(head(mammals)) # cleaning data names(mammals) <- sub("X[0-9._]+", "", names(mammals)) names(mammals) <- sub("MSW05_", "", names(mammals)) mammals <- dplyr::select(mammals, Order, Binomial, AdultBodyMass_g, AdultHeadBodyLen_mm, HomeRange_km2, LitterSize) names(mammals) <- gsub("([A-Z])", "_\\L\\1", names(mammals), perl = TRUE) names(mammals) <- gsub("^_", "", names(mammals), perl = TRUE) mammals[mammals == -999] <- NA names(mammals)[names(mammals) == "binomial"] <- "species" mammals <- dplyr::tbl_df(mammals) # for prettier printing

Looking at the data library(dplyr) mammals glimpse(mammals)

Selecting columns select(mammals, adult_head_body_len_mm) select(mammals, adult_head_body_len_mm, litter_size) select(mammals, adult_head_body_len_mm:litter_size) select(mammals, -adult_head_body_len_mm) select(mammals, contains("body")) View(select(mammals, starts_with("adult"))) select(mammals, ends_with("g")) select(mammals, 1:3)

Filtering rows View(filter(mammals, adult_body_mass_g > 1e7)[ , 1:3]) filter(mammals, species == "Balaena mysticetus") filter(mammals, order == "Carnivora" & adult_body_mass_g < 200)

Arranging rows arrange(mammals, adult_body_mass_g)[ , 1:3] arrange(mammals, desc(adult_body_mass_g))[ , 1:3] arrange(mammals, order, adult_body_mass_g)[ , 1:3]

Mutating columns glimpse(mutate(mammals, adult_body_mass_kg = adult_body_mass_g / 1000)) glimpse(mutate(mammals, g_per_mm = adult_body_mass_g / adult_head_body_len_mm)) glimpse(mutate(mammals, g_per_mm = adult_body_mass_g / adult_head_body_len_mm, kg_per_mm = g_per_mm / 1000))

Summarising columns summarise(mammals, mean_mass = mean(adult_body_mass_g, na.rm = TRUE)) # summarise with group_by: head(summarise(group_by(mammals, order), mean_mass = mean(adult_body_mass_g, na.rm = TRUE)))

Piping data # ?magrittr::`%>%` x <- rnorm(10) x %>% max # is the same thing as: max(x) mammals %>% arrange(adult_body_mass_g)

Piping data (Cont.) ## Say we wanted to find the species with the highest body-mass-to-length ratio: mammals %>% mutate(mass_to_length = adult_body_mass_g / adult_head_body_len_mm) %>% arrange(desc(mass_to_length)) %>% select(species, mass_to_length)

Piping data (Cont.) ## The above is equivalent to: select( arrange( mutate(mammals, mass_to_length = adult_body_mass_g / adult_head_body_len_mm), desc(mass_to_length)), species, mass_to_length)

Piping data (Cont.) ## taxonomic orders have a median litter size greater than 3 mammals %>% group_by(order) %>% summarise(median_litter = median(litter_size, na.rm = TRUE)) %>% filter(median_litter > 3) %>% arrange(desc(median_litter)) %>% select(order, median_litter)

Good reference of dplyr dplyr: A Grammar of DataManipulation

結語 Manipulating data is that process of re-sorting, rearranging and otherwise moving your research data, without fundamentally changing it. One of the key characteristics of a manipulation technique versus related techniques like transformation is that the underlying data remains unchanged. The main thing we’re doing is changing the relationship – logical or physical – that one piece of data has with another.

謝謝聆聽! Q&A

Data Manipulation with R: A Comprehensive Guide

Data Manipulation with R: A Comprehensive Guide

Presentation Transcript

RPG IV

Oracle

Advanced data manipulation

SQL- Data Manipulation Language

Optimal Manipulation of Voting Rules

Data Structures and Manipulation

Data Manipulation Language (DML)

3D Object Manipulation in a Single Photograph using Stock 3D Models

Discovering accessibility, display, and manipulation of data in a data portal

Chapter 5 Data Manipulation and Transaction Control Oracle 10 g : SQL

SQL Data Manipulation Language(DML) in Oracle's Dialect

Chapter 2: Data Manipulation

21. Direct Manipulation

Leveraging WebFOCUS MAINTAIN to Maximize Data Retrieval and Manipulation

Information manipulation possibilities with EMu or What can we do with all this data?

Action Research Data Manipulation and Crosstabs

Data Manipulation, part two

LECTURE Notes

SQL