370 likes | 379 Views
Learn how to manipulate and analyze data using the R programming language. This session covers key packages, functions, and techniques for data manipulation in R.
E N D
R與資料(數據)操作Data Manipulation with R 講者:陳慶文 Ching-Wen Chen, Ph. D. 國立高雄第一科技大學資訊管理系教授 Dept. of I.M., NKFUST中華民國106年3月5日
為何用R • 公開碼源、自由使用且免費 • 進入門檻低(但不易專精) • 參予「生產」套件的使用者多(不必受制於「菁英」所寫的內容),且品質有「人」把關 • 論壇多,找尋答案容易。 • 「我」不喜歡使用套裝軟體,因為「貴」。$會阻礙知識的獲取與傳播,不利於「不富裕」的使用者學習。
R是什麼 • R is a language and environment for statistical computing and graphics. • The term “environment” is intended to characterize it as a fully planned and coherent system, rather than an incremental accretion of very specific and inflexible tools. • R is an integrated suite of software facilities for data manipulation, calculation and graphical display. • R can be extended (easily) via packages. • Read yourselves: https://www.r-project.org/about.html
資料操作的範疇 Data Scientist vs. Data Engineer
Packages in Use • eight core packages (base, stats,…) • Documentation for package ‘base’ version 3.4.0 • dplyr(main focus) • ggplot2
Data from package {nycflights13} • Description Airline on-time data for all flights departing NYC in 2013. Also includes useful 'metadata' on airlines, airports, weather, and planes • URL http://github.com/hadley/nycflights13 • airlines • airports • flights • planes • weather
dplyr: two table verbs • Mutating joins • Filtering joins • Set operations
dplyr: Single table verbs • filter() (and slice()) • arrange() • select() (and rename()) • distinct() • mutate() (and transmute()) • summarise() • sample_n() (and sample_frac())
Filter rows with filter() library(nycflights13) library(dplyr) filter(flights, month == 1, day == 1) View(filter(flights, month == 1, day == 1)) flights[flights$month == 1 & flights$day == 1, ] filter(flights, month == 1 | month == 2) slice(flights, 1:10) slice(flights, 11101:11110)
Arrange rows with arrange() arrange(flights, year, month, day) arrange(flights, desc(arr_delay)) flights[order(flights$year, flights$month, flights$day), ] flights[order(flights$arr_delay, decreasing = TRUE), ] flights[order(-flights$arr_delay), ]
Select columns with select() select(flights, year, month, day) select(flights, year:day) select(flights, -(year:day)) select(flights, tail_num = tailnum) rename(flights, tail_num = tailnum)
Extract distinct (unique) rows distinct(flights, tailnum) distinct(flights, origin, dest)
Add new columns with mutate() mutate(flights, gain = arr_delay - dep_delay, speed = distance / air_time * 60) mutate(flights, gain = arr_delay - dep_delay, gain_per_hour = gain / (air_time / 60)) transform(flights, gain = arr_delay - delay, gain_per_hour = gain / (air_time / 60)) transmute(flights, gain = arr_delay - dep_delay, gain_per_hour = gain / (air_time / 60) )
Summarise values with summarise() summarise(flights, delay = mean(dep_delay, na.rm = TRUE)) sample_n(flights, 10) sample_frac(flights, 0.01)
Grouped operations by_tailnum <- group_by(flights, tailnum) delay <- summarise(by_tailnum, count = n(), dist = mean(distance, na.rm = TRUE), delay = mean(arr_delay, na.rm = TRUE)) delay <- filter(delay, count > 20, dist < 2000)
Plot:delay vs. distance library(ggplot2) ggplot(delay, aes(dist, delay)) + geom_point(aes(size = count), alpha = 1/2) + geom_smooth() + scale_size_area()
A handful of others destinations <- group_by(flights, dest) summarise(destinations, planes = n_distinct(tailnum), flights = n()) daily <- group_by(flights, year, month, day) (per_day <- summarise(daily, flights = n()) (per_month <- summarise(per_day, flights = sum(flights))) (per_year <- summarise(per_month, flights = sum(flights)))
Chaining a1 <- group_by(flights, year, month, day) a2 <- select(a1, arr_delay, dep_delay) a3 <- summarise(a2, arr = mean(arr_delay, na.rm = TRUE), dep = mean(dep_delay, na.rm = TRUE)) a4 <- filter(a3, arr > 30 | dep > 30) a4
Pipes filter( summarise( select( group_by(flights, year, month, day), arr_delay, dep_delay ), arr = mean(arr_delay, na.rm = TRUE), dep = mean(dep_delay, na.rm = TRUE) ), arr > 30 | dep > 30 ) flights %>% group_by(year, month, day) %>% select(arr_delay, dep_delay) %>% summarise( arr = mean(arr_delay, na.rm = TRUE), dep = mean(dep_delay, na.rm = TRUE) ) %>% filter(arr > 30 | dep > 30)
dplyr and pipes: the basics • Verbs make up the majority of the data manipulation • Select certain columns of data. • Filter your data to select specific rows. • Arrange the rows of your data into an order. • Mutate your data frame to contain new columns. • Summarise chunks of you data in some way.
The data pantheria <- "http://esapubs.org/archive/ecol/E090/184/PanTHERIA_1-0_WR05_Aug2008.txt" download.file(pantheria, destfile = "mammals.txt") mammals <- read.table("mammals.txt", sep = "\t", header = TRUE, stringsAsFactors = FALSE) str(mammals) View(head(mammals)) # cleaning data names(mammals) <- sub("X[0-9._]+", "", names(mammals)) names(mammals) <- sub("MSW05_", "", names(mammals)) mammals <- dplyr::select(mammals, Order, Binomial, AdultBodyMass_g, AdultHeadBodyLen_mm, HomeRange_km2, LitterSize) names(mammals) <- gsub("([A-Z])", "_\\L\\1", names(mammals), perl = TRUE) names(mammals) <- gsub("^_", "", names(mammals), perl = TRUE) mammals[mammals == -999] <- NA names(mammals)[names(mammals) == "binomial"] <- "species" mammals <- dplyr::tbl_df(mammals) # for prettier printing
Looking at the data library(dplyr) mammals glimpse(mammals)
Selecting columns select(mammals, adult_head_body_len_mm) select(mammals, adult_head_body_len_mm, litter_size) select(mammals, adult_head_body_len_mm:litter_size) select(mammals, -adult_head_body_len_mm) select(mammals, contains("body")) View(select(mammals, starts_with("adult"))) select(mammals, ends_with("g")) select(mammals, 1:3)
Filtering rows View(filter(mammals, adult_body_mass_g > 1e7)[ , 1:3]) filter(mammals, species == "Balaena mysticetus") filter(mammals, order == "Carnivora" & adult_body_mass_g < 200)
Arranging rows arrange(mammals, adult_body_mass_g)[ , 1:3] arrange(mammals, desc(adult_body_mass_g))[ , 1:3] arrange(mammals, order, adult_body_mass_g)[ , 1:3]
Mutating columns glimpse(mutate(mammals, adult_body_mass_kg = adult_body_mass_g / 1000)) glimpse(mutate(mammals, g_per_mm = adult_body_mass_g / adult_head_body_len_mm)) glimpse(mutate(mammals, g_per_mm = adult_body_mass_g / adult_head_body_len_mm, kg_per_mm = g_per_mm / 1000))
Summarising columns summarise(mammals, mean_mass = mean(adult_body_mass_g, na.rm = TRUE)) # summarise with group_by: head(summarise(group_by(mammals, order), mean_mass = mean(adult_body_mass_g, na.rm = TRUE)))
Piping data # ?magrittr::`%>%` x <- rnorm(10) x %>% max # is the same thing as: max(x) mammals %>% arrange(adult_body_mass_g)
Piping data (Cont.) ## Say we wanted to find the species with the highest body-mass-to-length ratio: mammals %>% mutate(mass_to_length = adult_body_mass_g / adult_head_body_len_mm) %>% arrange(desc(mass_to_length)) %>% select(species, mass_to_length)
Piping data (Cont.) ## The above is equivalent to: select( arrange( mutate(mammals, mass_to_length = adult_body_mass_g / adult_head_body_len_mm), desc(mass_to_length)), species, mass_to_length)
Piping data (Cont.) ## taxonomic orders have a median litter size greater than 3 mammals %>% group_by(order) %>% summarise(median_litter = median(litter_size, na.rm = TRUE)) %>% filter(median_litter > 3) %>% arrange(desc(median_litter)) %>% select(order, median_litter)
Good reference of dplyr dplyr: A Grammar of DataManipulation
結語 Manipulating data is that process of re-sorting, rearranging and otherwise moving your research data, without fundamentally changing it. One of the key characteristics of a manipulation technique versus related techniques like transformation is that the underlying data remains unchanged. The main thing we’re doing is changing the relationship – logical or physical – that one piece of data has with another.
謝謝聆聽! Q&A