630 likes | 767 Views
Introduction to Geospatial Analysis in R. SURF – 24 April 2012 Daniel Marlay. Synopsis.
E N D
Introduction toGeospatial Analysis in R SURF – 24 April 2012 Daniel Marlay
Synopsis • This month's talk is going to look at the geo-spatial capabilities of R. We'll look at how to import common geographical data formats into R and some of the free geographic data sources and map layers available. We'll then look at how to create maps in R using this data, and some of the ways to style it to display our data. We'll look at how R stores geographic data and how we can perform queries against that - for example identifying which points fall into a particular region. Finally, we'll take a brief look at modeling geospatial data and some of the issues to be aware of.
Introduction • There are extensive geospatial capabilities in R • I’ve just started to scratch the surface • This presentation will give a little bit of theory • Most of the content is a walk through of doing geospatial analysis in R • I’ve picked data sets that are freely available • Trying this yourself is the best way to learn • And maybe we’ll learn something about the way Australians vote…
R Geospatial Packages • sp – provides a generic set of functions, classes and methods for handling spatial data • rgdal – provides an R interface into the Geospatial Data Abstraction Library (GDAL) which is used to read and write geospatial data from R
Types of Geospatial Data • Vector data • Points • Lines • Areas • Bitmap • Often used for image data (e.g. aerial photos) • Needs to be registered to a coordinate system • “Labelled” data • Has geographic information, but needs to be matched before it can be used
Setting up the R Environment ## Set working directory to where the data is. Update as required if running this yourself setwd("C:\\Documents and Settings\\marlada\\My Documents\\AQUA Internal\\Thought Leadership\\201204 - SURF Geospatial Analysis Presentation"); ## Load the relevant libraries library(sp); # Basic R classes for handling geographic data library(rgdal); # Library for using the Geographic Data Abstraction Layer library(nlme); # Library that gives us generalised least squares
Read In Census Data (1/3) ## Read in and clean the census data (Note: a lot of this cleaning could be done more easily in Excel) EducationLevel <- read.csv("EducationData.csv",skip=6,na.strings=""); EducationLevel <- EducationLevel[c(-1,-2),c(-1,-27)]; # Remove leading and trailing blank columns and blank second row EducationLevel <- EducationLevel[-(97:100),]; # Remove trailing blank lines #### Create some useable column names EduDataCols <- paste(c(rep("Male",8),rep("Female",8),rep("Total",8)), rep(c("NotStated","InadDescr","Postgrad","GradDipCert","Bachelor","Diploma","Certificate","NA"),3), sep="."); colnames(EducationLevel) <- c("SED",EduDataCols);
Read In Census Data (2/3) #### Recode the data into character and numeric data to avoid weird errors from factors EducationLevel[,1] <- as.character(EducationLevel[,1]); for (col in EduDataCols) { EducationLevel[,col] <- as.numeric(as.character(EducationLevel[,col])); } #### Eyeball the data to make sure it is ok. summary(EducationLevel); head(EducationLevel,10); tail(EducationLevel,10);
Read In Electoral Data (1/2) ## Read in the electoral data ElectionResults <- read.csv("2011NSWElectionResults.csv"); #### Eyeball data to make sure it is ok summary(ElectionResults); head(ElectionResults); tail(ElectionResults);
Read In SED Geography (1/3) ## Read in the state electoral division boundaries (geography) and explore the SpatialPolygonsDataFrame class SED <- readOGR("C:\\Documents and Settings\\marlada\\My Documents\\AQUA Internal\\Thought Leadership\\201204 - SURF Geospatial Analysis Presentation\\Geographies","SED06aAUST_region"); #### Have an initial look at the SED data set that we've just read in summary(SED); plot(SED);
Examining the SpatialPloygonsDataFrame (1/2) #### SED is a SpatialPolygonsDataFrame, an S4 object. We can have a look at how it is constructed mode(SED); slotNames(SED); summary(SED@data); summary(SED@polygons); SED@plotOrder; SED@bbox; SED@proj4string;
Simple Mapping of SpatialPolygonsDataFrames (1/2) #### Let's now look at some more mapping, we've seen that we can plot all of Australia plot(SED[SED$STATE_2006 == "1",]); # Plot NSW plot(SED[SED$STATE_2006 == "1",],xlim=c(150.6,151.4),ylim=c(-34.3,-33.4)); # Plot Sydney - xlim and ylim from google maps ;-) plot(SED[SED$STATE_2006 == "1",],xlim=c(150.6,151.4),ylim=c(-34.3,-33.4)); # Plot Sydney and put on some electoral district names text(coordinates(SED[SED$STATE_2006 == "1",]),labels=(SED[SED$STATE_2006 == "1",])$NAME_2006,cex=0.5);
Thematic Mapping (1/8) ## Thematic mapping SED.NSW <- SED[SED$STATE_2006 == "1",]; # subset of SED for convenience #### Create a ThemeData data set with a summary of the data we are interested in - proportion of people with a tertiary education ThemeData <- data.frame(SED = as.character(EducationLevel$SED), PropTertiaryEd = (EducationLevel$Total.Postgrad + EducationLevel$Total.GradDipCert + EducationLevel$Total.Bachelor + EducationLevel$Total.Diploma + EducationLevel$Total.Certificate) / (EducationLevel$Total.Postgrad + EducationLevel$Total.GradDipCert + EducationLevel$Total.Bachelor + EducationLevel$Total.Diploma + EducationLevel$Total.Certificate + EducationLevel$Total.NA), stringsAsFactors=FALSE); hist(ThemeData$PropTertiaryEd); # Histogram of the proportions to work out the appropriate cut points ThemeData$PropTertiaryEdFact <- cut(ThemeData$PropTertiaryEd,c(0,0.25,0.3,0.35,0.4,0.5,1.0)); # Create a factor for the proportion variable levels(ThemeData$PropTertiaryEdFact) <- c("25% or Less","25% to 30%","30% to 35%","35% to 40%","40% to 50%","More than 50%");
Thematic Mapping (3/8) #### Display a thematic map for all of NSW bands <- length(levels(ThemeData$PropTertiaryEdFact)); pal <- heat.colors(bands); plot(SED.NSW,col=pal[ThemeData$PropTertiaryEdFact[match(SED.NSW$NAME_2006,ThemeData$SED)]]); # Note the use of match() to get the right rows legend("bottomright", legend=levels(ThemeData$PropTertiaryEdFact), fill=pal, title="Prop. with Tertiary Ed.",inset=0.01); #### Display a thematic map for Sydney plot(SED.NSW,col=pal[ThemeData$PropTertiaryEdFact[match(SED.NSW$NAME_2006,ThemeData$SED)]],xlim=c(150.6,151.4),ylim=c(-34.3,-33.4)); legend("bottomright", legend=levels(ThemeData$PropTertiaryEdFact), fill=pal, title="Prop. with Tertiary Ed.",inset=0.01);
Thematic Mapping (5/8) #### Now we'll add the election results to our ThemeData data set rownames(ElectionResults) <- as.character(ElectionResults$District); # Adding rownames allows us to index by them when matching ThemeData$PropGreenVote <- ElectionResults[ThemeData$SED,"GRN"] / ElectionResults[ThemeData$SED,"Total"]; # Create a green vote proportion variable hist(ThemeData$PropGreenVote,breaks=20); # Have a look at the distribution ThemeData$PropGreenVoteFact <- cut(ThemeData$PropGreenVote,c(0,0.05,0.06,0.08,0.1,0.15,1.0)); # Create a factor levels(ThemeData$PropGreenVoteFact) <- c("Less than 5%","5% to 6%","6% to 8%","8% to 10%","10% to 15%","More than 15%");
Thematic Mapping (7/8) #### And do some thematic maps of the election results bands <- length(levels(ThemeData$PropGreenVoteFact)); pal <- heat.colors(bands); plot(SED.NSW,col=pal[ThemeData$PropGreenVoteFact[match(SED.NSW$NAME_2006,ThemeData$SED)]]) legend("bottomright", legend=levels(ThemeData$PropPropGreenVoteFactFact), fill=pal, title="Prop. Voted Green",inset=0.01) plot(SED.NSW,col=pal[ThemeData$PropGreenVoteFact[match(SED.NSW$NAME_2006,ThemeData$SED)]],xlim=c(150.6,151.4),ylim=c(-34.3,-33.4)) legend("bottomright", legend=levels(ThemeData$PropGreenVoteFact), fill=pal, title="Prop. Voted Green",inset=0.01)
Geographic Querying (1/4) ## Demonstration of geographic querying #### Read in the Localities layer from the TOPO 2.5M data set Locs <- readOGR("C:\\Documents and Settings\\marlada\\My Documents\\AQUA Internal\\Thought Leadership\\201204 - SURF Geospatial Analysis Presentation\\Geographies\\localities","aus25lgd_p"); Mtns <- Locs[Locs$LOCALITY == "6",]; # Select only mountains plot(Mtns) #### Use the over function to find a list of mountains in SEDs with more than 10% green votes over(SED.NSW[!is.na(ThemeData$PropGreenVote[match(SED.NSW$NAME_2006,ThemeData$SED)]) & ThemeData$PropGreenVote[match(SED.NSW$NAME_2006,ThemeData$SED)] > 0.10,], Mtns); # Only gets one mountain per SED over(SED.NSW[!is.na(ThemeData$PropGreenVote[match(SED.NSW$NAME_2006,ThemeData$SED)]) & ThemeData$PropGreenVote[match(SED.NSW$NAME_2006,ThemeData$SED)] > 0.10,], Mtns,returnList=TRUE); # Gets all mountains, but in a less useful format do.call("rbind",over(SED.NSW[!is.na(ThemeData$PropGreenVote[match(SED.NSW$NAME_2006,ThemeData$SED)]) & ThemeData$PropGreenVote[match(SED.NSW$NAME_2006,ThemeData$SED)] > 0.10,], Mtns,returnList=TRUE)); # Gives us something a bit more useable