820 likes | 833 Views
Explore the reasons for modeling vegetation and learn about predictive models using parametric and nonparametric techniques, with examples and applications in forest inventory data and digital environmental data. This article discusses the importance of predictor data extraction, model generation, and prediction for vegetation mapping with statistical models and GIS tools.
E N D
Outline • Model types • Predictive models • Predictor data • Predictive model types (parametric/nonparametric) • Model Example (Tree-based/Random Forests) • Modeling dataset • Response/Predictor data • Discussion – scale of predictors • Predictor data extraction • Data exploration • Summary statistics/NA values • attaching data frame • Predictor variables • Response variables (binary, continuous) • Model generation • Tree-based models • Random Forests (classification/regression trees) • Variable importance/proximity • Model prediction • Polygon data • Predictor data (clipping, stacking) • Apply model and display map • Modelmap • Build model • Model diagnostics • Make map
Vegetation Modeling ## In general, there are 3 reasons we model vegetation: # 1. Explanatory: to explain why something is happening… find a pattern or a cause. # 2. Descriptive: to see an association between variables, not aimed at prediction. # 3. Predictive: to predict an occurrence based on known attributes. ## Examples: Is the height of a tree related to its diameter? Did the last 20 years of drought affect tree mortality rates? Are forest disturbances changing the carbon cycle? Can we predict the distribution of vegetation 100 years from now based on climate models? Can we predict the current distribution of vegetation across the landscape from the spectral signature of remotely-sensed data? ## An article on different reasons for modeling. Shmueli, G. 2010. To Explain or to Predict? Statistical Science 25(3):289-310.
Predictive Vegetation Models ## Predicting the distribution of vegetation across the landscape from available maps of environmental variables, such as geology, topography, and climate and spectral data from remotely-sensed imagery or products. ## Statistical models are built by finding relationships between vegetation data and the available digital data and then applied across given digital landscapes. Integrate forest inventory data . . . .and generate maps of forest attributes. . .with available digital data . . • Satellite imagery • DEMs • Soils y = f(x1 + x2 + xn) . .using flexible statistical models and GIS tools . .
Predictor Data • ## Digital environmental data and remotely-sensed products are becoming increasingly more available at many different scales. • Remotely-sensed data and derived products • Snapshot of what is on the ground • Caution: represents current vegetation distributions based on reflectance patterns, but does not explain potential occurrences of vegetation. • Topography variables (elevation, aspect, slope) • No direct physiological influence on vegetation • Caution: these variables are local to the model domain, be careful when extrapolating over space or time. • Climate variables (temperature, precipitation) • Directly related to physiological responses of vegetation • Surrogates for topographical variables • Caution: these variables are better for extrapolating over space and time, but recognize other limitations, such as current location and dispersal range, and species competition may also change through time.. • Other surrogates (geology, soils, soil moisture availability, solar radiation) • Resources directly used by plants for growth • Caution: similar to climate variables
Predictive Model Types Model Objective: Find a relationship between vegetation data and predictor data for prediction purposes. The model, in its simplest form, looks like the following: y ~ f (x1 + x2 + x3 + …) + ε, where y is a response variable, the x's are predictor variables, and ε is the associated error. • There are many different types of models • Parametric models – make assumptions about the underlying distribution of the data. • Maximum likelihood classification • Discriminant analysis • General linear models (ex. linear regression) • . . . • Nonparametric models – make no assumptions about the underlying data distributions. • Generalized additive models • Machine-learning models • Artificial neural networks • . . .
Predictive Model Types Parametric Models • Parametric models – make assumptions about the underlying distribution of the data • Assumptions: • The shape of the distribution of the underlying population is bell-shaped (normal). • Errors are independent. • Errors are normally distributed with mean of 0 and a constant variance. • Advantages: • Easy to interpret • High power with low sample sizes • Disadvantages: • If sample data are not from a normally-distributed population, may lead to incorrect conclusions. • Examples: • Maximum likelihood classification • Discriminant Analysis • Linear Regression • Multiple linear regression • Generalized linear models (parametric/nonparametric)
Predictive Model Types Parametric Models – Regression Example • Previous example: Using regression to fill in missing values. • ## Import data and subset for sp19 only • path <- "C:/Peru/Rworkshop" # Where workshop material is • setwd(path) • tree <- read.csv("PlotData/tree.csv", stringsAsFactors=FALSE) • sp19 <- tree[( tree$SPCD==19 & !is.na(tree$DIA) ),] • # Start off with a scatter plot • par(mfrow=c(1,2)) • plot(sp19$DIA,sp19$HT, xlab ="Diameter", ylab ="Height", • main = "Subalpine Fir") • abline(lm(sp19$HT ~ sp19$DIA)) • # We saw some heteroscedasticity (unequal variance), and transformed data to log scale. • plot(log(sp19$DIA), log(sp19$HT), xlab ="Diameter", ylab ="Height", • main = "Subalpine Fir, log scale") • abline(lm(log(sp19$HT) ~ log(sp19$DIA))) • # We looked at summary of models and saw lower residual error and higher R2 values. • r.mod <- lm(HT~DIA,data=sp19) • summary(r.mod) • r.mod.log.ht.dia <- lm(log(HT)~log(DIA),data=sp19) • summary(r.mod.log.ht.dia)
Predictive Model Types Parametric Models – Regression Example Previous example cont.. # Then we looked at the residuals versus the fitted values and normal QQ-plots. par(mfrow=c(2,2)) plot(r.mod$fitted,r.mod.log.ht.dia$residuals, xlab="Fitted",ylab="Residuals", main ="Fitted versus Residuals") abline(h=0) qqnorm(r.mod$residuals, main="Normal Q-Q Plot") qqline(r.mod$residuals) plot(r.mod.log.ht.dia$fitted,r.mod.log.ht.dia$residuals, xlab="Fitted",ylab="Residuals", main ="Fitted versus Residuals, log scale") abline(h=0) qqnorm(r.mod.log.ht.dia$residuals, main="Normal Q-Q Plot, log scale") qqline(r.mod.log.ht.dia$residuals) # NOTE: # Using transformations, we were able to work with data that was nonlinear, with unequal # variance structure using a parametric model.
Predictive Model Types Nonparametric Models • Nonparametric models – make no assumptions about the underlying distribution of the data. • Note: Vegetation data typically are not normally distributed across the landscape, therefore it is most often better to use a nonparametric model. • ## Advantages: • # If sample data are not from a normally-distributed population, using a parametric may lead to incorrect conclusions. • ## Disadvantages: • # Need larger sample sizes to have the same power as parametric statistics • # Harder to interpret • # Can overfit data • ## Examples: • # Generalized additive models • # Classification and regression trees (i.e. CART) • # Artificial neural networks (ANN) • # Multivariate adaptive regression splines (MARS) • # Ensemble modeling (i.e. Random Forests)
Random Forests • Random Forests (Breiman, 2001) Generates a series of classification and regression tree models.. .. sampling, with replacement, from training data (bootstrap) .. selecting predictor variables at random for each node .. outputting the class that most frequently results .. and calculating an out-of-bag error estimate .. and measuring variable importance through permutation randomForest – Liaw & Wiener ModelMap – Freeman & Frescino
Landsat TM Elevation Aspect Slope Prediction (% Tree crown cover) Model Example Objective: Using Random Forests to find relationships between forest inventory data and six spatial predictor layers. The models will be used to make predictions across a continuous, pixel-based surface. 60 TM B3 8000’ Elev 160° Aspect 15% Slope 120 TM B3 10500’ Elev 10° Aspect 12 % Slope 80 TM B3 9200’ Elev 95° Aspect 20% Slope Landsat TM Elevation Aspect Slope Extract data from each layer at each sample plot location. 10 % cover 80 % cover 35 % cover Generate spatially explicit maps of forest attributes based on cell by cell predictions.
Model Example The model form: y ~ f (x1 + x2 + x3 + x4 + x5 + x6) + ε, where y is forest inventory data, and the x's are the predictor variables listed below, including satellite spectral data and topographic variables. ## For this example, # We will look at 2 responses: # Presence of aspen Binary response of 0 and 1 values (1=presence) # Total carbon Continuous response # With 5 predictor variables: # Landsat Thematic Mapper, band 5 30-m spectral data, band 5 # Landsat Thematic Mapper, NDVI 30-m spectral data, NDVI # Classified forest/nonforest map 250-m classified MODIS, resampled to 30 m # Elevation 30-m DEM # Slope 30-m DEM –derived # Aspect 30-m DEM - derived
Study Area Utah, USA
Study Area Model data set: Uinta Mountains, Utah,USA Highest East-West oriented mountain range in the contiguous U.S. - up to 13,528 ft (4,123 m) Apply model to: High Uinta Wilderness ## Vegetation 5 different life zones: shrub-montane aspen lodgepole pine spruce-fir alpine
Data for Modeling • # Load libraries • library(rgdal) # GDAL operations for spatial data • library(raster) # Analyzing gridded spatial data • library(rpart) # Recursive partitioning and regression trees • library(car) # For book (An R Companion to Applied Regression) • library(randomForest) # Generates Random Forest models • library(PresenceAbsence) # Evaluates results of presence-absence models • library(ModelMap) # Generates and applies Random Forest models
Response Data • # Forest Inventory data (Model response) • # We have compiled this data before, let's review. • options(scipen=6) • plt <- read.csv("PlotData/plt.csv", stringsAsFactors=FALSE) • tree <- read.csv("PlotData/tree.csv", stringsAsFactors=FALSE) • ref <- read.csv("PlotData/ref_SPCD.csv", stringsAsFactors=FALSE) • ## The plt table contains plot-level data, where there is 1 record (row) for each plot. This table has the coordinates (fuzzed) of each plot, which we will need later. • dim(plt) ## Total number of plots • head(plt) ## Display first six plot records. • ## The tree table contains tree-level data, where there is 1 record (row) for each tree on a plot. We need to summarize the tree data to plot-level for modeling. • dim(tree) ## Total number of trees • head(tree) ## Display first six tree records. • # First, let's add the species names to the table. • # Merge species names to table using a reference table of species codes • tree <- merge(tree, ref, by="SPCD") • head(tree)
Response Data • # Forest Inventory data (Model response) • ## We have 2 responses: • # Presence of aspen Binary response of 0 and 1 values (1=presence) • # Total carbon Continuous response • ## Let's compile presence of aspen and append to plot table. • # First, create a table of counts by species and plot. • spcnt<- table(tree[,c("PLT_CN", "SPNM")]) • head(spcnt) • # For this exercise, we don't care about how many trees per species, we just want presence • # or absence, therefore, we need to change all values greater than 1 to 1. • spcnt[spcnt > 1] <- 1 • # We are only interested in aspen presence, so let's join the aspen column to the plot table. • spcntdf <- data.frame(PLT_CN=row.names(spcnt), ASPEN=spcnt[,"aspen"]) • plt2 <- merge(plt, spcntdf, by.x="CN", by.y="PLT_CN") • dim(plt) • dim(plt2) # Notice there are fewer records (plots with no trees) • plt2 <- merge(plt, spcntdf, by.x="CN", by.y="PLT_CN", all.x=TRUE) • dim(plt2)
Response Data • ## Forest Inventory data (Model response) • ## Now, let's compile total carbon by plot and append to plot table. • # First, create a table of counts by plot. • pcarbon<- aggregate(tree$CARBON_AG, list(tree$PLT_CN), sum) • names(pcarbon) <- c("PLT_CN", "CARBON_AG") • # Carbon is stored in FIA database with units of pounds. Let's add a new variable, CARBON_KG, with conversion to kg. • pcarbon$CARBON_KG <- round(pcarbon$CARBON_AG* 0.453592) • # Now we can join this column to the plot table (plt2). • plt2 <- merge(plt2, pcarbon, by.x="CN", by.y="PLT_CN", all.x=TRUE) • dim(plt2) • head(plt2) • # Change NA values to 0 values. • plt2[is.na(plt2[,"ASPEN"]), "ASPEN"] <- 0 • plt2[is.na(plt2[,"CARBON_KG"]), "CARBON_KG"] <- 0 • plt2$CARBON_AG <- NULL • head(plt2)
Response Data • # We need to extract data from spatial layers, so let's convert the plot table to a SpatialPoints object in R. • ## We know the projection information, so we can add it to the SpatialPoints object. • prj4str <- "+proj=longlat+ellps=GRS80 +datum=NAD83 +no_defs" • ptshp<- SpatialPointsDataFrame(plt[,c("LON","LAT")], plt, proj4string = CRS(prj4str)) • ## Display the points • plot(ptshp)
Predictor Data ## Predictor variables: # Landsat Thematic Mapper, band 5 30-m spectral data, band 5, resampled to 90 m # Landsat Thematic Mapper, NDVI 30-m spectral data, NDVI, resampled to 90 m # Classified forest/nonforest map 250-m classified MODIS, resampled to 90 m # Elevation 30-m DEM, resampled to 90 m # Slope 90-m DEM – derived in a following slides # Aspect 90-m DEM – derived in a following slides ## Set file names b5fn <- "SpatialData/uintaN_TMB5.img" # Landsat TM–Band5 ndvifn <- "SpatialData/uintaN_TMndvi.img" # Landsat TM–NDVI fnffn <- "SpatialData/uintaN_fnfrcl.img" # Forest type map (reclassed) elevfn<- "SpatialData/uintaN_elevm.img" # Elevation (meters) Note: If you don't have uintaN_fnfrcl.img, follow steps on last slide of this presentation (Appendix 1) ## Check rasters rastfnlst <- c(b5fn, ndvifn, fnffn, elevfn) rastfnlst sapply(rastfnlst, raster)
Predictor Data # TM Band 5 (uintaN_TMB5.img) # TM NDVI (uintaN_TMndvi.img)
Predictor Data # Forest/Nonforestmap (uintaN_fnf.img) # DEM (uintaN_elevm.img)
Predictor Data ## Now, let's generate slope from DEM. Save it to your SpatialData folder. help(terrain) help(writeRaster) slpfn <- "SpatialData/uintaN_slp.img" slope <- terrain(raster(elevfn), opt=c('slope'), unit='degrees', filename=slpfn, datatype='INT1U', overwrite=TRUE) plot(slope, col=topo.colors(6)) # Add slope file name to rastfnlst rastfnlst <- c(rastfnlst, slpfn) rastfnlst
Predictor Data ## We can also generate aspect from DEM. Save it to your SpatialData folder. help(terrain) ## This is an intermediate step, so we are not going to save it. aspectr<- terrain(raster(elevfn), opt=c('aspect'), unit='radians') aspectr # Note: Make sure to use radians, not degrees plot(aspectr, col=terrain.colors(6))
Predictor Data • ## Aspect is a circular variable. There are a couple ways to deal with this: • ## 1. Convert the values to a categorical variable (ex. North, South, West, East) • ## We derived aspect in radians. First convert radians to degrees. • aspectd <- round(aspectr * 180/pi) • aspectd • ## Now, create a look-up table of reclass values. • help(reclassify) • frommat <- matrix(c(0,45, 45,135, 135,225, 225,315, 315,361), 5, 2) • frommat • frommat <- matrix(c(0,45, 45,135, 135,225, 225,315, 315,361), 5, 2, byrow=TRUE) • frommat • tovect <- c(1, 2, 3, 4, 1) • rclmat <- cbind(frommat, tovect) • rclmat • ## Reclassify raster to new values. • aspcl<- reclassify(x=aspectd, rclmat, include.lowest=TRUE) • aspcl • unique(aspcl) • bks<- c(0,sort(unique(aspcl))) # Break points • cols <- c("dark green", "wheat", "yellow", "blue") # Colors • labs <- c("North", "East", "South", "West") # Labels • lab.pts<- bks[-1]-diff(bks)/2 # Label position • plot(aspcl, col=cols, axis.args=list(at=lab.pts, labels=labs), breaks=bks)
Predictor Data • ## 2. Convert to a linear variable (ex. solar radiation index; Roberts and Cooper 1989) • aspval <- (1 + cos(aspectr+30))/2 ## Roberts and Cooper 1989 • aspval • plot(aspval) • ## Let's multiply by 100 and round so it will be an integer (less memory) • aspval <- round(aspval * 100) • aspval • plot(aspval) • # Save this layer to file • aspvalfn <- "SpatialData/uintaN_aspval.img" • writeRaster(aspval, filename=aspvalfn, datatype='INT1U', overwrite=TRUE) • # Add aspvalto rastfnlst • rastfnlst <- c(rastfnlst, aspvalfn) • ## Converts aspect into solar radiation equivalents, with a correction of 30 degrees to reflect ## the relative heat of the atmosphere at the time the peak radiation is received. • ## Max value is 1.0, occurring at 30 degrees aspect; min value is 0, at 210 degrees aspect. • ## Roberts, D.W., and S. V. Cooper. 1989. Concepts and techniques in vegetation mapping. In Land classifications based on vegetation: applications for resource management. D. Ferguson, P. Morgan, and F. D. Johnson, editors. USDA Forest Service General Technical Report INT-257, Ogden, Utah, USA.
Discussion - Scale of Predictors ## Tools in R for handling scale issues. See help on functions for further details. # focal - applies a moving window function across pixels without changing the resolution. # Note: It works but takes a lot of time on large rasters. # aggregate - aggregates pixels to lower resolution. # resample – resamples pixels to match extent or pixel size of another raster.
Predictor Data Extraction 60 TM B3 8000’ Elev 160° Aspect 15% Slope 120 TM B3 10500’ Elev 10° Aspect 12 % Slope • ## The next step is to extract the values of each raster at each sample point location. 80 TM B3 9200’ Elev 95° Aspect 20% Slope Landsat TM Elevation Aspect Slope Extract data from each layer at each sample plot location. 10 % cover 80 % cover 35 % cover
Predictor Data Extraction • ## We need to check the projections of the rasters. If the projections are different, • ## reproject the points to the projection of the rasters, it is much faster. • ## We will use the plt2 table with LON/LAT coordinates and the response data attached. • head(plt2) • ## We know the LON/LAT coordinates have the following projection: • prj4str <- "+proj=longlat+ellps=GRS80 +datum=NAD83 +no_defs" • # Check projections of each raster.. • sapply(rastfnlst, function(x){ projection(raster(x)) }) • ## ReprojectSpatialPoints object to match raster projections. • help(project) • rast.prj <- projection(raster(rastfnlst[1])) • xy <- cbind(plt$LON, plt$LAT) • xyprj <- project(xy, proj=rast.prj)
Predictor Data Extraction • ## Extract values (raster package) • help(extract) • # Let's extract values from 1 layer. • tmp <- extract(raster(elevfn), xyprj) • head(tmp) • # Now, let's create a function to extract, so we can extract from all the rasters at the same time. • extract.fun <- function(rast, xy){ extract(raster(rast), xy) } • # Now, apply this function to the vector list of raster file names. • rastext<- sapply(rastfnlst, extract.fun, xyprj) • # Look at the output and check the class. • head(rastext) • class(rastext)
Predictor Data Extraction • ## Extract values (raster package) cont.. change names • # Let's make the column names shorter. • colnames(rastext) • # Use the rastfnlst vector of file names to get new column names. • # First, get the base name of each raster, without the extension. • cnames <- unlist(strsplit(basename(rastfnlst), ".img")) • cnames • # We could stop here, but let's make the names even shorter and remove • # 'uintaN_' from each name. • cnames2 <- substr(cnames, 8, nchar(cnames)) • cnames2 • # Now, add names to matrix. Because the output is a matrix, we will use colnames. • colnames(rastext) <- cnames2 • head(rastext)
Predictor Data Extraction • # Now, let's append this matrix to the plot table with the response data (plt2). • head(plt2) • # We just want the response variables, so let's extract these columns along with the unique identifier of the table (CN, ASPEN, CARBON_KG). • modeldat<- cbind(plt2[,c("CN", "ASPEN", "CARBON_KG")], rastext) • head(modeldat) • # Check if this is a data frame • is.data.frame(modeldat) • dim(modeldat) • # Let's also append the projected xy coordinates for overlaying with raster layers. • modeldat<- cbind(xyprj, modeldat) • head(modeldat) • colnames(modeldat)[1:2] <- c("X", "Y") • head(modeldat)
Model Data Exploration • ## What to look for: • # NA values • # Outliers • # Correlations • # Non-normal distributions • # Changes in variability • # Clustering • # Non-linear data structures
Model Data Exploration Summary statistics • ## Summary statistics • str(modeldat) • summary(modeldat) • head(modeldat) • dim(modeldat) • # We need to convert categorical variables to factors • modeldat$ASPEN <- factor(modeldat$ASPEN) • modeldat$fnfrcl<- factor(modeldat$fnfrcl) • ## Now, display summary statistics again and notice changes for ASPEN and fnfrcl • str(modeldat) • summary(modeldat) • head(modeldat) # notice head does not show which variables are factors
Model Data Exploration NA Values • ## Check for NA values. • modeldat[!complete.cases(modeldat),] • modeldat.NA <- modeldat[!complete.cases(modeldat),] • dim(modeldat.NA) • modeldat.NA • # We can overlay plots with NA values on raster. • plot(raster(aspvalfn)) • points(modeldat.NA, pch=20) • # Most R model functions will handle or remove NA values, but for this example, let's remove the plots with NA values from our dataset now. • modeldat <- modeldat[complete.cases(modeldat),] • dim(modeldat)
Model Data Exploration attach data frame • ## Attaching a data frame. This is useful if you are exclusively working with 1 data frame. • ## Caution: data frame variable names must be unique to data frame. • ## Let's save modeldat object and clean up before we attach the data frame. • save(modeldat, file="Outfolder/modeldat.Rda") • # Now, remove all objects except modeldat. • ls()[ls() != "modeldat"] • rm(list = ls()[ls() != "modeldat"]) • ls() • # .. and attach modeldat • attach(modeldat) • head(modeldat) • ASPEN # Display column vector without using $ • # Notes: • # To load the saved model object • # load(file="Outfolder/modeldat.Rda") • # Make sure to detach data frame when done using it.. • # detach(modeldat)
Model Data Exploration Predictors • ## Check for outliers and correlations among predictors. • ## Let's look at an example using 2 predictors (elevm, TMB5) • preds <- cbind(elevm, TMB5) • ## Correlation between predictor variables, to determine strength of relationship • cor(preds) • round(cor(preds),4) • ## Scatterplots • plot(preds) • ## Add a regression line • abline(lm(TMB5 ~ elevm)) • ## Now let's add a smoother line for more information about data trend • lines(loess.smooth(elevm, TMB5), col="red") • ## Another way • library(car) • scatterplot(TMB5 ~ elevm) • help(scatterplot)
Model Data Exploration Predictors • ## Correlation cont.. • # We can now see a non-linear trend in the data where, TMB5 spectral values decrease as elevations rise up to around 3000 meters, then begins to increase in value as elevations continue to rise. • # This suggests a non-parametric relationship. • help(cor) • # The default uses pearson correlation coefficient, but spearman may be a better choice for nonlinear data structures. • round(cor(preds, method="spearman"),4) • ## Let's look at all predictors at once • names(modeldat) • preds <- modeldat[-c(1:5)] • head(preds) • ## Correlation between predictor variables (to minimize number of predictors) • cor(preds) • str(preds) • ## Note Error : 'x' must be numeric (remove factor variable from analysis) • round(cor(preds[,-3], method="spearman"),4)
Model Data Exploration Predictors • ## Scatterplots • pairs(preds) • help(pairs) #select Scatterplot Matrices from graphics package • ## In help doc, scroll down to Examples and copy and paste the 2 functions: • # panel.hist • # panel.cor • pairs(preds, lower.panel = panel.smooth, upper.panel = panel.cor) • ## Change cor function to spearman within panel.cor • panel.cor <- function(x, y, digits = 2, prefix = "", cex.cor, ...) • { usr <- par("usr"); on.exit(par(usr)) • par(usr= c(0, 1, 0, 1)) • r <- abs(cor(x, y, method="spearman")) • txt <- format(c(r, 0.123456789), digits = digits)[1] • txt <- paste0(prefix, txt) • if(missing(cex.cor)) cex.cor <- 0.8/strwidth(txt) • text(0.5, 0.5, txt, cex = cex.cor * r) • } • pairs(preds, lower.panel = panel.smooth, upper.panel = panel.cor) • ## Another way • scatterplotMatrix(preds[,-3])
Model Data Exploration Predictors • ## Check for outliers • plot(TMB5, TMndvi) • identify(TMB5, TMndvi) # Click on outliers and press esc key to escape • ## Display the outliers • modeldat[c(58,60),] • plot(raster("SpatialData/uintaN_TMB5.img")) • points(modeldat[c(58,60),], pch=20) • ## Let's remove these outliers from our dataset • modeldat <- modeldat[-c(58,60),] • dim(modeldat) • length(TMB5) • ## Detach and reattach data frame • detach(modeldat) • attach(modeldat) • dim(modeldat) • length(TMB5) • save(modeldat, file="Outfolder/modeldat.Rda") ## Resave modeldat object • # load(file="Outfolder/modeldat.Rda") • ## Check scatterplot again • plot(TMB5, TMndvi) • ## ..and for all predictors • preds <- modeldat[-c(1:5)] • pairs(preds, lower.panel = panel.smooth, upper.panel = panel.cor)
Model Data Exploration Predictors • ## We can also look at the distribution of each predictor across its range of values and its diversion from normality using the cumulative density function we looked at earlier. • ## This is only meaningful for continuous predictors • plot(sort(elevm)) • lines(c(1, length(elevm)), range(elevm), col=2) • ## Now, let's make a little function to look at the rest of the continuous predictors • pdistn <- function(x){ • plot(sort(x)) • lines(c(1, length(x)), range(x), col=2) • } • par(ask=TRUE) # Press enter to go to next display • pdistn(elevm) • pdistn(slp) • pdistn(TMB5) • pdistn(TMndvi) • pdistn(aspval) • par(ask=FALSE) • ## For categorical predictors, let's use table to look at the distribution of samples by value. • table(fnfrcl)
Model Data Exploration Response • ## Let's look at the distribution and amount of variability of the sample response data. • ## Check for normality (bell-shaped curve) • ## Histograms and density functions • par(mfrow=c(2,1)) • hist(CARBON_KG) • hist(CARBON_KG, breaks=5) • ## Overlay density function (smoothed) on the histogram with continuous response. • hist(CARBON_KG, breaks=10, prob=TRUE) • lines(density(CARBON_KG)) • ## Data with lots of 0 values tend to have this shape. This is not a normal distribution • ## Therefore, using a nonparametric model, such as Random Forests is a good idea. • ## Let's look at the distribution if there were no 0 values (plots without trees) • hist(CARBON_KG[CARBON_KG>0], breaks=10, prob=TRUE) • lines(density(CARBON_KG[CARBON_KG>0])) • ## The shape of the distribution is similar, with much more higher values than lower values. • par(mfrow=c(1,1))
Model Data Exploration Data distributions – Binomial response • ## Let's look at the distribution of the sample response data with a couple predictors. • ## We will start with presence/absence of aspen. We know this is a binomial distribution with 0 and 1 values. • ## Let's explore a little more. • ## Plot elevation as a function of ASPEN, bar representing median value. • boxplot(elevm~ ASPEN, main="Aspen Presence/Absence") • ## Add names • boxplot(elevm ~ ASPEN, main="Aspen Presence/Absence", names=c("Absence", "Presence")) • # Note: the bold line is median value • ## Add points with mean values • means <- tapply(elevm, ASPEN, mean) • points(means, col="red", pch=18) • # Note: overall, presence of aspen tends to be at the lower elevations of the sample plots. • # The distributions are slightly skewed with the median elevm values higher than mean elevm values.
Model Data Exploration Data distributions – Binomial response • ## Other ways to explore relationships between the response and predictors. • ## We can look at the differences of presence vs absence with 2 predictors. • plot(elevm, ASPEN) • ## We can look at the differences of presence vs absence with 2 predictors (elevation and slope). • par(mfrow=c(1,2)) • plot(elevm[ASPEN==1], slp[ASPEN==1]) • plot(elevm[ASPEN==0], slp[ASPEN==0]) • ## Now, let's make it more meaningful by adding labels and using the same scale for each. • xlab<- "Elevation" • ylab <- "Slope" • xlim <- c(2000, 4000) • ylim <- c(0,40) • plot(elevm[ASPEN==1], slp[ASPEN==1], xlab=xlab, ylab=ylab, xlim=xlim, ylim=ylim, main="Aspen Present") • plot(elevm[ASPEN==0], slp[ASPEN==0], xlab=xlab, ylab=ylab, xlim=xlim, ylim=ylim, main="No Aspen") • par(mfrow=c(1,1))
Model Data Exploration Data distributions – Binomial response cont. • ## Other ways to explore relationships between the response and predictors (1 plot) • ## We can also color the points based on a factor (ASPEN) • plot(slp, elevm, col=ASPEN, pch=20, xlab="Slope", ylab="Elevation") • ## Add a legend (using default colors) • legend(x=35, y=2200, legend=levels(ASPEN), col=1:length(ASPEN), pch=20) • help(legend) • ## Now, do it again using your own color choice • palette(c("blue", "green")) # Change color palette first • plot(slp, elevm, col=ASPEN, pch=20, xlab="Slope", ylab="Elevation") • legend(x=35, y=2200, legend=levels(ASPEN), col=1:length(ASPEN), pch=20) • palette("default") # Change color palette back to default colors • ## Or.. • plot(slp, elevm, col=c("red", "blue"), pch=20, xlab="Slope", ylab="Elevation") • legend(x=35, y=2200, legend=levels(ASPEN), col=c("red", "blue"), pch=20) • ## Another way: • scatterplot(slp ~ elevm|ASPEN, data=modeldat) • ## For categorical predictors, use table function again • table(ASPEN, fnfrcl)
Model Data Exploration Data distributions – Continuous response • ## Other ways to explore relationships between the response and predictors. • ## We can look at relationship between elevation and CARBON_KG • plot(elevm, CARBON_KG, xlab="Elevation", ylab="Carbon(kg)") • ## Without 0 values • plot(elevm[CARBON_KG>0], CARBON_KG[CARBON_KG>0], xlab="Elevation", • ylab="Carbon (kg)") • ## Add regression and smoother lines. • par(mfrow=c(2,1)) • ## With 0 values • plot(elevm, CARBON_KG, xlab="Elevation", ylab="Carbon(kg)") • line.lm <- lm(CARBON_KG ~ elevm) • abline(line.lm, col="red") • line.sm <- lowess(elevm, CARBON_KG) • lines(line.sm, col="blue") • ## Without 0 values • plot(elevm[CARBON_KG>0], CARBON_KG[CARBON_KG>0],xlab="Elevation",ylab="Carbon(kg)") • line.lm.w0 <- lm(CARBON_KG[CARBON_KG>0] ~ elevm[CARBON_KG>0]) • abline(line.lm.w0, col="red") • line.sm <- lowess(elevm[CARBON_KG>0], CARBON_KG[CARBON_KG>0]) • lines(line.sm, col="blue") • par(mfrow=c(1,1))