270 likes | 336 Views
800. 6.18. 5.46. 5.64. 4.99. 5.58. 5.59. 4.98. 6.18. 5.13. 5.66. 5.95. 5.39. 4.51. 5.51. 5.79. 600. 5.45. 6.24. 5.59. 5.53. 5.06. 6.17. 4.65. 5.45. 5.83. 6.16. 5.24. 5.27. 5.25. 4.97. 4.75. 5.47. 5.39. 5.16. 5.66. 5.86. 5.23. 5.6. 5.59. 4.65. 4.69. 6.97.
E N D
800 6.18 5.46 5.64 4.99 5.58 5.59 4.98 6.18 5.13 5.66 5.95 5.39 4.51 5.51 5.79 600 5.45 6.24 5.59 5.53 5.06 6.17 4.65 5.45 5.83 6.16 5.24 5.27 5.25 4.97 4.75 5.47 5.39 5.16 5.66 5.86 5.23 5.6 5.59 4.65 4.69 6.97 6.03 5.85 6.4 5.51 5.35 5.3 400 6.18 5.95 5.71 5.72 5.42 5.93 5.26 6.23 5.83 6.59 5.96 5.23 5.37 5.23 5.66 5.97 4.83 5.17 6.07 5.26 200 5.25 5.21 5.23 5.03 5.27 4.83 5.92 6.44 5.6 5.52 5.5 4.94 5.74 5.35 5.25 4.95 0 5.54 4.73 4.48 5.19 0 100 200 300 400 500 Chapter 8 – Spatial contiguity (geostatistic + lattice data) EDA involves methods of describing and visualizing data and its structure in order to formulate hypotheses and check the validity of assumptions. R is a particularly useful tool for EDA. In the follows, some common techniques are introduced. Data posting – It is a map on which each data location is plotted along with its corresponding data value. gigantemap.r(soil87.dat) # soil87.dat is regularly spaced at 87 locations text(soil87.dat$gx,soil87.dat$gy, as.character(soil87.dat$pHFsrf),col=4,cex=0.8)
800 3 2 2 1 2 2 1 3 1 2 3 2 1 2 2 600 2 3 2 2 1 3 1 2 2 3 2 2 2 1 1 2 2 1 2 3 1 2 2 1 1 3 3 3 3 2 2 2 400 y 3 3 2 2 2 3 2 3 2 3 3 1 2 1 2 3 1 1 3 2 200 2 1 1 1 2 1 3 3 2 2 2 1 2 2 2 1 0 2 1 1 1 0 100 200 300 400 500 x Symbol map It is similar to the data posting map, but with each location replaced by a symbol that represents the class to which the data belongs. z=soil87.dat$pHFsrf summary(z) Min. 1st Qu. Median Mean 3rd Qu. Max. 4.48 5.23 5.47 5.499 5.83 6.97 z[z<=5.23]=1 z[z>5.23&z<=5.83]=2 z[z>5.83]=3 gigantemap.r(soil87.dat) text(soil87.dat$gx,soil87.dat$gy, as.character(z),col=4)
Contour map, image and 3-D plot • Contour/image maps are useful to reveal overall trends in the data. However, the smoothing out of the variability in the data sometimes can be misleading. • Contour map (all these are R functions): • 1. >pH.mat=gigante.mat.r(soil87.dat$gx,soil87.dat$gy,soil87.dat$pHFsrf) • # Interpolate a surface for z values. • 2. >gigantemap.r(soil87.dat) # draw a location map • 3. >contour(sort(unique(soil87.dat$gx)),sort(unique(soil87.dat$gy)),t(pH.mat), add=T) # add a contour map • Image: • 1. >gigantemap.r(soil87.dat) # draw a location map • 2. >image(sort(unique(soil87.dat$gx)),sort(unique(soil87.dat$gy)),t(pH.mat), add=T) # add an image • 3-D perspective plot: • >persp(sort(unique(soil87.dat$gx)),sort(unique(soil87.dat$gy)),t(pH.mat))
H-scatter plot Locations close to each other are likely to have similar values than locations far apart. H-scatter plot is a useful tool to display this spatial continuity. An h-scatter plot shows all possible pairs of data values whose locations are separated by a certain distance in a particular direction. A set of sample data, z(s), is from a systematical grid. The h-scatter plot is a scatter plot in which the x-axis is labeled z(s) and the y-axis is labeled z(s+h), where h is a spatial lag. If spatial continuity is strong, the cloud of points (z(s), z(s+h)) should fall closely along the 1-1 diagonal line. As h increases, the data values become less similar, the cloud on the h-scatter plot becomes fatter and more diffuse.
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + Regularly spaced Irregularly spaced + + + + 0.5 m 0.5 m + + + + + 20° 5 m + + + + 20° + + + + + + Distance lag and angle tolerance for h-scatter plot The construction of h-scatter plot is straightforward for regularly spaced sample locations. However, if the locations are irregularly spaced, in some situation there may be very few sample pairs for a given h, while in others there may be a lot of sample pairs. To overcome this problem we need to specify a tolerance both on the distance of h and on its direction. Illustrated below, any locations falling within the shaded area are considered to be within 5 m of a given location in an easterly direction. tol
Stationarity and isotropy The assumption of intrinsic stationarity of a random field is essential for geostatistical analysis. It is defined through first differences as follows: where g(h) is called (semi)variogram. Intrinsic stationarity basically implies a process with a constant mean and with a variance depending only through the magnitude and direction of h. Furthermore, if the variance 2g(h) is independent of direction, i.e., it only depends on the magnitude of ||h||, the random field is isotropic. Geostatistical data are typically composed of both large-scale trend and small-scale random variation. Geostatistics (variogram/Kriging) is designed to describe the small-scale variation, so it is necessary to remove the trend and anisotropy before conducting geostatistical analysis and spatial prediction.
Median polishing Median polishing is a resistant method and appropriate for grided data. It is based on an additive decomposition as data = grand mean + row + column + residual zij = z.. + (zi. – z..) + (z.j – z..) + (zij – zi. – z.j + z..) where a subscript dot denotes mean (median) over that subscript. The algorithm iterates successive sweeps of medians out of rows, then out of columns, accumulating them in row, column, and grand effects. At each step of the algorithm, the above relation is preserved. The residuals are what is left after the algorithm converges. The method cannot remove the interaction effects between rows and columns. Note: The residuals are the data that we should use in geostatistical analysis. It has been thought working with residuals introduces bias into variogram estimation, but the bias is generally small (particularly when median is used instead of mean).
Trend surface analysis Trend surface analysis is appropriate for (non-)grided data. It basically fits the data using a multiple regression (considering x and y coordinates as independent variables, interaction of x and y or higher order terms are usually needed). Data model: >grid=expand.grid(x=1:50,y=1:25) >x=grid$x >y=grid$y >z=10+0.1*x+0.0012*y^2+rnorm(1250,0,1) #data generating >x2=x^2 >y2=y^2 >xy=x*y >z=lm(z~x+y+xy+x2+y2) # fit the data z Mapping z: >x0=0:50 >y0=0:25 > image(x0,y0,matrix(z,nrow=50),col=topo.colors(100),main="Species distribution",xlab="x",ylab="y")
Trend surface analysis (cont’d) > summary(z.lm) Call: lm(formula = z0 ~ x + y + xy + x2 + y2) Residuals: Min 1Q Median 3Q Max -3.06491 -0.65335 0.03657 0.66170 2.88715 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 1.002e+01 1.478e-01 67.841 <2e-16 *** x 1.078e-01 8.404e-03 12.826 <2e-16 *** y -8.981e-03 1.706e-02 -0.527 0.599 xy 7.934e-05 2.614e-04 0.304 0.762 x2 -1.473e-04 1.461e-04 -1.008 0.314 y2 1.273e-03 5.862e-04 2.172 0.030 * --- Mapping z.lm residuals: > image(x0,y0,matrix(resid(z.lm),nrow=50),col=topo.colors(100),main=“Residuals",xlab="x",ylab="y")
z(si) • Spatial autocorrelation • Several techniques available to detect spatial autocorrelation (e.g., variogram and correlogram). Here we introduce four indices; we defer variogram to next chapter. These indices deal with binary, ordinal and continuous data. • Mantel test • Wij= ||si-sj|| Uij= ||z(si)-z(sj)|| random variable β is the slope of the regression Uij= βWij+eij
z(si) Mantel test (cont’d) Wij= ||si-sj|| Uij= ||z(si)-z(sj)|| We do not usually use β for Mantel test. We instead directly use M2 (or M1) for testing spatial patterning. Several methods can be used to test for significance. The basic idea is to randomly assign z1, z2, …, zn to the study area and then compare the observed M2 against the random M2’s to see how extreme the observed M2obs is. If M2obs is very different from the majority of random M2’s, we then reject the assumption of random distribution.
Mantel test – significance test • Permutation test: Under the hypothesis of random distribution, the observed z(si) can be thought of as a random assignment of values to spatial location s. There are in total n! possible # of assignments (maps). Enumerating M2 for the n! arrangements yields its null distribution and the probability to exceed M2obs can be ascertained. • Monte Carlo test: To permute n! is computingly very expensive. In this case, we can sample independently small subset of random assignments (e.g., 99) to construct a null distribution of M2. M2obs
Rook Bishop Queen North Carolina if sites i and j are connected if they are not connected Spatial connectivity (contiguity) Measures on lattices Defining spatial connectivity is critical to the analysis of lattice data. There are several ways to define connectivity: by contiguity and distance.
if sites i and j are connected if site i is occupied if they are not connected if it is not occupied 2. Black-Black and Black-White join counts Defining spatial connectivity is critical to the analysis of lattice data. There are several ways to define connectivity: by contiguity and distance.
Test for significance Permutation or Monte Carlo test can be used to test the significance of the BB or BW statistics. Typically, the significance is tested by assuming an asymptotic normal distribution: Cliff, A.D. & Ord, J.K. 1981. Spatial processes: models and applications. Pion Limited, London.
3. Moran’s I where zi is the data value at location i, h is the distance between locations i and j, wij takes 1 if the pair (i, j) pertains to distance class h (the one for which the coefficient is computed), otherwise 0. W is the sum of wij. Moran’s I behaves like Pearson’s correlation coef because its numerator consists of a covariance term. Thus, it is sensitive to extreme values. Covariance based Distance based The association of z values at locations i and j can be measured by several ways:
Testing Moran’s I If zi is iid normal distribution, it can be shown that, as n is large, Moran’s I is asymptotically normally distributed with the first two moments: where • Two types of test: • Normal test if data are iid normal dist. • Permutation (randomization) test for whatever data distribution.
Geary’s c • Different from Moran’s I, this coefficient is a distance-type function because the numerator sums the squared differences between values at the various pairs of locations. • In fact, Geary’s c corresponds in form to the Durbin-Watson statistic used to test the presence of autocorrelation in the residuals of a regression (particularly in time series data). • A side note on the effects of autocorrelation on regression analysis: • OLS regression coefs. are still unbiased, but they are no longer minimum variance estimates, i.e., the estimates are inefficient. • If the residuals are positively autocorrelated, the residual mean square errors (MSE) may seriously underestimate s2, so that the std error may be too small. Thus, the CI’s are shorter than they really should be (see Cressie p.13-15), and tests of hypotheses on individual regression coefs may indicate that one or more regressors contribute significantly to the model when they really do not. Inflating Type I error! Generally, underestimating s2 gives a false impression of accuracy. • The CI’s and tests of hypotheses based on the t- and F-distributions are not really appropriate.
Testing Geary’s I • Similar to the Moran’s I, c is also asymptotically normally distributed if zi is iid normal distribution and n is large. The first two moments are: • Two types of test: • Normal test if data are iid normal distribution. • Permutation (randomization) test for whatever data distribution.
spatstat functions for Moran’s I and Geary’s c • > xy.dat=expand.grid(x=1:4, y=1:4) #x and y locations • > xy.z=1:16 #illustrative data • > xy.nb=dnearneigh(as.matrix(xy.dat), 0, 1) #create neighborhood structure • > str(xy.nb) #view the neighbors • > plot(xy.nb,xy.dat) #plot neighborhood • > xy.W=nb2listw(xy.nb) #generate weight for each neighbor • #assign equal weight to each neighbor • > moran(xy.z, xy.W, length(xy.nb), Szero(xy.W)) # compute Moran’s I. > geary(xy.z, xy.W, length(xy.nb), length(xy.nb)-1,Szero(xy.W)) # compute Geary’s c. sp.correlogram is used to compute correlogram (the graphic relationship between autocorrelation coef against distance). However, care is needed for the proper use of sp.correlogram. When using sp.correlogram, only use the basic nb object that connect all cells, e.g. for the regular grids, we only need to use xy.nb=dnearneigh(as.matrix(xy.dat), 0, 1) to define the nb (higher orders of neighborhood is not needed). If using higher order neighborhood structure, e.g., xy.nb=dnearneigh(as.matrix(xy.dat), 0, 3), sp.correlogram will still compute correct sp.correlogram but the lag has a slightly different tolerance band.
styles in nb2listw >xy.W=nb2listw(xy.nb, style=“W”) It is helpful to understand the different weighting style in nblistw. There are five options in style: “W”, “B”, “C”, “U” and “S”. “W” is the default. You can use >str(xy.W) to view the difference between the styles. “W”: it weighs the neighbors of each focal point independently from the neighbors of other focal points. Each neighbors of a focal point has the same weight to the focal point. For example, if a focal point has two neighbors, then the weight for each neighbor is 0.5. If a focal point has four neighbors, then the weight for each neighbor is 0.25. The sum of the weights for all the focal points must be the total number of the focal points. “B”: it simply takes binary weight, i.e., each neighbor of a focal point contributes 1 to the focal point. The sum of the weights for all the focal points must be the total number of the neighbors. “C”: global standardized. It is calculated by dividing the number of neighbors of each focal point by the total number of neighbors (the sum calculated from “B” method). The sum of the weights must be 1. “S”: the variance-stabilizing coding scheme proposed by Tiefelsdorf et al. 1999, p. 167-168. The sum of the weights is n.
Let’s compute correlogram for Gigante surface water pH values > soil87.xy=data.frame(x=soil87.dat$gx,y=soil87.dat$gy) #x and y locations > soil87.nb=dnearneigh(as.matrix(soil87.xy), 0, 60) #create xy neighborhood structure > plot(soil87.nb, soil87.xy) #plot the nb structure > soil87.pH.correlogram=sp.correlogram(soil87.nb, soil87.dat$pHFsrf, order=18, method=“I”, zero.policy=T) #compute correlogram > soil87.pH.correlogram > plot(soil87.pH.correlogram, main=“Gigante surface water pH values”) Neighborhood structure
R function for Moran’s I and Geary’s c The R program, autocorrelation.r, computes both I and c for a given data at a particular distance. >autocorrelation.r(data,x,y,dist1,dist2,k=0) # data – a data array we want to test its autocorrelation # x – the x-coordinates for the data array # y – the y-coordinates for the data array # dist1 and dist2 – used to define neighbors; locations fall within (dist1,dist2) of a focal location are defined as neighbors, not including dist1 and dist2. # k – the number of simulations for computing permutation distribution under the null hypothesis of no spatial autocorrelation (k=0 default) Example: autocorrelation.r(soil87.dat$pHFsrf,soil87.dat$gx,soil87.dat$gy,50,100,k=99)
LISAs (Localized Indicators of Spatial Autocorrelation) • We can rewrite Mantel’s statistic M2 in terms of contribution of each individual site: • Mi can be thought of as the autocorrelation measure associated with the ith site, a local measure. • Definition of LISA: (1) It measures the extent of spatial autocorrelation around a particular site, (2) if its sum is proportional to a global measure of spatial autocorrelation such as BB, I or c. • Applications: • Detect regions where the autocorrelation is markedly different from other area. • Detect local spatial clusters (hotspots). There are many definitions of hot (cold) spots. One way is to use histogram and identifies those sites as hot (cold) spots where observations are extreme. LISAs enable hotspot definition to take into account of the surrounding values of a site. A hot spot can be defined as one or several contiguous sites where the LISAs are unusually large or small. • Distinguish spatial outliers from distributional outliers. A spatial outlier may not be a distributional outlier. where Anselin, L. 1995. Local indicators of spatial association – LISA. Geographic Analysis 27:93-115.
Moran’s LISAs A local version of Moran’s I can be written as It is obvious that the sum of the local Moran’s I is proportional to the total I (scaled by W, the total weights): Anselin (1995) derives the mean and variance of Ii under the randomization assumption: spdep packagge: “localmoran”.
A few technical comments • Spatial autocorrelation analysis should not be performed with fewer than ~30 data points because the number of location pairs in each distance class would then become too small to produce reliable results. • There are two ways of dividing distance into classes: (1) by forming equal distance classes (as we did for the soil pH values in soil87.dat), (2) by forming classes with equal frequencies (i.e., distance h is not equally spaced, but for each distance class the number of locations are approximately equal). • Spatial autocorrelation analysis should not be performed with a data containing a lot of 0’s, because the degree of autocorrelation would then be overestimated and would reflect the fact that the locations share their absence for that variable, which is not what is intended in most applications. • Euclidean distances between locations may not be the best way to express spatial relationships in applications. Such a spatial structure should be defined depending on the particular purposes of a study. • Legendre, P. and Fortin, M.-J. 1989. Spatial pattern and ecological analysis. Vegetatio 80:107-138.