560 likes | 1.17k Views
Introduction to Spatial Statistics. Geostatistics Group: Faye Belshe , Smitri Bhotika , Mike Gil, Mike Hyman, Kenny Lopiano , Jada White Slides contributed by Dr. Christman and Dr. Young October 9, 2009. Types of Spatial Data. Continuous Random Field Lattice Data
E N D
Introduction to Spatial Statistics Geostatistics Group: Faye Belshe, SmitriBhotika, Mike Gil, Mike Hyman, Kenny Lopiano, Jada White Slides contributed by Dr. Christman and Dr. Young October 9, 2009
Types of Spatial Data • Continuous Random Field • Lattice Data • Point Pattern Data Note: Each type of data is analyzed differently
Geostatistics • Geostatistical analysis is distinct from other spatial models in the statistics literature in that it assumes the region of study is continuous • Observations could be taken at any point within the study area • Interpolation at points in between observed locations makes sense
Spatial Autocorrelation • Spatial modeling is based on the assumption that observations close in space tend to co-vary more strongly than those far from each other • Positively co-vary: values are similar in value • E.g. elevation (or depth) tends to be similar for locations close together) • Negatively co-vary: values tend to be opposite in value • E.g. density of an organism that is highly spatially clustered, where observations in between clusters are low and values within clusters are high
Covariance • Definition: two variables are said to co-vary if their correlation coefficient is not zero where is the correlation coefficient between X and Y and X(Y)is the standard deviation of X(Y) • Consider this in the context of a single variable • E.g. do nearest neighbors have non-zero covariance?
Continuous Data – Geostatistics Notation Z(s) is the random process at location s=(x, y) z(s) is the observed value of the process at location s=(x, y) D is the study region The sample is the set {z(s) : s D} . We say that it is a partial realization of the random spatial process {Z(s) : s D}
Conceptual Model where (s) is the mean structure; called large-scale non-spatial trend W(s) is a zero-mean, stationary process whose autocorrelation range is larger than min{|| si – sj ||: i,j = 1, 2, …, n}; called smooth small-scale variation (s) is a zero-mean, stationary process whose autocorrelation range is smaller than min{|| si – sj ||: i,j = 1, 2, …, n} and which is independent of W(s); called micro-scale variation or measurement error (s) is the random noise term with zero-mean and constant variance and which is independent of W(s) and (s)
Simpler Conceptual Model where (s) is the mean structure; called large-scale non-spatial trend δ(s) = W(s) + (s) is a zero-mean, stationary process with autocorrelation which combines the smooth small- scale and micro-scale variation (s) is the random noise term with zero-mean and constant variance which is independent of W(s) and (s)
Graphical Concept with Trend Red line indicates large-scale trend Green line shows how the data are arranged around the trend Note that there is a pattern to the points around the red line. The pattern implies possible positive autocorrelation in Z(x). Finally, there is white noise.
Graphical Concept without Trend Red line indicates a constant mean, i.e. no large-scale trend Green line shows how the data are arranged around the trend Again, the pattern of the green line implies possible positive autocorrelation in RZ(x)
Important Point • The model indicates that Z can be decomposed into large-scale variation, small + micro-scale variation, and noise • The reality is that any estimated decomposition is not a unique • E.g. in the graph just shown, we could have instead added a sinusoidal aspect to the large-scale trend and hence captured much of the apparent autocorrelation
Example Red line indicates large-scale trend captured by a sinusoidal + linear trend Green line shows how the data are arranged around the trend Note that now there is no obvious pattern and so the remaining unexplained variation is likely white noise in Z(x).
Modeling • Ultimately we want to do modeling of Z using the geostatistical model • Requires estimates of the model components • the mean • the small-scale variation and the covariances among Z values at different locations • Any “leftovers”, i.e. the unexplained or residual variability
Important Point • The choice of approach (detailed fit of a trend vs. large-scale trend + autocorrelation) to estimating/predicting Z depends strongly on the reason for and uses of the model • E.g. if you are interested in predicting Z at unsampled locations within the study area, then any model that uses covariates to estimate large-scale trend must also have the covariates known for the unsampled locations • E.g. if you are interested in understanding the reasons for the spatial distribution of Z then you may or may not want to incorporate a spatial correlation component
Correlation Structure (Semivariogram) • Now, to assess spatial autocorrelation we look at the behavior of the following: for every possible pair of locations in the dataset (N locations yields N(N-1)/2 pairs). • Correlated: we would expect Z(si) to be similar in value to Z(sj) and hence the squared difference to be small. • Independent: we would expect the squared difference to be relatively large since the two numbers would vary according to the population variability.
Plot (Variogram Cloud) Variogram cloud for a dataset of 400 observations Looking for pattern, i.e. is there a trend in γ with respect to distance between two locations
Empirical Variogram • The variogram cloud is usually very uninformative • Difficult to discern trend or pattern • More pertinent is to calculate the average values of γ for different distances • Problem is we don‘t usually have discrete distances between locations (happens only when data are on a perfect grid). • A common method for averaging γ at specific distances is to bin the distances into intervals (called lag distances), i.e. use all points within some bin width around a given distance value
Continuous Data – Geostatistics Because we do not usually have lots of values at discrete distances, a common method for averaging the values at discrete distances is to use all points within some bin width around a given distance value. So we choose several levels of h (distances) and calculate the empirical variogram: where N(h) is the set of all locations that are a distance of h apart within a tolerance region around h, i.e. and |N(h)| is the number of pairs in N(h).
Empirical Semivariogram • This plot is called an omnidirectional classical empirical semivariogram • Omnidirectional because the direction between the pairs of locations was ignored, • Classical because the equation used to estimate the mean (alternatives exist that are robust to outliers or to failure of assumptions of the model) • Semi because of the division by 2 in the equation used Graph based on a set of 20 distance lags
Important Points • The constantly increasing semi-variogram indicates that there is a problem with this dataset • Ideally, it should at some distance level off at the variance of the process implying that at some distance the relationship between 2 locations is the same regardless of the distance between them (i.e. observations are independent at large distances) • This graph indicates that • The data imply correlation exists at all distances (and therefore the study region is small relative to the range of autocorrelation) or • The data have a large-scale trend which may account for most of the seeming autocorrelation (small-scale trend)
Semivariogram Empirical semivariogram for different dataset in which there was no large-scale trend but definite autocorrelation Note the rise and then leveling off of the γ(h) values as distance increases We’ll cover shapes for variograms in more detail later
Semivariogram Empirical semivariogram for different dataset in which there was no large-scale trend and no autocorrelation Note that the γ(h) values are more-or-less the same regardless of distance
Important Points • If the empirical semivariogram increases in distance between locations, then the correlation between points is decreasing as distance increases • The point at which it flattens to a constant value is the distance at which any two points that distance or larger apart are independent. The value of γ is the variance of the spatial process • At this point in our analyses, the number of lag distances you use is not that critical but when we try to fit a curve to the empirical semivariogram later the number of lags becomes very important
Important Point About Directionality • Another point to consider is whether the pattern of autocorrelation, i.e. the shape of the curve describing the semivariogram, is the same in every direction. • Can’t tell from the omnidirectional plot. • Need to check if there is a directional effect
Directional Semivariograms • To check directionality in the covariance, plot γ for each h for different directions • Modify the sets of locations over which the averaging occurs • Typically done using a set of binned directions (wedges of the compass) • Requires that you modify the definition of neighborhood
Directional Semivariograms EXAMPLE: calculate mean variability for the angles 0, 22.5, 45, 67.5, 90, and 112.5 with a tolerance of 11.25 on each side.
Need for Assumptions in Order to Proceed Beyond This Point • The data that are collected are a partial observation of the spatial surface (e.g. map) that we are interested in • In addition, it is usually assumed that there is some “super process” that created the particular surface for which we have this partial view • To estimate the spatial autocorrelation we need to make some assumptions. • Otherwise, we don’t have sufficient information to make any inferences.
Two Assumptions • Stationarity, specifically second-order stationarity • Isotropy
Stationarity • The mean of the process is constant, i.e. no trend (s) = for all s D (1) • The covariance between any pair of points depends only on the distance (and possibly direction) of the points NOT the location of the points in space: where C(.) is the covariance function • This implies that the variance of Z is constant everywhere • If both points are met then the spatial process we are studying is said to be second-order stationary.
Relationship between Semivariogram and Correlation Assuming intrinsic stationarity, we have Now, assuming that , we have where . Thus,
Isotropy • The covariance between any pair of points does not depend on direction but only distance If this holds then the spatial process is said to be isotropic
Non-Constant Mean • Two ways to handle a trend when it does exist: • Detrend the data using regression (or similar) with covariates and then use the residuals from the trend analysis for the spatial autocorrelation analysis • E.g. disease rates as a function of population density • Universal kriging (UK) which allows for estimating the trend as a global polynomial in s = (x, y) and estimating the spatial autocorrelation simultaneously • UK ignores other explanatory covariates which can be advantageous or not depending on the purpose of your study
Non-Constant Variance • To account for heterogeneity (non-constant variance), • estimate variability in smaller subregions of the study area • Need to make decisions about the size and extent of the subregions • Need sufficient numbers of observations within each subregion • Transform or standardize your data so that the variability of the transformed values is constant over the region
Anisotropy • Two types of anisotropy • Geometric • the range over which correlation is non-zero depends on direction • The variance is constant over all directions • This type can be adjusted for in geostatistical analyses • Zonal • Anything not geometric anisotropy • Anisotropy implies that the spatial process evolves differentially throughout the study region
Variography • Fitting a valid semivariogram function to the empirical semivariogram • Now we are interested in describing the variogram as an equation in which variance is a function of the distance. • We shall assume that the spatial process is second-order stationary and isotropic in the following.
Semivariogram We have already seen how to obtain the empirical variogram of is the semivariogram and is the primary quantity of interest because Now we are interested in describing the semivariogram as a function of the distance. We shall assume that the spatial process is second-order stationary and isotropic in the following.
Semivariogram Semivariogram Models have the following properties: 1) Many are not linear in their parameters 2) Must be “conditionally negative-definite”, i.e. the function must satisfy for any real numbers satisfying 3) If as , there is microscale variation which is assumed to be due to measurement error (ME) or a process occurring at the microscale. ME is measurable only if we have replicate values at each location in the sample.
Semivariogram Semivariogram Models have the following properties: • If (h) is constant for every h except h = 0 where (0) = 0, then Z(s) and Z(t) are uncorrelated for any pair of locations s and t • , i.e. ||h||2 is increasing faster than (h) as h increases
Characteristics of the Semivariogram • It is 0 when the separation distance is 0 (Var(0)=0). • Nugget effect: variation in two points very close together. • May be measurement error • May be indicative of erratic process (gold ore). • The sill corresponds to the overall variance of the data. • Data separated by distances less than the range are spatially autocorrelated (Less variation between close observations than between far observations.)
Estimating the Semivariogram • Take all pairwise differences in the data: (Z(si)-Z(sj)), s= (x, y), a point in the 2-D plane. • Compute the Euclidean distance between the spatial locations: • Average pairs that have the same distance class; • “Binning”: like a 2-D histogram.
Modeling the Semivariogram • The semivariogram measures variation among units h units apart. • Note: We do not want negative standard errors. • So, we model the semivariogram with selected parametric functions ensuring all standard errors are nonnegative. • We estimate the nugget, sill, and range parameters of the model that best fit the empirical semivariogram (nonlinear least squares problem).
Selected semivariogram models
Covariogram Models Spherical Model Gaussian Model Exponential Model Power Model is simply a reparameterization of the exponential model.
Covariogram vs. Semivariogram The covariogram and semivariogram are related:
The fitted semivariogram model Estimates: nugget=0.084, sill=0.269, range=110.3 miles
Common methods for fitting these functions to a set of empirical semivariogram means: 1) choose the most likely candidate model 2) Methods for estimating the parameters of the model : • non-linear least squares estimation – allows for the estimation of parameters that enter the equation non-linearly but ignores any dependences among the empirical variogram values • non-linear weighted least-squares – generalized least squares in which the variance-covariance of the variogram data points is accounted for in the estimation procedure • maximum likelihood assuming the data are Normally distributed but the estimators are likely to be highly biased, especially in small samples (the usual remedy is jackknifing) • restricted maximum likelihood – maximize a slightly altered likelihood function which reduces the bias of the MLEs
Properties of Variogram Models • if as then there is microscale variation • Usually assumed to be due to measurement error (ME) • ME is measurable only if we have replicate values at each location in the sample • When fitting a variogram function, may estimate a non-zero value for c0 even when you do not have replicate observations at sites. This is called the nugget. • if (h) is constant for every h except h=0 where (0) = 0, then Z(si) and Z(sj) are uncorrelated for any pair of locations si and sj