470 likes | 650 Views
Lecture #2: quantitative regionalization and cluster detection, with special reference to local statistics. Spatial statistics in practice Center for Tropical Ecology and Biodiversity, Tunghai University & Fushan Botanical Garden. Topics for today’s lecture.
E N D
Lecture #2:quantitative regionalization and cluster detection, with special reference to local statistics Spatial statistics in practice Center for Tropical Ecology and Biodiversity, Tunghai University & Fushan Botanical Garden
Topics for today’s lecture • Multivariate grouping, and location-allocation modeling. • Going from the global to the local: variability and heterogeneity. • Impacts of spatial autocorrelation on histograms. • The LISA and Getis-Ord statistics. • Cluster analysis: multivariate analysis, cluster detection, and spider diagrams. • An overview of geographic and space-time clusters. • Regression diagnostics and geographic clusters
Multivariate grouping goals • If groups are unknown, to identify the latent natural groups of areal units • If groups are known, to assess similarities and differences among the groups • To determine the group centroids and groups of geographical points that result from minimizing some function of standard distance
Conventional cluster analysis distances to minimize • Single linkage – distances are measured between pairs of closest (nearest neighbor) areal units, one from each of two clusters, in attribute space • Complete linkage – distances are measured between pairs of most distant (furthest neighbor) areal units, one from each of two clusters, in attribute space • This criterion often gives the best grouping results
Average linkage – distances are measured between all possible pairs of areal units, one from each of two clusters, in attribute space, and then averaged • Centroid method – squared distances are measured between each areal unit and all cluster means, in attribute space • Ward’s algorithm – based upon ANOVA, areal units are allocated to clusters in order to minimize within cluster variances, and maximize between cluster variance • This criterion relates to location-allocation
Contemporary cluster analysis criteria • One- or two-stage density– areal unit groupings are based upon nonparametric probability density estimation (kth nearest neighbor, uniform kernel, Wong’s hybrid); utilizes single linkage • EML (equal variance maximum likelihood) – areal unit groupings are based upon maximizing the likelihood of mixtures of identical spherical multivariate normal distributions, possibly with unequal mixing proportions (i.e., sampling probabilities) • Flexible-beta– areal unit groupings are based upon a weighting involving scalar beta, which usually falls between 0 and -1 (a common default value is -0.25, with -0.5 appearing to be more suitable for data with many outliers) • McQuitty’s method– areal unit groupings are based upon weighted average linkage, the weighted pair-group arithmetic averages • Gower's median method– areal unit groupings are based upon weighted pair-group centroids, where distance may or may not be squared
Clustering with PCA/FA • Although PCA & FA are used most frequently to deal with multicollinearity across attribute variables (R-mode), these techniques also can be used to handle redundant information across areal units (Q-mode; e.g., the eigenfunctions of geographic weight matrix C) • Linear combinations extracted from matrix (I-11T/n)C(I-11T/n) or (I-11T/n)D*(I-11T/n) identify the range of possible distinct map patterns (i.e., uncorrelated and orthogonal)
Legendre et al. method • A comparison of the two procedures is in • Links directly to the semivariogram plot • D* is a truncated distance-based matrix, where the truncation is determined by the length of a minimum spanning tree articulating the set of locations
Ej is the map pattern with spatial autocorrelation level MCj Properties • The extreme eigenvalues define MCmax and MCmin (not necessarily 1, -1) • As eigenvalues go from the largest positive to the largest negative value, map patterns become more fragmented • Positive eigenvalues denote: • Global trends with relatively large values • Regional trends with intermediate values • Local trends with relatively small values
Selected ideal map patterns global regional MC ~ 1 MC = 0.9 MC = 0.7 regional local MC = -0.6 MC = 0.5 MC = 0.25
SA impacts on Gaussian RVs standard normal curves Principal impact: variance inflation SA map pattern: MC = 1.12, GR = 0.08 MC = 1.00 GR = 0.18 MC = 0.00 GR = 1.00 MCmax = 1.18 MC = 0.28 GR = 0.77 heavier tails increased kurtosis
Unstandardized normal curve map pattern generated autoregressive generated Kurtosis increases from 0.01 (roughly 0) to 0.73. The variance of kurtosis is 24/n. Therefore, here spatial autocorrelation has induced increased relative peakedness (from the sign of the kurtosis statistic) whose z = 7.3. Kurtosis increases from 0.04 (roughly 0) to 2.79. The variance of kurtosis is 24/n. Therefore, here spatial autocorrelation has induced increased relative peakedness (from the sign of the kurtosis statistic) whose z = 27.8.
Typical case: MC/MCmax = 0.6 map pattern MC = 0.61 GR = 0.50 map pattern MC = 0.80 GR = 0.34 E(MC) = -0.00042 E(GR) = 1
Transformations to normal approximations Torturing the data – conforming to a bell-shaped curve • Box-Cox power transformations • Manly’s exponential transformation • Percentage adjustments (also arcsine)
Constant variance • Attribute: variable transformations often stabilize the variance of a variable across its measurement range • Mean/median split gives a heuristic assessment of constant variance (equal variability of high and low values)
Constant variance • Geographic: variable transformations often stabilize the variance of a variable across the geographic landscape over which it is distributed • Quadrants of the plane/established areal unit groupings give a heuristic assessment of constant variance across a geographic landscape Plane quadrants provinces
Non-normal random variables (RVs) • Poisson: the mean equals the variance (built-in heterogeneity) • overdispersion: the variance is greater than the mean • assuming a gamma-distributed mean results in a negative binomial random variable • binomial: variance equals (1-p) times the mean [i.e., Np(1-p)] • overdispersion: the variance is greater than Np(1-p) • employ a quasi-likelihood estimation
Spatial autocorrelation impacts on Poisson RVs weak positive spatial autocorrelation iid overdispersion occurs when: var(Y) > strong positive spatial autocorrelation
Impacts of typical spatial autocorrelation levels hexagonal tessellation Poissonness plots irregular tessellation
Spatial auto-correlation impacts on binomial RVs global • variance increases • shape goes to uniform, then to sinusoidal autoregressive global & regional global & regional & local
Going from the global to the local Paralleling statistics concerning data outliers, and leverage and influential points, spatial heterogeneity in georeferenced data is addressed by focusing on individual areal units. The emphasis shift is from global trends to local exceptions, to better understand local deviations from global model descriptions by exploiting tensions between global trends and informative local details latent in empirical data: • adaptation of conventional diagnostic statistics (e.g., Unwin and Wrigley, 1987) • spatializing existing statistical techniques (e.g., Fotheringham et al., 2002) • Anselin’s (1995) seminal paper about indices of spatial association (i.e., LISA statistics) • Getis and Ord’s (1992, 1995) Gi and Gi* statistics
Goals of global versus local analysis • Identify clustering • Identify particular clusters (significant local clusters in the absence of global autocorrelation) • distinguish between homogeneity and heterogeneity (e.g., spatial outliers - highs surrounded by lows, and vice versa) • identify hot/cold spots • analyze local instability (local deviations from global pattern of spatial autocorrelation)
LISA: local indicators of spatial autocorrelation selevation area competition
Goal: to assess spatial correlation heterogeneity ANOVA F = 2845 Pr(>X2) = 0.4 LISA z-score
LISA for PR LN(elevation + 17.5) slope is unstandardized MC Pr(LISA) Bonferroni Sidak MC = 0.51; GR = 0.49
LISA maps • Cannot distinguish between H-H and L-L clusters • Conventional clustering fails to preserve contiguity significant LISA
… with contiguity proclivity Clustering geocoding coordinate pair coupled with zLISA values Clustering geocoding coordinate pair with frequencies proportional to zLISA values
Getis-Ord Gi [ Gi* includes i (i.e., j = i)] • contiguity based upon distance band defined by dr • dr may be obtained from a semivariogram plot • one statistic for each areal unit • Gi(dr) > 0 signifies clustering of high values • Gi(dr) < 0 signifies clustering of lows values • LISA fails to make this particular distinction
A Gi-based analysis: complete linkage Gi Gi clusters
A relationship between LISA and Gi for the same geographic connectivity matrix C The quadratic trend is why LISA cannot distinguish between HH and LL clusters, while Gi can.
geographic & space-time clusters: an overview • Global cluster tests search for spatial clusters anywhere in a study area but do not necessarily identify where the clusters occur, and are used to identify departures from spatial randomness when overall spatial pattern is considered. • Local cluster tests identify locations at which there is some excess/deficit—a hot/cold spot—anywhere within a study area. • Focused cluster tests determine whether there is an excess near a pre-specified location, called a focus, and are used to detect clustering near, say, putative hazards (e.g., a toxic waste dump).
Spider diagrams • allocation to AAR centroids • allocation to cluster (U, V, z-LISA) centroids
Regression diagnostics: each observation’s influence on parameter estimates and predicted values • PRESS – global measure that should roughly equal the mean squared error (MSE) for a trend line (equivalent to cross-validation) • Leverage – measures degree of influence of areal unit CzY,i value on an MC trend line (marked: > 2/n) • Studentized residual – measures whether ith areal unit causes a significant shift in its corresponding regression intercept (i.e., is an outlier; marked: > 2) • Cook’s D – measures influence of ith areal unit on an MC estimate (analogous to DFFITS; marked: > 2 )
Moran scatterplot for LN(elevation + 17.5) marked values Barranquitas is a spatial outlier, again! mean of C1
Spatial autocorrelation in diagnostic statistics: eigenvector covariates MC(E2) = 1.04926 Dark red: very high Light red: high Gray: medium Light green: low Dark green: very low MC = 1.04926 R2
What have we learned today ?