460 likes | 503 Views
Model Based Geostatistics. Archie Clements University of Queensland School of Population Health. Overview. Introduction to geostatistics Assumptions Variogram components Variogram models Kriging Assumptions Model-based geostatistics Principles Building the model Prediction
E N D
Model Based Geostatistics Archie Clements University of Queensland School of Population Health
Overview • Introduction to geostatistics • Assumptions • Variogram components • Variogram models • Kriging • Assumptions • Model-based geostatistics • Principles • Building the model • Prediction • Validation • Applications: parasitic disease control in Africa
Spatial variation Z Y X
First and second order variation • First-order variation: • Trend • Large-scale variation • Can be due to large-scale environmental drivers (e.g. temperature for vector-borne diseases) • Second-order variation: • Localised variation: clustering • Modelled using geostatistics
Spatial dependence • Observations close in space are more similar than observations far apart • The variance of pairs of observations that are close together (small h) tends to be smaller than the variance of pairs far apart (large h) • Basis of the semivariogram • Spatial decomposition of the sample variance
Semivariance: statistical notation Semivariance is half the average squared difference of values observed at locations separated by a given distance (and direction) Function of distance (and direction); distance in bins, direction in sectors of compass – “azimuth”
Modelling spatial correlation: semivariogram Partial Sill Semivariance Sill Nugget Lag (h)
Nugget • Random variation (white noise); non-spatial measurement error • Microvariation (spatial variation at a scale smaller than the smallest bin) • If no spatial correlation: • Nugget = sill (flat semivariogram)
Semivariogram: decisions to be made • How many/what sized bins? • Depends on density of data points • For regular-spaced (grid-sampled) data bin size = size of cells in the grid • For irregular sampling – modify according to range of spatial correlation (big range, big bins; small range, small bins) • What maximum lag(h) to use? • Should be estimated up to half the length of the shortest side of study area • Which parametric model to use? • Visual fit • Statistical fit
Schistosoma mansoni, Uganda Omnidirectional semivariograms
Anisotropy • Spatial dependence is different in different directions • Semivariogram calculated in one direction is different from semivariogram calculated in another direction • Should check for anisotropy and, if present, accommodate it in interpolation • Range or sill (or both) can differ
Trended and skewed data • Data should be de-trended • Polynomials (regression on XY coordinates) • Generalised linear models (regression on covariates) • Generalised additive models (can over-fit) • If directional variograms are calculated & range in one direction is >3 X range in perpendicular, sign of trend • If skewed, consider transformation (e.g. log transformation, normal score transformation) • Otherwise, extreme values overly influence interpolated map • Have to back-transform interpolated values • Called “disjunctive Kriging”
Non-stationarity • Spatial correlation structure cannot be generalised to the whole study area • Why does it occur? • Different factors may operate in different parts of the study area • Different ecological zones with different disease epidemiology • Need to estimate the spatial correlation structure separately in each homogeneous zone
Kriging • Z(si) is the measured value at the ith location • λi is the weight attributed to the measured value at the ith location (calculated using semivariogram) • So is the prediction location For formulae on how the weights are estimated using the variogram: http://en.wikipedia.org/wiki/Kriging Prediction standard error/variance gives an indication of precision of the prediction
Geostatistics summary • Geostatistics involves 3 steps: • Exploratory data analysis • Definition of a variogram • Using the variogram for interpolation (Kriging) • Technique applicable for: • Point-referenced data • Spatially continuous processes: • Disease risk • Rainfall, elevation, temperature, other climate variables • Wildlife, vegetation, geology (mineral deposits)
Bayesian model-based geostatistics Seminal paper: Diggle, Tawn and Moyeed (1998). Model-based geostatistics. Appl. Stat. 47:3;299-350 Observed a need for addressing non-Gaussian observational error Idea is “to embed linear Kriging methodology within a more general distributional framework” Generalised linear models with an unobserved Gaussian process in the linear predictor Implemented in a Bayesian framework
Advantages of the Bayesian approach • Natural framework for incorporation of parameter uncertainty into spatial prediction • Can build uncertainty into parameters using priors • Non-informative • Informative (based on exploratory analysis, additional sources of information) • Convenient for modelling hierarchical data structures
Predictions • Can predict at specified validation locations (with observed outcomes for comparison) • Can predict at non-sampled locations, e.g. a prediction grid • Might be interested in • outcome • spatial random effect • Standard error of predicted outcome
Validation • Jack-knifing; sampling with replacement • Remove one observation, do prediction at that location and store predicted value • Repeat for all observations • Compare predicted to observed using statistical measures of fit (RMSE) and discriminatory performance (AUC) • Not feasible with MBG other than with v. small datasets • Cross-validation; sampling without replacement • Set aside a subset for validation (ideally 50%) • Use remaining data to “train” model • Compare predicted and observed for the validation subset using statistical measures • Can then recombine the validation and training subsets for final model build • External validation: using other prospective or retrospective dataset
Model-based geostatistics summary Model-based geostatistics involves: • Visual and exploratory data analysis • Variography (to determine if there is second-order spatial variation) • Variable selection (for deterministic component) • Building model (e.g. in WinBUGS) • Model selection (e.g. using DIC) • Prediction and validation
Schistosomiasis • 779 million people at risk • 207 million infected • Most in Africa • Significant illness and mortality • Two main forms in Africa: • Urinary schistosomiasis caused by Schistosoma haematobium • Intestinal schistosomiasis caused by S. mansoni
Life cycle of Schistosoma haematobium × Cercariae released Adult worm in human bladder wall Sporocysts in snail Eggs in urine Miracidia
Diagnosis of infection • S. haematobium: • Microscopic examination of urine slides: Presence of eggs and egg counts • Macrohaematuria (visible blood) • Microhaematuria (invisible blood) – tested using chemical reagent strips • Blood in urine questionnaire • S. mansoni and soil-transmitted helminths: • microscopic examination of stool samples
School-based control programmes • School-aged children have highest prevalence (proportion infected) and intensity (severity) of infection • Education system is convenient for control; central location to access target population
World Health Organisation guidelines: treat communities biannually where prevalence in school-age children is >10% and annually where prevalence >50% How do we determine which schools should be targeted? • No surveillance • Need to do surveys
Field survey: northwest Tanzania Lake Victoria • 153 schools surveyed • 60 children per school • What about non-sampled locations? Need to predict (interpolate) values
Uncertainty Lower bound: 95% PI Upper bound: 95% PI
Co-ordinated surveys in 3 contiguous countries • 418 schools • >26,000 children Probability that prevalence is >50% Clements et al. EID 2008
Other outcomes: co-infection East Africa: Brooker and Clements, Int. J. Parasitol., in press S. mansoni mono-infection: 7.9% Hookworm mono-infection: 40.5% Co-infection: 8.1%
Model for co-infection , Yijk~Multinomial(pijk,nijk),
Co-infection Hookworm monoinfection S. mansoni - Hookworm coinfection S. mansoni monoinfection
Other outcomes: Intensity of infection • Prevalence is used (currently) for disease control planning • Intensity of infection (eggs/ml urine or /g faeces) is more indicative of: • Morbidity (anaemia, urine tract, hepatic pathology) • Transmission
Intensity of S. mansoni infection, East Africa Clements et al. Parasitol 2006
Conclusions • In disease control we need evidence-based framework for deciding on where to allocate limited control resources • Maps are useful tools for highlighting sub-national variation; targeting interventions; advocacy (national and local); integrated control programmes; estimating heterogeneities in disease burden • Model-based geostatistics enables rich inference from spatial data; uncertainty