Measuring spatial clustering in disease patterns.

Measuring spatial clustering in disease patterns. Peter Congdon, Queen Mary University of London p.congdon@qmul.ac.uk http://www.geog.qmul.ac.uk/staff/congdonp.html http://webspace.qmul.ac.uk/pcongdon/

Background: spatial correlation • Tobler’s First Law of Geography: “All places are related but nearby places are more related than distant places” • Spatial correlation: similar values in nearer spatial units than more distant units • Common feature of geographically configured datasets (spatial econometrics, area health, political science, etc). • Can have positive or negative correlation, but positive correlation most common • Spatial correlation indices measure correlation but also account for distance between (or contiguity of )spatial units • Reference (null) pattern: spatial randomness. Values observed at one location do not depend on values observed at neighboring locations

Background: spatial heterogeneity • Michael Goodchild in “Challenges in geographical information science”, Proc RSA 2011” • Mentions a second principle of spatial data: spatial heterogeneity. • In fact, an example of such heterogeneity is local variation in the degree of spatial dependence, leading to local indices of spatial association (LISA measures)

Background: observation types • My focus is on spatial lattice data: N areal subdivisions (e.g. administrative areas) which taken together constitute the entire study region. • Unlike point data (e.g. mineral readings in geostatistics), where major focus is on interpolating a response between observed locations.

Global Indices of Spatial Association • Moran Index (for N areas, and continuous centred data Zi)

Spatial Weights • Possible options for spatial weights W=[wij] • Adjacency/contiguity: if area j is adjacent to area i, then wij=1; otherwise wij=0. • wij a distance-based weight such as the inverse distance between locations iand j: wij=1/dij

Global Indices of Spatial Association: Binary data

Background: Area health data and spatial correlation • Health data with full population coverage (as opposed to survey data) often only available for geographic aggregates. • These may be small neighbourhoods, such as English lower super output areas (LSOAs). Average 1500-2000 population. • Small area units (with relatively homogenous social structure, physical environment and other exposures) preferable for epidemiological inferences in terms reducing ecologic bias

Background: Area health data and spatial correlation • Examples of area health data (e.g. for electoral wards, LSOAs): mortality data by cause, cancer incidence data, health prevalence data • Spatial correlation in area health outcomes reflects clustering in risk factors (observed and unobserved), such as deprivation/affluence, health behaviours, environmental factors, neighbourhood social capital, etc

Bayesian Relative Risk Models for Area Spatial Data • Bayesian models for area disease risks now widely applied (to detect smooth underlying risk surface over region, etc). • Assume observed disease counts yi Poisson distributed, yi ~Po(eiri), (ei= expected counts) • Relative risks rihave average 1 when sum(expected)=sum(observed). Expected counts (demographic sense) based on applying region-wide disease rates to each small area population

Bayesian Relative Risk Models for Area Spatial Data • One option for modelling area relative risks, convolution scheme (Besag et al, 1991) • log(ri)=+si+ui, • Spatial error: si~ConditionalAutogressive (CAR) prior • Heterogeneity/overdispersion error: ui ~ Unstructured normal White Noise

Neighbourhood Clustering in Elevated Risk • Consider binary risk measures: bi=1 if relative risk ri>1, bi=0 otherwise. • These binary indicators are latent (unknown) as ri are latent. • Can use other thresholds (e.g. ri>1.5) • Interest often in posterior exceedance probabilities of elevated disease risk Ei=Pr(ri>1|y)=Prob(bi=1|y) in each area separately. • Possible rules: area i a hotspot if Ei > 0.9 or if Ei>0.8. Suitable threshold that Ei must exceed may depend on data frequency (higher thresholds can be set for more frequent data)

Neighbourhood Clustering in Elevated Risk • “Hotspot” detection does not measure broader local clustering in relative risks. Can have high risk and low risk clusters. • Can define high risk cluster centre: area i embedded in high risk cluster (i.e. high risk cluster centre) if both area i and all surrounding areas j have elevated risk, (Ei and Ej both high). • By contrast, high risk outlier: high risk area i (Eihigh), but all adjacent areas j are low risk (Ej low) • Also, cluster edge area: high risk area i (Eihigh), but adjacent areas j are mix of high and low risk

Neighbourhood Clustering in Elevated Risk • Low risk cluster centre: area i embedded in low risk cluster: both area i and surrounding areas j have low risk (Eiand Ejboth low) • By contrast: low risk outlier, low risk area i (Eilow) but all adjacent areas are high risk

Spatial Scan Clusters • Most well known approach to spatial clustering of lattice data based on spatial scan method: produces lists of areas in a cluster at given significance, e.g. under Poisson model for data • Spatial scan: circle (or ellipse) of varying size systematically scans the study region (moving window). • Each geographic unit (e.g. census tract, LSOA) is a potential cluster centre. • Clusters are reported for those circles (or other area shapes) where total observed values within circle are greater than expected values.

Stochastic Approach to Measuring Clustering in Elevated Risk • Method to be described provides measure of cluster status for each area in situation where relative health risks ri (and binary health status bi) are unknowns • Can be considered a method of cluster detection, included in MCMC updating • Includes high risk and low risk clustering in single perspective, and also encompasses outliers (isolated high or low risk hotspot)

Synthetic Data • Known adjacency structure: 113 middle level super output areas (MSOAs) in Outer NE London • 15 out of 113 areas have high RR (ri circa 1.75). Remainder have below average RR (ri circa 0.9). • High risk areas are located in three high risk clusters • Known yi and ei, and hence known crude relative risks (yi/ei), but whether (latent) RRs significantly elevated or not depends on amount of information in data (data frequency)

Synthetic Data • Assess Eiand bi (using Besag et al convolution model) according to different expected cases: ei=20.39, or ei=58.77. • For ei=20.39, yi are either 18 or 36 (to ensure sum of observed and expected are the same) • For ei=58.77, yi are either 52 or 103

Synthetic Data. Average e=20.39, Known RRs

Local Join-Counts to Detect Clustering in Relative Disease Risk • As mentioned above, global join counts (BB-WW-BW) measure global spatial clustering in binary risk indicators bi (note BW statistic combines two types of discrepancy) • To detect local clustering in risk (or outlier status), use local versions of global BB statistics.

Local Join-Counts to describe local clustering • Local version of BB statistic: summation only over neighbours of area i (not double summation) J11i=bi∑jwijbj • wij either distance based or contiguity based (wij=1 if areas i and j adjacent, wij=0 otherwise) • J11i measures high risk “cluster embeddedness” or high risk cluster centre status. J11i will be high for areas surrounded by other high risk areas

Local Join-Counts to describe local clustering • Local version of BW statistic : J10i=bi∑jwij(1-bj) • Measures high risk outlier status: when area i has elevated risk, but all neighbours have low risk • Also tends to increase for high risk cluster edges: area i has elevated risk, but many neighbours have low risk

Local Join-Counts for low risk clustering • Local version of WW statistic : J00i=(1-bi)∑jwij(1-bj) high when area i and its neighbours both have low risk • Finally, local WB statistic. Measures situation of low risk area but discrepant from neighbours J01i=(1-bi)∑jwijbj

Local Join-Counts under Binary Spatial Weights • Consider binary weights wij • Denote areas adjacent to area i as its neighbourhood” Ni • Li=number areas adjacent to area i (number of areas in neighbourhood Ni) • Common high risk joins formula (local BB count) is now J11i=bi∑jNibj • Local BW count: J10i=bi∑jNi(1-bj) • Also: J01i=(1-bi)∑jNibj • J00i=(1-bi) ∑jNi(1-bj)

Local Join-Counts under Binary Spatial Weights • Simple to show (and self-evident) Li=J11i+J10i+J01i+J00i • Multinomial sampling: Denominators Li known, but {J11i,J10i,J01i,J00i} are unknowns in modelling situation with relative disease risks ri and risk indicators bias unknowns.

Probabilities of Local Clustering • Proportion π11i of joins representing joint high risk, defined by E(J11i)=Liπ11i • Estimate during MCMC run (J11iand bi varying by iterations) as π11i=J11i/Li=bi∑jNibj/Li • π11i estimates probability that area iis member of high risk cluster. • As 11i  Ei, area i likely to be cluster centre • Term ∑jNibj/Li1 when all adjacent areas have definitive high risk

Probabilities of Local Clustering • Proportion of local joins that are (1,0) pairs, defined by E(J10i)=Liπ10i • Estimates probability that area i is high risk local outlier • Estimate during MCMC run: π10i=J10i/Li=bi∑jNi(1-bj)/Li,

Decomposition of Exceedance Probability • Can show that Ei=Pr(ri>1|y)=π11i+π10i • Have J11i+J10i=bi∑jNibj+bi∑jNi(1-bj)=biLi so that E(J11i)+E(J10i)=E(bi)Li=EiLi • Also by definition E(J11i)+E(J10i)=Liπ11i+Liπ10i

Synthetic Data. Average e=20.39, Known RRs

Synthetic Data Example: Cluster Focus Area 25, cluster centre. So also is area 23 in terms of having just high risk neighbours Areas 27 and 28, cluster edges (have as many background risk neighbours as high risk neighbours) Areas 22,23,25,27,28 all have true RR of 1.77, surrounding areas have RR of 0.88.

Cluster Focus (simulation with average ei=20.39, and bi=1 if ri>1)

Cluster Focus (simulation with average ei=58.77, and bi=1 if ri>1)

Cluster Centres and Edges • Cluster centre status verified: 11i  Eifor areas 25 and 23. • Cluster edge status becomes clearer with more frequent data (for areas 27 and 28)

Cluster Focus (simulation with average ei=20.39)Map of High Risk Cluster Probabilities 11i

Cluster Focus (simulation with average ei=58.77)Map of High Risk Cluster Probabilities 11i

Another simulation where clustering pattern known: cluster centre status under uneven risk scenario • Performance of 11i for measuring cluster centre status for contrasting situations • (1) EVEN RISK. High risk characterises all neighbours surrounding area i (so area iis cluster centre), and risk evenly distributed among neighbours • (2) UNEVEN RISK. High risk is not common to all neighbours, but unevenly concentrated among a few neighbours, so area i is no longer a cluster centre, and possibly a cluster edge.

Even risk vs uneven risk scenarios

Winbugs code • model {for (i in 1:N) {y[i] ~ dpois(mu[i]); mu[i] <- e[i]*r[i] • log(r[i]) <- alph+s[i]+u[i]; u[i] ~ dnorm(0,tau.u); • b[i] <- step(r[i]-1); • # joins and join counts • for (j in C[i]+1:C[i+1]) { • j11[i,j] <- b[i]*b.map[j]; j10[i,j] <- b[i]*(1-b.map[j]) • j01[i,j] <- (1-b[i])*b.map[j]; j00[i,j] <- (1-b[i])*(1-b.map[j])} • J11[i] <- sum(j11[i,C[i]+1 : C[i+1]]); J10[i] <- sum(j10[i,C[i]+1 : C[i+1]]) • J01[i] <- sum(j01[i,C[i]+1 : C[i+1]]); J00[i] <- sum(j00[i,C[i]+1 : C[i+1]]) • pi.L[1,i] <- J11[i]/L[i]; pi.L[2,i] <- J10[i]/L[i]; pi.L[3,i] <- J01[i]/L[i]; • pi.L[4,i] <- J00[i]/L[i]} • # neighbourhood vector of risks and indicators • for (i in 1:NN) { wt[i] <- 1; r.map[i] <- r[map[i]]; b.map[i] <- b[map[i]]} • # priors • alph ~ dflat(); tau.s ~ dgamma(1,0.001); rho ~ dexp(1); tau.u <- rho*tau.s • s[1:N] ~ car.normal(map[], wt[], L[], tau.s)}

Real Example: Suicide in North West England • Suicide counts {yi,ei} for 922 small areas (middle level super output areas, MSOAs) in NW England over 5 years (2006-10). • Model: yi~Po(eiri), relative risks ri averaging 1 log(ri)=+si+ui, si~CAR, ui~ WN • Overdispersion:ui needed as well as spatial term • Monitor exceedance and high risk clustering with bi=1 if ri>1, bi=0 otherwise. • Spatial interactions wij binary, based on adjacency

Smoothed Suicide RiskNote small expected values ei, average 3.5: impedes strong inferences about elevated risk, and so also about clustering

Real Example: Suicide in North West England • Flexscan(developed by Toshiro Tango) detects five significant clusters (p value under 0.05): most likely cluster (albeit irregular shape) consists of 9 areas in Blackpool.

High Suicide Risk Cluster, Blackpool and Surrounds

Real Example: Suicide in North West England, Areas within the Flexscan cluster

ExceedanceProbs for Blackpool Suicide Cluster (ARCMAP area IDs) Possible Questions What is most plausible cluster centre (if any)? Which areas are more likely to be cluster edges? Of two areas inside the doughnut, area 7 has higher exceedanceprob (E7=0.72, E4=0.48). Area 9 has E9=0.98, and five of 6 neighbours have Ej>0.8. Other neighbour has Ej=0.72. Area 9 has highest π11i namely 0.87. Area 6 has four neighbours, only two with Ej>0.8, two with Ej below 0.5 (E4=0.48, E41=0.26). Has π11i=0.54, π10i=0.34  cluster edge

Local Join-Counts for Bivariate Clustering • Local BB statistic for two outcomes A, B with event counts yAi, yBi. Binary indicators bABi=1 if both rAi>1 and rBi>1 bABi=0 otherwise • Bivariate high risk clustering assessed using local bivariate join count J11ABi=bABi∑jwijbABj

Local Join-Counts for Bivariate Clustering • J11ABi high in bivariate high risk cluster – when area i, and neighbours j of area i, bothhave high risk on both outcomes. • Bivariate high risk clustering probability π11ABi, proportion of joins that are joint high risk, defined by E(J11ABi)=Liπ11ABi • Estimate during MCMC run via π11ABi=J11ABi/Li

Two outcomes: Likelihood and Prior • NW England, MSOAs, yA suicide deaths, yBself-harm hospitalisations • Self harm much more frequent than suicide, average ei is 93. • Likelihood yAi~ Po(eAirAi), yBi~ Po(eBirBi) • Assume correlated spatial effects log(rAi)=A+sAi+uAi; log(rBi)=B+sBi+uBi, uAi~ WN, uBi ~ WN SA:B,i~BVCAR,

Measuring spatial clustering in disease patterns.