What is the true shape of a disease cluster? The multi-objective genetic scan

Geoinfo 2006 What is the true shape of a disease cluster? The multi-objective genetic scan Luiz Duczmal André L.F. Cançado Ricardo C.H. Takahashi Univ. Federal Minas Gerais, Brazil, Statistics Dept., Electrical Engineering Dept., Mathematics Dept.

Irregularly shaped spatial disease clusters occur commonly in epidemiological studies, but their geographic delineation is poorly defined. Most current spatial scan software usually displays only one of the many possible cluster solutions with different shapes, from the most compact round cluster to the most irregularly shaped one, corresponding to varying degrees of penalization parameters imposed to the freedom of shape. Even when a fairly complete set of solutions is available, the choice of the most appropriate parameter setting is left to the practitioner, whose decision is often subjective.

We propose quantitative criteria for choosing the best cluster solution, through multi-objective optimization, by finding the Pareto-set in the solution space. Two competing objectives are involved in the search: regularity of shape, and scan statistic value. Instead of running sequentially a cluster finding algorithm with varying degrees of penalization, the complete set of solutions is found in parallel, employing a genetic algorithm.

The cluster significance concept is extended for this set in a natural and unbiased way, being employed as a decision criterion for choosing the optimal solution. The Gumbel distribution is used to approximate the empiric scan statistic distribution, speeding up the significance estimation. The method is fast, with good power of detection. An application to breast cancer clusters is discussed.Keywords: spatial scan statistic, disease clusters, geometric compactness penalty correction, Pareto-sets, multi-objective optimization, vector optimization, Gumbel distribution, genetic algorithm.

Spatial Scan Statistics Kulldorff (1997) Map with m regions Total population N C cases Under the null hypothesis there is no cluster in the map, and the number of cases in each region is Poisson distributed.

For each circle centered in each centroid’s region, let z be the collection of regions that lie inside it. Let = number of cases inside z = expected cases inside z The scan statistic is defined as z if and one otherwise.

The collection (or zone) z with the highest L(z) is the most likely cluster. 2 We sweep through all the m possible circular zones, looking for the highest L(z) value. We need to compare this value against the max L(z) for maps with cases distributed randomly under the null hypothesis. The whole procedure is repeated for thousands of times, for each set of randomly distributed cases. (Monte Carlo, Dwass(1957)).

Extreme example of an irregularly shaped cluster Penalty function to control the freedom of shape (joint work with Kulldorff and Huang)

A(z)=area of the zone z H(z)=perimeter of the convex hull of z Intuitively, the convex hull of a planar object is the cell inside a rubber band stretched around it. Compactness: K(z) = the area of z divided by the area of the circle with perimeter H(z).

Compactness for some common shapes Circle: K(z) = 1 Square: K(z) = π/4

Penalty function for the log of the likelihood ratio (LLR(z)) K(z).LLR(z) Generalized compactness correction: .LLR(z) a = 1 : full compactness correction a = 0.5 : medium compactness correction a = 0.0 : no compactness correction

The Elliptic Scan Statistic (joint work with Kulldorff, Huang and Pickle) The scanning window has variable location, size, shape and angle. A penalty function may be used.

Most likely cluster Circular Elliptical, axis ratio = 2 Elliptical, axis ratio = 5 Pickle et al., Atlas of United States Mortality, NCHS, 1996 Breast Cancer Mortality Rates

penalty correction 1 0 circular

penalty correction 1 0 elliptical

penalty correction 1 0 irregular

no penalty correction = disaster ! 1 0 irregular

Extreme example of an irregularly shaped cluster (joint work with Martin Kulldorff and Lan Huang)

Homicide average 1998-2002 Minas Gerais State, Brazil Hom./100,000 inhab./year 853 municipalities Source: DATASUS Map by Ricardo Tavares 100 km

Genetic Algorithms (joint work with Cançado, Takahashi and Bessegato) • OBJECTIVE: • Find a quasi-optimal solution for a maximization problem. • Initial population. • Random crossing-over of parents and offspring generation. • Selection of children and parents for the next generation. • Random mutation. • Repeat the previous steps for a predefined number of • generations or until there is no improvement in the functional.

We minimize the graph-related operations by means of a fast offspring generation and evaluation of the Kulldorff´s scan likelihood ratio statistic. This algorithm is more than ten times faster and exhibits less variance compared to a similar approach using simulated annealing, and thus gives better confidence intervals for the Monte Carlo inference process of significance evaluation for the most likely cluster found.

Incidence of Malaria Deaths in the Brazilian Amazon (1998-2002)

Initial population construction Start at a region of the map.

Initial population construction Add the neighbor which forms the highest LLR 2-cell zone.

Initial population construction Stop. (It is impossible to form a higher LLR 5-cell zone)

Initial population construction Start at another region of the map.

Initial population construction etc. Repeat the previous steps for all the regions of the map.

THE OFFSPRING GENERATION (a simple example)

THE OFFSPRING GENERATION (a simple example) Another possible numbering

THE OFFSPRING GENERATION (a more sofisticated example)

One instance of two parent trees

Advantages: • The offspring generation is very inexpensive; • All the children zones are automatically connected; • Random mutations are easy to implement; • The selection for the next generation is straightforward; • Fast evolution convergence; • The variance between different test runs is small.

Population Evolution Performance

Irregularly shaped clusters benchmark, Northeast US counties map. Duczmal L, Kulldorff M, Huang L. (2006) Evaluation of spatial scan statistics for irregularly shaped clusters. J. Comput. Graph. Stat.

Power evaluation of the genetic algorithm, compared to the simulated annealing algorithm.

Cluster of high incidence of breast cancer. São Paulo State, Brazil, 2002. Population adjusted for age and under-reporting.

Cluster of high incidence of breast cancer. São Paulo State, Brazil, 2002. Population adjusted for age and under-reporting. Data source: DATASUS, G.L.Souza Compactness correction: 1.0 Cluster cases: 2,924 Cluster population: 346,024 Incidence: 0.00845 LLR: 298.9 p-value:0.001 0 100 km

The genetic algorithm for disease cluster detection is fast and exhibits less variance compared to similar approaches; • The potential use for epidemiological studies and syndromic surveillance is encouraged; • The need of penalty functions for the irregularity of cluster’s shape is clearly demonstrated by the power evaluation tests; • The power of detection of clusters is similar to the simulated annealing algorithm; • The flexibility of shape control gives to the practitioner more insight of the geographic cluster delineation.

Northeast US counties map with observed cases: Age adjusted female breast cancer, 1995. Kulldorff M., Feuer E.J., Miller B.A., Freedman L.S. (1997) Breast cancer clusters in the Northeast United States: a geographic analysis. American Journal of Epidemiology, 146:161-170. Percent below/above expected > 20% 12% to 20% 4% to 12% -4% to +4% -12% to -4% -20% to -12% < -20%

What is the true shape of a disease cluster? The multi-objective genetic scan