Hot Spot Detection in a Network Space: Geocomputational Approaches

Hot Spot Detection in a Network Space: Geocomputational Approaches Ikuho Yamada, Ph.D. Department of Geography & School of Informatics IUPUI October 3, 2005 Fall 2005 Talk Series on Network and Complex Systems

Introduction • Clusters in a spatial phenomenon = hot spots, where occurrence or level of the phenomenon is higher than expected. • Detecting hot spots is useful for • Understanding of the nature of the phenomenon itself: • Factors influencing the phenomenon; • Decision making in related policies/planning: • Remedial/preventive actions; • Regional development planning; • New facility design, etc…

Introduction (cont.) • Potential problem: • Spatial distribution of the phenomenon may be affected by a transportation network; • E.g., vehicle crashes, retail facilities, crime locations, … • Analytical results derived w/o considering the network’s influence will be misleading, especially for • Detailed micro-scale data, and local scale analysis.  Analysis based on a network space, rather than the Euclidean space. No!! Cluster?

Stage2: Identifying influencing factors Data Stage 1: Detecting local clusters Classifier to determine cluster or not (e.g., Decision tree) Answer to Questions 1, 2, & 3 Answer to Question 4 Highway network Vehicle crash location Black spots (Clusters of crashes) Objectives • Is there any clustering tendency? • Where are the clusters? • How large are the clusters? • What causes the clusters?

Event-based data Link-attribute-based data Objectives (cont.) • Stage 1: Cluster detection in the network space • To develop exploratory spatial data analysis methods for network-based local-cluster detection, named local indicators of network-constrained clusters (LINCS). K-function Moran’s I and Getis & Ord’s G statistics

Objectives (cont.) • Stage 2: Influencing factor identification • To examine applicability of inductive learning techniques for constructing models that explain the clusters in relation to the characteristics of the network space; • Decision tree induction algorithms; • Feedforward neural networks; • Discrete choice/regression models --- as examples of traditional statistical methods.

Outline • Constraints imposed by the network space • Stage 1 — Development of LINCS • Network K-function for event-based data • Stage 2 — Inductive learning • Decision tree induction to model relationships between the detected clusters using the network attributes • Case study: • 1997 vehicle crash data in Buffalo, NY • Conclusions

Constraints imposed by the network space • Location constraint: • Some spatial phenomena occur only on the links of the network. • E.g., vehicle crashes, retail facilities, geocoded addresses (crime locations, patient residences, …); • Movement constraint: • Movement between locations is restricted to the network links; • E.g., One can get to a gas station only by driving along the streets; •  Distance between locations is more appropriately represented by the network (shortest-path) distance than by the Euclidean (straight-line) distance.

Network constraints (cont.) Location constraint Movement constraint

Stage 1Cluster detection in the network space

Network K-function Planar K-function Global Network K-function (Okabe & Yamada 2001) • Extension of Ripley’s K-function (1976) to determine • If a point pattern has clustering/dispersal tendency significantly different from random with respect to the network; • For a set of network-constrained events P, • where ρ is the intensity of points. Not within distance h Within distance h

Global Net K-function (cont.) An example of random distribution in a network space

Global Net K-function (cont.) • Weakness of the global K-function in determining the scale of clustering: • If there is a strong cluster with radius R, K(h) tends to exceed the upper significance envelope, indicating clustering, even for h≥R. • Incremental K-function: • Instead of examining the total number of events within distance h, examine an increment of the number of events by a unit distance; • It can identify clustering scale more accurately than the original K-function. Different IncK(ht) Similar K(h)

Local Network K-function • Local indicator of clustering tendency: • Decomposition of the global K-function: • This indicator is determined only for event locations;  only for limited locations in a network; • Introduction of reference points: • Distributed over the network with a constant interval for which indicator values are calculated; • c.f., regular grid used in the planar space analysis such as Geographical Analysis Machine (GAM).

Local Net K-function (cont.) • Local network K-function: where j=1, …, m, and m is the number of reference points; • For an observed pattern, • Local K-function values are obtained for the reference points for a range of distance h. LINCS for event-based data (KLINCS)

Example of the KLINCS analysis • The incremental K-function can be an indicator of the scale of clustering to help us determine which scale(s) of the local K-function to be closely examined;  Distance 2, in this case.

KLINCS (cont.) • Results of the local network K-function: • Significance of individual reference points is determined by comparing with 1,000 simulations of random patterns on the network; • Obs. LKj(h) ≥ the largest simulated LKj(h)  clustering; • Obs. LKj(h) ≤ the smallest simulated LKj(h)  dispersal. (0.1% significance level)

Local version ILINCS LINCS for link-attribute-based data • Moran’s I statistic (1948): • A global measure of spatial autocorrelation; • Dependence of a variable value at a location on those on its nearby locations in a spatial context • LISA (local indicators of spatial association) by Anselin (1995); • Network Moran’s I (Black 1992): • A measure of network autocorrelation; • Dependence between a variable value at a given link and those of other links that are connected to the link in a network context. • Getis and Ord local G statistics (1992): • A local measure of concentration of variable values around a region; • Applicable to link-attribute-based data (Berglund and Karlström 1999). GLINCS

Relationship between I and G statistics Value of the target link i Values of the links in the neighborhood of link i

From LINCS to inductive learning • Question: What causes the detected clusters? • LINCS gives a measure of clustering tendency for each spatial unit (ref. point or link segment). • Network data include attributes that may be related to the cause of the clusters. • E.g., travel speed, traffic volume, … • Spatial attributes can also be assigned to the spatial units. • E.g., distance from the closest intersection, travel time from the closest police station, average income of the area, …

Spatial units LINCS results Network attributes Spatial attributes Clustering Random Dispersion Causality Relationships? LINCS to IL (cont.) • The spatial units can be categorized based on their LINCS values. • E.g., cluster/random/dispersion; large cluster/medium cluster/ small cluster/random; cluster center/cluster fringe/random. Inductive Learning Decision tree induction Feedforward neural network

Stage 2Influencing factor identification

Inductive learning • A means to model relationships between input variables and outcome (classification) without relying on prior knowledge: (Gahegan 2000) • Learns from a set of instances for which desired outcome is known; • Predicts outcomes for new instances with known input variables.

Decision tree • A way of representing rules for classification in a hierarchical manner; (Witten & Frank 2000; Thill & Wheeler 2000) • Node --- test on an attribute; • Leaf node --- specification of a class. • Decision tree induction: • Recursive process of splitting a set of instances with correct class information (training set) into subsets based on a particular attribute; • E.g., CHAID (Kass 1980), CART (Breiman et al. 1984), C4.5(Quinlan 1993) .

Other techniques of modeling • Feedforward neural network with back-propagation: (Thill & Mozolin 2000, Demuth & Beale 2000) • A way of deriving a mapping of multiple input variables to classification from a training dataset. • Discrete choice model ~ as an example of traditional statistical modeling: • A way to analyze a relationship between a set of independent variables and a dependent variable of binary formor discrete choice outcome among a small set of alternatives; • Probit model/logit model.

Case study

Crash distribution in the study region Data • 1997 vehicle crash data for the Buffalo, NY area (by New York State Department of Transportation): • NY State highways; • Milepost system with the resolution of 0.1 mile; • 1,658 crashes in the study region; • Mileposts are used as reference points;  Scale of analysis = 0.1 mile; • Monte Carlo simulation with 1,000 trials (0.1% significance level).

0.1~0.5mile 0.1mile Stage 1: Global scale results • Under the null hypothesis: • Crash probability = uniform over the network;  • Crash probability = proportional to traffic volume; • Annual Average Daily Traffic.

KLINCS at 0.1 mile scale Adjusted for AADT Cluster: 110 ref. points Random: 1304 ref. points Dispersion: 38 ref. points (Total: 1452) Stage 1: Local scale results KLINCS at 0.1 mile scale Not adjusted for AADT Cluster: 125 ref. points Random: 1327 ref. points Dispersion: 0 ref. points (Total: 1452)

Stage 1 local results (cont.) ILINCS at 0.1 mile scale adjusted for AADT GLINCS at 0.1 mile scale adjusted for AAD T Positive autocorrelation: 23 links Not significant: 1462 links Negative autocorrelation: 0 links (Total: 1485) High-valued cluster: 19 links Not significant: 1438 links Low-valued cluster: 28 links (Total: 1485)

Stage 1 local results (cont.) Priority Investigation Locations (PILs) designated by NYSDOT KLINCS at 0.1 mile scale Adjusted for AADT

Stage 2: Inductive learning results • AADT-adjusted KLINCS classification • Decision tree by the C4.5 induction algorithm with 24 attributes

Stage 2 results (cont.) • AADT-adjusted GLINCS model • Dependent variable = degree of significant clustering (0~1000) • Model tree, where each leaf node represents a linear model

Stage 2 results (cont.) • Accuracy for the test set: • Not much difference between the three models, especially in terms of all instances; • Because 90% of the instances are “random,” the modeling processes tried to fit the models more to the random instances to make fewer errors Weighting schemes to emphasize underrepresented classes

Conclusions • This research proposes a comprehensive framework for a network-based spatial cluster analysis when the phenomenon of interest is constrained by a network space; • Event-based data & link-attribute-based data; • Detection of local clusters (stage 1) • The LINCS methods can detect clusters without detecting spurious clusters caused merely by the network constraints; • Identification of influencing factors (stage 2) • Inductive learning techniques are useful to construct robust models to explain the detected clusters in relation to the network’s attributes.

Conclusions (cont.) • Combination of exploratory spatial data analysis and inductive learning modeling is a powerful tool • to reveal latent relationships between distributions of spatial phenomena and characteristics of physical/social environments; and then • to assist spatial decision making processes by providing guidance where/what to focus attention; • Stage 1  Spatial focus; Stage 2  Contextual focus. • The case study showed relatively well correspondence between the LINCS results and PILs, which verifies the effectiveness of the LINCS methods.

Thank you! Any questions & suggestions

Hot Spot Detection in a Network Space: Geocomputational Approaches