Spatial Data Mining: Three Case Studies

Spatial Data Mining:Three Case Studies For additional details www.cs.umn.edu/~shekhar/problems.html Shashi Shekhar, University of Minnesota Presented to UCGIS Summer Assembly 2001

Background • NSF workshop on GIS and DM (3/99) • Spatial data[1, 8] - traffic, bird habitats, global climate, logistics, ... • For spatial patterns - outliers, location prediction, associations, sequential associations, trends, …

Framework • Problem statement: capture special needs • Data exploration: maps, new methods • Try reusing classical methods • from data mining, spatial statistics • If reuse is not possible, invent new methods • Validation, Performance tuning

Case 1: Spatial Outliers • Problem: stations different from neighbors [SIGKDD 2001] • Data - space-time plot, distr. Of f(x), S(x) • Distribution of base attribute: • spatially smooth • frequency distribution over value domain: normal • Classical test - Pr.[item in population] is low • Q? distribution of diff.[f(x), neighborhood agg{f(x)}] • Insight: this statistic is distributed normally! • Test: (z-score on the statistics) > 2 • Performance - spatial join, clustering methods

Spatial outlier detection[4] Spatial outlier A data point that is extreme relative to it neighbors Given A spatial graph G={V,E} A neighbor relationship (K neighbors) An attribute function f: V -> R An aggregation function f aggr : R k -> R Confidence level threshold  Find O = {vi | vi V, vi is a spatial outlier} Objective Correctness: The attribute values of vi is extreme, compared with its neighbors Computational efficiency Constraints Attribute value is normally distributed Computation cost dominated by I/O op.

Spatial outlier detection Spatial Outlier Detection Test 1. Choice of Spatial Statistic S(x) = [f(x)–E y N(x)(f(y))] Theorem: S(x) is normally distributed if f(x) is normally distributed 2. Test for Outlier Detection | (S(x) - s) / s | >  Hypothesis I/O cost determined by clustering efficiency f(x) S(x) Spatial outlier and its neighbors

Spatial outlier detection Results 1. CCAM achieves higher clustering efficiency (CE) 2. CCAM has lower I/O cost 3. Higher CE leads to lower I/O cost 4. Page size improves CE for all methods CE value I/O cost Cell-Tree CCAM Z-order

Case 2: Location Prediction • Citations: SIAM DM Conf. 2001, SIGKDD DMKD 2000 • Problem: predict nesting site in marshes • given vegetation, water depth, distance to edge, etc. • Data - maps of nests and attributes • spatially clustered nests, spatially smooth attributes • Classical method: logistic regression, decision trees, bayesian classifier • but, independence assumption is violated ! Misses auto-correlation ! • Spatial auto-regression (SAR), Markov random field bayesian classifier • Open issues: spatial accuracy vs. classification accurary • Open issue: performance - SAR learning is slow!

Location Prediction[6, 7, 8] Given: 1. Spatial Framework 2. Explanatory functions: 3. A dependent function 4. A family of function mappings: Find: A function Objective:maximize classification_accuracy Constraints: Spatial Autocorrelation exists Nest locations Distance to open water Water depth Vegetation durability

Evaluation: Changing Model • Linear Regression • Spatial Regression • Spatial model is better

Evaluation: Changing measure New measure:

Case 3: Spatial Association Rules • Citation: Symp. On Spatial Databases 2001 • Problem: Given a set of boolean spatial features • find subsets of co-located features, e.g. (fire, drought, vegetation) • Data - continuous space, partition not natural, no reference feature • Classical data mining approach: association rules • But, Look Ma! No Transactions!!! No support measure! • Approach: Work with continuous data without transactionizing it! • confidence = Pr.[fire at s | drought in N(s) and vegetation in N(s)] • support: cardinality of spatial join of instances of fire, drought, dry veg. • participation: min. fraction of instances of a features in join result • new algorithm using spatial joins and apriori_gen filters

Co-location Patterns[2, 3] Can you find co-location patterns from the following sample dataset? Answers: and

Co-location Patterns Can you find co-location patterns from the following sample dataset?

Co-location Patterns Spatial Co-location A set of features frequently co-located Given A set T of K boolean spatial feature types T={f1,f2, … , fk} A set P of N locations P={p1, …, pN } in a spatial frame work S, pi P is of some spatial feature in T A neighbor relation R over locations in S Find Tc = subsets of T frequently co-located Objective Correctness Completeness Efficiency Constraints R is symmetric and reflexive Monotonic prevalence measure Reference Feature Centric Window Centric Event Centric

Co-location Patterns Comparison with association rules Participation index Participation ratio pr(fi, c) of feature fi in co-location c = {f1, f2, …, fk}: fraction of instances of fi with feature {f1, …, fi-1, fi+1, …, fk} nearby 2.Participation index = min{pr(fi, c)} Algorithm Hybrid Co-location Miner

Conclusions & Future Directions • Spatial domains may not satisfy assumptions of classical methods • data: auto-correlation, continuous geographic space • patterns: global vs. local, e.g. spatial outliers vs. outliers • data exploration: maps and albums • Open Issues • patterns: hot-spots, blobology (shape), spatial trends, … • metrics: spatial accuracy(predicted locations), spatial contiguity(clusters) • spatio-temporal dataset • scale and resolutions sentivity of patterns • geo-statistical confidence measure for mined patterns

References • S. Shekhar, S. Chawla, S. Ravada, A. Fetterer, X. Liu and C.T. Liu, “Spatial Databases: Accomplishments and Research Needs”, IEEE Transactions on Knowledge and Data Engineering, Jan.-Feb. 1999. • S. Shekhar and Y. Huang, “Discovering Spatial Co-location Patterns: a Summary of Results”, In Proc. of 7th International Symposium on Spatial and Temporal Databases (SSTD01), July 2001. • S. Shekhar, Y. Huang, and H. Xiong, “Performance Evaluation of Co-location Miner”, the IEEE International Conference on Data Mining (ICDM’01), Nov. 2001. (submitted) • S. Shekhar, C.T. Lu, P. Zhang, "Detecting Graph-based Spatial Outliers: Algorithms and Applications“, the Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2001. • S. Shekhar, S. Chawla, the book“Spatial Database: Concepts, Implementation and Trends”. (To be published in 2001) • S. Chawla, S. Shekhar, W. Wu and U. Ozesmi, “Extending Data Mining for Spatial Applications: A Case Study in Predicting Nest Locations”, Proc. Int. Confi. on 2000 ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery (DMKD 2000), Dallas, TX, May 14, 2000. • S. Chawla, S. Shekhar, W. Wu and U. Ozesmi, “Modeling Spatial Dependencies for Mining Geospatial Data”, First SIAM International Conference on Data Mining, 2001. • S. Shekhar, P.R. Schrater, R. R. Vatsavai, W. Wu, and S. Chawla, “Spatial Contextual Classification and Prediction Models for Mining Geospatial Data”, IEEE Transactions on Multimedia, 2001. (Submitted) Some papers are available on the Web sites: http://www.cs.umn.edu/research/shashi-group/

Spatial Data Mining: Three Case Studies