210 likes | 365 Views
A spatial neighborhood graph approach to Regional Co-location pattern discovery: summary of results. Pradeep Mohan*, Shashi Shekhar , Zhe Jiang University of Minnesota, Twin-Cities, MN. James A. Shine, James P. Rogers, Nicole Wayant
E N D
A spatial neighborhood graph approach to Regional Co-location pattern discovery: summary of results Pradeep Mohan*, ShashiShekhar, Zhe Jiang University of Minnesota, Twin-Cities, MN James A. Shine, James P. Rogers, Nicole Wayant US Army- ERDC, Topographic Engineering Center, Alexandria, VA *Contact: mohan@cs.umn.edu
Outline • Motivation • Problem Formulation • Computational Approach • Conclusions and Future work
Motivation: Spatial Heterogeneity, the second law of Geography Spatial Heterogeneity (Goodchild, 2004; Goodchild 2003) • Expectations vary across space. • Global models may not explain locally observed phenomena. • Need for place based analysis. Spatial Heterogeneity in Retail • Traditional Data Mining : Which pair of items sell together frequently ? • Ans : Diaper in Transaction Beer in Transaction. • Is this association true every where ? Answer : Blue Collar neighborhoods Global Spatial Data Mining – Global Co-location patterns • Which pairs of spatial features are located together frequently ? Example: Gas stations and Convenience Stores Our Focus: • Where do certain pairs of spatial features co-locate frequently ? Example: Assaults happen frequently around downtown bars.
Applications • Crime analysis • Localizing frequent crime patterns, Opportunities for crime vary across space! Question: Do downtown bars often lead to assaults more frequently ? • Public Health • Localizing elevated disease risks around putative sources (e.g. mining areas) Courtsey: www.amazon.com Question: Where does high asbestos concentration often lead to lung cancer ? • Ecology • Localizing symbiotic relationships between different species of plants / animals. Question: Where are Plover birds frequently found in the vicinity of a crocodile ? • Courtsey: www.startribune.com Predicting localities of the next crime.
Regional co-location patterns (RCP) • Input: Spatial Features, Crime Reports. • Output: RCP (e.g. < (Bar, Assaults), Downtown >) • Subsets of spatial features. • Frequently located in certain regions of a study area.
Outline • Motivation • Problem Formulation • Basic Concepts • Problem Statement • Challenges • Related Work • Computational Approach • Conclusions and Future work
Basic Concepts: Neighborhoods Prevalence locality • Subsets of spatial framework containing instances of a Pattern. • Simple representation to visualize: Convex Hull • Other Representations possible. Neighborhood Graph • Given: A Spatial Neighbor Relation (spatial neighborhood size) • Nodes: Individual event instances • Edges: Presence (If neighbor relation is satisfied) • Based on Event Centric Model (Huang , 2004)
Basic Concepts: Quantifying regional interestingness • Conditional probability of observing a pattern instance within a locality given an instance of a feature within that locality. Regional Participation Ratio Example Regional Participation index Quantifies the local fraction participating in a relationship. Example
Detailed Statement *Prevalence Threshold = 0.25 *Spatial neighborhood Size = 1 Mile • Given: • A spatial framework, • A collection of boolean spatial event types and their instances. • A minimum interestingness threshold, Pθ • A symmetric and transitive neighbor relation R (e.g. based on Spatial neighborhood size) • Find : • All RCPs with prevalence >= Pθ • Objective: • Minimize computational cost. • Constraints: • Spatial framework is Heterogeneous. • Interest measure captures spatial heterogeneity. • Completeness : All prevalent RCPs are reported. • Correctness: Only prevalent RCPs are reported.
Challenges • Conflicting Requirements • Interest measure captures spatial heterogeneity while supporting scalable algorithms. • Exponential search space. • Candidate pattern set cardinality is exponential in the number of event types. Illustration: Spatial Data Mining (e.g. RCP) Statistics Rigor Computational Scalability
Challenges • Conflicting Requirements • Interest measure captures spatial heterogeneity while supporting scalable algorithms. • Exponential search space. • Candidate pattern set cardinality is exponential in the number of event types. Illustration: {NULL} A C B AB AC BC ABC
Contributions • Regional Co-location Patterns • Neighborhood based Formulation • Interest Measure • Captures the local fraction of events participating in patterns. • Shows attractive computational properties, Honors spatial heterogeneity. • Computational Approach • Computational Structure – Pattern Space Enumeration • Performance Enhancement- Maximal locality based Pruning Strategies • Experimental Evaluation • Performance Evaluation using real datasets, Lincoln, NE • Real world case study.
Related Work Approaches for Regional Co-location Pattern discovery Spatial Neighborhood based Fitness function Clustering (Eick et al., 2008) Zoning Based (Celik et al., 2007) Our Work Zoning Based Fitness Function Clustering • Reports one pattern per interesting region based on a criterion (e.g. Max) • Computational structure and pruning strategies not explored. • Clustering is based on real valued attributes.
Outline • Motivation • Problem Formulation • Computational Approach • Pattern Space Enumeration • Performance Tuning • Experimental Evaluation • Conclusions and Future work
Computational Approach Prevalence Threshold = 0.25 {Null} A B C ✕ 0.16 ✔ ✕ 0.16 0.25 ✔ 0.25 ✔ ✔ 0.33 0.25 ✕ 0.16 ✔ ✔ ✕ 0.25 0.25 0.16 Key Idea • Enumerate Entire Pattern Space. Expensive ! ✕ 0.16 • Examine each pattern and prune. ✔ 0.25 ✔ 0.25 Compute Neighborhoods ✕ Pruned RCP Identify candidate RCP instance Accepted RCP ✔
Performance Tuning: Key Ideas Key Idea • Interest Measure shows special pruning properties in certain subsets of the spatial framework. Maximal Locality Key Properties • Collection of connected instances. • Maximal localities are mutually disjoint. • Contains several RCPs. Key Observations • RPI shows anti-monotonicity property within Maximal Localities • Pruning a co-location, {AB}, prunes all its super sets (e.g. {ABC}, {ABCD}…etc.). • RPI within a Maximal locality is an upper bound to RPI of constituent Prevalence localities.
Performance Tuning Prevalence Threshold = 0.25 {Null} A B C ML1 ML2 ML3 {AB},0.167 {AC},0.25 {BC},0.167 {AB},0.25 {AC},0.25 {BC},0.33 ✕ ✕ No RCP No RCP ✕ <{BC},PL3({BC})>,0.167 <{AC},PL1({AC})>,0.25 ✕ <{BC},PL4({BC})>,0.167 Completeness {ABC}: Pruned Automatically • Pruning a pattern within a maximal locality does not prune any valid RCPs. Compute Maximal Locality Correctness Due to upper bound property of RPI • Accepting a pattern involves additional checks so that only prevalent RCPs are reported. Due to anti-monotonicity of RPI
Experimental Evaluation: Spatial Neighborhood Size • What is the effect of spatial neighborhood size on performance of different algorithms ? • Fixed Parameters: Dataset Size : 7498 instances; # Features: 5; Prevalence Threshold: 0.07 # of RCPs Run Time Trends • Run Time: ML Pruning out performs PS Enumeration by a factor of 1.5 - 5 • # of RCPs examined: ML Pruning out performs PS Enumeration by a factor of 4.13 - 19
Experimental Evaluation: Feature Types • What is the effect of number of feature types on performance of different algorithms ? • Fixed Parameters: Dataset Size : 7498 instances; Spatial neighborhood size: 800 feet; Prevalence Threshold: 0.07 # of RCPs Run Time Trends • Run Time: ML Pruning out performs PS Enumeration by a factor of 1.2 • # of RCPs examined: ML Pruning out performs PS Enumeration by a factor of 1.6 – 3.5
Real Dataset Case study Q: Where do assaults frequently occur around bars ? Are there other factors ? Dataset: Lincoln, NE, Crime data (Winter ‘07), Neighborhood Size = 0.25 miles, Prevalence Threshold = 0.07 RCP of Larceny, Bars and Assaults RCP of Larceny and Assaults RCP of Bar and Assaults Observations • Assaults are more likely to be found in areas reporting larceny (e.g. 47.6 % vs 21.1%) Crimes. • Bars in Downtown are more likely to be crime prone than bars in other areas (e.g. 21.1%, 20.1 %)
Conclusion and Future work • Conclusions • Neighborhood based formulation of Regional Spatial Patterns. • Regional Participation Index: Measures the local fraction of the global count. • Vector representation for Prevalence Localities (other representations possible, convex for simplicity) • Future Work • Other representations for prevalence localities. • Statistical interpretation LISA statistics / variants of Local Ripley’s K , multiple hypothesis testing. • Interpretation using predictive methods (e.g. Geographically Weighted Regression) • Acknowledgement: • Reviewers of ACM GIS • Members of the Spatial database and spatial data mining group, UMN. • U.S. Department of Defense. • Mr. Tom Casady and Kim Koffolt.