330 likes | 544 Views
SSCP: Mining Statistically Significant Co-location Patterns. Sajib Barua and Jörg Sander Dept. of Computing Science University of Alberta, Canada. Outline. Introduction Related work Motivation Proposed Method Experimental evaluation Synthetic data Real data Conclusions. Definition.
E N D
SSCP: Mining Statistically Significant Co-location Patterns Sajib Barua and Jörg Sander Dept. of Computing Science University of Alberta, Canada
Outline • Introduction • Related work • Motivation • Proposed Method • Experimental evaluation • Synthetic data • Real data • Conclusions SSCP: Mining Statistically Significant Co-location Patterns
Definition • Co-location patterns are subsets of Boolean spatial features whose instances are often seen to be located at close spatial proximity. • Examples: {Shopping mall, parking} {Nile crocodile, Egyptian plover} SSCP: Mining Statistically Significant Co-location Patterns
{A2, B1, C1} is an instance of co-location {A,B,C} {A2, B1, D1} is an instance of co-location {A,B,D} {A2, C1, D1} is an instance of co-location {A,C,D} {B1, C1, D1} is an instance of co-location {B,C,D} {A2, B1, C1, D1} is an instance of co-location {A, B,C,D} {A2, B1, C1} is an instance of co-location {A,B,C} B2 B2 C1 C1 B2 C1 {A2, B1, C1, D1} form a clique under a relation R. C2 C2 C2 C3 C3 B1 B1 B1 C3 A2 A2 A2 D1 D1 D1 Event Centric Model • Co-location is defined based on a spatial relationship R • A co-location type C is a set of n different spatial features f1, f2, …, and fn. SSCP: Mining Statistically Significant Co-location Patterns
PI ({A,B}) = min {1/2, 1/2} = 0.5 PI ({A, B}) = min {1/2, 1/2} = 0.5 PI ({B, C}) = min {1, 2/3} = 0.66 PI ({A, C}) = min {1/2, 1/3} = 0.33 PI ({A, B, C}) = min {1/2, 1/2, 1/3} = 0.33 PI({A,B,C}) <= PI ({A, B}) or PI ({B, C}) or PI ({A, C}) B2 B2 C1 C1 A1 A1 C2 C2 B1 B1 C3 C3 A2 A2 Prevalence Measure • Participation ratio (PR) of a feature in a co-location type C, is the fraction of its instances participating in any instance of C. • Participation index (PI) is the minimum participation ratio in C. PR and PI are anti-monotonic SSCP: Mining Statistically Significant Co-location Patterns
Related Work • Spatial statistics • Ripley’s K function, distance based measure, co-variogram function. • Spatial data mining • Koperski et al. [4] mine spatial association rules. • Morimoto [5] also look for frequently occurring patterns. • Shekhar et al. [2] introduce three models to materialize transaction. • Huang, et al. [3], Yo et al. [6,7], and Xiao et al. [8]. SSCP: Mining Statistically Significant Co-location Patterns
Limitations of the Existing Methods • Spatial statistics • Defined only for pairs. • Co-location mining • Only one global threshold for PI is used. • No guideline to setup PI-threshold • Do not address the spatial auto-correlation and feature abundance effects. A simple threshold can report meaningless patterns or can miss meaningful patterns. SSCP: Mining Statistically Significant Co-location Patterns
A has fewer instances B is abundant A & B have true spatial dependency. Motivation Assume PI-threshold = 0.4 Existing co-location mining algorithms will not report{A,B}. SSCP: Mining Statistically Significant Co-location Patterns
A & B are abundant. Both randomly distributed. Do not have any true spatial dependency. Motivation Assume PI-threshold = 0.4 Existing co-location mining algorithms will report{A,B}. SSCP: Mining Statistically Significant Co-location Patterns
A & B are auto-correlated. Do not have any true spatial dependency. Motivation Assume PI-threshold = 0.4 Existing co-location mining algorithms will report{A,B}. SSCP: Mining Statistically Significant Co-location Patterns
Our Idea • Our approach uses statistical test. • Spatial dependency is measured using PI. #○ = 12 #∆ = 12 If features ○ and ∆ were spatially independent of each other, what is the chance of seeing the PI-value of {○, ∆} equal or higher than the observed PI-value (0.41)? SSCP: Mining Statistically Significant Co-location Patterns
Generate Artificial Data Sets Observed data Artificial data sets generated under null model SSCP: Mining Statistically Significant Co-location Patterns
p-value computation If p <= α, PIobsis statistically significant at level α. p-value = 0.163 α = 0.05 PIobs = 0.41 SSCP: Mining Statistically Significant Co-location Patterns
A & B are auto-correlated. Do not have any true spatial dependency. Auto-correlated Feature SSCP: Mining Statistically Significant Co-location Patterns
Modeling Auto-correlation • Auto-correlation is modeled as a cluster process. Poisson Cluster Process [9] • Autocorrelation is measured in terms of intensity and type of distribution of a parent process and offspring process around each parent. SSCP: Mining Statistically Significant Co-location Patterns
Estimating Summary Statistics • Estimate the summary statistics. • Auto-correlated feature: intensity of parent and offspring process (κ, and µ values). • Randomly distributed feature: Poisson intensity (either homogenous (a constant) or non-homogenous (a function of x and y)). SSCP: Mining Statistically Significant Co-location Patterns
Null Model Design • The artificial data sets maintain the following properties of the observed data: • same number of instances for each feature, and • similar spatial distribution for each individual feature. SSCP: Mining Statistically Significant Co-location Patterns
p-value computation • Estimate • Use randomization tests, where a large number of datasets conforming to the null hypothesis is generated. • How many simulations do we need? • Diggle suggested 500 simulations for α = 0.01 [10]. SSCP: Mining Statistically Significant Co-location Patterns
Improving Runtime: Data Generation • In a simulation, we only generate feature instances of those clusters which are close enough to other different features (either auto-correlated or non auto-correlated) This saves time of the artificial data generation step of a simulation. SSCP: Mining Statistically Significant Co-location Patterns
No need to compute • Procedure: • In each simulation, compute -values of all possible 2-size subsets • For a co-location C of size k ( > 2), we lookup PI-values of its 2-size subsets of C. If a subset C' is found for which < PIobs(C), is not required to be computed. • Otherwise is computed for simulation Ri. Improving Runtime: PI-value Computation • In a simulation Ri, for a co-location C SSCP: Mining Statistically Significant Co-location Patterns
An Example Four features A, B, C, D • {A,B,C}: If {A,B} < PIobs{A,B,C}, {A,B,C} < PIobs{A,B,C}. No need to compute {A,B,C}. • {A,B,C} < PIobs{A,B,C} does not imply {A,B,C,D} < PIobs{A,B,C,D}. • {A,B,C,D}: by checking 2-size subsets The worst case complexity is O(2n) • The size of the largest co-location is much smaller. • Largest co-location size is predictable • if PIobs(C) = 0, we do not compute -value of C, • Our pruning strategies All these keep the actual cost in practice less than the worst case cost. SSCP: Mining Statistically Significant Co-location Patterns
Experimental Results (1) Negative association: • Features ○ and ∆ with 40 instances of each. • This synthetic data set is generated using multi-type Strauss process to impose a negative association (inhibition) between these two features. Result PIobs = 0.55 and p-value = 0.931 > 0.05 (α), hence (○, ∆) will not be reported. SSCP: Mining Statistically Significant Co-location Patterns
Experimental Results (2) Autocorrelation: • #○ = 100, and #∆ = 120. • ∆: independently and uniformly distributed over the space ○: spatially auto-correlated In our generated data, ∆ is found in most clusters of ○. • The summary statistics of ○ is estimated by fitting the model of Matérn Cluster process[9] (κ= 40, µ = 5, r = 0.05). Results: • PIobs {○, ∆} = 0.49, existing algorithm will report the pattern if a threshold <= 0.49 is chosen. • p-value = 0.383 > 0.05 (α); {○, ∆} is notreported. SSCP: Mining Statistically Significant Co-location Patterns
Experimental Results (3) Multiple features: #○ = 40, #∆ = 40, #+ = 118, #x = 40, and = #30. • Study area = Unit square, co-location neighborhood radius = 0.1 • Features ○ and ∆ are negatively associated. • Feature + is spatially auto-correlated. Features +, ○, and x are positively associated. • Feature is randomly distributed. Significant co-location patterns = {○, +}, {○, x}, {+, x}, {○, +, x}, {○, +, }, {○, x, }, {+, x, }, and {○, +, x, }. SSCP: Mining Statistically Significant Co-location Patterns
Runtime Comparison (1) • Features ○, ∆, +: are auto-correlated, strongly associated. Each has 400 instances. • Feature x: is randomly distributed, and has 20 instances. • Our algorithm finds all co-locations of features ○, ∆, and x. • Instances of each auto-correlated features is increased • cluster numbers is kept same • number of instances per cluster is increased by a factor k. Runtime comparison Speedup SSCP: Mining Statistically Significant Co-location Patterns
Runtime Comparison (2) • The number of clusters for features ○, ∆, and + is increased by a factor k but the number of instances per cluster is kept same. • Total instances of x is increased by the same factor k. Runtime comparison Speedup SSCP: Mining Statistically Significant Co-location Patterns
Ants Data • ○ = Cataglyphis ants (29) and ∆ = Messor ants (68). • PIobs {Cataglyphis, Messor} = {24/29, 30/68} = 0.44. • p-value = 0.142 > 0.05 (α); Co-location {○, ∆} is not significant. • R. D. Harkness also did not find any clear association between these two species. • Existing algorithm will report {○, ∆} if PI-threshold <= 0.44. SSCP: Mining Statistically Significant Co-location Patterns
Toronto Address Repository Data SSCP: Mining Statistically Significant Co-location Patterns
Found Co-locations SSCP: Mining Statistically Significant Co-location Patterns
Conclusions • A new definition for co-location pattern. • Does not depend on a global threshold. • Statistically meaningful. • Runtime cost of randomization tests is reduced. • Investigate other prevalence measures to check if they allow additional pruning techniques. • Removing redundant patterns. SSCP: Mining Statistically Significant Co-location Patterns
References • 1. Agrawal, R., Srikant, R.: Fast Algorithms for Mining Association Rules in Large Databases. In: Proc. VLDB, pp. 487-499 (1994) • 2. Shekhar, S. et al.: Discovering Spatial Co-location Patterns: A Summary of Results, In Proc. SSTD, pp. 236-256 (2001) • 3. Huang, Y. et al.: Discovering Colocation Patterns from Spatial Data Sets: A General Approach. IEEE TKDE 16(12), 1472-1485 (2004) • 4. Koperski, K. et al.: Discovery of Spatial Association Rules in Geographic Information Databases. In SSD, pp. 47-66 (1995) • 5. Morimoto, Y.: Mining Frequent Neighboring Class Sets in Spatial Databases. In SIGKDD, pp. 353-358 (2001) • 6. Yoo, J. S. et al.: A Partial Join Approach for Mining Co-location Patterns. In Proc. GIS, pp. 241-249 (2004) • 7. Yoo, J. S. et al.: A joinless Apporach for Mining Spatial Co-location Patterns. IEEE TKDE 18(10), 1323-1337 (2006) • 8. Xiao, X. et al.: Density Based Co-location Pattern Discovery. In Proc. GIS, pp. 250-259 (2008). • 9. Ilian et al: Statistical Analysis and Modeling of Spatial Point Patterns. • 10. Diggle P.J.: Statisitcal Analysis of Spatial Point Pattern, 2003 SSCP: Mining Statistically Significant Co-location Patterns
Questions? SSCP: Mining Statistically Significant Co-location Patterns