1 / 19

Parameter-Free Spatial Data Mining Using MDL.

This paper presents a parameter-free spatial data mining method using Minimum Description Length (MDL) that simultaneously finds patterns of spatial correlation and feature co-occurrence. The approach leverages quadtree compression and binary matrices to handle spatial data efficiently without relying on user input parameters. By exploiting spatial locality and employing a greedy search for local optima, the methodology iteratively groups rows and columns based on non-uniform priors for spatial coherence. The algorithms INNER and OUTER are detailed to demonstrate the process of grouping the data effectively. The complexity of the methods is analyzed in terms of computational efficiency. This technique is particularly valuable for applications involving biodiversity data, geographical data, environmental data, and historical and linguistic data analysis.

kmojica
Download Presentation

Parameter-Free Spatial Data Mining Using MDL.

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Parameter-Free Spatial Data Mining Using MDL. S. Papadimitriou, A. Gionis, P. Tsaparas, R.A. Väisänen, H. Mannila, and C. Faloutsos. International Conference on Data Mining 2005

  2. Problems: • Finding patterns of spatial correlation and feature co-occurrence. • Automatically • That is, parameter-free. • Simultaneously • For example: • Spatial locations on a grid. • Features correspond to species present in specific cells. • Each pair of cell and species is 0 or 1, depending on species present in that cell. • Feature co-occurrence: • Cohabitation of species. • Spatial correlation: • Natural habitats for species.

  3. Motivation: • Many applications • Biodiversity Data • As we just demonstrated. • Geographical Data • Presence of facilities on city blocks. • Environmental Data • Occurrence of events (storms, drought, fire, etc.) in various locations. • Historical and Linguistic Data • Occurrence of words in different languages/countries, historical events in a set of locations. • Existing methods either: • Detect one pattern, but not both, or • Require user-input parameters.

  4. Background • Minimum Description Length (MDL): • Let L(D|M) denote the code length required to represent data D given (using) model M. Let L(M) be the complexity required to describe the model itself. • The total code length is then: • L(D, M) = L(D|M) + L(M) • This was used in SLIQ and is the intuitive notion behind the connection between data mining and data compression. • The best model minimizes L(D, M), resulting in optimal compression. • Choosing the best model is a problem in its own right. • This will be explored further in the next paper I present.

  5. Background • Quadtree Compression • Quadtrees: • Used to index and reason about contiguous variable size grid regions (among other applications, mostly spatial). • Used for 2D data; kD analogue is a kD-tree. • “Full Quadtree”: All nodes have either 0 or 4 children. • Thus, all internal nodes correspond to a partitioning of a rectangular region into 4 subregions. • Each quadtree’s structure corresponds to a unique partitioning. • Transmission: • If we only care about the structure (spatial partitioning), we can transmit a 0 for internal nodes and a 1 for leaves in depth-first order. • If we transmit the values as well, the cost is the number of leaves times the entropy of the leaf value distribution.

  6. Example

  7. Quadtree Encoding • Let T be a quadtree with m leaf nodes, of which mp have value p. • The total codelength is: • If we know the distribution of the leaf values, we can calculate this in constant time. • Updating the tree requires O(log n) time in the worst case, as part of the tree may require pruning.

  8. Binary Matrices / Bi-groupings: • Bi-grouping: • Simultaneous grouping of m rows and n columns into k and l disjoint row and column groups. • Let D denote an m x n binary matrix. • The cost of transmitting D is given as follows: • Recall the MDL Principle: L(D) = L(D|M) + L(M). • Let {Qx, Qy} be a bi-grouping. • Lemma (we will skip the proof): • The codelength for transmitting an m-to-k mapping Qx where mp symbols are mapped to the value p is approximately:

  9. Methodology • Exploiting spatial locality: • Bi-grouping as presented is nonspatial! • To make it spatial, assign a non-uniform prior to possible groupings. • That is, adjacent cells are more likely to belong to the same group. • Row groups correspond to spatial groupings. • “Neighborhoods” • “Habitats” • Row groupings should demonstrate spatial coherence. • Column groups correspond to “families”. • “Mountain birds” • “Sea birds” • Intuition • Alternately group rows and columns iteratively until the total cost L(D) stops decreasing. • Finding the global optimum is very expensive. • So our approach will use a greedy search for local optima.

  10. Algorithms • INNER: • Group given the number of row and column groups. Start with an arbitrary bi-grouping of matrix D into k row groups and l column groups. do { Let for each row ifrom 1 to n 1 ≤ p ≤ k such that the “cost gain”: is maximized. Repeat for columns, producing the bi-grouping t += 2 } while (L(D) is decreasing)

  11. Algorithms • OUTER: • Finds the number of row and column groups. Start with k0 = l0 = 1. Split the row group p* with the maximum per-row entropy, holding the columns fixed. Move each row in p* to a new group kT+1iff doing so would decrease the per-row entropy of p*, resulting in a grouping Assign group to the result of INNER If the cost does not decrease, return Otherwise, increment t and repeat. Finally, perform this again for the columns.

  12. Complexity • INNER is linear with respect to nonzero elements in D. • Let nnz denote those elements. • Let k be the number of row groupings and l be the number of column groupings. • Row swaps are performed in the quadtree and take O(log m) time each, where m is the number of cells. • Let T be the iterations required to minimize the cost. • O(nnz * (k + l + log m) * T) • OUTER, though quadratic with respect to (k + l), is linear with respect to the dominating term nnz. • Let n be the number of row splits. • O((k + l)2nnz + (k + l) n log m)

  13. Experiments • NoisyRegions • Three features (“species”) on a 32x32 grid. • So D has 32x32 = 1024 rows. • And 3 columns. • 3% of each cell, chosen at random, has a wrong species, also randomly chosen. • The spatial and non-spatial groupings are shown to the right. • Recall: Bi-grouping is not spatial by default. • Spatial grouping reduces the total codelength. • The approach is not quite perfect due to the heuristic nature of the algorithm.

  14. Experiments • Birds • 219 Finnish bird species over 3813 10x10km habitats. • Species are the features, habitats are cells. • So our matrix is 3813x219. • The spatial grouping is clearly more coherent. • Spatial grouping reveals Boreal zones: • South Boreal: Light Blue and Green. • Mid Boreal: Yellow. • North Boreal: Red. • Outliers are (correctly) grouped alone. • Species with specialized habitats. • Or those reintroduced into the wild.

  15. Other approaches • Clustering • k-means • Variants using different estimates of central tendency: • k-medoids, k-harmonic means, spherical k-means, … • Variants determining k based on some criteria: • X-means, G-means, … • BIRCH • CURE • DENCLUE • LIMBO • Also information-theoretic. • Approaches either lossy, parametric, or aren’t easily adaptable to spatial data.

  16. Room for improvement: • Complexity • O(n * log m) cost for reevaluating the quadtreecodelength. • O(log m) worst-case time for each reevaluation/row swap * n swaps. • However, the average-case complexity is probably much better. • If we know something about the data distribution, we might be able to reduce this. • Faster convergence • Fewer iterations, reducing the scaling factor T. • Rather than stopping only when there is no decrease in cost, perhaps stop when we fall below a threshold? (Introduces a parameter) • Accuracy • The search will only find local optima, leading to errors. • We can employ some approaches used in annealing or genetic algorithms to attempt to find the global optimum. • Randomly restarting in the search space, for example. • Stochastic gradient descent – similar to what we’re already doing, actually.

  17. Conclusion • Simultaneous and automatic grouping of spatial correlation and feature co-habitation. • Easy to exploit spatial locality. • Parameter-free. • Utilizes MDL: • Minimizes the sum of the model cost and the data cost given the model. • Efficient. • Almost linear with the number of entries in the matrix.

  18. References • S. Papadimitriou, A. Gionis, P. Tsaparas, R.A. Vaisanen, H. Mannila, C. Faloutsos, "Parameter-Free Spatial Data Mining Using MDL", ICDM, Houston, TX, U.S.A., November 27-30, 2005. • M. Mehta, R. Agrawal and J. Rissanen, "SLIQ: A Fast Scalable Classifier for Data Mining", in Proceedings of the 5th International Conference on Extending Database Technology, Avignon, France, Mar. 1996.

  19. Thanks! Any questions?

More Related