190 likes | 211 Views
Parameter-Free Spatial Data Mining Using MDL. S. Papadimitriou, A. Gionis , P. Tsaparas , R.A. Väisänen , H. Mannila , and C. Faloutsos . International Conference on Data Mining 2005. Problems:. Finding patterns of spatial correlation and feature co-occurrence. Automatically
E N D
Parameter-Free Spatial Data Mining Using MDL. S. Papadimitriou, A. Gionis, P. Tsaparas, R.A. Väisänen, H. Mannila, and C. Faloutsos. International Conference on Data Mining 2005
Problems: • Finding patterns of spatial correlation and feature co-occurrence. • Automatically • That is, parameter-free. • Simultaneously • For example: • Spatial locations on a grid. • Features correspond to species present in specific cells. • Each pair of cell and species is 0 or 1, depending on species present in that cell. • Feature co-occurrence: • Cohabitation of species. • Spatial correlation: • Natural habitats for species.
Motivation: • Many applications • Biodiversity Data • As we just demonstrated. • Geographical Data • Presence of facilities on city blocks. • Environmental Data • Occurrence of events (storms, drought, fire, etc.) in various locations. • Historical and Linguistic Data • Occurrence of words in different languages/countries, historical events in a set of locations. • Existing methods either: • Detect one pattern, but not both, or • Require user-input parameters.
Background • Minimum Description Length (MDL): • Let L(D|M) denote the code length required to represent data D given (using) model M. Let L(M) be the complexity required to describe the model itself. • The total code length is then: • L(D, M) = L(D|M) + L(M) • This was used in SLIQ and is the intuitive notion behind the connection between data mining and data compression. • The best model minimizes L(D, M), resulting in optimal compression. • Choosing the best model is a problem in its own right. • This will be explored further in the next paper I present.
Background • Quadtree Compression • Quadtrees: • Used to index and reason about contiguous variable size grid regions (among other applications, mostly spatial). • Used for 2D data; kD analogue is a kD-tree. • “Full Quadtree”: All nodes have either 0 or 4 children. • Thus, all internal nodes correspond to a partitioning of a rectangular region into 4 subregions. • Each quadtree’s structure corresponds to a unique partitioning. • Transmission: • If we only care about the structure (spatial partitioning), we can transmit a 0 for internal nodes and a 1 for leaves in depth-first order. • If we transmit the values as well, the cost is the number of leaves times the entropy of the leaf value distribution.
Quadtree Encoding • Let T be a quadtree with m leaf nodes, of which mp have value p. • The total codelength is: • If we know the distribution of the leaf values, we can calculate this in constant time. • Updating the tree requires O(log n) time in the worst case, as part of the tree may require pruning.
Binary Matrices / Bi-groupings: • Bi-grouping: • Simultaneous grouping of m rows and n columns into k and l disjoint row and column groups. • Let D denote an m x n binary matrix. • The cost of transmitting D is given as follows: • Recall the MDL Principle: L(D) = L(D|M) + L(M). • Let {Qx, Qy} be a bi-grouping. • Lemma (we will skip the proof): • The codelength for transmitting an m-to-k mapping Qx where mp symbols are mapped to the value p is approximately:
Methodology • Exploiting spatial locality: • Bi-grouping as presented is nonspatial! • To make it spatial, assign a non-uniform prior to possible groupings. • That is, adjacent cells are more likely to belong to the same group. • Row groups correspond to spatial groupings. • “Neighborhoods” • “Habitats” • Row groupings should demonstrate spatial coherence. • Column groups correspond to “families”. • “Mountain birds” • “Sea birds” • Intuition • Alternately group rows and columns iteratively until the total cost L(D) stops decreasing. • Finding the global optimum is very expensive. • So our approach will use a greedy search for local optima.
Algorithms • INNER: • Group given the number of row and column groups. Start with an arbitrary bi-grouping of matrix D into k row groups and l column groups. do { Let for each row ifrom 1 to n 1 ≤ p ≤ k such that the “cost gain”: is maximized. Repeat for columns, producing the bi-grouping t += 2 } while (L(D) is decreasing)
Algorithms • OUTER: • Finds the number of row and column groups. Start with k0 = l0 = 1. Split the row group p* with the maximum per-row entropy, holding the columns fixed. Move each row in p* to a new group kT+1iff doing so would decrease the per-row entropy of p*, resulting in a grouping Assign group to the result of INNER If the cost does not decrease, return Otherwise, increment t and repeat. Finally, perform this again for the columns.
Complexity • INNER is linear with respect to nonzero elements in D. • Let nnz denote those elements. • Let k be the number of row groupings and l be the number of column groupings. • Row swaps are performed in the quadtree and take O(log m) time each, where m is the number of cells. • Let T be the iterations required to minimize the cost. • O(nnz * (k + l + log m) * T) • OUTER, though quadratic with respect to (k + l), is linear with respect to the dominating term nnz. • Let n be the number of row splits. • O((k + l)2nnz + (k + l) n log m)
Experiments • NoisyRegions • Three features (“species”) on a 32x32 grid. • So D has 32x32 = 1024 rows. • And 3 columns. • 3% of each cell, chosen at random, has a wrong species, also randomly chosen. • The spatial and non-spatial groupings are shown to the right. • Recall: Bi-grouping is not spatial by default. • Spatial grouping reduces the total codelength. • The approach is not quite perfect due to the heuristic nature of the algorithm.
Experiments • Birds • 219 Finnish bird species over 3813 10x10km habitats. • Species are the features, habitats are cells. • So our matrix is 3813x219. • The spatial grouping is clearly more coherent. • Spatial grouping reveals Boreal zones: • South Boreal: Light Blue and Green. • Mid Boreal: Yellow. • North Boreal: Red. • Outliers are (correctly) grouped alone. • Species with specialized habitats. • Or those reintroduced into the wild.
Other approaches • Clustering • k-means • Variants using different estimates of central tendency: • k-medoids, k-harmonic means, spherical k-means, … • Variants determining k based on some criteria: • X-means, G-means, … • BIRCH • CURE • DENCLUE • LIMBO • Also information-theoretic. • Approaches either lossy, parametric, or aren’t easily adaptable to spatial data.
Room for improvement: • Complexity • O(n * log m) cost for reevaluating the quadtreecodelength. • O(log m) worst-case time for each reevaluation/row swap * n swaps. • However, the average-case complexity is probably much better. • If we know something about the data distribution, we might be able to reduce this. • Faster convergence • Fewer iterations, reducing the scaling factor T. • Rather than stopping only when there is no decrease in cost, perhaps stop when we fall below a threshold? (Introduces a parameter) • Accuracy • The search will only find local optima, leading to errors. • We can employ some approaches used in annealing or genetic algorithms to attempt to find the global optimum. • Randomly restarting in the search space, for example. • Stochastic gradient descent – similar to what we’re already doing, actually.
Conclusion • Simultaneous and automatic grouping of spatial correlation and feature co-habitation. • Easy to exploit spatial locality. • Parameter-free. • Utilizes MDL: • Minimizes the sum of the model cost and the data cost given the model. • Efficient. • Almost linear with the number of entries in the matrix.
References • S. Papadimitriou, A. Gionis, P. Tsaparas, R.A. Vaisanen, H. Mannila, C. Faloutsos, "Parameter-Free Spatial Data Mining Using MDL", ICDM, Houston, TX, U.S.A., November 27-30, 2005. • M. Mehta, R. Agrawal and J. Rissanen, "SLIQ: A Fast Scalable Classifier for Data Mining", in Proceedings of the 5th International Conference on Extending Database Technology, Avignon, France, Mar. 1996.
Thanks! Any questions?