260 likes | 409 Views
Discovering Interesting Regions in Spatial Data Sets. Christoph F. Eick for Data Mining Class Motivation: Examples of Region Discovery Region Discovery Framework A Fitness For Hotspot Discovery Other Fitness Functions A Family of Clustering Algorithms for Region Discovery Summary .
E N D
Discovering Interesting Regions inSpatial Data Sets Christoph F. Eick for Data Mining Class Motivation: Examples of Region Discovery Region Discovery Framework A Fitness For Hotspot Discovery Other Fitness Functions A Family of Clustering Algorithms for Region Discovery Summary
Discovering Interesting Regions inSpatial Data Sets Christoph F. Eick for Data Mining Class Motivation: Examples of Region Discovery Region Discovery Framework A Fitness For Hotspot Discovery Other Fitness Functions A Family of Clustering Algorithms for Region Discovery Summary
1. Motivation: Examples of Region Discovery Application 1: Supervised Clustering [EVJW07] Application 2: Regional Association Rule Miningand Scoping [DEWY06, DEYWN07] Application 3: Find Interesting Regions with respect to a Continuous Variables [CRET08] Application 4: Regional Co-location Mining Involving Continuous Variables [EPWSN08] Application 5: Find “representative” regions (Sampling) b=1.01 RD-Algorithm b=1.04 Wells in Texas: Green: safe well with respect to arsenic Red: unsafe well
2. Region Discovery Framework • We assume we have spatial or spatio-temporal datasets that have the following structure: (x,y,[z],[t];<non-spatial attributes>) e.g. (longitude, lattitude, class_variable) or (longitude, lattitude, continous_variable) • Clustering occurs in the (x,y,[z],[t])-space; regions are found in this space. • The non-spatial attributes are used by the fitness function but neither in distance computations nor by the clustering algorithm itself. • For the remainder of the talk, we view region discovery as a clustering task and assume that regions and clusters are the same
Region Discovery Framework Continued The algorithms we currently investigate solve the following problem: Given: A dataset O with a schema R A distance function d defined on instances of R A fitness function q(X) that evaluates clustering X={c1,…,ck} as follows: q(X)= cXreward(c)=cXinterestingness(c)*size(c) with b>1 Objective: Find c1,…,ck O such that: • cicj= if ij • X={c1,…,ck} maximizes q(X) • All cluster ciX are contiguous (each pair of objects belonging to ci has to be delaunay-connected with respect to ci and to d) • c1,…,ck O • c1,…,ck are usually ranked based on the reward each cluster receives, and low reward clusters are frequently not reported
Challenges for Region Discovery • Recall and precision with respect to the discovered regions should be high • Definition of measures of interestingness and of corresponding parameterized reward-based fitness functions that capture “what domain experts find interesting in spatial datasets” • Detection of regions at different levels of granularities (from very local to almost global patterns) • Detection of regions of arbitrary shapes • Necessity to cope with very large datasets • Regions should be properly ranked by relevance (reward); in many application only the top-k regions are of interest • Design and implementation of clustering algorithms that are suitable to address challenges 1, 3, 4, 5 and 6.
3. Fitness Function for Supervised Clustering Class of Interest: Unsafe_Well Prior Probability: 20% γ1 = 0.5, γ2 = 1.5; R+ = 1, R-= 1; β = 1.1, =1. 10% 30%
4. Fitness Functions for Other Region Discovery Tasks 4.1 Creating Contour Maps for Water Temperature (Temp) Fig. 1: Sea Surface Temperature on July 7 2002 Var=2.2 Reward: 48,5 Rank: 3 Mean=11.2 A single region and its summary Examples in the data set WT have the form: (x,y,temp); var(c,temp) denotes the variance of variable temp in region c interestingness(c)= IF var(c,temp)>var(WT,temp) THEN 0 ELSE min(1, log20(var(WT,temp)/var(c,temp))) with being a parameter (with default 1) Basically, regions receive rewards if their variance is lower than the variance of the variable temperature for the whole data set, and regions whose variance is at least 20 times less receive the maximum reward of 1.
4.2 Finding Regions with High Water Temperature Differences Examples in the data set WT have the form: (x,y,Temp) Fitness function: Let c be a cluster to be evaluated interestingness(c)= IF var(c,temp)<var(WT,temp) THEN 0 ELSE min(1, log20(var(c,temp)/var(WT,temp))) ) with being a parameter (with default 1)
4.3 Programming Project Fitness Functions Purity r1 r2 (6, 2, 2) r3 (0, 0, 5) (2,2,1) We assume we have 3 classes; in r1 we have 6 objects of class1, 3 objects of class 2, and 2 objects of class1 We assume th=0.5 and =2 i(r1)= (0.6-0.5)**2=0.01 i(r2)=(1-0.5)**2=0.25 i(r3)=0 q(X)=q({r1,r2,r3})= 0.01*10b + 0.25*5b
Programming Project Fitness Function Variance r3 Var(r3)=1100 r1 var(r1)=80 r2 Var(r2)=200 O Var(O)=100 r4 Var(r4)=20 We assume =1 and th=1.5 i(r1)= 0 i(r2)=(2-1.5)=0.5 i(r3)=(11-1.5)=9.5 i(r4)=0
Interestingness Function Binary Co-location r3 r1 (1,1) (-1, 1) (1, 0.6) r2 (-1, -4) (-.0.5, -1) (-0.5,0) R4 (1,-1) (1, 1) (0.3, -0.1) We assume =1, th=0.1 and A={B1,B2} i(r1)= |1-1-0.6|/3 -0.1=0.1 i(r2)=|4+0.5+0|/3-0.1=1.4 i(r3)=… i(r4)=0 because |-1+1-0.03|/3=0.01<0.1 Meaning: z-value of B1 is -1, and z-value of B2 is -4 Binary Co-location: i(o,{B1,B2})=zB1(o)*zB2(o)
Programming Project Function MSE r1 r2 (2,2) (4,4) (-1,-1) (-7,-7) (-4,-4) MSE(r1)=(1**2+1**2+1**2+1**2+1**2)/2=2 MSE(r2)=(3**2+3**2+3**2+3**2+1**2+0+0)/3=12
4.4 Regional Co-location Mining R1 R2 Regional Co-location R3 R4 Task: Find Co-location patterns for the following data-set. Global Co-location: and are co-located in the whole dataset
A Reward Function for Binary Co-location Task: Find regions in which the density of 2 or more classes is elevated. In general, multipliers lC are computed for every region r, indicating how much the density of instances of class C is elevated in region r compared to C’s density in the whole space, and the interestness of a region with respect to two classes C1 and C2 is assessed proportional to the product lC1*lC2 Example: Binary Co-Location Reward Framework; lC(r)=p(C,r)/prior(C) C1,C2 = 1/((prior(C1)+prior(C2)) “maximum multiplier” kC1,C2(r) = IF lC1(r)<1 or lC2(r )<1 THEN 0 ELSE sqrt((lC1(r)–1)*(lC2(r)–1))/(C1,C2 –1) interestingness(r)= maxC1,C2;C1C2 (kC1,C2(c))
The Ultimate Vision of the Presented Research DomainExpert Spatial Databases Measure ofInterestingness Acquisition Tool Database Integration Tool Fitness Function Data Set Family of Clustering Algorithms Region DiscoveryDisplay Ranked Set of Interesting Regions and their Properties Visualization Tools Architecture Region Discovery Engine
How to Apply the Suggested Methodology • With the assistance of domain experts determine structure of dataset to be used. • Acquire measure of interestingness for the problem of hand (this was purity, variance, MSE, probability elevation of two or more classes in the examples discussed before) • Convert measure of interestingness into a reward-based fitness function. The designed fitness function should assign a reward of 0 to “boring” regions. It is also a good idea to normalize rewards by limiting the maximum reward to 1. • After the region discovery algorithm has been run, rank and visualize the top k regions with respect to rewards obtained (interestingness(c)*size(c)), and their properties which are usually task specific.
5. A Family of Clustering Algorithms for Region Discovery • Supervised Partitioning Around Medoids (SPAM). • Representative-based Clustering Using Randomized Hill Climbing (CLEVER) • Supervised Clustering using Evolutionary Computing (SCEC) • Single Representative Insertion/Deletion Hill Climbing with Restart (SRIDHCR) • Supervised Clustering using Multi-Resolution Grids (SCMRG) • Agglomerative Clustering (MOSAIC) • Supervised Clustering using Density Estimation Techniques (SCDE) • Clustering using Density Contouring (DCONTOUR) Remark: For a more details about SCEC, SPAM, SRIDHCR see [EZZ04, ZEZ06]; the PKDD06 paper briefly discusses SCMRG
CLEVER Separate Slideshow
Steps of Grid-based Clustering Algorithms Basic Grid-based Algorithm • Define a set of grid-cells • Assign objects to the appropriate grid cell and compute the density of each cell. • Eliminate cells, whose density is below a certain threshold t. • Form clusters from contiguous (adjacent) groups of dense cells (usually minimizing a given objective function) Simple version of a grid-based algorithm: Merge cells greedily as long as merging improves q(X).
Advantages of Grid-based Clustering Algorithms • fast: • No distance computations • Clustering is performed on summaries and not individual objects; complexity is usually O(#populated-grid-cells) and not O(#objects) • Easy to determine which clusters are neighboring • Shapes are limited to union of grid-cells
Ideas SCMRG (Divisive, Multi-Resolution Grids) Cell Processing Strategy 1. If a cell receives a reward that is larger than the sum of its rewards its ancestors: return that cell. 2. If a cell and its ancestor do not receive any reward: prune 3. Otherwise, process the children of the cell (drill down)
Parameters SCMRG Separate Transparency!
6. Summary • A framework for region discovery that relies on additive, reward-based fitness functions and views region discovery as a clustering problem has been introduced. • The framework find interesting places and their associated patterns. • The framework extracts regional knowledge from spatial datasets • The ultimate vision of this research is the development of region discovery engines that assist earth scientists in finding interesting regions in spatial datasets.
Why should people use Region Discovery Engines (RDE)? RDE: finds sub-regions with special characteristics in large spatial datasets and presents findings in an understandable form. This is important for: • Focused summarization • Find interesting subsets in spatial datasets for further studies • Identify regions with unexpected patterns; because they are unexpected they deviate from global patterns; therefore, their regional characteristics are frequently important for domain experts • Without powerful region discovery algorithms, finding regional patters tends to be haphazard, and only leads to discoveries if ad-hoc region boundaries have enough resemblance with the true decision boundary • Exploratory data analysis for a mostly unknown dataset • Co-location statistics frequently blurred when arbitrary region definitions are used, hiding the true relationship of two co-occurring phenomena that become invisible by taking averages over regions in which a strong relationship is watered down, by including objects that do not contribute to the relationship (example: High crime-rates along the major rivers in Texas) • Data set reduction; focused sampling