DATA STRUCTURES USED IN SPATIAL DATA MINING

DATA STRUCTURES USED IN SPATIAL DATA MINING

What is Spatial data ? • broadly be defined as data which covers multidimensional points, lines, rectangles, polygons, cubes and other geometric objects. Spatial data occupies a certain amount of space called it’s spatial extent, which is characterized by location and boundary. • USES • Geographic Information Systems. • CAD/CAM It can • Multimedia Applications – Content based image retrieval – Fingerprint matching – MRI ( Digitized medical images)

Features of spatial data • Specific features of spatial data are rich data types, implicit spatial relationships among the variables, observations that are not independent, spatial auto correlation among the features. • It has two distinct types of attributes i.e. spatial attributes, non spatial attributes. Spatial attributes are used to define the spatial locations and extend of spatial objects.

Types of spatial databases • Region Data: It has a spatial extent having a location and boundary. Region data basically is the geometric approximation to an actual database. • Point Data: Point data consists of collection of points in a multidimensional space. It doesn’t cover any area of space.

What is Spatial Data Mining? • It is defined as the non-trivial search for interesting and unexpected spatial patterns from spatial databases. • New understanding of geographic processes for critical questions like how is the health of planet Earth? Characterize effects of human activity on environment and ecology? needs spatial data mining.

Spatial data in GIS • A geographic information system is any system for capturing, storing, analyzing and managing data and associated attributes which are spatially referenced to Earth. • There are two broad methods used to store data in a GIS i.e. Raster and Vector. In a GIS, geographical features are often expressed as vectors, by considering those features as geometrical shapes like point, chains, polygons.

Spatial data structures used in GIS In order to handle spatial data efficiently, as required in computer aided design and geo-data applications, a database system needs an index mechanism that will help it retrieve data items quickly according to their spatial locations. • Quad tree • k-d tree • R-tree • R+-tree • R*-tree

It is used to store 2D space. Each node of a quad tree is associated with a rectangular region of space. The top node is associated with the entire target space. Each internal node splits the space into four disjunct sub spaces according to the axes. Each of these sub spaces is split recursively until there is at most one object inside each of them. Quad trees

Division of space by quadtree

A k-d tree partitions the space into two sub spaces according to one of the coordinates of the splitting points. Let level(nod) be the length of the path from the root to the node nod and suppose the axes are numbered from 0 to k − 1. At the level level(nod) in every node the space is split according to the coordinate number (level(nod) mod k). The partitioning is done along one dimension at the node at the top level of the tree, along another dimension in nodes at the next level and so on, cycling through the dimensions. k-d Trees

Division of space by a k-d tree

It is a balanced tree structure with the index objects stored in leaf nodes. The structure is completely dynamic with no need for intermittent restructuring. If M is the maximum number of entries in one node and m = M/2. Then ‘m’ specifies the minimum number of entries allowed in a node except for the root. R-Trees

Continue… • Every non-leaf node has between ‘m’ and ‘M’ children • unless it is the root. • The root node has at least two children unless it is a leaf. • For each index record (I, tuple-id) in a leaf node, I is the smallest rectangle that spatially contains the n dimensional data object. • For each (I, child-ptr) entry in a non-leaf node, I is the smallest rectangle that spatially contains the rectangles in the child nodes.

Division of space by R-trees

It is an extension of R-tree. Here bounding rectangle of nodes at one level do not overlap. This feature decreases the number of searched branches of the tree and reduces the time consumption and increases the space consumption . Here the data objects are allowed to split so that different parts of one object can be stored in more nodes of one tree level. R+-tree

Continue… • Root has at least two children unless it is a leaf. • All leaves are at same level. • There is no constraint on the minimum number of entries at each node.

Division of space by R+-tree

R*-tree is a modification of R–tree. R–tree tries to minimize the area of all nodes of the tree. But R*–tree combines more criteria: the area covered by a bounding rectangle the margin of a rectangle: Minimization of the margin of a bounding rectangle prefers the squares. the overlap between rectangles: Minimization of the overlap between rectangles decreases the number of paths that must be searched R*-tree

Conclusion New techniques are needed for SDM due to spatial auto correlation, continuity of space. Indexing structures discussed above are very much useful for spatial data represented in vector space. For metric spaces M-tree, Vp-tree, mvp-tree are used.The main aim of all these indexing structures is to minimize disk access.

References • http://en.wikipedia.org/wiki/Quadtree • http://www.cs.umd.edu/~hjs/rtrees/index.html • Spatial datamining.pdf • http://www.dbminer.com • R+-tree.pdf • Data structure for spatial data mining21.pdf

THANK YOU

QUERIES ???

DATA STRUCTURES USED IN SPATIAL DATA MINING