520 likes | 776 Views
Spatial Access Methods & Query Processing. Matei Lunca GIA 2004 Richardson Van Oosterom - Advances In Spatial Data Handling. Inhoud. Extend RDMS for GIS/GIA Trees Query types The curse of dimensionality Approximate matches. Geographic Information Retrieval. Spatial Access Methods
E N D
Spatial Access Methods & Query Processing Matei Lunca GIA 2004 Richardson Van Oosterom - Advances In Spatial Data Handling
Inhoud • Extend RDMS for GIS/GIA • Trees • Query types • The curse of dimensionality • Approximate matches
Geographic Information Retrieval • Spatial Access Methods • Algoritmes voor opslaan en vinden van ruimtelijke gegevens; 3+-D met sterke relatie en dus niet via gewone structuren zoals B-Trees op te slaan • Query Processing • Datastructuur en DB zoekacties in deze context • GIS vragen zoals “buffer rond rivier”
Extending RDMS for GIS/GIA • In GIS objects organized by location and extension in space • Because of arbitrary complexity of spatial objects access methods for 2D objects such as minimum bounding rectangles needed • Curse of dimensionality!
Requirements of spatial access methods • Dynamic Random access and queries must be supported • Space efficient Complex spatial data can in many cases not be partitioned because of relations between objects, thus data blocks may be large and not fit In memory • Efficiency independent of operators/ distribution For multiple DB storing different types of data to be joined • Compatible with concurrency
Practical requirements • Costs of computing and communicating data • Minimize external access costs (I/O) • Indexing = Trees • Pointers at leaves/nodes • Searching = going down tree • Fast for range queries • Hashing = address buckets • No ordering needed
Challenges in Indexing • Most DB support • B+-Trees • Hash tables • Few DB support • R-Trees • Region quadtree • Why is implementation so difficult? • Integration with query optimizer • Providing query operators that utilize the index • Cost model (efficiency known before implementation) • Concurrency control and recovery techniques
Space Driven VS Data Driven • Space Driven Trees • Decomposition independent from data insertion order • Region quadtree • Data Driven Trees • Space decomposed based on input data • Point quadtree • K-D Tree
Space/Data Driven Structures • Space driven structures – Grids • Twin grid file • Shuffles points between the primary and secondary file to minimize the total size • Multilayer grid file • Uses two or more grid files, storing objects in the first grid file where no splitting across hyperplanes is needed • Data driven structures - R-Tree
Trees • X-Tree • TR*-Tree • IQ, PX & MDX-Trees • PX-Tree • TV-Tree • VAM-Split Trees
Trees: X-Tree • Adapts R*-Trees to high dimensional data • Overlap-free split based on split history • R/R*-Trees lead to high overlap • diminish advantages of hierarchical partitions • When algorithm would lead to unbalanced directory the X-Tree omits the split and the node becomes a super node • Supernodes are nodes enlarged by a multiple of the block size that avoid splits that would result in an inefficient structure by linear scanning
Trees: X-Tree (2) • Dynamically use overlap-minimizing splits • Supernodes accessed sequentially if no good split decision found for a directory node
Trees: TR*-Tree • Improved R*-Tree • Represent exact geometry spatial attributes • Reduce memory operations • Store components of 1 decomposed object • Internal node • Pointer child node • Minimum bounding rectangle of trapezoids in child • Leaf node • Trapezoids
Trees: TR*-Tree (2) • Representation of Bavaria
Trees: IQ-, PX- & MDX-Trees • IQ-Tree • Index structure for query processing in high-dimensional data spaces • Compresses data to improve query processing • PX-Tree & Multi-Disc X-Tree • Parallel access method • Short response time & high query throughput
Trees: TV-Tree • R-Tree-like + varying length feature vector • Telescope vector • Divide attributes into • Those common to all subtree items • Those used for branching • Those ignored • Knowledge about the behaviour of single attributes (their selectivity) is necessary
Trees: VAM-Split Trees • VAM-Split R-Tree • VAM-Split KD-Tree • Static index structures • All objects must be available when index is created • Splits are performed at maximum variance value • Built in memory before permanently stored on disk • Size limited to the amount of (virtual) memory available
Other Trees • The Cell Tree • Levels of data split by arbitrary hyperplanes • Concave objects decomposed into convex pieces, which are indexed in every cell that they overlap • The K-D Tree • Levels of data are split along different dimensions into non-overlapping cells • Objects indexed in all cells they intersect
Other Trees (2) • Generalized BD Tree • Stores objects as hierarchy of minimum bounding boxes • The P-Tree • Hyperplanes split space hierarchically by polytopes = multidimensional boxes with nonrectangular sides • R-Tree special case in which all polytopes are boxes • R-files • Divide space into hierarchy of nested boxes in which objects are indexed in lowest cell which contains them
Cost Models • Curse of dimensionality performance deteriorations • Cost model for query processing in high-dimensional data spaces for careful optimization of parameters of an index • Data space quantization • Data compression - VA File, IQ Tree • Reduce I/O by representing attributes in less bits • Page size • Dimension assignment
High-dimensional data spaces & massive data sets • Exotic data, cardinality/dimensionality++ • Terabyte, petabyte • Common problem: overfit the data • Common challenge: fit model/pattern robustly • Compression, statistics, stochastic analysis, discrete mathematics, harmonic analysis • Complexity & noisiness lead to constructing statistical/fuzzy models
The Pyramid-Technique • Maps data from D-dimensional space to 1D so B+-Trees can be used to manage data • Data space is divided into 2D pyramids • Pyramids partitioned into data pages of B+-Tree • No inverse transformation needed because data and D-dimensional key stored
The Pyramid-Technique (2) • Complex queries • Pyramid value calculated from query input • Querying the tree with this value • Result = D-dimensional points sharing pyramid value that must be scanned for the search item • Efficient query processing only in < 8 D
Query processing • Direct VS indirect spatial search • Direct = locating objects in an geographical area • Indirect = queries based on non-spatial attributes • Show geography complying non-spatial requirements
Query processing steps • Query input • Filter step • Spatial index • Candidate set • Refinement step • Load spatial extent • Test spatial extent • Hits/false drops • Query result output
Query types • Point query/point-in-polygon query • Parameter: coordinates • What objects exists at these coordinates? • Window/range query • Parameter: region defined by coordinates • What objects are located in this region? • Distance and Buffer Zone queries • Parameters: buffer object and distance • What objects are there within given distance from buffer?
Query types (2) • Path queries (network structure required) • Parameters: network locations • What is the shortest route from A to B? • Join and Range queries • Spatial objects and relationships • Spatial predicates: points, windows, buffers, paths • Overlaying roads and waterworks GIS layers and displaying the result according to relative height (river, bridge, aqueduct) is a spatial join
Query types (3) • Feature approach – feature vectors • Neighborhood search • Spatial-Query-by-Sketch • Multimedia (2D) search instead of alphanumeric
Similarity search • Approximate surface by parametric functions • Assigning appropriate class to query object • Section Coding – each polygon’s circumcircle is decomposed into sectors & normalized • Similarity = distance feature vectors
Similarity search (2) • Shape Histograms (feature vectors!) • Bins = complete & disjoint cells of space • Shell Model • Concentric uniform shells around the center • Independent of rotation around the center • Sector Model • Distribute uniformly on surface (Voronoi)
Special Query Types • Spatial continuous queries • In dynamic environments continuous pooling necessary, because otherwise query results meaningless • Result, expiry time given current motion vector, and change that can cause expiration • Spatio-temporal queries • Spatiotemporal Database Systems (STDBS) track and presenting data about moving objects, such as GPS • Probabilistic models are also available that attempt to plot future values in order to give faster response
Query pre-processing • Pre-optimize index structure • With specific knowledge: if we use a TIN for river network studies, valleys are more important and could be stored at high nodes in tree • Avoid characteristic areas: don’t store exact geometry of a chasm, but no-go denomination
Query processing strategies • Parallel searches (nice split) • In varying data structures • Shape-based strategy • Models the direction region • Converts processing of direction predicates into processing of topological operations between open shapes and closed geometry objects • Eliminates computation related to world boundary
Hoofdpunten • Spatial context definieren/representeren • Space Driven VS Data Driven • Ieder toepassing zijn eigen techniek • Tree • Hashing • 3D histogram • Approximate/Fuzzy approach