1.09k likes | 1.27k Views
Multidimensional Access Methods. Volker Gaede IC-Parc, Imperial College, London Oliver Gunther Humboldt-Universitat, Berlin ACM Computing Surveys, June 1998 Mohammad Reza Kolahdouzan, kolahdoz@usc.edu. Outline. Organization of Spatial Data What is Special About Spatial?
E N D
Multidimensional Access Methods Volker Gaede IC-Parc, Imperial College, London Oliver Gunther Humboldt-Universitat, Berlin ACM Computing Surveys, June 1998 Mohammad Reza Kolahdouzan, kolahdoz@usc.edu
Outline • Organization of Spatial Data • What is Special About Spatial? • Definitions and Queries • Basic Data Structures • One-Dimensional Access Method • Main Memory Structures • Point Access Methods • Multidimensional Hashing • Hierarchical Access Methods • Space Filling Curves for Point Data • Spatial Access Methods • Transformation • Overlapping Regions • Clipping • Multiple Layers • Comparative Studies, Conclusion
What is special about Spatial Data? • Basic properties: • Complex structure • object could be point or thousands of polygons • variable tuple size • Dynamic • different operations (insert, delete, update) are interleaved • Large • e.g.: gigabytes for geographic maps • Integration of secondary and tertiary memory • Several proposals but no standard algebra • no standard set of operators • operators depends of domain (application specific) • Not closed operators • result of an operator can return any kind of object • Expensive computational costs
What is special about Spatial Data? ... • Special physical layer support is required for “Search” operators (for search, update, … ) • Multidimensional access methods required. • Hard to do: no total ordering for spatial objects that preserves spatial proximity (i.e., no mapping from 2,3,... to dimensions to 1) • One solution: • single key structure, per dimension • problem: each index is traversed independently of others
What is special about Spatial Data? ... • Requirements for multidimensional access methods: • Dynamic: keep track of changes (inserts, deletes, updates, ... ) • Secondary/tertiary storage management: not possible to have everything in main memory • Broad range of supported operations: not sacrifice one for others • Independence of the input data(distribution)and insertion sequence • Simplicity • Scalability • Time efficiency of search: goal is to meet one-dimensional B-tree: logarithmic worst case • Space efficiency • Concurrency and recovery: multiple concurrent accesses • minimum impact for integration
Definitions • Multidimensional access method: support search in spatial DB • Point access methods (PAMs): designed to perform spatial search in point DBs • point could have 2,3,… dimensions, but no extension. • Spatial access methods (SAMs): can manage extended objects (lines, polygons, ...) • Universe, Original Space: a d-dimensional Euclidean space, E(d) • Any point has a unique location, defined by its d coordinates • d-dimensional polytope, P: intersection of some finite number of closed half-spaces in E(d), such that the dimension of the smallest affine subspace containing P is d. • D-dimensional polyhedron: • union of some polytopes. • Divides the points in space to: its interior, boundary, and exterior (mutually disjoint)
Definitions ... • Examples for polyhedron: • line (polyline) = 1-dimensional polyhedron • region (polygon) = 2-dimensional polyhedron • A set of k-dimensional polyhedra forms a data type, (e.g., {point, line, region, … } • Spatial attribute of a spatial data describes its extension o.G • synonyms: • geometry, shape, spatial extension • shape descriptor, shape description, shape information, geometric description
Definitions ... • Simple solutions for indexing: • Abstracting complex object with simpler shape, like bounding box or sphere, and providing a link to actual data • produces candidate solutions • based on application, refinement may be necessary • Decomposition, represent a shape with union of simpler shapes (e.g., convex polygons with bounded number of vertices).
Definitions ... • What is efficiency? • Space: minimize number of occupied bytes • Time: not clear! • We care about elapsed time, but it depends on implementation, hardware, … • Solution: number of disk accesses for a search. • Based on assumption that search is I/O bound (B-tree) • May not be true for spatial objects • Should still minimize number of disk accesses
Queries • No standard query language • Application domain dependent set of operators • Some more common operators (intersection, … ) • Queries are usually in term of extension to SQL
Queries ... • Common Queries • Query 1: Exact Match Query EMQ, Object Query • returns objects that have similar extension
Queries ... • Common Queries ... • Query 2: Point Query PQ • returns an object that contains a point
Queries ... • Common Queries ... • Query 3: Window Query WQ, Range query • returns objects that have at least one point in an interval • intervals are iso-oriented (parallel to coordinate axes)
Queries ... • Common Queries ... • Query 4: Intersection Query IQ, Region query, Overlap Query • returns objects that have at least one common point with another object
Queries ... • Common Queries ... • Query 5: Enclosure Query EQ • returns objects that enclose another object
Queries ... • Common Queries ... • Query 6: Containment Query CQ • returns objects that are contained in another object • contain and enclosure are dual
Queries ... • Common Queries ... • Query 7: Adjacency Query AQ • returns adjacent objects to another one
Queries ... • Common Queries ... • Query 8: Nearest Neighbor Query NNQ • returns an object that has the minimum distance to another one • distance between two objects: distance between their closest points
Queries ... • Common Queries ... • Query 9: Spatial join • two collection R and S of spatial data • spatial predicate θ (e.g., intersects(), contains(), adjacent(), … ) • find pair of objects
One Dimensional Access Methods • Important foundation of mutlidimensional methods • Access methods: • linear hashing • extendible hashing • Hierarchical (e.g., B-tree) • Hierarchical: • scalable • behave well for skewed data • nearly independent data distribution • Hashing: • performance depends on distribution, and hashing function
One dimensional Access Methods ... • Linear hashing: • divide universe [A,B) of possible hash values into binary intervals • size of intervals: (B-A)/2^k and (B-A)/2^(k+1) • if a bucket is full, split an interval (not necessarily the same bucket) to two equal subintervals • no guarantee for relieving overflowed bucket • put overload of a bucket in overflow page, link it to original • Extendible hashing: • binary intervals (cells), like linear • use index in a directory for each cell, no overflow pages • if a bucket at maximal depth exceeds capacity, splits all cells into two • for skewed data, this may lead to different directory entries for same data region • pro: exact match takes at most two page accesses • con: poor space utilization, non-incremental growth (doubling) • solution: bounded-index extendible hashing
One dimensional Access Methods ... • B-tree: • organize data hierarchically • balanced • relate to nesting of intervals • interior nodes: point to immediate mutually disjoint descendants • leaf nodes: data items • comparison: • B-tree: adaptive (size of interval depends on data) • hash methods outperform B-tree for uniformly distributed data for: • exact match queries, insertion, deletion
Main Memory Structures • Early methods • only work in main memory, not optimized for secondary memory • are adapted in real methods • our running example: • points, polygons (their centroids, or MBBs)
Main Memory Structures ... • k-d-tree • binary search tree • can only handle points (polygons with their centroids) • recursively divide the universe using (d-1)dimensional hyperplanes • hyperplanes: • iso-oriented • contain at least on data point • Interior nodes has one or two descendants, used to guide search • insert and search are straight forward • delete may require reorganizing a sub-tree • cons: • structure sensitive to ordering of insertion • data is all over the tree
Main Memory Structures ... • k-d-tree
Main Memory Structures ... • Adaptive k-d-tree • solves the problems of k-d-tree • tries to splits with about the same number of elements in each side • hyperplanes need not contain a data point • split till each subspace has some specific number of elements • result: • splitting points not part of the data • all data in leaves • interior nodes contain dimension and coordinates of split • cons: • difficult to keep the tree balanced (when frequent inserts and deletes) • tree still depends on order of insertion • pro: • very well for static environments • con for all k-d-trees: • for some certain distributions, no hyper-plane can split the data points evenly
Main Memory Structures ... • Adaptive k-d-tree
Main Memory Structures ... • BSP-tree (binary space partitioning) • like k-d-tree splits the world • splits independent of other subspaces • hyper-planes have arbitrary orientations (not iso-oriented) • decomposes until number of objects is less than a threshold • interior node: hyper-planes • leaf nodes: subspaces, links to contained objects • search for a point: straight forward • range search: generalization of point search • cons: • generally not balanced, may have very deep subtress • higher space requirements (info of the hyperplanes) • pros: • adapt well for different distributions
Main Memory Structures ... • BSP-tree (binary space partitioning)
Main Memory Structures ... • BD-tree • binary tree • represents subdivisions of data into interval shaped regions • encode regions in bit strings (DZ expressions, z values, ST_Morton_Number) • associate bit strings to nodes • finding z values: • left and down of hyperplanes : 0 • right and up of hyperplanes: 1 • splitting: • make interval-shaped excision: one node is excision (interval-shaped) part, the other one is remainder of the subspace • result: each data bucket contain at least 1/3 of the original entries • searching: • first finds full prefix • then traverses the tree based on prefix
Main Memory Structures ... • BD-tree
Main Memory Structures ... • The quadtree • closely related to k-d-tree • split with iso-oriented hyperplanes • difference: not binary tree any more • each node has 2^d descendants (in d dimensional space) • each node is interval shaped partitions • partitions don’t have to be equal size • decompose till a threshold • many variants • search is similar to k-d-tree • for point query: traverses one subtree • for region query: goes to often more than one subtree • cons: • not necessarily balanced • deeper subtrees for dense regions
Main Memory Structures ... • point quadtree • constructed by inserting points one by one • how to insert: • search for a point • if not found, add it to the leaf node which search is stopped at • divide corresponding partition to 2^d subspaces, new point in center • deletion: requires reconstructing subtree
Main Memory Structures ... • region quadtree • regular decomposition of the universe • result: 2^d subspaces are equal • very good for search operations
Point Access Methods • Previous access methods designed for main memory • Can be used for secondary memory, but performance is very below optimum (no control over how OS accesses disks) • PAMs: organize points in to buckets, buckets in to subspaces • Different approaches for PAMs: • hashing (extended, linear) • hierarchical (tree based) • space-filling curves • lots of hybrid methods are also available
Point Access Methods ... • Another taxonomy for PAMs: based on properties of buckets: • disjoint of overlapped • cover complete universe, or partial cover • interval (box), or arbitrary shapes
Point Access Methods ... • Multidimensional Extended Hashing • Heuristics: hashing functions that preserves ordering to some extend • idea: objects close in space should be stored close in disk • Grid File: • superimposes d-dimensional orthogonal grid on the universe. • Grids not regular, different shapes and sizes for cells • grid directory links cells to data buckets • grid directory stored in disk, grid is in main memory • result: no more than 2 disk accesses for exact match queries • insert data: if bucket is full, requires cell splitting
Point Access Methods ... • Grid File • con: super-linear growth even for uniformly distributed data (lower left part, 4 cells are pointed to one bucket) • exact match point query: • use scales to locate cell containing search point • if not in memory, one disk access to get it • retrieve bucket that have the objects in cell, and compare to search point (one more access) • range query: • test all cells that overlap the search region
Point Access Methods ... • EXCELL • like grid file • splits the universe regularly • if insert require cell splitting, splits all cells (double the directory size)
Point Access Methods ... • Two level Grid file • idea: use a second grid file to manage grid directory • first level is root directory (coarsened level of the second level) • second level, actual grid directory • first level points to second, second to data • pro: • splits are confined to subdirectory regions, not effecting their surroundings: slower directory growth. • Second level in main memory: still 2 accesses for exact match
Point Access Methods ... • Two level Grid file
Point Access Methods ... • Twin Grid File • idea: second grid file: increase space utilization • objective: reduce total number of buckets (or cells) in two files • relation between files not hierarchical (like two-level) • both files spans the whole universe • data is distributed between two files dynamically • pro: • space utilization: 90% (original grid 69%) • superior to original grid for larger search spaces • con: inferior to original grid for smaller query ranges • inserting data: • if bucket overflows, first try to redistribute the data between files • could lead to bucket overflow or underflow in one • make the shift permanent if the number of buckets is minimized • otherwise split the bucket • search requires checking both files
Point Access Methods ... • Twin Grid File
Point Access Methods ... • Multidimensional Linear Hashing • no directory (unlike extended), or very small directory • result: • relatively little storage requirements • can keep the related information in main memory • early techniques: • fail to support range queries • multidimensional order-preserving linear hashing with partial expansions (MOLHPE)! • idea : partially extending the buckets without expanding the file size at the same time • guarantees modest file growth • very good for uniformly distributed data, fails for nonuniform distributions
Point Access Methods ... • Hierarchical Access Methods • based on binary of multiway tree structure • like hashing, stores data in bucket • each bucket is leaf of a node, and a disk page • interior nodes of the tree guide search • search: top-down tree traversal • difference between different methods: characteristics of the regions
Point Access Methods ... • Hierarchical Access Methods: • k-d-B-tree • combination of adaptive k-d-tree and B-tree • partition the universe like adaptive k-d • associates subspaces to tree nodes • interior nodes are intervals • nodes in same level are mutually disjoint • perfectly balanced (like B-tree) • search straightforward, like k-d-tree • insert: search, find the right bucket, if required split and move half the data to it. • Deletion: search, remove, if necessary merge node with siblings • con: no minimum space utilization can be guaranteed
Point Access Methods ... • Hierarchical Access Methods: • k-d-B-tree
Point Access Methods ... • Hierarchical Access Methods: • LSD tree (??) • directory is organized same as adaptive k-d-tree • better adaptation to data distribution (in compare to fixed binary partitioning) • external balancing property: heights of external subtrees differ at most by one (??) • combines two split strategies to accommodate skewed data: • data-dependent : based on data, tries to achieve most balanced structure (equal number of data in both sides of split) • distribution-dependent: split at fixed dimension and position (know distribution is assumed)
Point Access Methods ... • Hierarchical Access Methods: • LSD tree (??)
Point Access Methods ... • Hierarchical Access Methods: • Buddy tree: • dynamic hashing scheme with tree structure (hybrid) • tree is made by consecutive insertions • cut the universe equally with iso-oriented hyperplanes • interior nodes: a partition and an interval (MBB of points or intervals below node) • intervals in same level nodes are mutually disjoint • leaves are data (like other trees!!) • each directory node has at least two entries • so: may not be balanced • when a node splits, MBB of two intervals are computed to reflect the current situation • so: tries to achieve high selectivity at directory level • except for root, only one pointer refers to each directory page. • so: guarantees linear growth