400 likes | 557 Views
Progressive Computation of Constrained Subspace Skyline Queries. Evangelos Dellis 1 Akrivi Vlachou 1 Ilya Vladimirskiy 1 Bernhard Seeger 1 Yannis Theodoridis 2 1 Department of Computer Science, University of Marburg, Germany
E N D
Progressive Computation of Constrained Subspace Skyline Queries Evangelos Dellis1 Akrivi Vlachou1 Ilya Vladimirskiy1 Bernhard Seeger1 Yannis Theodoridis2 1 Department of Computer Science, University of Marburg, Germany 2 Department of Computer Science, University of Piraeus, Greece
Overview • Introduction • Motivation - Related Work • Basic STA • Improved Pruning • Indexing using Low-dimensional R-trees • Experimental Evaluation • Conclusions – Future Work
Overview • Introduction • Motivation - Related Work • Basic STA • Improved Pruning • Indexing using Low-dimensional R-trees • Experimental Evaluation • Conclusions – Future Work
Finding A Hotel Close to the Beach • Which one is better? • i or h? (i, because its price and distance dominate those of h) • i or k?
Skyline Queries • Retrieve points not dominated by any other point: • A point p dominates another point q if it is as good or better as p in all dimensions and better in at least one dimension.
Skyline of Manhattan • Which buildings can we see? • Higher or nearer (a building dominates another building if it is higher, closer to the river, and has the same x position)
SQL Extension • SQL syntax: • Examples: b) Find salespersons who were very successful in 1999 and have low salary a) Find a hotel that is cheap and close to the beach.
Overview • Introduction • Motivation - Related Work • Basic STA • Improved Pruning • Indexing using Low-dimensional R-trees • Experimental Evaluation • Conclusions – Future Work
Motivation • Constrained Skyline (car database): • A user may only be interested in recordswithin the price range from 3 thousand to 7 thousand euros and with mileage reading between 20K and 100K. • The traditional skyline(dashed line) fails to return interesting points.
Motivation (continued) • Subspace Skyline: • A car database could contain many other attributes of the cars: • horsepower, age, fuel consumption, etc… • A customer that is sensitive on the price and the mileage reading (2-dimensional subspace) would like to pose a skyline query on those attributes, rather than on the whole data space. • While the dimensionality of the corresponding data space might be rather high, skyline queries generally refer to a low dimensional subspace. • The constrained subspace skyline queries form the generalization of all meaningful skyline queries over a given dataset.
Related Work • SKYCUBE [VLDB 2005, SIGMOD 2006]: • The Skyline Cube (SKYCUBE), consists of the skylines in all possible (2d-1) subspaces. • Drawback: It is not possible to pre-calculate the points of the full space skyline and their duplicates, since the result depends on the given constraints (static). • SUBSKY [ICDE 2006]: • Transforms the multi-dimensional data into one-dimensional, and therefore permits indexing the dataset with a B+-tree. • Drawbacks: • is unable to answer constrained subspace skyline queries as all points have to be transformed in a pre-processing step. • does not deliver the skyline points progressively.
Related Work (Continued) • BBS [SIGMOD 2003, TODS 2005]: • all points are indexed in an R-tree. • mindist(MBR) = the L1 distance between its lower-left corner and the origin (NN). • Keep a heap of index entries and objects, ordered by mindist. • Is still the most efficient method for (constrained subspace) skyline retrieval!
Related Work (Continued) Shortcomings of BBS: • Maintaining a high-dimensional index to support constrained skyline queries in arbitrary dimensionality is not suitable: • It has been shown that the performance of such high-dimensional indexes deteriorates with an increasing number of dimensions. (Curse of Dimensionality) • The performance of low-dimensional constrained skyline queries decreases when the dimensionality of the indexed space is high in contrast to the query space that is low. (Random Grouping Effect) • Only low-dimensional indexes, e.g. R-trees, seem to perform well in practice and for that reason have found their place in commercial database management systems (DBMS).
Our Approach • We partition vertically the data space among several low-dimensional subspaces and index each of these subspaces using an R-tree. • A constrained skyline query is then partitioned into several sub-queries, each of them is processed by utilizing the corresponding index using incremental NN search. • TA-INDEX [DAWAK 2005]: An algorithm for vertically partitioned nearest neighbor queries.
Contributions • We present a threshold-based skyline algorithm (called STA), which exploits multiple indexes. • We propose different pruning strategies to identify dominated regions and to discard irrelevant sub-trees of the indexes. • A workload-adaptive strategy for determining the number of indexes and the assignment of dimensions to the indexes is presented.
Overview • Introduction • Motivation - Related Work • Basic STA • Improved Pruning • Indexing using Low-dimensional R-trees • Experimental Evaluation • Conclusions – Future Work
Problem Definition Constrained Subspace Skyline Queries: • For a point p∈Dc in the dimension set S΄: • the dominance region contains points which are dominated by p. • the anti-dominance region refers to the set of points dominating p. • A point p∈D is said to dominate another point q∈D on subspace S΄ if: • on every dimension di∈S΄, pi ≤ qi; and • on at least one dimension dj∈S΄, pj < qj.
One-point Pruning • Observation: A point p is a skyline point in S΄ if and only if there exists no point q that belongs to the anti-dominance area of p for all dimension sets Si΄ (1≤ i ≤ n). • Pruning with the Nearest Neighbor: need to prune objects not part of skyline. • because it is a member of the skyline, there is no dominating point. • among all the skyline points it is the one with a large volume, and hence, it is also expected to prune a large percentage of the data points.
STA: A Threshold-based Skyline Algorithm • Our algorithm works in two steps: • Filter step: • All retrieved points are organized in a priority queue (heap) based on their Manhattan distance according to the dimension set S΄. • We use the Manhattan distance of the last reported point of Si΄ as a threshold to speed up the filtering phase. • Refinement step: (domination test) • The refinement step begins when the first constrained nearest neighbor based on S΄ is returned by the filter step. This point is guaranteed to be a skyline point. • In the next iteration, where another candidate is found, the refinement step needs to determine whether this candidate is a skyline point or not. • The dominance test is performed in a way similar to traditional window queries using a main-memory R-tree whose dimensionality is equal to the query dimensionality.
Index Scheduling • Round Robin strategy: • Inefficient • We are interested in more advanced strategies resulting in a fast increase of the threshold. • We choose the index that will increase the partial distance mostly as it is more beneficial for our threshold. • Strategies for index scheduling for nearest neighbor search on a vertically partitioned data set have been studied in [DAWAK 2005].
Overview • Introduction • Motivation - Related Work • Basic STA • Improved Pruning • Indexing using Low-dimensional R-trees • Experimental Evaluation • Conclusions – Future Work
Improved Pruning • Motivating example: • Non uniform distributions Points form clusters • Need: Pruning using multiple points • Simultaneous pruning: • we are not able to prune simultaneously in both subspaces using the same point.
Multiple-point Pruning • Observation: when points lying in the dominance region of a point are not discarded in at least one subspace, then we are able, under certain conditions, to discard points in all remaining subspaces, while we guarantee no false dismissals. • we use the points that are retrieved as local constrained nearest neighbors from an index, for pruning in all other indexes. • Example: 4-dimensional data space is divided into two 2-dimensional subspaces. When the point p1 is retrieved from subspace S1 then the dominance area of the point p1 in subspace S2 is used for pruning.
Avoiding False Hits • Unfortunately, by following this strategy some skyline points are falsely discarded. • Case 1: Let the point q in the projection S2 collapse on the point q1. The point p is not a skyline point in S, since it is dominated by q in all dimensions sets of S. • Case 2: On the other hand, if the point q in the projection S2 collapses on the point q2, then point p may be discarded falsely, since it is a potential skyline point. • Solution: To discard points from the dominance area of p in S2, the point p and a point qi must be dominated by the projection of the same point in S2 and S1 respectively. This condition must hold for each point qi which belongs in the discarded area of S1.
Overview • Introduction • Motivation - Related Work • Basic STA • Improved Pruning • Indexing using Low-dimensional R-trees • Experimental Evaluation • Conclusions – Future Work
Random Grouping Effect • Random Grouping Effect: Since not all dimensions are used for splitting the axes during the index creation for a leaf node, when a query that requires projection is posed to the index the performance of the index corresponds to a random low-dimensional index, • i.e. an index that groups the points into leaf nodes in a mostly random manner. • Example: consider a 10-d data space and assume that we are interested in retrieving the skyline of any 2-d subspace. • If only two dimensions are used for splitting, then the probability that the chosen dimensions have been used for splitting is very small. • Thus, the query performance is similar to the performance of a 2-d index, where the data points were grouped together randomly.
Number of Indexes • If every leaf node is splitted at least once in each dimension, we need a total number of at least 2d leaf nodes. • Well-performing index: every leaf node is splitted by each dimension once (L ≥ 2d). (Defines a maximum dimensionality for a low-dimensional index) • Example: 32-d Color dataset, 68,040 records. • Our formula suggests 2 indexes • In this way we index more effectively high dimensional datasets, by avoiding performance degradation due to random grouping effect.
Dimension Assignment Algorithm • Number of Distinct Values: a quality measure of a subspace Si • points whose projections coincide to a low-dimensional point, so that it is dominated by some duplicate point in the query-dimensional space. • DAA: a greedy algorithm to distribute the attributes over the n indexes. • restrict the random grouping effect • maximize the number of distinct values
Workload-adaptive Extension • User preferences are correlated: • use multiple indexes, which are built on the most preferred subspaces • Simple, but very powerful extension: • associate some probability with each subspace (the frequency with which it is queried) • weight the cost estimation of each dimension set by its probability. • This extension allows us to examine the performance of our algorithm under a workload, which is closer to real applications, instead of picking random subspaces.
Overview • Introduction • Motivation - Related Work • Basic STA • Improved Pruning • Indexing using Low-dimensional R-trees • Experimental Evaluation • Conclusions – Future Work
Experimental Evaluation Datasets: • Three data sets from real-world applications: • NBA dataset contains 17,000 13-dimensional points, where each point corresponds to the statistics of a player in 13 categories. • Color moments dataset contain 9-dimensional features of 68,040 photo images extracted from the Corel Draw database. • Color histogram consists of 32-dimensional features, representing the histogram of an image. • Additionally, we generated 10-dimensional uniform datasets with a cardinality of 10,000, 50,000 and 100,000 data points. Implementation Details: • We compare our algorithm against the current state-of-the-art method BBS. • We set the page size for each R-tree to 4K and each dimension was represented by a real number. • Measurement: The number of disc I/O’s (page accesses)
Examination of Constrained Subspace Skylines • Effect of Constrained Region: • Varying constrained region from 50% to 100% of each axis. • We examine subspaces with dimensionality of dsub=3. • Uniform dataset: full space dimensionality of 10-d and a cardinality of 50,000 points. • Observation: the performance of our algorithm is not affected significantly by the size of the constrained region.
Examination of Constrained Subspace Skylines • Effect of Subspace Dimensionality • We vary the query subspace dimensionality from 2 to 4. • We set the constrained region constant (represented as 60% of the values of each requested axis). These results demonstrate that the STA algorithm leads to substantially less page accesses than BBS. b) 9-d Color Dataset, 68k a) 10-d Uniform Dataset, 50k • These results demonstrate that the STA algorithm leads to substantially less page accesses than BBS.
Scalability with the Dataset Cardinality • We use uniform datasets, (dimensionality of 10-D) • Vary the cardinality between 10,000 and 100,000 points. • We set the constrained region to cover 60% of each axis. • In addition we request the skyline of 3-dimensional subspaces. • The proposed method scale better with cardinality than BBS.
Scalability with Full-space Dimensionality • Varying the Full-space Dimensionality: • We set the constrained region to cover 60% of each axis. In addition we request the skyline of 3-dimensional subspaces. • Uniform dataset with varied dimensionality of 10, 20 and 30-d. • Real datasets with varied dimensionality of 9, 13 and 32-d a) Uniform Datasets b) Real Datasets • In both cases our algorithm constantly outperforms BBS in this experiment.
Adaptation to the query Workload • Query-workload using the “80-20” law: • 20% of the attributes contribute to 80% of the queries • 32-dimensional Color histogram dataset, which consists of 68,040 records a) I/O cost b) CPU cost • Scalability using the “80-20” law: • Subspace skyline with dsub = 3 • Constrained Region: 60% of each axis
Overview • Introduction • Motivation - Related Work • Basic STA • Improved Pruning • Indexing using Low-dimensional R-trees • Experimental Evaluation • Conclusions – Future Work
Conclusions – Future Work • We addressed the problem of Constrained Subspace Skyline Queries and we have presented a threshold-based skyline algorithm, which exploits multiple indexes. • We proposed different pruning strategies to identify dominated regions and to discard irrelevant sub-trees of the indexes. • A workload-adaptive strategy for determining the number of indexes and the assignment of dimensions to the indexes is presented. • Extensive performance evaluation show the superiority of our proposed technique against related work. • Future Work may include: • Examination of STA using external queues • Development of a Cost Model for Constrained Subspace Skyline Queries
References • SKYCUBE [VLDB 2005, SIGMOD 2006]: • Yuan, Y., Lin, X., Liu, Q., Wang, W., Yu, J., Zhang, Q.: Efficient Computation of the Skyline Cube. Very Large Data Bases Conference (VLDB), Trondheim, Norway, August 30 - September 2, 2005. • Pei, J. Jin, W, Ester, M., Tao, Y.: Catching the Best Views of Skyline: A Semantic Approach Based on Decisive Subspaces. Very Large Data Bases Conference (VLDB), Trondheim, Norway, August 30 - September 2, 2005. • Xia, T., Zhang, D.: Refreshing the Sky: The Compressed Skycube with Efficient Support for Frequent Updates. To appear in Proceedings of the 2006 ACM SIGMOD International Conforerence on Management of Data (SIGMOD), Chicago, IL, USA 2006. • SUBSKY [ICDE 2006]: • Tao, Y., Xiao, X., Pei, J. SUBSKY: Efficient Computation of Skylines in Subspaces. IEEE International Conference on Data Engineering (ICDE), Atlanta, Georgia, USA, April 3-7, 2006. • BBS [SIGMOD 2003, TODS 2005]: • Papadias, D., Tao, Y., Fu, G., Seeger, B. An Optimal and Progressive Algorithm for Skyline Queries. ACM Conference on the Management of Data (SIGMOD), San Diego, CA, June 9-12, 2003. • Papadias, D., Tao, Y., Fu, G., Seeger, B. Progressive Skyline Computation in Database Systems. ACM Transactions on Database Systems, 30(1): 41-82, 2005. • TA-INDEX [DAWAK 2005]: • Dellis, E., Seeger, B., Vlachou, A. Nearest Neighbor Search on Vertically Partitioned High-Dimensional Data. In Proceedings of 7th International Conference on Data Warehousing and Knowledge Discovery (DaWaK), Copenhagen, Denmark, 2005
Thank You Questions?