Stabbing the Sky: Efficient Skyline Computation over Sliding Windows

Stabbing the Sky: Efficient Skyline Computation over Sliding Windows COMP9314 Lecture Notes

Outline • Introduction • n-of-N Queries • (n1, n2)-of-N Queries • Performance Evaluation • Conclusions Xuemin Lin@dbg.unsw.edu.au

Skyline Skyline Query: • Input: a set of points in d- dimensional space. • Output: points not dominated by another point. • (x1, x2, …, xd) dominates (y1, y2, …, yd) iff xi<=yi (1<=i<=d) & ∃k, xk<yk. Xuemin Lin@dbg.unsw.edu.au

Applications Multi-criteria decision making… Stock Trading Example: • What are the top deals? Xuemin Lin@dbg.unsw.edu.au

Skyline Query Over Sliding Window Stock Trading Example • Top deals of a stock in the last 5 mins? last 4 mins, … • Top deals of a stock in the last 10K deals? … Queries: • n-of-N model (n <= N): the most recent n elements • (n1, n2)-of-N model • One-time queries • Continuous queries Xuemin Lin@dbg.unsw.edu.au

Challenges Insertions & deletions (possibly high speed). On-line information • memory requirement • processing speed Existing techniques do not support n-of-N: [Borzsonyi et al (ICDE01), Tan et al (VLDB01), Kossman et al (VLDB 02), Papadias et al (SIGMOD03), Kapoor (SIAM J. comp00)] • support the computation of whole dataset • O (n logd-2 n) for d >= 4 & O (n log n ) otherwise Xuemin Lin@dbg.unsw.edu.au

Results n-of-N: • keep N’ (N’  N) elements where N’ = O (logd N) if data distribution on each dimension is independent. • a novel encoding scheme, with O (N’) space, leads to n-of-N query time O ( log N’ + s ) instead of O (n logd-2 n). • a new trigger based technique for continuously processing an n-of-N query. • trigger update time: O ( log s). • result update time: O (logδ) where δ is a result change. (n1, n2)-of-N: similar results. Xuemin Lin@dbg.unsw.edu.au

n-of-N Queries • e is redundant Point e in PN (the most recent N elements) iff • e expires w.r.t PN,or • ∃e’ s.t. e’  e, and e’ is younger than e N = 6 Xuemin Lin@dbg.unsw.edu.au

Optimality Theorem: Non-redundant Points (RN) vs. n-of-N Skyline Query Result (Qn,N) • (PN – RN)does not appear in any Qn,N • Qn,N must be a subset of RN • xRNn, xQn,N • RN = O(logd-1N) for “independent” distributions • Only need to keep RN – the minimum number of elements to be kept. Xuemin Lin@dbg.unsw.edu.au

Querying RN • critical dominance: e  e’ where e is the youngest. • dominance graph GRN: RN and the critical dominance relationships. Xuemin Lin@dbg.unsw.edu.au

Querying RN e  Qn,Niff • e is a root in GRN or • e’ e in GRN & e’ has expired  t(e’) < M – n + 1 <= t(e) Xuemin Lin@dbg.unsw.edu.au

Querying RN: Optimal Algorithm To answer an n-of-N Query, encode the GRN using intervals: • Stab the intervals by (M-n+1). • For all returned intervals (x,y), return point whose timestamp is y • Technique: Use an interval tree index to achieve optimal O(log|RN|+s) query time e.g., n=6 (0,3] (0,4] (3,7] (4,5] (4,6] Xuemin Lin@dbg.unsw.edu.au

Maintaining RN new element enewarrives: • If the oldest eold RN expires, remove eold and update RN and GRN(interval tree). • find D  RNdominated by enew, update RN and GRN • Depth-first search on a R-tree of RN • find ecenew, update GRN • Best-first search on the R-tree of RN enew = 8 eold = 3 D = {6} e = 4 Xuemin Lin@dbg.unsw.edu.au

Continuous n-of-N Query Trigger-based algorithm: • Deletion: Qn,N – {eold}, and Qn,N – {D} • Insertion: Qn,N  {enew} if (e’cenew and t(e’ ) >= M-n+1) • Maintain a min-heap of Qn,N for efficiency enew = 8 M = 8 eold = 3 D = {6} e’ = 4 M = 7 n = 4, N=5 Q4,5 = {4,7} M = 8 n = 4, N=5 Q4,5 = {5,7,8} Q5,5 = {3,4} Q5,5 = {4,7} Xuemin Lin@dbg.unsw.edu.au

(n1,n2)-of-N Query More complicated than n-of-N Query • PN needs to be kept! • (Old) critical dominance: t (ae) = max { t (e’): e’  e & t (e’) < t (e) } • backward critical dominance: t (be) = min { t (e’): e’  e & t (e’) > t (e)} • e Q(n1,n2),N iff ae < M-n2+1 eM-n1+1 < be • CBC dominance graph: PN & the two kinds of dominance relationships (2,4)-of-7: {4, 6} a5 = 3, b5 = 6 (M-n2+1, M-n1+1] = (4,6] Xuemin Lin@dbg.unsw.edu.au

Processing (n1,n2)-of-N Query Encode the CBC dominance graph: • e ((ae, e], be) • build an interval tree on (ae, e] only Stab using M-n2+1 against the interval tree and check e <= M-n1+1 < be) on-the-fly: • O(logN+s*), sub-optimal (1,6] (3,4] (3,5] ... (2,4)-of-7: ??? (M-n2+1, M-n1+1] = (4,6] Candidates: {4, 5, 6} (2,4)-of-7: {4, 6} Xuemin Lin@dbg.unsw.edu.au

More on (n1,n2)-of-N Query Maintenance: Similar to that of n-of-N query, but • Always expires the oldest element in PN, and maintain the interval tree and the R-tree on RN. • Implementation-wise: Use two interval trees to index RN and PN-RN, respectively. Continuous queries • More complicated • A new skyline point might not be a skyline in the previous result, • nor critically dominated by a skyline point in the previous result • nor a newly arrived point • Basic idea • Maintain additional Candidate Solutions (minimization) & triggers • Details in the full paper Xuemin Lin@dbg.unsw.edu.au

Experiment Setup • Hardware • P4 2.8G CPU, 1G Memory • Datasets • Correlated, independent, and anti-correlated • d = 2 to 5, N = 106 • Algorithms • KLP, nN, mnN, cnN, n12N, mn12N • Metrics • Processing time  Streaming rate Xuemin Lin@dbg.unsw.edu.au

n-of-N Query • Varying dimensionality • M up to 2M, N = 1M, n uniformly from [1K, 1M], #queries = 1000 Xuemin Lin@dbg.unsw.edu.au

n-of-N Query (cont’d) • Varying n • for correlated, independent, and anti-correlated datasets Xuemin Lin@dbg.unsw.edu.au

Maintenance Costs 2d and 5d datasets, measure average and max time, N = i * 105 Xuemin Lin@dbg.unsw.edu.au

Scalability M (total number) = 2M, N = 1M, #queries = 2M independent anti-correlated Xuemin Lin@dbg.unsw.edu.au

Continuous n-of-N Queries • 2d & 5d datasets • N = 10K and 1M • 10 queries with n = i*(N/10) • measures cnN avg, cnN max, nN avg, nN max Xuemin Lin@dbg.unsw.edu.au

(n1,n2)-of-N Queries Varying dimensionality • M up to 2M, N = 1M, #queries = 1000 • restricting n2 – n1 >= 500 Scalability • M = 2M, N = 1M, #queries = 2M Xuemin Lin@dbg.unsw.edu.au

Maintenance • 2d and 5d datasets • measure average and max time • N = i * 105 Xuemin Lin@dbg.unsw.edu.au

Conclusions • Efficient algorithms for various sliding windows skyline queries • Keep only minimum number of points • Encode and index those points • Maintain all the data structures • The proposed solutions • have theoretical guarantee on the performance, and • have demonstrated efficiency and scalability in the experiments • Future work • Improve the current solution for (n1,n2)-of-N queries • Approximate skyline queries Xuemin Lin@dbg.unsw.edu.au

Q&A Thank You! Xuemin Lin@dbg.unsw.edu.au

Reference • [ICDE01]S. Borzsonyi, D. Kossmann, and K. Stocker. The skyline operator. ICDE, 2001. • [VLDB01]K. Tan, P. Eng, and B. Ooi. Efficient progressive skyline computation. VLDB, 2001. • [VLDB 02]D. Kossmann, F. Ramsak, and S. Rost. Shooting stars in the sky: An online algorithm for skyline queries. VLDB, 2002. • [SIGMOD03] D. Papadias, Y. Tao, G. Fu, and B. Seeger. An optimal progressive alogrithm for skyline queries. SIGMOD, 2003. • [SIAM J. comp00] S. Kapoor. Dynamic maintenance of maxima of 2-d point sets. SIAM J. Comput., 2000. Xuemin Lin@dbg.unsw.edu.au

Stabbing the Sky: Efficient Skyline Computation over Sliding Windows