320 likes | 338 Views
Innovative algorithm to compute skylines in a sliding window with uncertain data streams, balancing space and time efficiency.
E N D
Probabilistic Skyline Operator over Sliding Windows Wenjie Zhang University of New South Wales & NICTA, Australia Joint work: Xuemin Lin, Ying Zhang, Wei Wang (UNSW & NICTA) Jeffrey Xu Yu (CUHK)
Outline • Background • Framework • Algorithms • Experiment • Conclusion
Background • Elements continuously arrive with occurrence probabilities 2 0.1 1 1 0.1 4 0.8 6 0.5 3 0.4 5 0.1 • Problem : How to continuously compute skylines in a sliding window with size N (elements)? • Sliding window: N = 5
Background Multi-criteria decision making regarding uncertain data: • Online auction • Financial market • … …
Related work • Probabilistic skyline (VLDB07) • Probabilistic reverse skyline (SIGMOD08) • Probabilistic aggregates and sketches over uncertain streams (SIGMOD07, SODA07, PODS07) • Frequent items on uncertain streams (SIGMOD08) • Top-k queries over uncertain sliding window (VLDB08) • … … Probabilistic skyline computation Uncertain stream processing
Models and Problem Definition • Model: DS is a stream of elements, each element a is in a d-dimensional space and with an occurrence probability P(a) ( in (0, 1]) The skyline probability of an element a is: • Problem Definition: retrieving elements from the most recent N elements, with skyline probability no less than a given threshold q
Challenges and Contributions • Space efficiency: • Contribution: Space reduction: O(N) to O(lnd-1N) • Time efficiency • Contribution: R-tree based efficient incremental algorithms
Outline • Background and Preliminaries • Framework • Algorithms • Experiment • Conclusion
Framework: what to keep ? window size N : 5 probability threshold: 0.5 Pold (2) = 1 – P(1) 2 0.1 0.1 1 Pnew(2) = (1 – P(3)) * (1 – P(4)) 4 0.8 3 0.4 Pnew (2) < q , element 2 will never become skyline in the window 5 0.1
Framework: what to keep ? • Candidate set SN,q: • Correctness: • (1) no missing skyline points • (2) no false hits to determine SN, q • (3) no false positive to determine skyline results • (4) no false negative to determine skyline results • --- probability based on SN,q may not be accurate, but • satisfies the threshold requirement.
Framework • Space required for SN,q: • SN,q is the minimum information to be maintained to get a correct answer. Psky(3) = 0.9 * (1 – 0.4) * (1- 0.3)<q Psky(3) = 0.9 > q 3 0.9 0.4 2 2 0.3 1 1 4 0.8 window size N : 4 probability threshold q: 0.5
Space of Candidate Set • Theorem: Candidate Set requires a poly-logarithmic space on average case regarding uniform distributions, O(f(q)lnd-1N).
Outline • Background and Preliminaries • Framework • Algorithms • Experiment • Conclusion
Algorithms • We maintain two R-trees • R1: SKYN,q --- skylines • R2: SN,q- SKYN,q --- candidates – skylines
Algorithms R1: SKYN,q not in SN,q 1(.1) window size N : 13 probability threshold q: 0.2 6(.8) 8(.2) 5(.8) 10(.2) 7(.6) 3(.4) 9(.5) 11(.6) R2: SN,q – SKYN,q 13(.1) 12(.1) 2(.1) 4(.1)
Algorithms • New element arrives • Check Psky & Pnew on R1 • Check Pnew on R2 • Handling elements with Pnew < q • Old element expires • Update Pold • Check Psky on R2
Algorithms: new elements arrives R1: SKYN,q window size N : 13 probability threshold q: 0.2 Delete an Entry: 6(.8) 8(.2) 5(.8) Before update: Pnew : (1, 1) Psky : (0.8, 0.8) global Pnew = 1 – 0.2 After update: global Pnew *= 1- 0.8 Delete from R1 10(.2) 7(.6) 3(.4) 9(.5) 11(.6) R2: SN,q - SKYN,q 13(.1) 12(.1) 2(.1) 4(.1) 14(0.8)
Algorithms: new elements arrives Move an Entry from R1 to R2: window size N : 13 probability threshold q: 0.2 R1: SKYN,q 8(.2) Before update: Pnew : (1, 1) Psky : (0.24, 0.6) global Pnew = 1 After update: global Pnew *= 1 – 0.8 min Pnew = 0.2 ≥ q max Psky = 0.12 < q Move from R1 to R2 10(.2) 7(.6) 3(.4) 9(.5) 11(.6) R2: SN,q - SKYN,q 13(.1) 12(.1) 2(.1) 4(.1) 14(0.8)
Algorithms: new elements arrives window size N : 13 probability threshold q: 0.2 R1: SKYN,q 8(.2) R2: SN,q - SKYN,q 10(.2) Before update: Pnew : (0.9, 1) global Pnew = 1 After update: global Pnew *= 1 – 0.8 min Pnew < q; max Pnew ≥ q Drill down and delete 2 7(.6) 3(.4) 9(.5) 11(.6) 13(.1) 12(.1) 2(.1) 4(.1) 14(0.8)
Algorithms: new elements arrives window size N : 13 probability threshold q: 0.2 R1: SKYN,q 8(.2) R2: SN,q - SKYN,q 10(.2) Update Pold: 7(.6) 3(.4) Update Pold of 12 & 13 global Pold /= (1 – 0.1) 9(.5) 11(.6) 13(.1) 12(.1) 2(.1) 4(.1) 14(0.8)
Algorithms: new elements arrives window size N : 13 probability threshold q: 0.2 R1: SKYN,q 8(.2) R2: SN,q - SKYN,q 10(.2) 7(.6) Insert new element: Pnew = 1. compute Psky 3(.4) 9(.5) 11(.6) 13(.1) 12(.1) 4(.1) 14(0.8)
Algorithm: old element expires • Delete it from R1 or R2. • Update Pold of remaining elements: • Record globalPold on intermediate entries fully dominated by it • Check Psky after update
Algorithms: old element expires R1: SKYN,q window size N : 13 probability threshold q: 0.2 8(.2) Pold (7) /= 1 – P(3) 10(.2) R2: SKYN,q 7(.6) 3(.4) 9(.5) 11(.6) 13(.1) 12(.1) 4(.1) global Pold /= 1 – P(4) 14(0.8)
Algorithms: handling multiple thresholds • Continuous queries • Users specify k probability thresholds q1, …, qk. (qi < qi-1) • Solution: instead of maintaining R1, we maintain R1, …, Rk, each corresponding to a confidence value. • Ad-hoc queries • Users issue a query: retrieve skylines with probability at least q’ (q’ ≥ qk) • Solution: find an Ri with qi ≤ q’ < qi-1. Then all elements in {Rj: j < i -1} are results. We search Ri-1 to output qualified skylines
Experiment • Data set: • Real: stock transactions. 2-d. probability assigned randomly. Size: 2 million • Synthetic: spatial location (independent or anti-correlated); probability (uniform or normal); 2d to 5d; 2 million • Default values: p : 0.3; d: 3; N : 1M; spatial distribution: anti-correlated; probability: uniform;
Experiment: space 0.1% to the sliding window size for 2-d data; save around 89% space even for 5-d data.
Experiment: space Size of SN,q deceases with the increase of Pu, while size of SKYN,q increases with it.
Experiment: time Maintenance time increases with # probability thresholds; query time deceases with it.
Conclusion • We characterize a candidate set with minimum size and propose time efficient techniques. • We extend the framework to handle multiple thresholds.