1 / 32

Probabilistic Skyline Operator over Sliding Windows

Innovative algorithm to compute skylines in a sliding window with uncertain data streams, balancing space and time efficiency.

jryder
Download Presentation

Probabilistic Skyline Operator over Sliding Windows

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Probabilistic Skyline Operator over Sliding Windows Wenjie Zhang University of New South Wales & NICTA, Australia Joint work: Xuemin Lin, Ying Zhang, Wei Wang (UNSW & NICTA) Jeffrey Xu Yu (CUHK)

  2. Outline • Background • Framework • Algorithms • Experiment • Conclusion

  3. Background • Elements continuously arrive with occurrence probabilities 2 0.1 1 1 0.1 4 0.8 6 0.5 3 0.4 5 0.1 • Problem : How to continuously compute skylines in a sliding window with size N (elements)? • Sliding window: N = 5

  4. Background Multi-criteria decision making regarding uncertain data: • Online auction • Financial market • … …

  5. Related work • Probabilistic skyline (VLDB07) • Probabilistic reverse skyline (SIGMOD08) • Probabilistic aggregates and sketches over uncertain streams (SIGMOD07, SODA07, PODS07) • Frequent items on uncertain streams (SIGMOD08) • Top-k queries over uncertain sliding window (VLDB08) • … … Probabilistic skyline computation Uncertain stream processing

  6. Models and Problem Definition • Model: DS is a stream of elements, each element a is in a d-dimensional space and with an occurrence probability P(a) ( in (0, 1]) The skyline probability of an element a is: • Problem Definition: retrieving elements from the most recent N elements, with skyline probability no less than a given threshold q

  7. Challenges and Contributions • Space efficiency: • Contribution: Space reduction: O(N) to O(lnd-1N) • Time efficiency • Contribution: R-tree based efficient incremental algorithms

  8. Outline • Background and Preliminaries • Framework • Algorithms • Experiment • Conclusion

  9. Framework: what to keep ? window size N : 5 probability threshold: 0.5 Pold (2) = 1 – P(1) 2 0.1 0.1 1 Pnew(2) = (1 – P(3)) * (1 – P(4)) 4 0.8 3 0.4 Pnew (2) < q , element 2 will never become skyline in the window 5 0.1

  10. Framework: what to keep ? • Candidate set SN,q: • Correctness: • (1) no missing skyline points • (2) no false hits to determine SN, q • (3) no false positive to determine skyline results • (4) no false negative to determine skyline results • --- probability based on SN,q may not be accurate, but • satisfies the threshold requirement.

  11. Framework • Space required for SN,q: • SN,q is the minimum information to be maintained to get a correct answer. Psky(3) = 0.9 * (1 – 0.4) * (1- 0.3)<q Psky(3) = 0.9 > q 3 0.9 0.4 2 2 0.3 1 1 4 0.8 window size N : 4 probability threshold q: 0.5

  12. Space of Candidate Set • Theorem: Candidate Set requires a poly-logarithmic space on average case regarding uniform distributions, O(f(q)lnd-1N).

  13. Outline • Background and Preliminaries • Framework • Algorithms • Experiment • Conclusion

  14. Algorithms • We maintain two R-trees • R1: SKYN,q --- skylines • R2: SN,q- SKYN,q --- candidates – skylines

  15. Algorithms R1: SKYN,q not in SN,q 1(.1) window size N : 13 probability threshold q: 0.2 6(.8) 8(.2) 5(.8) 10(.2) 7(.6) 3(.4) 9(.5) 11(.6) R2: SN,q – SKYN,q 13(.1) 12(.1) 2(.1) 4(.1)

  16. Algorithms • New element arrives • Check Psky & Pnew on R1 • Check Pnew on R2 • Handling elements with Pnew < q • Old element expires • Update Pold • Check Psky on R2

  17. Algorithms: new elements arrives R1: SKYN,q window size N : 13 probability threshold q: 0.2 Delete an Entry: 6(.8) 8(.2) 5(.8) Before update: Pnew : (1, 1) Psky : (0.8, 0.8) global Pnew = 1 – 0.2 After update: global Pnew *= 1- 0.8 Delete from R1 10(.2) 7(.6) 3(.4) 9(.5) 11(.6) R2: SN,q - SKYN,q 13(.1) 12(.1) 2(.1) 4(.1) 14(0.8)

  18. Algorithms: new elements arrives Move an Entry from R1 to R2: window size N : 13 probability threshold q: 0.2 R1: SKYN,q 8(.2) Before update: Pnew : (1, 1) Psky : (0.24, 0.6) global Pnew = 1 After update: global Pnew *= 1 – 0.8 min Pnew = 0.2 ≥ q max Psky = 0.12 < q Move from R1 to R2 10(.2) 7(.6) 3(.4) 9(.5) 11(.6) R2: SN,q - SKYN,q 13(.1) 12(.1) 2(.1) 4(.1) 14(0.8)

  19. Algorithms: new elements arrives window size N : 13 probability threshold q: 0.2 R1: SKYN,q 8(.2) R2: SN,q - SKYN,q 10(.2) Before update: Pnew : (0.9, 1) global Pnew = 1 After update: global Pnew *= 1 – 0.8 min Pnew < q; max Pnew ≥ q Drill down and delete 2 7(.6) 3(.4) 9(.5) 11(.6) 13(.1) 12(.1) 2(.1) 4(.1) 14(0.8)

  20. Algorithms: new elements arrives window size N : 13 probability threshold q: 0.2 R1: SKYN,q 8(.2) R2: SN,q - SKYN,q 10(.2) Update Pold: 7(.6) 3(.4) Update Pold of 12 & 13 global Pold /= (1 – 0.1) 9(.5) 11(.6) 13(.1) 12(.1) 2(.1) 4(.1) 14(0.8)

  21. Algorithms: new elements arrives window size N : 13 probability threshold q: 0.2 R1: SKYN,q 8(.2) R2: SN,q - SKYN,q 10(.2) 7(.6) Insert new element: Pnew = 1. compute Psky 3(.4) 9(.5) 11(.6) 13(.1) 12(.1) 4(.1) 14(0.8)

  22. Algorithm: old element expires • Delete it from R1 or R2. • Update Pold of remaining elements: • Record globalPold on intermediate entries fully dominated by it • Check Psky after update

  23. Algorithms: old element expires R1: SKYN,q window size N : 13 probability threshold q: 0.2 8(.2) Pold (7) /= 1 – P(3) 10(.2) R2: SKYN,q 7(.6) 3(.4) 9(.5) 11(.6) 13(.1) 12(.1) 4(.1) global Pold /= 1 – P(4) 14(0.8)

  24. Algorithms: handling multiple thresholds • Continuous queries • Users specify k probability thresholds q1, …, qk. (qi < qi-1) • Solution: instead of maintaining R1, we maintain R1, …, Rk, each corresponding to a confidence value. • Ad-hoc queries • Users issue a query: retrieve skylines with probability at least q’ (q’ ≥ qk) • Solution: find an Ri with qi ≤ q’ < qi-1. Then all elements in {Rj: j < i -1} are results. We search Ri-1 to output qualified skylines

  25. Experiment • Data set: • Real: stock transactions. 2-d. probability assigned randomly. Size: 2 million • Synthetic: spatial location (independent or anti-correlated); probability (uniform or normal); 2d to 5d; 2 million • Default values: p : 0.3; d: 3; N : 1M; spatial distribution: anti-correlated; probability: uniform;

  26. Experiment: space 0.1% to the sliding window size for 2-d data; save around 89% space even for 5-d data.

  27. Experiment: space Size of SN,q deceases with the increase of Pu, while size of SKYN,q increases with it.

  28. Experiment: space

  29. Experiment: time

  30. Experiment: time Maintenance time increases with # probability thresholds; query time deceases with it.

  31. Conclusion • We characterize a candidate set with minimum size and propose time efficient techniques. • We extend the framework to handle multiple thresholds.

  32. Thanks !

More Related