740 likes | 862 Views
Randomized Multi-pass Streaming Skyline A lgorithm. Atish Das Sarma , Ashwin Lall , Danupon Nanongkai , Jun Xu. Georgia Tech. VLDB 2009. In one sentence …. “We develop a streaming algorithm. “We develop a streaming algorithm for skyline problem.
E N D
Randomized Multi-pass Streaming Skyline Algorithm AtishDas Sarma, AshwinLall, DanuponNanongkai, Jun Xu Georgia Tech VLDB 2009
“We develop a streamingalgorithm for skyline problem with near-optimal worst-case guarantee.”
I want a cheap hotel nearby dominates
I want a cheap hotel nearby dominates
Price de la Cite Mercure Park & Suites Athena du Helder Distance
Price de la Cite Mercure Park & Suites Athena du Helder Distance
Problem definition • Given distinct d-dimensional points • (a1, …, ad)dominates(b1, …, bd) if ai ≤ bi for all i and ai’ < bi’ for some i’ • Skyline = set of undominated points Example (1,3) (1, 3) , (5, 2) , (3, 2) (3,2) (5,2) dominates Skyline = { (1, 3) , (3, 2) }
Skyline algorithms RAM Disk (External) DD&C Kung et al. FOCS’ 75 LD&CBently et al. JACM’78, FLETBently et al. SODA’90, Preprocessing Non-preprocessing SD&CBorzsonyi et al. ICDE’01, BNL Borzsonyi et al. ICDE’01, SFSChomicki et al. ICDE’03, LESS Godfrey et al. VLDB’05 BBS Papadias et al. SIGMOD’03 NN Kossman et al. VLDB’02
Our Goal “Non-preprocessing external algorithm with worst-case guarantee” What is the model of external algorithms?
Models for external algorithms CPU process ≠ I/O Sequental I/O ≠ Random I/O Multi-pass Streaming Model # of random I/O’s = # of passes Streaming model naturally forces us to minimize the number of random I/O’s
Multi-pass Streaming model Huge Harddisk (1, 2) (3, 7) (5, 3) (2, 5) (4, 1) (9, 9) Small RAM
Multi-pass Streaming model Huge Harddisk (1, 2) (3, 7) (5, 3) (2, 5) (4, 1) (9, 9) Small RAM
Multi-pass Streaming model Huge Harddisk (1, 2) (3, 7) (5, 3) (2, 5) (4, 1) (9, 9) Small RAM
Multi-pass Streaming model Huge Harddisk (1, 2) (3, 7) (5, 3) (2, 5) (4, 1) (9, 9) Small RAM
Multi-pass Streaming model Huge Harddisk (1, 2) (3, 7) (5, 3) (2, 5) (4, 1) (9, 9) 2nd pass Small RAM
Multi-pass Streaming model Huge Harddisk (1, 2) (3, 7) (5, 3) (2, 5) (4, 1) (9, 9) 3rd pass Small RAM
Our Goal “Non-preprocessing external algorithm with worst-case guarantee” streaming
Main results Theory • RAND uses O(log n) passes & O(m) space • Every algorithm that uses 1 pass needs Ω(n) space Next: RAND algorithm Later: Experimental result RAND: Almost optimal multi-pass streaming algorithm for skyline O(log n) passes & O(m) space 1 pass needs Ω(n) space n = # of points and m = skyline size
Algorithms: Main Idea Suppose m is known. Theorem: In 3 passes and m space, we can find skyline points that “dominate” at least n/2 points, with high probability
Eliminate-Points algorithm 1. Sample x=2m ln(mn log n) points p1, p2, …,px 2. Go through the stream,Replace each pi by a point dominating it 3. For each pi, delete pi and all points it dominates Output p1, p2, …,px and repeat (1, 5), (3, 4), (4, 5), (4, 3), (3, 3), (4,4) (4, 4)
Eliminate-Points algorithm 1. Sample x=2m ln(mn log n) points p1, p2, …,px 2. Go through the stream,Replace each pi by a point dominating it 3. For each pi, delete pi and all points it dominates Output p1, p2, …,px and repeat (1, 5), (3, 4), (4, 5), (4, 3), (3, 3), (4,4) (4, 4)
Eliminate-Points algorithm 1. Sample x=2m ln(mn log n) points p1, p2, …,px 2. Go through the stream,Replace each pi by a point dominating it 3. For each pi, delete pi and all points it dominates Output p1, p2, …,px and repeat (1, 5), (3, 4), (4, 5), (4, 3), (3, 3), (4,4) (4, 4)
Eliminate-Points algorithm 1. Sample x=2m ln(mn log n) points p1, p2, …,px 2. Go through the stream,Replace each pi by a point dominating it 3. For each pi, delete pi and all points it dominates Output p1, p2, …,px and repeat (1, 5), (3, 4), (4, 5), (4, 3), (3, 3), (4,4) (4, 4) (3, 4)
Eliminate-Points algorithm 1. Sample x=2m ln(mn log n) points p1, p2, …,px 2. Go through the stream,Replace each pi by a point dominating it 3. For each pi, delete pi and all points it dominates Output p1, p2, …,px and repeat (1, 5), (3, 4), (4, 5), (4, 3), (3, 3), (4,4) (3, 4)
Eliminate-Points algorithm 1. Sample x=2m ln(mn log n) points p1, p2, …,px 2. Go through the stream,Replace each pi by a point dominating it 3. For each pi, delete pi and all points it dominates Output p1, p2, …,px and repeat (1, 5), (3, 4), (4, 5), (4, 3), (3, 3), (4,4) (3, 4)
Eliminate-Points algorithm 1. Sample x=2m ln(mn log n) points p1, p2, …,px 2. Go through the stream,Replace each pi by a point dominating it 3. For each pi, delete pi and all points it dominates Output p1, p2, …,px and repeat (1, 5), (3, 4), (4, 5), (4, 3), (3, 3), (4,4) (3, 4) (3, 3)
Eliminate-Points algorithm 1. Sample x=2m ln(mn log n) points p1, p2, …,px 2. Go through the stream,Replace each pi by a point dominating it 3. For each pi, delete pi and all points it dominates Output p1, p2, …,px and repeat (1, 5), (3, 4), (4, 5), (4, 3), (3, 3), (4,4) (3, 4) (3, 3)
Eliminate-Points algorithm 1. Sample x=2m ln(mn log n) points p1, p2, …,px 2. Go through the stream,Replace each pi by a point dominating it 3. For each pi, delete pi and all points it dominates Output p1, p2, …,px and repeat (1, 5), (3, 4), (4, 5), (4, 3), (3, 3), (4,4) (3, 4) (3, 3)
Eliminate-Points algorithm 1. Sample x=2m ln(mn log n) points p1, p2, …,px 2. Go through the stream,Replace each pi by a point dominating it 3. For each pi, delete pi and all points it dominates Output p1, p2, …,px and repeat (1, 5), (3, 4), (4, 5), (4, 3), (3, 3), (4,4) (3, 4) (3, 3)
Eliminate-Points algorithm 1. Sample x=2m ln(mn log n) points p1, p2, …,px 2. Go through the stream,Replace each pi by a point dominating it 3. For each pi, delete pi and all points it dominates Output p1, p2, …,px and repeat (1, 5), (3, 4), (4, 5), (4, 3), (3, 3), (4,4) (3, 4) (3, 3)
Eliminate-Points algorithm 1. Sample x=2m ln(mn log n) points p1, p2, …,px 2. Go through the stream,Replace each pi by a point dominating it 3. For each pi, delete pi and all points it dominates Output p1, p2, …,px and repeat (1, 5), (3, 4), (4, 5), (4, 3), (3, 3), (4,4) (3, 4) (3, 3)
Eliminate-Points algorithm 1. Sample x=2m ln(mn log n) points p1, p2, …,px 2. Go through the stream,Replace each pi by a point dominating it 3. For each pi, delete pi and all points it dominates Output p1, p2, …,px and repeat (1, 5), (3, 4), (4, 5), (4, 3), (3, 3), (4,4) (3, 4) (3, 3)
Eliminate-Points algorithm 1. Sample x=2m ln(mn log n) points p1, p2, …,px 2. Go through the stream,Replace each pi by a point dominating it 3. For each pi, delete pi and all points it dominates Output p1, p2, …,px and repeat (1, 5), (3, 4), (4, 5), (4, 3), (3, 3), (4,4) (3, 4) (3, 3)
Analysis Theorem: Eliminate-Points algorithm deletes at least n/2 points with high probability
Analysis • Draw trees: Each point points to its first dominating point 1, 5 3, 3 4, 5 3, 4 4, 3 4, 4 (1, 5), (3, 4), (4, 5), (4, 3), (3, 3), (4,4)
Analysis • Draw trees: Each point points to its first dominating point 1, 5 3, 3 4, 5 3, 4 4, 3 4, 4 (1, 5), (3, 4), (4, 5), (4, 3), (3, 3), (4,4)
Analysis • Draw trees: Each point points to its first dominating point 1, 5 3, 3 4, 5 3, 4 4, 3 4, 4 (1, 5), (3, 4), (4, 5), (4, 3), (3, 3), (4,4) Note: There will be m trees, each rooted by a skyline point
Analysis • Draw trees: Each point points to its first dominating point 1, 5 3, 3 4, 5 3, 4 4, 3 4, 4 (1, 5), (3, 4), (4, 5), (4, 3), (3, 3), (4,4) (4, 4)
Analysis • Draw trees: Each point points to its first dominating point 1, 5 3, 3 4, 5 3, 4 4, 3 4, 4 (1, 5), (3, 4), (4, 5), (4, 3), (3, 3), (4,4) (3, 3)
Analysis • Claim: The tree that some element is sampled will be deleted 1, 5 3, 3 4, 5 3, 4 4, 3 4, 4 (1, 5), (3, 4), (4, 5), (4, 3), (3, 3), (4,4) (3, 3)
Analysis • There are m trees, each rooted by a skyline point 1 2 m-1 m
Analysis • There are m trees, each rooted by a skyline point 1 2 m-1 m
Analysis • Big tree has bigger chance of being sampled … and deleted 1 2 m-1 m