Progressive Computation of The Min-Dist Optimal-Location Query

Progressive Computation of The Min-Dist Optimal-Location Query Donghui Zhang, Yang Du, Tian Xia, Yufei Tao* Northeastern University * Chinese University of Hong Kong VLDB’06, Seoul, Korea

Motivation • “What is the optimal location in Boston area to build a new McDonald’s store?” • Suppose a customer drives to the closest McDonald’s. • Optimality: Minimize AVG driving distance. Optimal Location Query

min-dist OL 600 200 200 600 • Without any new site: • AD = (200+200+600+600)/4 = 400. Optimal Location Query

min-dist OL 600 30 l1 30 600 • Without any new site: • AD = (200+200+600+600)/4 = 400. • With new site l1: • AD(l1) = (30+30+600+600)/4 = 315. Optimal Location Query

min-dist OL 200 30 l2 30 200 • Without any new site: • AD = (200+200+600+600)/4 = 400. • With new site l1: • AD(l1) = (30+30+600+600)/4 = 315. • With new site l2 : • AD(l2) = (200+200+30+30)/4 = 115. Optimal Location Query

distance between o and its nearest site Formal Definition • Given a set S of sites, a set O of objects, and a query range Q , • min-dist OL is a location lQ which minimizes Optimal Location Query

L1 Distance • d(o, s) = |o.x– s.x|+|o.y– s.y| Optimal Location Query

Challenging • There are infinite number of locations in Q. How to produce a finite set of candidates (yet keeping optimality)? • How to avoid computing AD(l) for all candidates? Optimal Location Query

Solution Highlights • Algorithm to compute AD(l). • Theorems to limit #candidates. • Lower-bound of AD(l) for all locations l in a cell C. • Progressive algorithm. Optimal Location Query

Define l 1. Compute AD(l) • Remember • Let RNN(l) be the objects “attracted” by l. • AD(l)=AD if RNN(l)= RNN(l)= AD=AD(l) Optimal Location Query

RNN(l)={o7, o8} AD(l) < AD 1. Compute AD(l) • Remember • Define • Let RNN(l) be the objects “attracted” by l. • AD(l)=AD if RNN(l)= l Optimal Location Query

Average savings for customers in RNN(l) 1. Compute AD(l) • Remember • Define • Let RNN(l) be the objects “attracted” by l. • AD(l)=AD if RNN(l)= • AD(l)=AD - ? Optimal Location Query

1. Compute AD(l) • Theorem • S and O are “static” versus l. • AD can be pre-computed. • So is dNN(o, S) • To compute AD(l): • Find RNN(l) • oRNN(l), compute d(o, l) Optimal Location Query

2. Limit #candidates • Theorem: within the X/Y range of Q, draw grid lines crossing objects. Only need to consider intersections! Q Optimal Location Query

2. Limit #candidates • Theorem: within the X/Y range of Q, draw grid lines crossing objects. Only need to consider intersections! Q Optimal Location Query 5x6=30 candidates

δ l 2. Limit #candidates • Proof idea: suppose the OL is not, move it will produce a better (or equal) result. • Consider RNN(l). • Move to the right  saves total dist. Optimal Location Query

2. VCU(Q) • A spatial region, enclosing the objects closer to Q than to sites in S. • It’s the Voronoi cell of Q versus sites in S. Optimal Location Query

5x6=30 candidates 2. Further Limit #candidates • Only consider objects in VCU(Q). Optimal Location Query

4x4=16 candidates 2. Further Limit #candidates • Only consider objects in VCU(Q). Optimal Location Query

Naïve Algorithm • Derive candidates. • Compute AD(l) for each. • Pick smallest. • Not efficient! Too many candidates! To compute AD(l) for each one, need: • compute RNN(l) • retrieve all these objects… Optimal Location Query

Progressive Idea • Treat Q as a cell and consider its corners. Optimal Location Query

Progressive Idea • Divide the cell. Optimal Location Query

Progressive Idea • Recursively divide a sub-cell. Optimal Location Query

Progressive Idea • Recursively divide a sub-cell. • Able to check all candidates. Optimal Location Query

AD(lo) =50 C Progressive Idea • Q: What do you save? • A: Cell pruning, if its lower bound AD(l0) of some candidate l0. Suppose 60 is a lower bound for AD(l), l Optimal Location Query

3. LB(C): lower bound for AD(l), lC AD(c1)=1000 AD(c2)=3000 c AD(c3)=4000 AD(c4)=2500 Optimal Location Query

3. LB(C): lower bound for AD(l), lC • Theorem: AD(c1)=1000 AD(c2)=3000 c AD(c3)=4000 AD(c4)=2500 is a lower bound, where p is perimeter. • e.g. LB(C)=3500-p/4 Optimal Location Query

3. LB(C): lower bound for AD(l), lC • A better lower bound Theorem: • Comparing with the previous lower bound: • Higher quality since the lower bound is larger. • More computation. Optimal Location Query

4. The Progressive Algorithm • Maintain a heap of cells ordered by LB(). Initially one cell: Q. • Maintain the best candidate lopt • Pick the cell with minimum LB() and partition it. • Compute AD() for the corners of sub-cells. • Compute LB() for the sub-cells. • Insert sub-cell ci to heap if LB(ci)<AD(lopt) • Goto 3. Optimal Location Query

AD(best corner of Q) AD( real OL ) is inside the interval LB(Q) Time Progressiveness • The algorithm quickly reports a candidate OL with a confidence interval, and keeps refining. Optimal Location Query

AD( real OL ) is inside the interval Progressiveness • The algorithm quickly reports a candidate OL with a confidence interval, and keeps refining. AD(best candidate) LB(Q) Time Optimal Location Query

AD( real OL ) is inside the interval Progressiveness • The algorithm quickly reports a candidate OL with a confidence interval, and keeps refining. AD(best candidate) Min{ LB(C) | C in heap } Time • User may choose to terminate any time. Optimal Location Query

Batch Partitioning • To partition a cell, should partition into multiple sub-cells. • Reason: to compute AD(l), need to access the R*-tree of objects. When access the R*-tree, want to compute multiple AD(l). • Tradeoff: if partition too much: wasteful! Since some candidates could be pruned. Optimal Location Query

Performance Setup • O: 123,593 postal addresses in Northeastern part of US. Stored using an R*-tree. • S: randomly select 100 sites from O. • Buffer: 128 pages. • Dell Pentium IV 3.2GHz. • Query size: 1% in each dimension. Optimal Location Query

review slide 4x4=16 candidates 2. Further Limit #candidates • Only consider objects in VCU(Q). Optimal Location Query

Effect of VCU Computation Optimal Location Query

review slide 3. LB(C): lower bound for AD(l), lC • Theorem: AD(c1)=1000 AD(c2)=3000 c AD(c3)=4000 AD(c4)=2500 is a lower bound, where p is perimeter. • e.g. LB(C)=3500-p/4 Optimal Location Query

review slide 3. LB(C): lower bound for AD(l), lC • A better lower bound Theorem: • Comparing with the previous lower bound: • Higher quality since the lower bound is larger. • More computation. Optimal Location Query

Comparison of Lower Bounds Optimal Location Query

Effect of Batch Partitioning Optimal Location Query

review slide AD(best candidate) AD( real OL ) is inside the interval Min{ LB(C) | C in heap } Time Progressiveness • The algorithm quickly reports a candidate OL with a confidence interval, and keeps refining. • User may choose to terminate any time. Optimal Location Query

Progressiveness • Each step: partition a cell to 40 sub-cells. • After 200 steps, accurate answer. • After 20 steps, answer is 1% away from optimal. Optimal Location Query

Conclusions • Introduced the min-dist optimal-location query. • Proved theorems to limit the number of candidates. • Presented lower-bound estimators. • Proposed a progressive algorithm. Q & A... Optimal Location Query

Progressive Computation of The Min-Dist Optimal-Location Query