180 likes | 371 Views
SLICE: Reviving Regions-Based Pruning for Reverse k Nearest Neighbors Queries. Shiyu Yang 1 , Muhammad Aamir Cheema 2,1 , Xuemin Lin 1,3 , Ying Zhang 4,1. 1 The University of New South Wales, Australia 2 Monash University, Australia 3 East China Normal University, China
E N D
SLICE: Reviving Regions-Based Pruning for Reverse k Nearest Neighbors Queries Shiyu Yang1, Muhammad AamirCheema2,1,Xuemin Lin1,3, Ying Zhang4,1 1The University of New South Wales, Australia 2 Monash University, Australia 3 East China Normal University, China 4University of Technology, Sydney, Australia
Introduction • k Nearest Neighbor Query • Find the facility that is one of k-closest facilities to the query user. • Reverse k Nearest Neighbor Query • Find every user for which the query facility is one of the k-closest facilities. • RkNNs are the potential customers of a facility u1 u2 u3 f1 f3 f2 K=1
Related Work Six-regions (SIGMOD 2000) TPL (VLDB 2004) FINCH (VLDB 2008) Boost (SIGMOD 2010) InfZone (ICDE2011) Half-space Six-regions (SIGMOD 2000) Region-based TPL (VLDB 2004), FINCH (VLDB 2008), InfZone (ICDE 2011)
Related Work k=2 • Regions-based Pruning: -Six-regions(SIGMOD 2000) • Divide the whole space centred at the query q into six equal regions • Find the k-th nearest neighbor in each Partition. • The k-th nearest facility of q in each region defines the area that can be pruned a u1 u2 b d c q The user points that cannot be pruned should be verified by range query
Related Work k=2 • Half-space Pruning: the space that is contained by khalf- spaces can be pruned -TPL(VLDB 2004) • Find the nearest facility f in the unpruned area. • Draw a bisector between q and f, prune by using the half-space • Iteratively access the nearest facility in unpruned area. a b d c q
Related Work k=2 • Half-space Pruning: -InfZone(ICDE 2011) • The influence zone corresponds to the unpruned area when the bisectors of all the facilities have been considered for pruning. • A point p is a RkNN of q if and only if p lies inside unprunedarea. • No verification phase. a b d c q Half-space pruningis expensive especially when k is large.
Related Work VS Regions-based SLICE Half-space O(km2) O(m log m) O(m log k) Pruning Cost m is the # of facilities considered for pruning High High Pruning Power Low • Range query Verification Cost O(k) O(log m) • Can regions-based pruning do better?
Notations • Partition: P • Subtended angle: ∠a • Maximum (minimal) subtended angle w.r.t P (, ) • Upper (lower) arc • Center: q • Radius: = P p f q θmax Upper θmin a Lower
Observation -- Pruning P • A facility f prunes every point p ∈ P for which dist(p,q) > (UpperArc) < 90◦ • We can prove a < b. • a2=b2+c2-2bc∙cos() • b> = • c2-2bc∙cos() < c2-2c∙cos() = c2(1- ) <0 • Facility prunes area outside the upper arc of f for every partition P for which < 90◦ a p f b c q Upper θmax θ
Comparison with Six-regions VS Six-region SLICE f dist(f,q) Area pruned q One < 90o Partitions Pruned any 6 No. of Partitions
Pruning Algorithm • Divide space into tpartitions • Compute the upper arc of each partition for facilities. • The area outside the k-th smallest upper arc(rB)in each partition can be pruned. • Users in the pruned area can be pruned • Users in the unpruned area will be verified by accessing significant facilities f1 f2 u1 u2 q k=2
Significant Facility Verification P • Significant facility: • A facility f that prunes at least one point p ∈ P lying inside the bounding arc of P. • Verification for a candidate 2 q SLICE Regions-based Issuing range query for each candidate Accessing significant facilities (O(k)) M N Significant facility cannot be in red area High I/O cost No additional I/O cost
Theoretical Analyses • Number of significant facilities • More analyses can be found in paper • I/O Cost • Pruning phase: • Same as circular range query centered at q with radius 2rB • Verification phase: • Same as circular range query centered at q with radius rB 2.34k ( θ ⇒ 0) 9k ( θ = 60o)
Experiments • Data Set : • Synthetic data : • Size:50000, 100000, 150000 or 200000 • Distribution: Uniform or Normal • Real data: The real data set consists of 175, 812 points in North America • Algorithms: • Six-regions, InfZone and SLICE • Page size 4KB and number of buffers for Six-regionsis 10 • Number of partitions for SLICE is 12
Experiments • Effect of different values of k CPU I/O
Experiments • Effect of data distribution • Effect of % users
Experiments • Effect of partitions • Number of significant facilities Number of partitions Value of k
Thanks! Q&A