Efficient Dual-Resolution Layer Indexing for Top-k Queries

On Top-n Reverse Top-k Queries: Variants, Algorithms, and Applications 陳良弼 Arbee L.P. Chen National Chengchi University 9/21/2012 at NCHU

IEEE International Conference on Data Engineering (ICDE) • A premium international conference on databases • Inaugural conference held at Los Angeles in 1984 • Held in Taiwan in 1995

ICDE2012 Research Papers Distribution • System Aspects • Privacy and Security 8% • Storage Management and Performance 7% • Entity resolution/Versioning 7% • Query Processing 31% • Top-k query 9% • Distributed/parallel/map-reduce 8% • Location-aware 5% • Execution Plan 5% • Graph indexing 4%

Text/Web/Keyword Search 19% • Stream/Trajectory/Sequence/Spatio-Temporal 10% • Social Media7% • Uncertain Database 6% • Data Mining 5%

Efficient Dual-Resolution Layer Indexing for Top-k Queries, ICDE2012 H2 H7 H1 H6 H8 H3 H4 H9 H5

(price, distance to the airport) (0.45, 0.6) 0.525 (0.6, 0.2) 0.4 (0.55, 0.4) H2 H7 0.475 H1 (0.55, 0.3) 0.425 H6 (0.7, 0.4) 0.55 H8 (0.3, 0.7) 0.5 (0.3, 0.6) 0.45 H3 H4 (0.5, 0.5) 0.5 0.45 (0.2, 0.7) H9 H5

(price, distance to the airport) (0.6, 0.2) 0.4 (0.55, 0.4) H7 0.475 H1 (0.55, 0.3) 0.425 H6 (0.3, 0.6) 0.45 H4 0.45 (0.2, 0.7) H5

Answering Why-not Questions on Top-k Queries, ICDE2012 • Top-k query (Cleanliness, delicious, Parking spaces) p1 (95,80,40) 82 p2 (70,20,30) 41 p3 Top-2(0.4,0.5,0.1) p5 (50,90,60) (85,60,60) 71 p4 69 (75,70,50) 70 p6 (58,20,30) 36.2

(Cleanliness, delicious, Parking spaces) • Why-not question p1 (95,80,40) 82 p2 83.5 (70,20,30) 41 p3 46 Top-2(0.5,0.4,0.1) Why p5 is not in my top-2 query list? p5 does not exist? Should I revise my query to look for top-5 hotels? Should I change my weights? p5 (50,90,60) 71 (85,60,60) 67 p4 69 71.7 (75,70,50) 70 70.5 p6 (58,20,30) 36.2 40

The Min-dist Location Selection Query, ICDE2012 c1 c6 c2 Nearest facility distance c3 Minimize Nearest facility distance f1 p1 f2 c5 p2 c7 c4 c8

c1 c6 c2 Nearest facility distance c3 f1 p1 f2 c5 c7 c4 c8

c1 c6 c2 Nearest facility distance c3 f1 c5 f2 c7 p2 c4 c8

Introduction Assume k = 3 kNN(q) = {a, b, c} a b q c kNN (k-Nearest Neighbors) Queries 13

Introduction Assume k = 3 d RkNN(q) = {a, …} a q d RkNN (Reverse k-Nearest Neighbors) Queries 14

Introduction Two types of data Assume k = 3 d BRkNN(q) = {a, …} a q d BRkNN (Bi-chromatic Reverse k-Nearest Neighbors) Queries 15

Application I shop customer Which location is the best?

Top-n Reverse kNN Queries Given two types of data G (goal) and C (condition) G:C: g3 g2 g1 Retrieve n data points from G, which have the largest BRkNN values Example: n=2, k=2 BR2NN value of g1=4 BR2NN value of g2=9 BR2NN value of g3=5 BR2Top-2 ={g2, g3}

Voronoi Diagram of G : goal point (VD-node) : condition point 18

A Filter-Refinement Frameworkfor Solving BRkNN Queries Assume k = 2 Lower-bound region of VDi (layer 0) Upper-bound region of VDi (layer 0 ~ layer (k-1)) Layer 1 Layer 0 VDi Layer 1 19

Filter phase Assume k = 2 VDi Construct bisectors layer by layer to reduce the region 20

Refinement Phase Assume k = 2 For a data point p, we want to check VDs at layer 1 ~ layer 2 to make sure whether VDi is one of the 2NN of p p VDi 21

Refinement Phase Assume k = 2 VDi: (VD13, 1.2) (VD26, 1.4) (VD27, 1.7) (VD3, 1.7) (VD4, 1.8) (VD30, 2.1) (VD5, 2.5) (VD7, 4.8) dist(p, VD30) ＞ 1.2 p 0.9 VDi >1.2 2.1 VD30 … 22

Refinement Phase Assume k = 2 VDi: (VD13, 1.2) (VD26, 1.4) (VD27, 1.7) (VD3, 1.7) (VD4, 1.8) (VD30, 2.1) (VD5, 2.5) (VD7, 4.8) p 0.9 VDi >1.2 dist(VDi, VDj) ＞ 2dist(VDi, p) 2.1 VD30 … 23

Application II Maximum Coverage BRkNN Queries Retrieve 2 points from dataset G Assume k = 2 24

BRkNN value = 9 25

BRkNN value = 8 26

total = 12 27

total = 14 28

Maximum Coverage BRkNN Queries C G • Given: • A set of goal points (G) • A set of condition points (C) • k: the k value of BRkNN • Goal: • Find n points from G, g1, g2, …, gn, which maximize |∪i=1~nBRkNN(gi,G,C)| 29

Application III • Find n Most Favorite Products based on Reverse Top-k Queries

Airlines Hotels All candidate packages Which are the most favorite packages? 31

Top-k Queries (Customer’s View) All candidate packages C1- (a1, h1): 0.80+0.20.2+0.40.5+0.60.1+0.40.2 =0.38 (a1, h2): 0.80+0.20.2+0.40.5+0.60.1+0.60.2 =0.42 … C2- (a1, h1): 0.80.1+0.20.3+0.40.1+0.60.3+0.40.2 =0.44 (a1, h2): 0.80.1+0.20.3+0.40.1+0.60.3+0.60.2 =0.48 … Customer preferences 32

Reverse Top-k Queries (Travel Agency’s View) All candidate packages Retrieve the customers whose top-2 favorites contain (a1, h2)  {c3} #customers in the reverse top-k query for a product is a good estimate of the favoring degree of the product in the market Customer preferences 33

All candidate packages k (#packages considered by customers) = 2 n (#packages to be offered by the travel agency) = 2 (a1, h2): {c3} (a1, h5): {c3, c4} (a2, h5): {c4} (a3, h2): {c2} (a3, h5): {c2, c4} (a3, h6): {c1, c5} (a4, h6): {c5} (a5, h6): {c1} (a1, h2): {c3} (a1, h5): {c3, c4} (a2, h5): {c4} (a3, h2): {c2} (a3, h5): {c2, c4} (a3, h6): {c1, c5} (a4, h6): {c5} (a5, h6): {c1} Customer preferences 34

Problem Definition of n-k MFP Given a set of component tables T1, T2, …, and Tx, which form a set of the candidate products P, a set of customers C with different preferences on the products, and two positive integers k and n RTOPk(cp, P, C): the set of the customers whose top-k favorites contain the candidate product cp Retrieve the minimum subset P’ of P such that |P’|  n and is maximized Maximum coverage problem: NP-hard 35

Skyline A2 A1 0 An object p is said to dominate another object q if and only if p is larger than or equal to q on all dimensions and p is larger than q on at least one dimension Given a set of multi-dimensional objects, the skyline consists of the objects which are not dominated by any other object 36

Property 1 Airlines Hotels Only the component tuples dominated by at most (k-1) other tuples in the same component table have the possibility of being a part of a top-k product for a customer c 37

Reduce component tables Airlines Hotels 38

Property 2 A2 The candidate products in the n-k MFP must be in Skyline(P) A1 0 For any two candidate products cp1 and cp2 in P, if cp1 dominates cp2, RTOPk(cp2, P, C)  RTOPk(cp1, P, C) For any candidate product cp in P, if cp  Skyline(P), cp  n-k MFP 39

Property 2 (cont.) Airlines Hotels : the set of candidate products generated from Skyline(T1), Skyline(T2), …, and Skyline(Tx) A candidate product cp  Skyline(P) if and only if cp  [VLDB’09] Only the skyline tuples of each component table have the possibility of being a part of a candidate product in the n-k MFP 40

Property 3 The upper bounds of the remaining candidate packages RTOPk(cp, Skyline(P), C) is an upper bound of RTOPk(cp, P, C) Only the customers in RTOPk(cp, Skyline(P), C) possibly become the members in RTOPk(cp, P, C) 41

Refinement The top-2 favorites of C3: {(a1, h5), (a1, h2)} The top-2 favorites of C4: {(a1, h5), (a2, h5), (a3, h5)} P’ : {(a1, h5)} 42

Refinement The top-2 favorites of C1: {(a3, h6), (a4, h6)} The top-2 favorites of C5: {(a3, h6), (a4, h6)} P’ : {(a1, h5)} P’ : {(a1, h5), (a3, h6)} P’ : {(a1, h5)} P’ : {(a1, h5)} P’ : {(a1, h5)} P’ : {(a1, h5)} 43

Application IV • Find Most Favorite Products by Top-k Reverse Skyline Queries : user preferences 1 Year : products 1 1 u1 2 1 1 1 k=1 u2 Mileage

Thank you for your attention!

Efficient Dual-Resolution Layer Indexing for Top-k Queries