220 likes | 337 Views
A Unified Approach for Computing Top-k Pairs in Multidimensional Space. Presented By: Muhammad Aamir Cheema 1 Joint work with Xuemin Lin 1 , Haixun Wang 2 , Jianmin Wang 3 , Wenjie Zhang 1. 1 University of New South Wales, Australia 2 Microsoft Research Asia
E N D
A Unified Approach for Computing Top-k Pairs in Multidimensional Space Presented By:Muhammad Aamir Cheema1 Joint work with Xuemin Lin1, Haixun Wang2, Jianmin Wang3, Wenjie Zhang1 1 University of New South Wales, Australia 2 Microsoft Research Asia 3 Tsinghua University, China
Introduction • Top-k Pairs Query: • Given a scoring function f() that computes the score of a pair of objects, return k pairs of objects with smallest scores. • Examples: o2 • k-closest pairs • f(ou,ov) = dist(ou,ov) • Answer (k=1) = (o1,o2) o1 • f(ou,ov) = (ou.x +ov.x) + (ou.y +ov.y) • Answer (k=1) = (o4,o5) • k-furthest pairs • f(ou,ov) = - dist(ou,ov) • Answer (k=1) = (o2,o4) y-axis o3 o5 o4 x-axis
Related Work K-Closest Pairs Queries • Computational geometry [M Smid, Handbook on Comp. Geometry] • Database community • [Hjaltason et. al, SIGMOD 1998] • [Corral et. al, SIGMOD 2000] • [Yang et. al, IDEAS 2002] • [Shan et. al, SSTD 2003] K-Furthest Pairs Queries [Supowit , SODA 1990] [Katoh et. al, IJCGA 1995] [Corral et. al, DKE 2004] Top-k Queries • Fagin’s Algorithm [Fagin, PODS 1996] • Threshold Algorithm [Fagin, JCSS 1999], [Nepal et. al, ICDE 1999] , [Gȕntzer et. al, VLDB 2000] • No Random Access Algoritm[Fagin, JCSS 1999], [Mamoulis et. al, TODS 2007]
Motivation • No existing work for more general queries • Other Lp distances (e.g., Manhattan distance) ? • More general scoring functions • Chromatic queries SELECT a.id , b.id FROM AGENT a, AGENT b WHERE a.id < b.id ORDER BY |a.sold – b.sold| - |a.salary – b.salary| LIMIT k; SELECT a.id , b.id FROM AGENT a, AGENT b WHERE a.id < b.id AND a.manager <> b.manager ORDER BY |a.sold – b.sold| - |a.salary – b.salary| LIMIT k; • No existing unified algorithm • One framework that answers a broad class of top-k pairs queries
Problem Definition (Preliminaries) • Monotonic function • f() is monotonic if f(x1,…,xN) ≤ f(y1,…,yN) whenever xi ≤ yi for every 1 ≤ I ≤ N • Examples: • f(x1,…,xN) = x1 + x2 + … + xN(summation) • f(x1,…,xN) = (x1 + x2 + … + xN) / N (average)
Problem Definition (Preliminaries) • Loose monotonic function • s() takes two parameters and is loose monotonic if both of following hold for every fixed value x • for every y > x, s(x,y) either monotonically increases or monotonically decreases as y increases • for every y < x, s(x,y) either monotonically increases or montonically decreases as y decreases • Loose monotonic functions are more general than the monotonic functions y x y ∞ -∞ -3 5 0 1 2 s2(x,y) = (x + y) = 1 3 -2 6 s1(x,y) = |x – y| = 1 4
Problem Definition • Return k pairs of objects with smallest scores. SCORE (a,b) = f ( s1(a,b),…,sd(a,b) ) si( ) is called local scoring function and can be any loose monotonic function of user’s choice. f( ) is called global scoring function and can be any monotonic function that involves an arbitrary set of attributes. s1(a,b) = | a.sold – b.sold | s2(a,b) = -| a.salary – b.salary | f( ) = s1(a,b) + s2(a,b) SELECT a.id , b.id FROM AGENT a, AGENT b WHERE a.id < b.id ORDER BY |a.sold – b.sold| - |a.salary – b.salary| LIMIT k;
Problem Definition • Return k pairs of objects with smallest scores among the valid pairs. Let each object be assigned a color. Chromatic Queries: Homochromatic Queries: pairs containing objects of same color Heterochromatic Queries: pairs containing objects of different colors SELECT a.id , b.id FROM AGENT a, AGENT b WHERE a.id < b.id ORDER BY |a.sold – b.sold| - |a.salary – b.salary| LIMIT k; SELECT a.id , b.id FROM AGENT a, AGENT b WHERE a.id < b.id AND a.manager≠b.manager ORDER BY |a.sold – b.sold| - |a.salary – b.salary| LIMIT k; SELECT a.id , b.id FROM AGENT a, AGENT b WHERE a.id < b.id AND a.manager = b.manager ORDER BY |a.sold – b.sold| - |a.salary – b.salary| LIMIT k;
Contributions Unified algorithm (internal and external) • k-closest pairs, k-furthest pairs and variants (any Lp distance) • queries involving any arbitrary subset of attributes • chromatic and non-chromatic queries • skyline pairs queries and rank based top-k pairs queries No pre-built indexes required • efficiently builds a simple data structure on-the-fly • can answer queries involving filtering conditions on objects Known memory requirement • existing R-tree based approaches may require arbitrarily large heaps • our algorithm requires O(k) space + 2d buffer pages SELECT a.id , b.id FROM AGENT a, AGENT b WHERE a.id < b.id AND a.age > 40 AND b.age > 40 ORDER BY |a.sold – b.sold| - |a.salary – b.salary| LIMIT k; Efficient • Theoretically Optimal for d ≤ 2 • Experimentally
Framework Top-K algorithms (e.g., FA, TA, NRA etc.) … s2(a,b) s1(a,b) sd(a,b) f ( s1(a,b), s2(a,b), …,sd(a,b) ) How to efficiently create and maintain these sources???
Creating/maintaining sources Naïve approach • Create all possible pairs O(N2) • Sort them according to their local scores O(N2 log N) • space requirement: O(N2) Features of our approach • Optimal internal memory algorithm • requires O(N) space • returns first pair in O(N log N) • each next best pair is returned in O( log N) • Optimal external memory algorithm • B = number of elements that can be stored in one disk page • M = used internal memory minimum M = 2B • returns first pair in O(N/B logM/B N/B) • each next best pair is returned in O(logM/B N/B)
Creating/maintaining sources • Initialize • sort the objects • for each object ou • create its best pair (ou,ov) • insert (ou,ov) in heap • getNextPair() • report the top pair (ou,ov) of heap • create next best pair of ou • enheap the new pair and delete (ou,ov) s(x,y) = |x – y| 6 3 6 2 1 5 10 6 12 14 15 20 30 o1 o2 o3 o4 o5 o6
Homochromatic Queries o2 o6 o1 o3 o4 o5 6 12 14 15 20 30
Heterochromatic Queries • Let (ou,ov) be the pair • ox = the object next to ov • If ou and ox have different color • (ou,ox) is the next best pair • else • oy= the adjacent object of ox • (ou,oy) is the next best pair o2 o6 o1 o3 o4 o5 6 12 14 15 20 30
Experiments • K-closest pairs queries [Corral et. al, SIGMOD 2000] • Data size: two dataset each containing 100K objects • k: 10
Experiments • Naive: join the dataset with itself using nested loop (block nested loop for external memory algorithm) • Scoring function: • Local scoring function is either sum or absolute difference (chosen randomly) • Global scoring function is weighted aggregate (weights are chosen randomly and negative weights are allowed)
Complexity Internal memory algorithm = External memory algorithm = d = number of local scoring functions involved N = total number of objects V = total number of valid pairs (N2 at most) M = internal memory used by the algorithm B = the number of entries one disk page can store