1 / 22

A Unified Approach for Computing Top-k Pairs in Multidimensional Space

A Unified Approach for Computing Top-k Pairs in Multidimensional Space. Presented By: Muhammad Aamir Cheema 1 Joint work with Xuemin Lin 1 , Haixun Wang 2 , Jianmin Wang 3 , Wenjie Zhang 1. 1 University of New South Wales, Australia 2 Microsoft Research Asia 3 Tsinghua University, China.

cassie
Download Presentation

A Unified Approach for Computing Top-k Pairs in Multidimensional Space

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A Unified Approach for Computing Top-k Pairs in Multidimensional Space Presented By:Muhammad Aamir Cheema1 Joint work with Xuemin Lin1, Haixun Wang2, Jianmin Wang3, Wenjie Zhang1 1 University of New South Wales, Australia 2 Microsoft Research Asia 3 Tsinghua University, China

  2. Introduction • Top-k Pairs Query: • Given a scoring function f() that computes the score of a pair of objects, return k pairs of objects with smallest scores. • Examples: o2 • k-closest pairs • f(ou,ov) = dist(ou,ov) • Answer (k=1) = (o1,o2) o1 • f(ou,ov) = (ou.x +ov.x) + (ou.y +ov.y) • Answer (k=1) = (o4,o5) • k-furthest pairs • f(ou,ov) = - dist(ou,ov) • Answer (k=1) = (o2,o4) y-axis o3 o5 o4 x-axis

  3. Related Work K-Closest Pairs Queries • Computational geometry [M Smid, Handbook on Comp. Geometry] • Database community • [Hjaltason et. al, SIGMOD 1998] • [Corral et. al, SIGMOD 2000] • [Yang et. al, IDEAS 2002] • [Shan et. al, SSTD 2003] K-Furthest Pairs Queries [Supowit , SODA 1990] [Katoh et. al, IJCGA 1995] [Corral et. al, DKE 2004] Top-k Queries • Fagin’s Algorithm [Fagin, PODS 1996] • Threshold Algorithm [Fagin, JCSS 1999], [Nepal et. al, ICDE 1999] , [Gȕntzer et. al, VLDB 2000] • No Random Access Algoritm[Fagin, JCSS 1999], [Mamoulis et. al, TODS 2007]

  4. Motivation • No existing work for more general queries • Other Lp distances (e.g., Manhattan distance) ? • More general scoring functions • Chromatic queries SELECT a.id , b.id FROM AGENT a, AGENT b WHERE a.id < b.id ORDER BY |a.sold – b.sold| - |a.salary – b.salary| LIMIT k; SELECT a.id , b.id FROM AGENT a, AGENT b WHERE a.id < b.id AND a.manager <> b.manager ORDER BY |a.sold – b.sold| - |a.salary – b.salary| LIMIT k; • No existing unified algorithm • One framework that answers a broad class of top-k pairs queries

  5. Problem Definition (Preliminaries) • Monotonic function • f() is monotonic if f(x1,…,xN) ≤ f(y1,…,yN) whenever xi ≤ yi for every 1 ≤ I ≤ N • Examples: • f(x1,…,xN) = x1 + x2 + … + xN(summation) • f(x1,…,xN) = (x1 + x2 + … + xN) / N (average)

  6. Problem Definition (Preliminaries) • Loose monotonic function • s() takes two parameters and is loose monotonic if both of following hold for every fixed value x • for every y > x, s(x,y) either monotonically increases or monotonically decreases as y increases • for every y < x, s(x,y) either monotonically increases or montonically decreases as y decreases • Loose monotonic functions are more general than the monotonic functions y x y ∞ -∞ -3 5 0 1 2 s2(x,y) = (x + y) = 1 3 -2 6 s1(x,y) = |x – y| = 1 4

  7. Problem Definition • Return k pairs of objects with smallest scores. SCORE (a,b) = f ( s1(a,b),…,sd(a,b) ) si( ) is called local scoring function and can be any loose monotonic function of user’s choice. f( ) is called global scoring function and can be any monotonic function that involves an arbitrary set of attributes. s1(a,b) = | a.sold – b.sold | s2(a,b) = -| a.salary – b.salary | f( ) = s1(a,b) + s2(a,b) SELECT a.id , b.id FROM AGENT a, AGENT b WHERE a.id < b.id ORDER BY |a.sold – b.sold| - |a.salary – b.salary| LIMIT k;

  8. Problem Definition • Return k pairs of objects with smallest scores among the valid pairs. Let each object be assigned a color. Chromatic Queries: Homochromatic Queries: pairs containing objects of same color Heterochromatic Queries: pairs containing objects of different colors SELECT a.id , b.id FROM AGENT a, AGENT b WHERE a.id < b.id ORDER BY |a.sold – b.sold| - |a.salary – b.salary| LIMIT k; SELECT a.id , b.id FROM AGENT a, AGENT b WHERE a.id < b.id AND a.manager≠b.manager ORDER BY |a.sold – b.sold| - |a.salary – b.salary| LIMIT k; SELECT a.id , b.id FROM AGENT a, AGENT b WHERE a.id < b.id AND a.manager = b.manager ORDER BY |a.sold – b.sold| - |a.salary – b.salary| LIMIT k;

  9. Contributions Unified algorithm (internal and external) • k-closest pairs, k-furthest pairs and variants (any Lp distance) • queries involving any arbitrary subset of attributes • chromatic and non-chromatic queries • skyline pairs queries and rank based top-k pairs queries No pre-built indexes required • efficiently builds a simple data structure on-the-fly • can answer queries involving filtering conditions on objects Known memory requirement • existing R-tree based approaches may require arbitrarily large heaps • our algorithm requires O(k) space + 2d buffer pages SELECT a.id , b.id FROM AGENT a, AGENT b WHERE a.id < b.id AND a.age > 40 AND b.age > 40 ORDER BY |a.sold – b.sold| - |a.salary – b.salary| LIMIT k; Efficient • Theoretically Optimal for d ≤ 2 • Experimentally

  10. Framework Top-K algorithms (e.g., FA, TA, NRA etc.) … s2(a,b) s1(a,b) sd(a,b) f ( s1(a,b), s2(a,b), …,sd(a,b) ) How to efficiently create and maintain these sources???

  11. Creating/maintaining sources Naïve approach • Create all possible pairs O(N2) • Sort them according to their local scores O(N2 log N) • space requirement: O(N2) Features of our approach • Optimal internal memory algorithm • requires O(N) space • returns first pair in O(N log N) • each next best pair is returned in O( log N) • Optimal external memory algorithm • B = number of elements that can be stored in one disk page • M = used internal memory minimum M = 2B • returns first pair in O(N/B logM/B N/B) • each next best pair is returned in O(logM/B N/B)

  12. Creating/maintaining sources • Initialize • sort the objects • for each object ou • create its best pair (ou,ov) • insert (ou,ov) in heap • getNextPair() • report the top pair (ou,ov) of heap • create next best pair of ou • enheap the new pair and delete (ou,ov) s(x,y) = |x – y| 6 3 6 2 1 5 10 6 12 14 15 20 30 o1 o2 o3 o4 o5 o6

  13. Homochromatic Queries o2 o6 o1 o3 o4 o5 6 12 14 15 20 30

  14. Heterochromatic Queries • Let (ou,ov) be the pair • ox = the object next to ov • If ou and ox have different color • (ou,ox) is the next best pair • else • oy = the adjacent object of ox • (ou,oy) is the next best pair o2 o6 o1 o3 o4 o5 6 12 14 15 20 30

  15. Experiments • K-closest pairs queries [Corral et. al, SIGMOD 2000] • Data size: two dataset each containing 100K objects • k: 10

  16. Experiments • Naive: join the dataset with itself using nested loop (block nested loop for external memory algorithm) • Scoring function: • Local scoring function is either sum or absolute difference (chosen randomly) • Global scoring function is weighted aggregate (weights are chosen randomly and negative weights are allowed)

  17. Number of Objects

  18. Number of attributes (d)

  19. Value of k

  20. Number of colors

  21. Thanks

  22. Complexity Internal memory algorithm = External memory algorithm = d = number of local scoring functions involved N = total number of objects V = total number of valid pairs (N2 at most) M = internal memory used by the algorithm B = the number of entries one disk page can store

More Related