1 / 59

Computer Science and Engineering

Computer Science and Engineering. Ranking Complex Objects in a Multi-dimensional Space. Wenjie Zhang, Ying Zhang, Xuemin Lin. The University of New South Wales, Australia. Ranking Queries . To retrieve a limited number of best qualified results from a large set of data

dmays
Download Presentation

Computer Science and Engineering

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Computer Science and Engineering Ranking Complex Objects in a Multi-dimensional Space Wenjie Zhang, Ying Zhang, Xuemin Lin The University of New South Wales, Australia

  2. Ranking Queries • To retrieve a limited number of best qualified results from a large set of data • A broad range of queries, ranking by value, similarity, relevance, k nearest neighbor, etc. • Best? • Specify ranking function over certain dimensions. --- top-k query • No ranking function available ? --- skyline, dominating, minimal regret ratio, etc..

  3. Complex Objects • Objects that cannot be modeled by a single d-dimensional value • Focus of this talk: • Uncertain objects: multiple d-dimensional instances per object. Exclusive semantics • Multi-valued objects: multiple d-dimensional instances per object. Inclusive semantics. • Spatial + text

  4. Applications Ranking Dimension Probability How to get the top-2 highest temperatures ?

  5. Applications Who is a better player according to #rebounds and #points ?

  6. Applications How to get the mobile holder nearest to a location P at time t ?

  7. Applications • 11 restaurants with spatial • Locations • (t1, t2, t3) = (sushi, seafood, • coffee)is the textual keyword • Of each restaurant p11 (t2) p10 (t1) p6 (t2,t3) p4 (t1) p9 (t2) p1 (t1,t2) p8 (t3) • Find the restaurant nearest • to the query with sushi & • seafood p3 (t1,t3) p5 (t2,t3) p2 (t1,t2)

  8. Why is ranking complex objects hard? • To interpret the semantics for ranking • E.g., top-k ranking over uncertain objects is studied since 2007 with more than 10 ranking models proposed • Computationally expensive • To handle multiple instance, and extra information (e.g., text)

  9. Uncertain Data Ranking Dimension Probability How to get the top-2 highest temperatures ?

  10. Top-k Ranking Queries • U-topk: [Soliman, Ilyas, Chang, ICDE07], [Yi, Li, Srivastava, Kollios, 08] • U-kRanks: [Soliman, Ilyas, Chang, ICDE07], [Lian, Chen, 08] • PT-k: [Hua, Pei, Zhang, Lin, SIGMOD08] • Global-topk: [Zhang, Chomicki, 08] • Expected ranks: [Cormode, Li, Yi, ICDE09] • Unified Ranking Approach: [Li, Saha, Reshpande, VLDB09] • Representative top-k: [Ge, Zdonik, Madden, SIGMOD09] • Top-k with Data Cleaning: [Mo, Cheng, Cheung, Li, Yang, ICDE13] • Top-k Oracle: [Song, Li, Ge, ICDE13]

  11. U-topk

  12. U-kRanks

  13. PT-k

  14. Global Top-k Based on PT-k, return the k tuples with highest probabilities Top-2 answer: (R3, R5)

  15. Expected Rank • The expected rank of a value across all possible worlds

  16. How to evaluate if a ranking model is good ? • To see if the properties of the original operator is retained • Top-k operator • Value-invariance: ranking determined by relative order but not scores • Exact k: return exactly k results • Unique rank: each item has one and only one rank position • Containment: top-(k+1) list contains top-k list • Stability: if an element is in top-k list, after increasing its score or probability, it should stay in the list

  17. How to evaluate if a ranking model is good ?

  18. Unified Ranking Approach

  19. Representative Top-k Much higher score with similar prob. Based on Top-k Return samples of the distribution Higher scores with large total prob.

  20. Top-k Oracle • Select an arbitrary number of top-k results (sort based on score) to form a top-k oracle • Query evaluation is then executed based on this Oracle • Any previous top-k semantics could be plugged-in

  21. Top-k with Data Cleaning • Cleaning: acquire exact/newest information for uncertain records at extra cost. E.g., to collect the sensor reading again • To select the entities to be cleaned with limited budget to achieve highest quality. • Any top-k semantics could be plugged in

  22. What if ranking function is not given ? • Top-k: a pre-given function to specify how to rank • What if ranking functions are not available ? --- Skyline

  23. Skylines Skyline: candidates of best options in multi-criteria decision applications. • n-dimensional numeric space D = (D1, …, Dn) • on each dimension, a user preference ≺ is defined • two points, udominatesv (u≺v), if • Di (1 ≤ i ≤ n), u.Di≺ = v.Di • Dj (1 ≤ j ≤ n), u.Dj≺v.Dj • Skyline: points not dominated by another point.

  24. Skylines • A skyline building is either close to the viewing point, or higher than those in front of it.

  25. Probabilistic Skyline • Consider game-by-game statistics • Conventional methods compute the skyline on • Aggregate: mean • Limitations • Affected by outliers • Lose data distributions • Probabilistic skylines [Pei, Jiang, Lin, Yuan, VLDB07] • An instance has a probabilityto represent the object • An object has a probability to be in the skyline

  26. Uncertain Objects • An uncertain object is represented as • Continuous case: a probabilistic density function (PDF) • Discrete case: a set of instances, each takes a probability to appear • U = {u1, …, un}, 0 < p(ui) ≤ 1 and 1≤i≤n p(ui) = 1 • Without loss of generality, assume equal probability, p(ui)= 1 / |U|

  27. Probabilistic Skyline • Assume each instance takes equal probability (0.5) to appear. • Possible world: W = {ai, bj, ck} (i, j, k = 1 or 2) with probability Pr(W) = 0.5 × 0.5 × 0.5 = 0.125 • W Pr(W) = 1,  is the set of all possible worlds. • Skyline of a possible world • SKY({a1, b1, c1}) = {a1, b1} • Skyline probability • Pr(B) = 4 × 0.125 = 0.5 • Pr(A) = 1 • Pr(C) = 0

  28. Probabilistic Skyline Brand-Agg (20.39, 2.67, 10.37) Ewing-Agg (19.48, 1.71, 9.91)

  29. Anything missed ? • A desired property of skylines • Provides a minimum candidate set for all monotonic scoring functions • If an object is preferred by a scoring function, it is in skyline. if an object is not preferred by any scoring function, not in. • Probabilistic skylines missed this property • Borrowed idea from an important statistic tool --- stochastic orders

  30. Expected Utility & Stochastic Order Expected Utility Principle: • Given a set U of uncertain objects and a decreasing utility function f, select U in U to maximize E[f (U)]. Stochastic Order: • Given a family ℱ of utility functions, U ≺ℱ V if for each f in ℱ E[f(U)] ≥ E [f(V)] Decreasing Multiplicative Functions: • ℱ= where fi is nonnegative decreasing. Low orthant order: the stochastic order is defined over the family of decreasing multiplicative functions.

  31. Example • Utility function: • : nonnegative decreasing • : nonnegative decreasing e.g. • B never preferred by the expected utility principle! • 2. Psky(A) = 1, Psky (B) = 0.5, Psky (C) = 0.01

  32. Stochastic Order I: lower orthant order Given U & V, U stochastically dominates V (U ≺sd V) if for any x, U.cdf (x) ≥ V.cdf (x) and exists y such that U.cdf (y) > V.cdf (y). U.cdf (x): probability mass of U in the rectangular region R ((0,0,…0), x); see the shaded region. Stochastic Skyline: the objects in U not stochastically dominated by any others, called stochastic skyline. Problem Statement: efficiently compute stochastic skyline regarding discrete cases.

  33. Minimality of stochastic skyline Stochastic skyline removes all objects not preferred by any non-negative decreasing functions!

  34. Testing if U ≺sd V Violation point: a point x in Rd+ is a violation point regarding U ≺sd V if U.cdf (x) < V.cdf (x). Testing algorithm: if no violation points, then U ≺sd V. Not enough to test instances.

  35. Reduce to Grid Points • Test if U.cdf ≥ V.cdf against grid points only (see (a)). • Testing the switching grid points only (see solid lines (b)).

  36. Algorithm • Given a rectangular region R (x, y), if U.cdf (x) ≥ V.cdf (y), then no violation point in R (x, y). • Partition base testing algorithm: • Get switching points • Initial check • Iteratively partition the grid to throw away non-promising sub-grids

  37. Complexity • The algorithm runs O (dm log m + md (T (Uartree) + T (Vartree))) where m is the number of instances in V. • NP-Complete regarding d. • Covert (the decision version of) the minimal set cover problem to a special case of the testing problem.

  38. Usual Order Lower orthant order helps retrieve minimum candidate sets for monotonic multiplication functions. How about more general monotonic functions, like linear functions ?

  39. Usual Order 2 ≤ r ≤ 3, l≤ 1 r ≤ 2, l≤ 3 r ≤ 3, l ≤ 3 E[f(A)], E[f(B)], E[f(C)] ?

  40. Usual Order • Lower Set:

  41. Usual Order

  42. General Stochastic Skyline

  43. Verification Algorithm • Verification: to determine if U ≺uo V • Naively: test U.cdf(S) ≥ V.cdf(S) against every lower set S (infinite number of lower sets) • From infinite to finite: (all subsets of V still exponential)

  44. Max-flow • Given a road network, the weight along an edge shows the capacity. Question: what is the maximum flow from source to destination ? 0 6 3 2 2 0 2 1 4 4 3 0 0 0 2 1

  45. Max flow • Max-flow / min-cut Theorem: for any network having a single source and a single destination node, the maximum flow from origin to destination equals the minimum cut value for all cuts in the network. • Ford and Fulkerson algorithm

  46. Verification • Time Complexity: O(tG + mnlogm) • tG: time to construct GU, V • m: number of arcs • n: number of nodes

  47. Verification Compression: R-tree based level-by-level dominance checking

  48. Verification FD: {(U1, V1), (U2, V2), (u1, v6), (u2, v6)} Step 1: get full dominance list FD

  49. Verification

  50. Framework U ≺uo V (U ≺lo V) preserves the transitivity: if U ≺uo V, V could be removed since for any W s.t. V ≺uo W, U ≺uo W Apply standard filtering paradigm

More Related