Computer Science and Engineering

Computer Science and Engineering Ranking Complex Objects in a Multi-dimensional Space Wenjie Zhang, Ying Zhang, Xuemin Lin The University of New South Wales, Australia

Ranking Queries • To retrieve a limited number of best qualified results from a large set of data • A broad range of queries, ranking by value, similarity, relevance, k nearest neighbor, etc. • Best? • Specify ranking function over certain dimensions. --- top-k query • No ranking function available ? --- skyline, dominating, minimal regret ratio, etc..

Complex Objects • Objects that cannot be modeled by a single d-dimensional value • Focus of this talk: • Uncertain objects: multiple d-dimensional instances per object. Exclusive semantics • Multi-valued objects: multiple d-dimensional instances per object. Inclusive semantics. • Spatial + text

Applications Ranking Dimension Probability How to get the top-2 highest temperatures ?

Applications Who is a better player according to #rebounds and #points ?

Applications How to get the mobile holder nearest to a location P at time t ?

Applications • 11 restaurants with spatial • Locations • (t1, t2, t3) = (sushi, seafood, • coffee)is the textual keyword • Of each restaurant p11 (t2) p10 (t1) p6 (t2,t3) p4 (t1) p9 (t2) p1 (t1,t2) p8 (t3) • Find the restaurant nearest • to the query with sushi & • seafood p3 (t1,t3) p5 (t2,t3) p2 (t1,t2)

Why is ranking complex objects hard? • To interpret the semantics for ranking • E.g., top-k ranking over uncertain objects is studied since 2007 with more than 10 ranking models proposed • Computationally expensive • To handle multiple instance, and extra information (e.g., text)

Uncertain Data Ranking Dimension Probability How to get the top-2 highest temperatures ?

Top-k Ranking Queries • U-topk: [Soliman, Ilyas, Chang, ICDE07], [Yi, Li, Srivastava, Kollios, 08] • U-kRanks: [Soliman, Ilyas, Chang, ICDE07], [Lian, Chen, 08] • PT-k: [Hua, Pei, Zhang, Lin, SIGMOD08] • Global-topk: [Zhang, Chomicki, 08] • Expected ranks: [Cormode, Li, Yi, ICDE09] • Unified Ranking Approach: [Li, Saha, Reshpande, VLDB09] • Representative top-k: [Ge, Zdonik, Madden, SIGMOD09] • Top-k with Data Cleaning: [Mo, Cheng, Cheung, Li, Yang, ICDE13] • Top-k Oracle: [Song, Li, Ge, ICDE13]

U-topk

U-kRanks

PT-k

Global Top-k Based on PT-k, return the k tuples with highest probabilities Top-2 answer: (R3, R5)

Expected Rank • The expected rank of a value across all possible worlds

How to evaluate if a ranking model is good ? • To see if the properties of the original operator is retained • Top-k operator • Value-invariance: ranking determined by relative order but not scores • Exact k: return exactly k results • Unique rank: each item has one and only one rank position • Containment: top-(k+1) list contains top-k list • Stability: if an element is in top-k list, after increasing its score or probability, it should stay in the list

How to evaluate if a ranking model is good ?

Unified Ranking Approach

Representative Top-k Much higher score with similar prob. Based on Top-k Return samples of the distribution Higher scores with large total prob.

Top-k Oracle • Select an arbitrary number of top-k results (sort based on score) to form a top-k oracle • Query evaluation is then executed based on this Oracle • Any previous top-k semantics could be plugged-in

Top-k with Data Cleaning • Cleaning: acquire exact/newest information for uncertain records at extra cost. E.g., to collect the sensor reading again • To select the entities to be cleaned with limited budget to achieve highest quality. • Any top-k semantics could be plugged in

What if ranking function is not given ? • Top-k: a pre-given function to specify how to rank • What if ranking functions are not available ? --- Skyline

Skylines Skyline: candidates of best options in multi-criteria decision applications. • n-dimensional numeric space D = (D1, …, Dn) • on each dimension, a user preference ≺ is defined • two points, udominatesv (u≺v), if • Di (1 ≤ i ≤ n), u.Di≺ = v.Di • Dj (1 ≤ j ≤ n), u.Dj≺v.Dj • Skyline: points not dominated by another point.

Skylines • A skyline building is either close to the viewing point, or higher than those in front of it.

Probabilistic Skyline • Consider game-by-game statistics • Conventional methods compute the skyline on • Aggregate: mean • Limitations • Affected by outliers • Lose data distributions • Probabilistic skylines [Pei, Jiang, Lin, Yuan, VLDB07] • An instance has a probabilityto represent the object • An object has a probability to be in the skyline

Uncertain Objects • An uncertain object is represented as • Continuous case: a probabilistic density function (PDF) • Discrete case: a set of instances, each takes a probability to appear • U = {u1, …, un}, 0 < p(ui) ≤ 1 and 1≤i≤n p(ui) = 1 • Without loss of generality, assume equal probability, p(ui)= 1 / |U|

Probabilistic Skyline • Assume each instance takes equal probability (0.5) to appear. • Possible world: W = {ai, bj, ck} (i, j, k = 1 or 2) with probability Pr(W) = 0.5 × 0.5 × 0.5 = 0.125 • W Pr(W) = 1,  is the set of all possible worlds. • Skyline of a possible world • SKY({a1, b1, c1}) = {a1, b1} • Skyline probability • Pr(B) = 4 × 0.125 = 0.5 • Pr(A) = 1 • Pr(C) = 0

Probabilistic Skyline Brand-Agg (20.39, 2.67, 10.37) Ewing-Agg (19.48, 1.71, 9.91)

Anything missed ? • A desired property of skylines • Provides a minimum candidate set for all monotonic scoring functions • If an object is preferred by a scoring function, it is in skyline. if an object is not preferred by any scoring function, not in. • Probabilistic skylines missed this property • Borrowed idea from an important statistic tool --- stochastic orders

Expected Utility & Stochastic Order Expected Utility Principle: • Given a set U of uncertain objects and a decreasing utility function f, select U in U to maximize E[f (U)]. Stochastic Order: • Given a family ℱ of utility functions, U ≺ℱ V if for each f in ℱ E[f(U)] ≥ E [f(V)] Decreasing Multiplicative Functions: • ℱ= where fi is nonnegative decreasing. Low orthant order: the stochastic order is defined over the family of decreasing multiplicative functions.

Example • Utility function: • : nonnegative decreasing • : nonnegative decreasing e.g. • B never preferred by the expected utility principle! • 2. Psky(A) = 1, Psky (B) = 0.5, Psky (C) = 0.01

Stochastic Order I: lower orthant order Given U & V, U stochastically dominates V (U ≺sd V) if for any x, U.cdf (x) ≥ V.cdf (x) and exists y such that U.cdf (y) > V.cdf (y). U.cdf (x): probability mass of U in the rectangular region R ((0,0,…0), x); see the shaded region. Stochastic Skyline: the objects in U not stochastically dominated by any others, called stochastic skyline. Problem Statement: efficiently compute stochastic skyline regarding discrete cases.

Minimality of stochastic skyline Stochastic skyline removes all objects not preferred by any non-negative decreasing functions!

Testing if U ≺sd V Violation point: a point x in Rd+ is a violation point regarding U ≺sd V if U.cdf (x) < V.cdf (x). Testing algorithm: if no violation points, then U ≺sd V. Not enough to test instances.

Reduce to Grid Points • Test if U.cdf ≥ V.cdf against grid points only (see (a)). • Testing the switching grid points only (see solid lines (b)).

Algorithm • Given a rectangular region R (x, y), if U.cdf (x) ≥ V.cdf (y), then no violation point in R (x, y). • Partition base testing algorithm: • Get switching points • Initial check • Iteratively partition the grid to throw away non-promising sub-grids

Complexity • The algorithm runs O (dm log m + md (T (Uartree) + T (Vartree))) where m is the number of instances in V. • NP-Complete regarding d. • Covert (the decision version of) the minimal set cover problem to a special case of the testing problem.

Usual Order Lower orthant order helps retrieve minimum candidate sets for monotonic multiplication functions. How about more general monotonic functions, like linear functions ?

Usual Order 2 ≤ r ≤ 3, l≤ 1 r ≤ 2, l≤ 3 r ≤ 3, l ≤ 3 E[f(A)], E[f(B)], E[f(C)] ?

Usual Order • Lower Set:

Usual Order

General Stochastic Skyline

Verification Algorithm • Verification: to determine if U ≺uo V • Naively: test U.cdf(S) ≥ V.cdf(S) against every lower set S (infinite number of lower sets) • From infinite to finite: (all subsets of V still exponential)

Max-flow • Given a road network, the weight along an edge shows the capacity. Question: what is the maximum flow from source to destination ? 0 6 3 2 2 0 2 1 4 4 3 0 0 0 2 1

Max flow • Max-flow / min-cut Theorem: for any network having a single source and a single destination node, the maximum flow from origin to destination equals the minimum cut value for all cuts in the network. • Ford and Fulkerson algorithm

Verification • Time Complexity: O(tG + mnlogm) • tG: time to construct GU, V • m: number of arcs • n: number of nodes

Verification Compression: R-tree based level-by-level dominance checking

Verification FD: {(U1, V1), (U2, V2), (u1, v6), (u2, v6)} Step 1: get full dominance list FD

Verification

Framework U ≺uo V (U ≺lo V) preserves the transitivity: if U ≺uo V, V could be removed since for any W s.t. V ≺uo W, U ≺uo W Apply standard filtering paradigm

Computer Science and Engineering