Answering Top-k Queries Using Views

Answering Top-k Queries Using Views By: Gautam Das (Univ. of Texas), Dimitrios Gunopulos (Univ. of California Riverside), Nick Koudas (Univ. of Toronto), Dimitris Tsirogiannis (Univ. of Toronto) Presented By: Kushal Shah Lipsa Patel

Views • Definition: Views • Declaring Views • Advantages of using Views

Views • A view may be thought of as a table, that is derived from one or more underlying base table. • Two kinds: • Virtual: Not stored in the database; just a query for constructing the relation. • Materialized: Actually constructed and stored.

Declaring Views • Materialized: • CREATE [MATERIALIZED] • VIEW <name> AS <query>; • Virtual: Default

Advantages of using Views • If we have several tables in a DB and we want to view only specific columns from specific tables we can go for views. • Suffice the needs of security: Sometimes allowing specific users to see only specific columns based on the permission that we can configure on the views.

Answering Top-k Queries Using Views By: Gautam Das (Univ. of Texas), Dimitrios Gunopulos (Univ. of California Riverside), Nick Koudas (Univ. of Toronto), Dimitris Tsirogiannis (Univ. of Toronto) Presented By: Kushal Shah Lipsa Patel

Top-k Query • Top-k Query Processing – Definition • Top-k Example • Algorithms for Top-k Query Processing

Top-k Query Processing Top-k query processing = Finding k objects that have the highest overall Score

Top-k Example • R • Users preferences regarding the ordering of the tuples of a relation can be expressed as a scoring functions on the attributes of a relation, eg • fq = 3x1 + 2x2 + 5x3 • The top-k problem is to find the k tuples with the highest score according to a given scoring function.

Algorithms for Top-k Query Processing • How?  Which algorithms? – Related Work How we complement existing approaches? • TA [Fagin] • PREFER [Hristidis] Stores the multiple copies of a relation and each copy is ordered according to a different scoring function. In order to answer a top-k query the algorithm utilizes a single copy with a scoring function which is closest to the scoring function of the query.

M Object ID Attribute 1 Attribute 2 d c b a (d, 0.9) (a, 0.9) 0.9 0.85 (a, 0.85) (b, 0.8) 0.8 0.7 (b, 0.7) (c, 0.72) 0.72 0.2 . . . . . . . . 0.6 0.9 . . . . . . . . . . . . (c, 0.2) (d, 0.6) N Example – Simple Database model Sorted L1 Sorted L2

ID L2 L1 (d, 0.9) (a, 0.9) (b, 0.8) (a, 0.85) (c, 0.72) (b, 0.7) A1 Min(A1,A2) A2 . . . . . . . . (d, 0.6) (c, 0.2) Example – Threshold Algorithm Step 1: - parallel sorted access to each list For each object seen: - get all grades by random access - determine Min(A1,A2) - amongst 2 highest seen ? keep in buffer a 0.9 0.85 0.85 d 0.9 0.6 0.6

ID L2 L1 a: 0.9 d: 0.9 a: 0.85 b: 0.8 a 0.9 b: 0.7 c: 0.72 0.9 d A2 Min(A1,A2) A1 . . . . . . . . d: 0.6 c: 0.2 Example – Threshold Algorithm Step 2: - Determine threshold value based on objects currently seen under sorted access. T = min(L1, L2) - 2 objects with overall grade ≥ threshold value ? stop else go to next entry position in sorted list and repeat step 1 0.85 0.85 0.6 0.6 T = min(0.9, 0.9) = 0.9

ID L2 L1 (a, 0.9) (d, 0.9) (b, 0.8) (a, 0.85) (c, 0.72) (b, 0.7) A1 A2 Min(A1,A2) . . . . . . . . (d, 0.6) (c, 0.2) Example – Threshold Algorithm Step 1 (Again): - parallel sorted access to each list For each object seen: - get all grades by random access - determine Min(A1,A2) - amongst 2 highest seen ? keep in buffer a 0.9 0.85 0.85 0.9 d 0.6 0.6 b 0.8 0.7 0.7

ID L2 L1 a: 0.9 d: 0.9 a: 0.85 b: 0.8 a 0.9 b: 0.7 c: 0.72 0.7 b A2 Min(A1,A2) A1 . . . . . . . . d: 0.6 c: 0.2 Example – Threshold Algorithm Step 2 (Again): - Determine threshold value based on objects currently seen. T = min(L1, L2) - 2 objects with overall grade ≥ threshold value ? stop else go to next entry position in sorted list and repeat step 1 0.85 0.85 0.7 0.8 T = min(0.8, 0.85) = 0.8

ID L2 L1 c d: 0.9 a: 0.9 a: 0.85 b: 0.8 a 0.9 c: 0.72 b: 0.7 0.7 b A1 A2 Min(A1,A2) . . . . . . . . c: 0.2 d: 0.6 Example – Threshold Algorithm Situation at stopping condition 0.85 0.85 0.7 0.8 0.72 0.2 0.2 T = min(0.72, 0.7) = 0.7

Related Work for Top-k Query Processing • TA: Sequential as well as Random Access • PREFER

Approach for Top-k Query Processing • Top-k Query Answering using Views • Views are Materialized (incurring space overhead) • Advantages of using views: increased performance because views are small in size • Space-Performance tradeoff

Example Views Top-5 query using function f1 = 2x1 + 5x2 Three attribute relation R Top-5 query using function f2 = x2 + 2x3 • Top-k ranking queries in SQL-like syntax: SELECT TOP[k] FROM R ORDER BY Score(q) Score(q) - function that assigns numeric score to any tuple t • Ranking Views: Views only aim to rank • A ranking view is the materialized result of a previously asked top-k query. Can we answer new top-k queries efficiently using ranking views? Let’s see

Formal Definitions • Ranking Queries • Ranking Views

Ranking Queries • Ranking Queries: Top-k ranking queries in SQL-like syntax: Select Top[k] from R where Range(q) Order By Score(q) • A ranking query may be expressed as a triple Q = (Score(q), k, Range(q)), where • Score(q)= Function that assigns numeric score to any tuple t • Range(q) = defines selection condition for the tuples of R • Semantics: Retrieve the k tuples with the top scores satisfying the selection condition.

Ranking Views • Materialized Ranking View V: • for a previously executed query • Q1 = (ScoreQ1, k1, RangeQ1), • the corresponding materialized ranking view is a set of k(tid, scoreQ(tid)) pairs, • ordered by decreasing values of scoreQ(tid).

Problems we are going to solve • Top-k Query Answer using Views • View Selection

Top-k Query Answer using Views • Given: Set U of views • Query Q • Obtain an answer to Q combining all the information conveyed by the views in U • Solution: Algorithm named LPTA

Problems we are going to solve • Top-k Query Answer using Views • View Selection

View Selection • Problem: Given a collection of views V={V1…Vr} – base views and a query Q, determine the most efficientsubset U of V to execute Q on. • Input to LPTA: subset U • Obtaining an answer to ranking query: Running TA on base views. • Find the subset U that when utilized by LPTA • 1. Provide answer to query • 2. Provide answer faster than running TA on the base views V

Outline • LPTA Algorithm • View Selection Problem

LPTA: Linear Programming Adaptation of the Threshold Algorithm • 1. Scoring function of Query: Q - fQ = 3x1 + 10x2 • 2. Scoring function of Views: V1 – fv1 = 2x1 + 5x2 • Subset of Views U V2 – fv2 = x1 + 2x2 LPTA for Top-k Query Answer using Views Top-1 Query • View is a set of pairs of (tuple identifier, score). • The LPTA algorithm requires sorted access on each view in non-increasing order of that score.

LPTA Example R V1 V2 Top-3 Query f2 = x2 + 2x3 Top-5 Query f1 = 2x1 + 5x2 Answer Top-2 Query using LPTA

LPTA Setting • The algorithm initializes the top-k buffer to empty. V1 V2 Top-2 Buffer For each tid read, random access on R to retrieve tuple and compute score acc to query function f3 = 3x1 + 10x2 + 5x3 Top-2 Buffer Check for stopping Condition

Check for Stopping Condition • The unseen tuples in the view have satisfy the following inequalities: • The domain of each attribute of R [1,100] • 0<=X1,X2,X3<=100---------------------------(1) • 2x1 + 5x2 <= sd1-------------------------------(2) • x2 + 2x3 <= sd2---------------------------------(3) • sd1 = 527 and sd2 = 219 • Unseenmax = Solution to the linear program where we maximize the function f3 = 3x1 + 10x2 + 5x3 subject to these inequalities. • The solution of the linear program gives the maximum score of any unseen tuple. • Unseenmax = the maximum possible score (with respect to the ranking query’s scoring function) of any tuple not yet visited in the views. • The algorithm terminates when the top-k buffer is full and Unseenmax <= topkmin

Calculating Unseenmax • Unseenmax = Solution to the linear program where we maximize the function f3 = 3x1 + 10x2 + 5x3 subject to these inequalities. • A linear programming problem may be defined as the problem of maximizing or minimizing a linear function subject to linear constraints. The constraints may be equalities or inequalities. Here is a simple example. • Find numbers x1 and x2 that maximize the sum x1 + x2 subject to the constraints • x1 ≥ 0, x2 ≥ 0, and • x1 + 2x2 ≤ 4 • 4x1 + 2x2 ≤ 12 • −x1 + x2 ≤ 1 Objective Function

Maximize the function This system of inequalities defines a convex region. Occasionally, the maximum occurs along an entire edge or face of the constraint set, but then the maximum occurs at a corner point as well. Convex region

LPTA - Example stopping condition Score of a tuple with respect to the query: project that tuple to the vector of the query Score of a tuple with respect to a view: project that tuple to the the vector of the view Sweeping line perpendicular to V1 from infinity to origin Normalized Domain[0,1] V1 X1 Q R(X1, X2) Top-1 query V1 V2 V2 X2 Views and top-k query represented by vectors denoting the direction of increasing score Max posssible score of any tuple not yet visited in the views with respect to the scoring func of query UNSEENMAX

LPTA - Example (cont’) The algorithm will stop early if the scoring function of the views is “similar” to the scoring function of the query. stopping condition X1 V1 Q Top-1 R(X1, X2) V1 V2 V2 X2

LPTA Algorithm – Pseudo Code • There is Sequential as well as Random Access. • Sequential access on views • Random Access on base table to find the tuple

Comparison of LPTA with TA • LPTA becomes TA when the set of views U = set of base views • Execution cost: Both have Sequential as well as Random Access • These I/O Operations play a significant role – overshadow the costs of CPU operations such as updated top-k buffer, testing for stopping condition & so on.

Determining Factor for performance LPTA versus TA • Highly correlated: every sequential access incurs a random access. • As a result the determining factor for the performance is (distance from the beginning of the view each algorithm has to traverse (read sequentially) before coming into a halt with the correct answer) X (the number of views participating in the process). • d=number of lock-step r = no of views • Running Cost: • O(dr)

Outline • LPTA Algorithm • View Selection Problem

View Selection Problem • Given a collection of views V = {V1,…,Vr} and a Query Q, determine the most efficient subset U C V to execute Q on. • Conceptual discussion of View Selection • Two attribute relation (in two dimension) • Multi attribute relation (for any dimension) • Domain of each attribute is normalized to [0,1] • M-attribute relation is refer as m-dimension

View Selection – Two Dimension(same side) Two views V1 and V2 and Query Q are represented by vectors. Both the view vectors are to the same side (clockwise) of the query vector Sorted access to V1 – sweeping line 1 to V1 from infinity to origin AB 1 Q passes through M & intersect unit square ABR – Top-k tuples ABPOT – Remaining tuples Q Min top-k tuple A Stopping condition for V1: sweepline crosses AB1 b’coz convex polygon AB1POT – unseen tuples and score(unseen) <= Score(M) V1 M V2 B Square OPRT B1 B2 Number of sorted access V1 = NumTuples(AB1R) V2 = NumTuples(AB2R)

View Selection – Two Dimension(same side) Conclusion • V2 is slower compared to V1 • If several views in two dimension are available & • all their vectors are to one side of query vector, • then it is optimal for LPTA to use the vector that is closet to the query vector.

Estimating the Number of Tuples • Estimating and Comparing the Number of Tuples by simply comparing the areas of respective triangles. • Such approach: Need to have an uniform distribution within the triangles, which is often quite unrealistic. • In our approach for view selection, • utilize the conceptual conclusions + borrow knowledge of actual data distribution.

View Selection – Two Dimension(either side of query) A1 A B B1 Can use only V1 or only V2 for execution Q Min top-k tuple V1 If uses only v1 to answer the query the stopping condition will be reached once the sweepline perpendicular to v1 crosses position A1B/ For V2 - AB1 M V2

View Selection – Two Dimension(either side of query) A1 A B B1 Running LPTA on both V1 and V2, rather than just running on only one of V1 or V2? Two views are better than one The intersection point of the sweep lines perpendicular to v1 and v2 is on the line AB A11 Q V1 A21 Min top-k tuple The stopping condition is reached when the sweeplines resp crosses A11B11 and A21B21 such that 1) intersection pt of A11B11 and A21B21 is on line AB M V2 B11 B21 2) NumTuples(A11B11R) = NumTuples(A21B21PR) since algo sweeps each view in lock-step

LPTA on both Views versus One • For two views the position of each sweepline is before the respective stopping positions if only one view has been used. • Total number of sorted accesses for two views: • NumTuples (A11B11R) + NumTuples (A21B21R) = 2 NumTuples (A11B11R) • If Min (NumTuples (A1BR), NumTuples (AB1PR), 2 NumTuples (A11B11R)) = NumTuples (A1BR) - Use V1 • If Min (NumTuples (A1BR), NumTuples (AB1PR), 2 NumTuples (A11B11R))= NumTuples (AB1PR) - Use V2 • Else use both V1, V2

Theorem for Two Dimensional Case • Theorem 1: Set of Views = {V1,…,Vr} Query = Q • Two Dimensional dataset • Va = Closest to query in Anticlockwise • Vc = Closest to query in Clockwise • So they are on either side of the query • Optimal execution of LPTA requires the use of either Va or Vc i.e., the use of subset from {Va , Vc }

View Selection – Higher Dimension • Extension of Theorem 1 • Theorem 2: Set of Views = {V1,…,Vr} Query = Q • m-dimensional dataset • Optimal execution of LPTA requires the use of subset of views U C V such that |U| <= m

Outline • LPTA Algorithm • View Selection Problem • Cost Estimation Framework

Cost Estimation Framework – Running LPTA • Cost Estimation Framework: The cost of running LPTA when a specific set of views is used to answer a query. • Cost = total number of sequential accesses in a view • Uses 2 views to answer a query V1 Min top-k tuple A Q Cost = 6 sequential accesses V2 Can we find that cost without actually running LPTA? B

Answering Top-k Queries Using Views