Answering Top-k Queries Using Views By Gautam Das,Dimitrios Gunopulos , Nick Koudas , Dimitris Tsirogiannis

Answering Top-k Queries Using ViewsBy GautamDas,DimitriosGunopulos,Nick Koudas, DimitrisTsirogiannis Presented by:Yesha Gupta Referenceaitrc.kaist.ac.kr/~vldb06/slides/R13-1.ppt

Top-k query • To return k highest ranked values from a relation • Query: Return top-2 values for function 3x1+10x2+5x3 m R n

Views (1/2) Base table R with three attributes Top-5 query using function f1 = 2x1 + 5x2 • Base ViewVX1 ordered by Score m R n

Views (2/2) • Advantages of Views • faster query response time • space-performance tradeoff to improve efficiency • Can we use view to answer top-k query? Yes • Challenges in using views • guarantee of getting top-k query • which views to select to answer a given query

Ranking Queries • Top-k ranking queries : Select top[k] from relation(R) where order by • Expressed as triple • Assigns a numeric score to any tuple t • Selection condition for the tuples

Ranking Views • Materialized Ranking views : previously executed ranking query • A set of k pairs, ordered by decreasing values of • Relation R with m attributes, base views ,

Related work in Top-k query (1/2) • TA [Fagin et. al. ‘96] • deterministic stopping condition • always the correct top-k set m • List2 (X2) • List1 (X1) R n Number of sorted access = Number of random access

Related work in Top-k query (2/2) • PREFER [Hristidis et. al. ‘01] • A prototype system and available at http://db.ucsd.edu/PREFER/ • Stores multiple copies of base relation R as materialized views • Utilizes only one • This paper complements both the approaches

Problems for views • Problem 1 (Top-k query answer using views) • Input: a set U of views {v1,v2,…}, Query Q • Output: top-k set • Algorithm: LPTA • Problem 2 (View Selection) • Input: a set V of views (Materialized Ranking Views and Base Views), Query Q • Output: the most efficient subset to execute Q on • Algorithm: SelectViews

Outline • LPTA • View Selection problem • Conceptual discussion • Cost estimation problem • Various Select Views methods • General queries and views • Experiment • Conclusion

LPTA Setting • Linear additive scoring functions e.g. • Output Top-2 set; Q = (fQ, 2, *) • Set of Views: • Materialized views V1 ,V2 • Sorted access on pairs • Random access on the base table R

LPTA execution R V1 V2 Top-k buffer for Q = (fQ, 2, *) topkmin = 996 Unseenmax = ? 7 1248 6 996

Calculate Unseenmax • The unseen tuples in the view have satisfy the following inequalities: • The domain of each attribute of R [1,100] • 1<=X1,X2,X3<=100 (1) • 2x1 + 5x2 <= 527 (2) • x2 + 2x3 <= 219 (3) • Unseenmax= Solution to the linear program where we maximize the function subject to these inequalities. • Unseenmax= 1338

stopping condition LPTA – General Example(1/3) V1 X1 Q Top-1 R(X1, X2) V1 V2 V2 X2

d iteration LPTA – General Example(2/3) Linear Programming adaptation of TA Q: V1 V2

LPTA – General Example(3/3) stopping condition X1 V1 Q Top-1 R(X1, X2) V1 V2 V2 X2

LPTA execution R V1 V2 Top-k buffer for Q = (fQ, 2, *) topkmin = 996 Unseenmax = 953.5 Unseenmax = 1338 Yes Is Unseenmax ≤ topkmin ? No

LPTA • LPTA becomes TA when the set of views U = set of base views • Execution cost: Both have Sequential as well as Random Access • Every sequential access incurs random access • # of sequential access can be considered as running cost • Running cost = O(dr) d= #of lock-step iterations and r= # of views

View Selection Problem (1/5) • Given a collection of views and a query Q, determine the most efficient subset to execute Q on. • Conceptual discussion • Two dimensions • Higher dimensions

A B1 B B2 View Selection – 2d (2/5) Q Min top-k tuple V1 M V2

A1 A B B2 View Selection - 2d (3/5) Q Min top-k tuple V1 M V2

View Selection - 2d (4/5) • Theorem: If is a set of views for a two dimensional dataset and Q is a query. Let and be the closet view vectors in anticlockwise and clockwise order respectively. Then the optimal execution of LPTA requires a subset of views from .

View Selection - Higher d (5/5) Theorem: If is a set of views for an -dimensional dataset and Q a query, the optimal execution of LPTA requires a subset of views such that . Question: How do we select the optimal subset of views?

Cost Estimation Framework • What is the cost of running LPTA when a specific set of views is used to answer a query? • Cost = number of sequential accesses V1 Min top-k tuple A Q Cost = 6 sequential accesses V2 Can we find that cost without actually running LPTA? B

Simulation of LPTA on Histograms Equi – depth Histogram • Base ViewVX1 ordered by Score HVX1 Let’s say, each histogram has 2 bucketsSo b = 5There are 10 tuples in the view, so n=10 Each bucket will represent n/b data points or attribute valuesn/b = 10/5 = 2 b buckets

Simulation of LPTA on Histograms • Estimate query score distribution by convolution Domain [0,1]

topkmin Cost Simulation of LPTA on Histograms • Use HQ to estimate the score of the k highest tuple (topkmin). • Simulate LPTA in a bucket by bucket lock step to estimate the cost. HQ: approximates the score distribution of the query Q HQ HV1 HV2 b buckets n/b tuples per bucket

Estimate cost • Number of buckets visited (d) = 3 • Number of views (r̕) = 2 • Number of tuples per bucket (n/b) = 56/7 = 8 • Number of tuples in last scanned bucket (n̕) = 3 • Number of sorted access = ((d-1) n/b + n̕) r̕ = ((3-1) 8 + 3) 2 = 38 • Running time = Ο((d-1) + logn̕)

View Selection Algorithms • Excustive(E): Check all possible subsets where and select subset of views with smallest cost • Greedy (SV): Keep expanding the set of views to use until the estimated cost stops reducing.

T Selected Views Select Views Spherical (SVS) Requires the solution of a single linear program. Q

Select Views By Angle (SVA) Select Views By Angle (SVA): Sort the views by increasing angle with respect to Q. V4 V3 V2 Selected Views Q V1

General Queries and Views • Views that materialize their top-k tuples. • Convolute the view histograms • Truncate the view histograms • Run EstimateCost() • Accommodating range conditions • Select the views that cover the range conditions. • Truncate each attribute’s histogram. • Convolute histogram

Experiments • Datasets (Uniform, Zipf, Real) • Experiments: • Performance comparison of LPTA, PREFER and TA • Accuracy of the cost estimation framework • Performance of LPTA using each of the view selection algorithms • Scalability of the LPTA algorithm

Performance comparison of LPTA, PREFER and TA Real dataset, 2d Uniform dataset, 3d

Cost Estimation Accuracy 2d (buckets = 0.5% of n) (buckets = 1% of n)

Performance of LPTA using View Selection Algorithms 500K tuples, top-100 (3d) (2d)

Scalability Experiments on LPTA (2d, uniform dataset) (500K tuples, top-100)

Conclusions • Using views for top-k query answering • LPTA: linear programming adaptation of TA • View selection problem, cost estimation framework, view selection algorithms • Experimental evaluation

References • aitrc.kaist.ac.kr/~vldb06/slides/R13-1.ppt • Answering Top-k Queries Using Views: Gautam Das, DimitriosGunopulos, Nick Koudas , DimitrisTsirogiannis

Thank you

Answering Top-k Queries Using Views By Gautam Das,Dimitrios Gunopulos , Nick Koudas , Dimitris Tsirogiannis