1 / 18

Join Synopses for Approximate Query Answering

Join Synopses for Approximate Query Answering. Swarup Achrya Philip B. Gibbons Viswanath Poosala Sridhar Ramaswamy Presented by Bhushan Pachpande. Contents. Introduction Need for approximate answers Problem with joins Join synopses Allocation Maintenance of join synopses

basil-black
Download Presentation

Join Synopses for Approximate Query Answering

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Join Synopses for Approximate Query Answering Swarup Achrya Philip B. Gibbons Viswanath Poosala Sridhar Ramaswamy Presented by Bhushan Pachpande

  2. Contents • Introduction • Need for approximate answers • Problem with joins • Join synopses • Allocation • Maintenance of join synopses • Experimental Evaluation

  3. Introduction This paper • demonstrates difficulty of providing good approximate answers to join queries • proposes join synopses as the efficient solution for this problem • presents strategy for allocating available space for join synopses • provides efficient algorithm for maintaining join synopses in presence of updates to the base relations

  4. Why Approximate answers ? • Reduce overhead for large DBs and improve response time • Reduce access to the base relation • Example of Approximate answers • Initial queries in the data mining which are used to determine what the interesting queries are • Queries requesting numerical answers and full precision of exact answer not needed e.g. total, average • The research in this paper was conducted while developing efficient approximate query answering system, Aqua.

  5. Aqua System • improve response time by avoiding frequent access to original data • maintains smaller sized statistical summaries, called synopses, on warehouse. • sits on the top of the DBMS. Collects all synopses, uses it to answer queries posed by user • There key components • Statistic Collection • Query Rewriting • Maintenance parses sql input, rewrite queries for scaling certain operators to fit for synopses Keep synopses up to date during updating of original data

  6. Problem with Joins • Natural set of synopses for an approximate query includes uniform random samples of each base relations • Non-uniform result samples - For the join to be uniform random sample, probabilities of tuples in join samples must be equal • Small join result Join of relations R & S on attribute X Probabilities of tuples a1 and a2 being selected should be same as prob. of tuples a1 and b1 selected in join R.X S.X a a b b a b a1 a2 prob. (a1,a2)=(1/r)*(1/r)*(1/r)=1/r3 b1 prob. (a1,b1)=(1/r)*(1/r)*(1/r) *(1/r)=1/r4 To get uniform join samples is very difficult uniform random sampling

  7. Join Synopses • Naïve way - execute all possible join queries and collect samples • Join synopses - samples are taken from small set of distinguished joins • Can obtain random samples of all possible joins in the schema • This is scheme is for foreign key joins • Modeled database schema as a graph • vertex - base relation • directed edge (u to v) – if u has at least one attribute which is foreign key in v

  8. Join Synopses • Key result proved - There is 1-1 correspondence between a tuple in relation ‘r’ & a tuple in the output of any foreign key join involving ‘r’ & any of its descendents in the graph. • A sample Sr of a relation ‘r’ can be used to produce another relation J(Sr) called a join synopsis of ‘r’. ( provides random samples). • Join synopses of R is simply a sample of R where as for C it is the join of N, R and sample of C.

  9. Join Synopses • For each node u in database schema G, corresponding to a relation r1, define J(u) to be the output of the maximum foreign key join r1xr2x..xrk with source r1. • Let Su be a uniform random sample of r1. • The join synopsis J(Su) is the output of Suxr2xr3…..xrk. • J(Su) is a uniform random sample of J(u) with |Su| tuples. • Thus we can extract from our synopsis a uniform random sample of the output of any k-way foreign key join. • From 1 join synopsis for a node whose foreign key join has k relations, we can extract a URS of the output of between k-1 & pow(2,k-1)-1 distinct foreign key joins.

  10. Allocation • Allocate space among various join synopses when certain properties of query workload are known. • Identify heuristics for the common case when such properties are not known. • Let ‘S’ be a set of queries with selects, aggregates, group by’s & foreign key joins. • For each relation Ri, find fraction Fi of queries in S for which Ri is the source relation in a foreign key join. • It is known that the error bounds are inversely proportional to sqrt(n).(n- number of tuples in join sample). • Select join synopsis sizes so as to minimize the average relative error.

  11. Allocation • The average relative error bound over the queries is proportional to sum(fi/sqrt(ni)) • ni is selected so as to minimize the above equation for the total memory allocated for join synopses • For each relation Ri if si = size of single join synopses tuple then join synopses size is chosen so as sum (nisi) <= Total memory allocated • In the absence of query work load information heuristic strategies can be used. • EqJoin • CubeJoin • PropJoin divides space equally amongst the relations divides space proportional to their join synopses tuple sizes divides space proportional to cube root of their join synopses tuple sizes

  12. Maintenance of Join Synopses • Need to maintain the join synopses when base relation is updated (insert or delete) • does not require frequent access to base relation • If a new tuple is inserted • Let Pu be the probability of newly arrived tuple for relation u in random sample Su • Let uxr2xr3x….xrk be the max foreign key join with source u. • We add ‘T’ (new tuple) to Su with probability Pu. • If ‘T’ is added to Su, we add to J(Su) the tuple Txr2xr3x….rk

  13. Maintenance of Join Synopses • If T is added to Su and Su exceeds its target size, then select uniformly at random a tuple T’ to evict from Su and remove the tuple in J(Su) corresponding to T’. • On delete of a tuple T from u • T is in Su delete the tuple from Su and remove the tuple from J(Su) corresponding to T • If sample becomes too small due to many deletions repopulate by scanning relation u. • This algorithm performs lookups with the base relation with small probability Pu

  14. Experimental Evaluation • Two classes of experiments • Accuracy experiments • Maintenance experiments • Accuracy Experiments • Compares accuracy of techniques based on join synopses and based on base samples • parameters varied - query selectivity and total space allocated to precomputed summaries (summary size/join synopses size) • Maintenance Experiments • Study cost of keeping join synopses up to date in presence of insertions/deletions to the underlying data.

  15. Experimental Evaluation • Ran results on TPC-D decision support benchmark • Query used is an aggregate that is computed on join of Lineitem, Customer, Order, Supplier, Nation and Region. • The query used is Query selectivity is varied using these parameters • region parameter is set to ‘ASIA’ and selection predicate is on o_orderdate column to the range [1/1/94, 1/1/95]

  16. Experimental Evaluation Accuracy Experiments

  17. Experimental Evaluation Maintenance Experiments tuples inserted in lineitem table

  18. Conclusion • Provides uniform random sampling for joins in the database having foreign key joins. • Focus on computing approximate answers to aggregates computed on multi-way joins. • Join synopses can be maintained effectively during updates.

More Related