Join Synopses for Approximate Query Answering

Join Synopses for Approximate Query Answering SwarupAcharya Phillip B. Gibbons ViswanathPoosala Sridhar Ramaswamy Presented By VinayHoskere

Contents • Introduction • AQUA • Problem with joins • Join synopses • Allocation • Maintenance of join synopses • Experimental Evaluation

Introduction • Traditional query processing: Exact answers, minimize response time, maximize throughput • Data warehouses: full precision of exact answer not needed, less time, minimum number of accesses to base data. • Random Sampling techniques: generate approximate answers • Foreign keys joins: Large tables increase the size of base relation. Schemes for providing approximate join aggregates that rely on using random samples of base relations suffer from disadvantages

AQUA • Approximate query answering : Improves response time, avoids access to original database • Maintains smaller sized statistical summaries – “SYNOPSES” • Provides confidence bounds. • It has 3 components • Statistics Collection • Query Rewriting • Maintenance. • It sits on top of a DBMS.

Problem with Joins • Natural set of synopses would be random samples from each of the base relation • Non Uniform Result Sample • Small join result sizes • Using samples on base relations is not feasible.

Join Synopses • Naïve way - execute all possible join queries and collect samples • Join synopses - samples are taken from small set of distinguished joins • This scheme is for foreign key joins • Model database schema as a graph • vertex - base relation • directed edge (u to v) – if u has at least one attribute which is foreign key in v

Definition

Join Synopses • There is a 1-1 correspondence between a tuple in a relation ‘r’ & a tuple in the output of any foreign key join involving ‘r’ & any of its descendants in the graph. • The subgraph of G on the ‘k’ nodes in any k-way foreign key join must be a connected subgraph with a single root node

Definition

Join Synopses • For each node u in G, corresponding to a relation r1, define J(u) to be the output of the maximum foreign key join r1xr2x..xrk with source r1. • Let Su be a uniform random sample of r1. • The join synopsis J(Su) is the output of Suxr2xr3…..xrk. • J(Su) is a uniform random sample of J(u) with |Su| tuples. • Thus we can extract from our synopsis a uniform random sample of the output of any k-way foreign key join.

Allocation • Optimal strategy for allocating the available space among the various join synopses when certain properties of the query work load are known . • Let ‘S’ be a set of queries with selects, aggregates, group by’s & foreign key joins. • For each relation Ri, find fraction Fi of queries in S for which Riis the source relation in a foreign key join. • The average relative error bound over the queries is proportional to sum(Fi/sqrt(ni)).

Allocation • Heuristic allocation: When properties of work load are not known. • There are 3 procedures- • EqJoin – Divides up the space allotted equally amongst relations • CubeJoin – Divides up the space in proportion to the cube root of their join synopsis tuple size. • PropJoin – Divides up the space in proportion to their join synopsistuple size.

Maintenance of Join Synopses • Need to maintain the join synopses when base relation is updated (insert or delete) • does not require frequent access to base relation • If a new tuple is inserted • Let Pu be the probability of newly arrived tuple for relation u in random sample Su • Let uxr2xr3…..xrkbe the max foreign key join with source u. • We add ‘T’ (new tuple) to Su with probability Pu. • If ‘T’ is added to Su, we add to J(Su) the tupleTxr2xr3…..xrk

Maintenance of Join Synopses • If T is added to Su and Su exceeds its target size, then select uniformly at random a tuple T’ to evict from Su and remove the tuple in J(Su) corresponding to T’. • On delete of a tuple T from u • T is in Su delete the tuple from Su and remove the tuple from J(Su) corresponding to T • If sample becomes too small due to many deletions repopulate by scanning relation u. • This algorithm performs lookups with the base relation with small probability Pu

Experimental Evaluation • Test bed – TPC-D decision support benchmark. DB of around 300 MB. • Machine – 296MHz UltraSPARC-II, 256 MB of memory, Solaris 5.6. • Query used is based on Q5 & an aggregate computed on join of Lineitem, Customer, Order, Supplier, Nation, Region. • The query used is

Experimental Evaluation

Conclusion • Approximate query answering is becoming increasingly essential in data warehouses. • One of the fundamental problems faced here : computing approximate answers to aggregates on multi way joins. • Join synopses : Solution for schemas that involve foreign key joins. • Provides better performance than schemas based on base samples • Can be maintained efficiently during updates • Approximating answers for group by, rank and set valued queries still remains.

Thank You.

Join Synopses for Approximate Query Answering