Query Sampling in DB2

Query Sampling in DB2

Motivation • Data volume is growing fast • Many algorithms do not scale up with data volume • For exploratory analysis users often want just approximate answers

Solution: Sampling • Processing less data => huge performance improvement • Approximate answers often suffice

Current Support of Sampling in DB2: The Rand() Function RAND() returns a uniform random number between 0 and 1 SELECT * FROM original query WHERE rand() < 0.01 Advantage: User can easily specify the size of sample he wants Disadvantage: The RAND() operator does not provide any optimization as it is applied to the query result

Current Support of Sampling in DB2: The TABLESAMPLE operator • Can place sampling clause after any SQL table reference • SELECT … • FROM T TABLESAMPLE BERNOULLI(10.0) • WHERE … • General form of sampling clause • TABLESAMPLE samplingMethod(p) • samplingMethod is one of the two: • BERNOULLI: row-level Bernoulli sampling • SYSTEM: page-level (efficient) sampling method • p = inclusion probability for each row (%) = (expected) sampling fraction

Current Support of Sampling in DB2: The TABLESAMPLE operator (cont.) • Advantage: By pushing sampling to the bottom of a query tree, can provide huge performance improvements • Disadvantage: How to extrapolate the sampling rate at the base table to the query result? • Joins • Group by • Count (DISTINCT) • Subqueries

Extrapolation Problem Let R be a base relation referenced in query Q. Then: uniform random sample of Q ≠ Q evaluated over a uniform random sample of R (and possibly other tables) Example (CMN99): Let Q = R S, where: R(A,B) = {(a1, b0), (a2, b1), (a2, b2), (a2, b3),…, (a2, bk)} S(A,C) = {(a2, c0), (a1, c1), (a1, c2), (a1, c3),…, (a1, ck)} sample(R, rate1) sample(S, rate2) cannot generate sample(R S, rate3) for any reasonable values of rate1 and rate2

Solution • Push sampling to the bottom of the query tree – effectively replacing RAND() with TABLESAMPLE - whenever possible • Selections • Projections without duplicate removal • Foreign key joins • Some types of group by (using materialized views)

Foreign Key Joins • A two-way join r1 r2 is aforeign key join if a join attribute is the foreign key in r1 (this definition can be easily extended to n-way join). • The relation r1 is called a source relation. • A random sample of r1 r2 can be produced by joining a random sample of r1 (the source relation) with r2.

FK-Join Query Transformation select * from ( select * from F, D where F.foreignkey = D.primarykey) where RAND()<0.1 select * from F TABLESAMPLE bernoulli (10), D where F.foreignkey = D.primarykey)

Sampling group by queries using materialized views select * from (select (exp1, exp2,…) from F, D1, D2 where F.y1=D1.x1 and F.y2=D2.x2 and pred1, pred2,… group by x1, x2) where RAND()<0.1 create view MQT as ( select DISTINCT D1.x1, D2.x2 from F, D1, D2 where F.y1=D1.x1 and Fy2=D2.x2) select (exp1, exp2,…) from F, MQT TABLESAMPLE bernoulli (10.0) where F.y1=MQT.x1 and F.y2=MQT.x2 and pred1, pred2,… group by x1, x2

Experiments • Modified 12 queries from TPCH • 62% performance improvement for 10% sample • 76% improvement for 1% sample • Defined several MQTs and queries in TPCH schema • 95% improvement for 1% sample

Query Sampling in DB2