120 likes | 219 Views
Query Sampling in DB2. Motivation. Data volume is growing fast Many algorithms do not scale up with data volume For exploratory analysis users often want just approximate answers. Solution: Sampling. Processing less data => huge performance improvement Approximate answers often suffice.
E N D
Motivation • Data volume is growing fast • Many algorithms do not scale up with data volume • For exploratory analysis users often want just approximate answers
Solution: Sampling • Processing less data => huge performance improvement • Approximate answers often suffice
Current Support of Sampling in DB2: The Rand() Function RAND() returns a uniform random number between 0 and 1 SELECT * FROM original query WHERE rand() < 0.01 Advantage: User can easily specify the size of sample he wants Disadvantage: The RAND() operator does not provide any optimization as it is applied to the query result
Current Support of Sampling in DB2: The TABLESAMPLE operator • Can place sampling clause after any SQL table reference • SELECT … • FROM T TABLESAMPLE BERNOULLI(10.0) • WHERE … • General form of sampling clause • TABLESAMPLE samplingMethod(p) • samplingMethod is one of the two: • BERNOULLI: row-level Bernoulli sampling • SYSTEM: page-level (efficient) sampling method • p = inclusion probability for each row (%) = (expected) sampling fraction
Current Support of Sampling in DB2: The TABLESAMPLE operator (cont.) • Advantage: By pushing sampling to the bottom of a query tree, can provide huge performance improvements • Disadvantage: How to extrapolate the sampling rate at the base table to the query result? • Joins • Group by • Count (DISTINCT) • Subqueries
Extrapolation Problem Let R be a base relation referenced in query Q. Then: uniform random sample of Q ≠ Q evaluated over a uniform random sample of R (and possibly other tables) Example (CMN99): Let Q = R S, where: R(A,B) = {(a1, b0), (a2, b1), (a2, b2), (a2, b3),…, (a2, bk)} S(A,C) = {(a2, c0), (a1, c1), (a1, c2), (a1, c3),…, (a1, ck)} sample(R, rate1) sample(S, rate2) cannot generate sample(R S, rate3) for any reasonable values of rate1 and rate2
Solution • Push sampling to the bottom of the query tree – effectively replacing RAND() with TABLESAMPLE - whenever possible • Selections • Projections without duplicate removal • Foreign key joins • Some types of group by (using materialized views)
Foreign Key Joins • A two-way join r1 r2 is aforeign key join if a join attribute is the foreign key in r1 (this definition can be easily extended to n-way join). • The relation r1 is called a source relation. • A random sample of r1 r2 can be produced by joining a random sample of r1 (the source relation) with r2.
FK-Join Query Transformation select * from ( select * from F, D where F.foreignkey = D.primarykey) where RAND()<0.1 select * from F TABLESAMPLE bernoulli (10), D where F.foreignkey = D.primarykey)
Sampling group by queries using materialized views select * from (select (exp1, exp2,…) from F, D1, D2 where F.y1=D1.x1 and F.y2=D2.x2 and pred1, pred2,… group by x1, x2) where RAND()<0.1 create view MQT as ( select DISTINCT D1.x1, D2.x2 from F, D1, D2 where F.y1=D1.x1 and Fy2=D2.x2) select (exp1, exp2,…) from F, MQT TABLESAMPLE bernoulli (10.0) where F.y1=MQT.x1 and F.y2=MQT.x2 and pred1, pred2,… group by x1, x2
Experiments • Modified 12 queries from TPCH • 62% performance improvement for 10% sample • 76% improvement for 1% sample • Defined several MQTs and queries in TPCH schema • 95% improvement for 1% sample