1 / 12

Query Sampling in DB2

Query Sampling in DB2. Motivation. Data volume is growing fast Many algorithms do not scale up with data volume For exploratory analysis users often want just approximate answers. Solution: Sampling. Processing less data => huge performance improvement Approximate answers often suffice.

yale
Download Presentation

Query Sampling in DB2

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Query Sampling in DB2

  2. Motivation • Data volume is growing fast • Many algorithms do not scale up with data volume • For exploratory analysis users often want just approximate answers

  3. Solution: Sampling • Processing less data => huge performance improvement • Approximate answers often suffice

  4. Current Support of Sampling in DB2: The Rand() Function RAND() returns a uniform random number between 0 and 1 SELECT * FROM original query WHERE rand() < 0.01 Advantage: User can easily specify the size of sample he wants Disadvantage: The RAND() operator does not provide any optimization as it is applied to the query result

  5. Current Support of Sampling in DB2: The TABLESAMPLE operator • Can place sampling clause after any SQL table reference • SELECT … • FROM T TABLESAMPLE BERNOULLI(10.0) • WHERE … • General form of sampling clause • TABLESAMPLE samplingMethod(p) • samplingMethod is one of the two: • BERNOULLI: row-level Bernoulli sampling • SYSTEM: page-level (efficient) sampling method • p = inclusion probability for each row (%) = (expected) sampling fraction

  6. Current Support of Sampling in DB2: The TABLESAMPLE operator (cont.) • Advantage: By pushing sampling to the bottom of a query tree, can provide huge performance improvements • Disadvantage: How to extrapolate the sampling rate at the base table to the query result? • Joins • Group by • Count (DISTINCT) • Subqueries

  7. Extrapolation Problem Let R be a base relation referenced in query Q. Then: uniform random sample of Q ≠ Q evaluated over a uniform random sample of R (and possibly other tables) Example (CMN99): Let Q = R S, where: R(A,B) = {(a1, b0), (a2, b1), (a2, b2), (a2, b3),…, (a2, bk)} S(A,C) = {(a2, c0), (a1, c1), (a1, c2), (a1, c3),…, (a1, ck)} sample(R, rate1) sample(S, rate2) cannot generate sample(R S, rate3) for any reasonable values of rate1 and rate2

  8. Solution • Push sampling to the bottom of the query tree – effectively replacing RAND() with TABLESAMPLE - whenever possible • Selections • Projections without duplicate removal • Foreign key joins • Some types of group by (using materialized views)

  9. Foreign Key Joins • A two-way join r1 r2 is aforeign key join if a join attribute is the foreign key in r1 (this definition can be easily extended to n-way join). • The relation r1 is called a source relation. • A random sample of r1 r2 can be produced by joining a random sample of r1 (the source relation) with r2.

  10. FK-Join Query Transformation select * from ( select * from F, D where F.foreignkey = D.primarykey) where RAND()<0.1 select * from F TABLESAMPLE bernoulli (10), D where F.foreignkey = D.primarykey)

  11. Sampling group by queries using materialized views select * from (select (exp1, exp2,…) from F, D1, D2 where F.y1=D1.x1 and F.y2=D2.x2 and pred1, pred2,… group by x1, x2) where RAND()<0.1 create view MQT as ( select DISTINCT D1.x1, D2.x2 from F, D1, D2 where F.y1=D1.x1 and Fy2=D2.x2) select (exp1, exp2,…) from F, MQT TABLESAMPLE bernoulli (10.0) where F.y1=MQT.x1 and F.y2=MQT.x2 and pred1, pred2,… group by x1, x2

  12. Experiments • Modified 12 queries from TPCH • 62% performance improvement for 10% sample • 76% improvement for 1% sample • Defined several MQTs and queries in TPCH schema • 95% improvement for 1% sample

More Related