190 likes | 369 Views
Spring - 2007 CSE 6392 – 003 Data Exploration and Analysis in Relational Databases. On Random Sampling over Joins. Surajit Chaudhuri Rajeev Motwani Vivek Narasayya. By: Lekhendro 2/27/2007. Outlines. Semantic and algorithms of sample The difficulty of join sampling
E N D
Spring - 2007 CSE 6392 – 003 Data Exploration and Analysis in Relational Databases On Random Sampling over Joins Surajit Chaudhuri Rajeev Motwani Vivek Narasayya By: Lekhendro 2/27/2007
Outlines • Semantic and algorithms of sample • The difficulty of join sampling • Classification of Join problems • Two previous sampling strategies • New strategies for join sampling • Experiment results
Semantics of Sampling • Sampling with Replacement (WR) • Sampling without Replacement (WoR) • Independent Coin Flips (CF) • Stream Sampling • Sequential • Non sequential (could be on materialized data) • Weighted and Unweighted Sampling
I. Black-Box U1: Given relation R with n tuples, generate an UNWEIGHTED WR sample of size r. Sequential WR Sampling
II. Black-Box U2: Given relation R with n tuples, generate an UNWEIGHTED WR sample of size r. Sequential WR Sampling: continued…
III. Black-Box WR1: Given relation R with n tuples, generate an WEIGHTED WR sample of size r. Sequential WR Sampling:continued…
IV. Black-Box WR2: Given relation R with n tuples, generate an WEIGHTED WR sample of size r. Sequential WR Sampling:continued…
The Difficulty of Join Sampling • Suppose that we have the relations
Classification of the Problem: Case A :No information is available for either or . Case B : No information is available for but indexes and /or statistics are available for . Case C : Indexes/statistics are available for and .
Previous Sampling Strategies I.Strategy Naive-Sample:
Previous Strategies: continued… II.Strategy Olken-Sample:
New Strategies for Join Sampling I. Strategy Stream Sample:
New Strategies: continued … II. Strategy Group Sample
New Strategy : continued … III.Strategy Frequency-Partition-Sample
Summary • The difficulty of join sampling. • Classification of problem for random sampling over join – 3 cases. • Different strategies:
Summary : continued … • When indexes/statistics are NOT provided in both operands • Frequency Partition Strategy outperformed others strategies. • When indexes/statistics areprovided in both operands • Stream strategy is the best among them. • Stream strategy is also applicable when indexes/statistics areavailable only in the inner relation