200 likes | 335 Views
On Random Sampling over Joins. Surajit Chaudhuri Rajeeve Motwani Vivek Narasayya Microsoft Research Stanford University Microsoft Research Compiled by: Arjun Dasgupta. CONTENTS. The difficulty of join sampling Semantic and algorithms of sample
E N D
On Random Sampling over Joins Surajit Chaudhuri Rajeeve Motwani Vivek Narasayya Microsoft ResearchStanford UniversityMicrosoft Research Compiled by: Arjun Dasgupta
CONTENTS • The difficulty of join sampling • Semantic and algorithms of sample • Two previous sampling strategies • New strategies for join sampling • Experiment’s results
SAMPLE (R1><R2,f) ≠ SAMPLE (R1,f) >< SAMPLE (R2,f)
STRATEGY USED • Obtain SAMPLE (R1><R2,f) from non-uniform samples of R1 and R2
The Difficulty of Join Sampling -Example: • Suppose that we have the relations
TECHNIQUES FOR SAMPLING • Black Box U1 (un-weighted) • Black Box U2 (un-weighted) • Black Box WR1 (weighted) • Black Box WR2 (weighted)
Black-Box U2: Given relation R with n tuples, generate an unweighted WR sample of size r. 1. 2. Initialize reservoir array A[1..r] with r dummy values. 3. While tuples are streaming by do begin (a) get next tuple t; (b) (c) for j=1 to r set A[j] to t with probability 1/N end
Black-Box WR2: Given relation R with n tuples, generate a weighted WR sample of size r. • 1. • 2. Initialize reservoir array A[1…r] with r dummy values. • 3. While tuples are streaming by do begin (a) get next tuple t with weight w(t); (b) (c) for j=1 to r do set A[j] to t with prob. w(t)/W end.
The Classification of the Problem: • Case A :No information is available for either or . • Case B : No information is available for but indexes and /or statistics are available for . • Case C : Indexes/statistics are available for and .
Previous Sampling Strategies Strategy Naive-Sample: 1. Compute the join . 2. As the tuples of J stream by, use Black-Box U1 or U2 to produce .
Previous Sampling Strategies Strategy Olken-Sample: 1. Let M be an upper bound on for all . 2.repeat (a) Sample a tuple uniformly at random. (b) Sample a random tuple from among all tuples that have . (c) Output with probability , and with remaining probability reject the sample. Until r tuples have been produced.
New Strategies for Join Sampling Strategy Stream Sample: 1. Use Black-Box WR1 or WR2 to produce a WR sample of size r, where the weight for a tuple is set to 2. While tuples of are streaming by do begin (a) get next tuple and let ; (b) sample a random tuple from among all tuples that have ; (c) output . end.
New Strategies for Join Sampling • Strategy Stream Sample is more efficiency then Olken : 1. No information is required for - case B. 2. No tuple is rejected after computing the join . 3. Only one iteration is needed for each output tuple.
New Strategies for Join Sampling Strategy Group Sample 1. Use Black-Box WR1 or WR2 to produce a WR sample of size r, where the weight for a tuple is set to . 2. Let consist of the tuples . Produce whose tuples are grouped by ‘s tuples that generated them. 3. Use r invocations of Black-Box U1 or U2 to sample r sample, one of each group.
New Strategy for Join Sampling • Strategy Frequency-Partition-Sample
Summery • The difficulty of join sampling- example. • The classification of the problem - 3 cases. • Naive-sample Olken-sample previous strategies • Stream-sample Group-sample new strategies Frequency-partition-sample • Conclusion : The new strategies are better then the earlier techniques.