390 likes | 489 Views
Evaluating Window Joins over Unbounded Streams. Author: Jaewoo Kang, Jeffrey F. Naughton, Stratis D. Viglas University of Wisconsin-Madison CS Dept. Presenter : Yang Ying-Chia 楊 應 甲 ( R01922018) CSIE, National Taiwan University. Outline. Abstract Background Introduction Related Work
E N D
Evaluating Window Joins over Unbounded Streams Author: Jaewoo Kang, Jeffrey F. Naughton, Stratis D. Viglas University of Wisconsin-Madison CS Dept. Presenter:Yang Ying-Chia 楊應甲 (R01922018) CSIE, National Taiwan University
Outline • Abstract • Background • Introduction • Related Work • Estimating the Cost of Moving Window Joins • On Maximizing the Efficiency of Processing Joins • Conclusion
Abstract – Problem and Solution • Problem: Process joins over unbounded streams. • Solution: Moving Window Join • Queries have “window predicates”
Abstract – Central Point of the Thesis • The paper proposes a unit-time-basis cost model for evaluating moving window joins. • Using this cost model, it proposes strategies for maximizing the efficiency of processing joins in different scenarios.
Abstract • Background • Introduction • Related Work • Estimating the Cost of Moving Window Joins • On Maximizing the Efficiency of Processing Joins • Conclusion
Background • Join • Nested Loops Join (NLJ) • Hash Join (HJ) • Moving Window Join
Background – Moving Window Join • Instead of saying we want to join all tuples of A and B, we say we want to join all tuples that have arrived on A in the last t1 seconds with all the tuples that have arrived on S in the last t2 seconds.
Abstract • Background • Introduction • Related Work • Estimating the Cost of Moving Window Joins • On Maximizing the Efficiency of Processing Joins • Conclusion
Introduction – Questions • How can we measure the efficiency of a moving window join evaluation strategy, since the traditional metric of execution time to completion does not apply? • Can an algorithm for a moving window join take advantage of asymmetries in the rates of the input streams? • How can we deal with cases in which an input stream is so fast that the system cannot keep up? • If memory is the bottleneck, how should we allocate memory between the two windows for the two inputs?
Introduction – The Three Scenarios • One stream is much faster than the other. • System resources are insufficient to keep up with the input streams. • Memory is limited.
Abstract • Background • Introduction • Related Work • Estimating the Cost of Moving Window Joins • On Maximizing the Efficiency of Processing Joins • Conclusion
Related Work • Predicate grouping and group optimization techniques • Adaptive query processing and query scrambling • Symmetric Hash Joinand symmetric nested loops join • Diag-Join for data warehouse environment • Rate based streaming query optimization framework
Abstract • Background • Introduction • Related Work • Estimating the Cost of Moving Window Joins • On Maximizing the Efficiency of Processing Joins • Conclusion
Estimating the Cost of Moving Window Joins • Cost model • Cost of a single join operation
Cost of Nested Loop Join A to B Number of tuples accessed to search for matched in window B Number of tuples insert and invalidation Cost of accessing a single tuple Number of tuples accessed in a time unit
Cost of Hash Join A to B Cost of accessing a single tuple in a specific hash table implementation Cost of probe(b) and invalidate(b) is a function of the hash bucket size in window B
Cost of Full Join • Symmetric Join • HHJ, NNJ
Cost of Full Join • Asymmetric Join • HNJ
Cost Curves for Full Joins σa= 1/|A| = 1/Nkey(A) σb = 1/|B| = 1/Nkey(B)
Observation from the Previous Graphs • When input streams’ speed difference is minimal, HJ outperforms every other join combinations. • As the speed gap increases, the cost of HJ increases considerably and exceeds that of HNJ at around 70 tuples/sec and 140 tuples/sec. • Here we have a performance crossover point.
Estimating the Weight Factors • The crossover points can be calculated by equating the two cost formulas • For two given streams, we can determine when NLJ will outperform HJ, depending on the ratio of the arrival of the input streams. …
Abstract • Background • Introduction • Related Work • Estimating the Cost of Moving Window Joins • On Maximizing the Efficiency of Processing Joins • Conclusion
Recall the three scenarios • One stream is much faster than the other. • System resources are insufficient to keep up with the input streams. • Memory is limited.
Exploiting Asymmetry in Input Streams Speed • Assumptions: • The two time windows are fixed. • The aggregate speed of two streams is less than the system’s service rate μ (i.e., λa + λb < μ ). • The following inequality determines the likely winner between NLJ and HJ: • If inequality holds, NLJ will outperform HJ; otherwise, HJ outperforms NLJ.
Observation from the Previous Graphs • HHJ costs the least until the input rate reaches about 70 tuples/sec; then HNJ takes over. Hence, either HHJ or HNJ is the winner. • Both hash join output rates decrease drastically after passing their thrashing point.
Maximizing the Number of Result Tuples with Limited Computing Resources • This scenario arises under the following conditions: • System evaluates very expensive predicates • The input stream’s speed is faster than the join operator’s service rate, i.e., λa + λb> μ. • Hence, not all answer tuples can be generated and input streams need to be “regulated”. • But, what policy?
Performance Comparison between Policies • The winner is the equal distribution strategy! • Regardless of time window sizes and window selectivity factors.
Maximizing the Number of Result Tuples with Limited Memory • Assumption: • The two time window sizes can be adjusted to fully utilize available memory. • The two arrival rates are constant. • Hence, memory allocation strategies are necessary. But, what policy? Will equal distribution win again?
Performance Comparison between Policies • The winner is the Max A strategy, which allocates all memory to the slower stream. • Keep the slower stream in memory and let the faster one probe against it and pass by.
Maximizing the Number of Result Tuples with Limited Memory • Another assumption: • Variable time windows • Variable arrival rates
Performance Comparison between Policies • The best policy is either maximizing stream A’s time window in conjunction with maximizing B’s arrival rate, or we can maximize B’s time window and A’s arrival rate alternatively.
Abstract • Background • Introduction • Related Work • Estimating the Cost of Moving Window Joins • On Maximizing the Efficiency of Processing Joins • Conclusion
Conclusion • A unit-time basis model to analyze expected performance of moving window joins is introduced. • The proposed cost-model divides the join cost into two independent terms, each corresponding to one of the two join directions. • This work can be extended to have a cost model beyond single joins and for full query plans. • Other algorithms apart from NLJ and HJ can be modeled and evaluated.
The End Thanks for your attention