260 likes | 272 Views
This research paper explores the challenges of evaluating window joins over punctuated streams and proposes the PWJoin approach that combines optimizations enabled by punctuations and sliding windows. It also introduces a state design that facilitates constraint-exploiting optimizations.
E N D
Evaluating Window Joins over Punctuated Streams Luping Ding and Elke A. Rundensteiner Database Systems Research Group Worcester Polytechnic Institute {lisading, rundenst}@cs.wpi.edu CIKM'04
Stream Data Processing • Online Transaction Management • Sensor Network Monitoring • Network Usage Analysis • Online Auction Register Continuous Queries Streaming Data Stream Query Engine Streaming Result CIKM'04
New Challenges in Stream Context • Potentially infinite data streams vs. stateful operators. e.g., join, distinct, … • Problem: potentially unbounded state • Reason: no hint on which data is no longer useful CIKM'04
Example -Symmetric Hash Join [WA93] • Memory overflow resolution – state relocation • Example: XJoin [UF00], Hash-Merge Join [MLA04] • Problems • Join state still grows with no bound • Delivery of some join results may be highly deferred Memory Overflow Memory SA SB probe insert A B CIKM'04
Avoiding Unbounded State • Solution: exploit constraints to detect no-longer-useful data • Sliding window [MWA+03] • Identify a bounded set of input data based on time • K-constraint [BW03] • Models clustered or ordered data arrival pattern • Punctuation [TMSF03] • Dynamically announce termination of certain value CIKM'04
Sliding Window [KNV03] Wa Wb … … Timeline Stream A Stream B CIKM'04
Punctuation • Meta-knowledge embedded inside data streams • An ordered set of patterns corresponding to attributes of tuples • Wildcard (*), constant (9), list ({1,2,3}), range ([1, 20]), empty () • Semantics: tuples after a punctuation p will NOT match p … Bid 180 Marlie 820.00 Nov-13-03 11:02:00 No more tuple will contain Item_id 180. 182 Ultrasale 1000.00 Nov-13-03 11:05:00 180 Jocelyn 850.00 Nov-13-03 11:14:00 180 * * * 181 pcfan 50.00 Nov-13-03 11:36:00 … CIKM'04
Punctuation-Aware Join [DMR+04] A B A C 1 200.00 Joinitem_id SA 2 63.00 SB … … 175 175 80.00 80.00 175 175 100.00 100.00 … … No more tuple will have A = 175. 175 * 181 50.00 180 135.00 175 175 20.00 20.00 158 310.00 Stream A Stream B … … … … CIKM'04
Window and Punctuation Occur Simultaneously SELECT A.item_id, Count (*) FROM Auction [Range 24 Hours] A, Bid B WHERE A.item_id = B.item_id GROUP BY A.item_id Auction Stream Group-byitem_id(count(*)) Joinitem_id Bid Stream Out1 (item_id) Out2 (item_id, count) Contains punctuations on item_id Applies a 24-hour window on Auction stream CIKM'04
Optimization Opportunities • Maintainsmaller state thaneitherpure window join or pure punctuation-exploiting join • Bid tuples that have been joined don’t need to be maintained in state • Drop tuples without affecting precision of result • Bid tuples out of 24-hour window of corresponding Auction tuple don’t need to be processed • Produce some aggregate results earlier • Aggregate result for some Auciton tuples can be produced in less than 24 hours CIKM'04
Our Approach: PWJoin • Punctuation-exploiting Window Join • Features of PWJoin: • Include optimizations enabled by punctuations and by sliding windows individually • Accomplish optimizations enabled by interactions of two constraint types • Employ a state design that effectively facilitates constraint-exploiting optimizations CIKM'04
PWJoin Basics and Issue Receive a new tuple ta from stream A Probe B state Invalidate tuples from B state Insert ta into A state • Issue: how to design PWJoin state to facilitate all search-based operations? • Invalidate conducts time-based search • Probe and Purge needs value-based search Receive a new punct pa from stream A Purge tuples from B state Insert pa into A state CIKM'04
PWJoin State with Two-dimensional Index Time List I-Node Index (Hash Table) Punctuation Time List Window Begin 8 8 none 10 10 punctuated 8 8 10 tuple T-Node NextValueListTNode 4 NextTimeListTNode 8 Window End Key Head Tail PunctFlag I-Node CIKM'04
Facilitating Search-based Operations • Search-based Operations • Invalidate: probe time list and stop when encountering a time-valid tuple • Probe: probe I-Node index and join with tuples in value list of matching I-Node • Purge: probe I-Node index and delete tuples in value list of matching I-Node • Avoid access to irrelevant tuples CIKM'04
Punctuation Propagation • An operator may propagate punctuations to benefit downstream operators Auction Stream Group-byitem_id(count(*)) Joinitem_id Bid Stream Item_id Bidder_id Bid_price propagate punctuations on item_id be unblocked by punctuations propagated by join operator 180 * * CIKM'04
Optimizations Enabled by Combined Constraints Early Punctuation Propagation Tuple Dropping a1 a1 a6 a6 a1 a1 a2 a3 a2 a3 a3 a3 a3 a3 a7 a7 a4 a4 a3 a3 a2 a2 a1 a1 a8 a8 a3 a3 propagation point 2 a2 a2 a6 a6 a3 a3 a10 a10 a3 propagation point 1 a3 Stream S1 Stream S2 Stream S1 Stream S2 CIKM'04
Achieving Optimizations by Combined Constraints • Early propagation • Invalidate punctuations in punctuation time list as invalidating tuples • Expired punctuations can be propagated • Tuple dropping • When early propagation happens, set PunctFlag of matching I-Node as “propagated” • Drop new tuples that matches an I-Node whose PunctFlag is “propagated” CIKM'04
Memory Cost Analysis |Sb|T = |Sb|Tinsert - |Sb|Tpurge= |Sb|Tarrive - |Sb|Tpurge = bTb - bTb(paT/NKb,T) b – tuple input rate of stream B pa – punctuation input rate of stream A NKb,T - # of distinct join values occurred in stream B up to T’th time unit Tb – time window on stream B Saving by Punctuation Window Join CIKM'04
Experimental Setup • Experimental System • CAPE [RDS+04]: Continuous Query Processing System • Stream benchmark: generate synthetic data streams • 733MHz Intel(R) Celeron CPU, 512MB RAM, Windows 2000 • Experiments • Compare memory overhead and tuple output rate of PWJoin with a pure window join • Compare punctuation output rate of PWJoin with PJoin CIKM'04
PWJoin vs. WJoin – Memory and Tuple Output Rate Stream A, B: punct-asc-100-40 CIKM'04
PWJoin vs. PJoin – Punctuation Output Rate Stream A: punct-asc-100-40, Stream B: punct-random-30-40 Window: 1 second CIKM'04
Related Work • Pipelined join solutions • Symmetric Hash Join [WA93], XJoin [UF00], Hash-Merge Join[MLA04], Ripple Joins[HH99] • Constraint-exploiting stream query optimization • Window joins [KNV03, GO03, GGO04, HFA+03, ZRH04] • Punctuation[TMS+03], PJoin [DMR+04] • k-Constraint-exploiting algorithm [BW04] CIKM'04
Conclusion • Proposed PWJoin algorithm • Designed storage structure for PWJoin state • Derived cost model for PWJoin • Conducted experimental study to explore effectiveness of PWJoin CIKM'04
Thanks • Nishant Mehta (developing stream generator) • Prof. Leonidas Fegaras (feedback on paper) • CAPE Group Members • WPI Database Research Group CAPE Project: http://davis.wpi.edu/~dsrg/CAPE/ CIKM'04
References • [KNV03] J. Kang, J. F. Naughton and S. D. Viglas. Evaluating Window Joins over Unbounded Streams. ICDE’03. • [UF00] T. Urhan and M. Franklin, XJoin: A Reactively Scheduled Pipelined Join Operator. IEEE Data Engineering Bulletin, 23(2), 2000. • [HH99] P. Haas and J. Hellerstein, Ripple Joins for Online Aggregation. SIGMOD’99. • [GO03] L. Golab and M. T. Ozsu, Processing Sliding Window Multi-Joins in Continuous Queries over Data Streams. VLDB’03. • [GGO04] L. Golab, S. Garg and M. T. Ozsu, On Indexing Sliding Windows over On-line Data Streams, EDBT’04. • [RDS+04] E. A. Rundensteiner, L. Ding, T. Sutherland, Y. Zhu, B. Pielech and N. Mehta, CAPE: Continuous Query Engine with Heterogeneous-Grained Adaptivity. VLDB Demo, 2004. • [BW04] S. Babu and J. Widom. Exploiting k-Constraints to Reduce Memory Overhead in Continuous Queries over Data Streams • [TMS+03] P. A. Tucker, D. Maier, T. Sheard and L. Fegaras. Exploiting Punctuation Semantics in Continuous Data Streams. TKDE, 15(3), 2003. • [DMR+04] L. Ding, N. Mehta, E. A. Rundensteiner and G. T. Heineman, Joining Punctuated Streams. EDBT’04. • [MWA+03] R. Motwani, J. Widom, A. Arasu et al. Query Processing, Resource Management, and Approximation in a Data Stream Management System. CIDR’03. CIKM'04
PWJoin vs. WJoin – Irrelevant Punctuations Stream A: punct-asc-100-40, Stream B: punct-random-30-40 Window: 2 seconds CIKM'04