1 / 38

Supporting Streaming Updates in an Active Data Warehouse

Supporting Streaming Updates in an Active Data Warehouse. Neoklis Polyzotis, Spiros Skiadopoulos, Panos Vassiliadis , Alkis Simitsis, Nils-Erik Frantzell. Forecast. Problem in active data warehousing :

graceland
Download Presentation

Supporting Streaming Updates in an Active Data Warehouse

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Supporting Streaming Updates in an Active Data Warehouse Neoklis Polyzotis, Spiros Skiadopoulos, Panos Vassiliadis, Alkis Simitsis, Nils-Erik Frantzell

  2. Forecast • Problem in active data warehousing: • the join between a fast stream of source updates and a disk-based relationunder the constraint of limited memory • Solution: • themesh join, a novel join operator that operates under minimum assumptions for the stream and the relation • Features: • a cost model and tuning methodology that accurately associates memory consumption with the incoming stream rate ICDE 2007, Constantinople 18/4/2007

  3. Roadmap • Motivation & Problem statement • The Mesh-Join Algorithm • Cost model & Tuning • Experiments • Conclusions ICDE 2007, Constantinople 18/4/2007

  4. Roadmap • Motivation & Problem statement • The Mesh-Join Algorithm • Cost model & Tuning • Experiments • Conclusions ICDE 2007, Constantinople 18/4/2007

  5. ETL workflows DS.PS1.PKEY, LOOKUP_PS.SKEY, SUPPKEY DS.PS_NEW1 DS.PS_NEW1.PKEY, DS.PS_OLD1.PKEY SUPPKEY=1 COST DATE DS.PS1 DIFF1 A2EDate SK1 $2€ Add_SPK1 DS.PS_OLD1 U rejected rejected rejected Log Log Log DS.PS2.PKEY, LOOKUP_PS.SKEY, SUPPKEY DS.PS_NEW2 DS.PS_NEW2.PKEY, DS.PS_OLD2.PKEY SUPPKEY=2 COST DATE=SYSDATE QTY>0 DS.PS2 NotNULL AddDate Add_SPK2 SK2 CheckQTY DIFF2 DS.PS_OLD2 rejected rejected Log Log DSA PKEY, DAY MIN(COST) DW.PARTSUPP S1_PARTSUPP V1 Aggregate1 FTP1 PKEY, MONTH AVG(COST) DW.PARTSUPP.DATE, DAY TIME S2_PARTSUPP V2  Aggregate2 FTP2 Sources DW ICDE 2007, Constantinople 18/4/2007

  6. Active Data Warehousing • Traditionally, data warehouse refreshment has been performed off-line, through Extraction-Transformation-Loading (ETL) software • Active Data Warehousing refers to a new trend where data warehouses are updated as frequently as possible, to accommodate the high demands of users for fresh data ICDE 2007, Constantinople 18/4/2007

  7. Issues around Active Warehousing • Smooth upgrade of the software at the (legacy) source • minimal modification of the software configuration at the source side • Minimal overhead of the source system • No data losses are allowed in the long run • Maximum freshness of data • the response time for the transport, cleaning, transformation and loading of a new source record to the DW should be small and predictable • Scalability at the warehouse side • the architecture should scale up with respect to the number of sources and data consumers at the DW • if possible, cover issues like checkpointing, index maintenance ICDE 2007, Constantinople 18/4/2007

  8. Grand view of an Active DW ICDE 2007, Constantinople 18/4/2007

  9. Problem statement • Joining a fast stream of updates with a persistent relation within limited memory bounds is of particular importance in the Active Warehousing setting • Example practical cases: • Surrogate Key assignment • Duplicate detection • … ICDE 2007, Constantinople 18/4/2007

  10. Example: Surrogate Key ICDE 2007, Constantinople 18/4/2007

  11. Roadmap • Motivation & Problem statement • The Mesh-Join Algorithm • Cost model & Tuning • Experiments • Conclusions ICDE 2007, Constantinople 18/4/2007

  12. Operation of Mesh-Join ICDE 2007, Constantinople 18/4/2007

  13. (Not really any) Assumptions • No assumption of any order in either the stream or the relation • No indexes are necessarily present • Limited memory is available • The join condition is arbitrary (equality, similarity, range, etc.) • The join relationship is general (i.e., many-to-many, one-to-many, or many-to-one) • The result is exact. … But .. • The relation remains fixedthroughout the join ICDE 2007, Constantinople 18/4/2007

  14. Architecture of Mesh-Join ICDE 2007, Constantinople 18/4/2007

  15. ICDE 2007, Constantinople 18/4/2007

  16. Roadmap • Motivation & Problem statement • The Mesh-Join Algorithm • Cost model & Tuning • Experiments • Conclusions ICDE 2007, Constantinople 18/4/2007

  17. Critical issues • The important measures are: • the stream rate λ • the available memory M • the service rate μ of the join • The main challenge is to interrelate these metrics in a cost formula, so as to be able to tune the system • minimize M, given a desirable rate μ • maximize μ, give a constraint of available memory M ICDE 2007, Constantinople 18/4/2007

  18. NR b = # iterations a stream tuple must “see” Cost model: Memory wrt b, s Size of hash H Size of queue Q Size of b buffers Size of w buffers ICDE 2007, Constantinople 18/4/2007

  19. Cost model: cost of an iteration wrt b, s ICDE 2007, Constantinople 18/4/2007

  20. Cost model Interrelated M, μ, λvia w, s Cloop = function (w, b) M = function (w, b) ICDE 2007, Constantinople 18/4/2007

  21. Tuning: M,μ as a function of b ICDE 2007, Constantinople 18/4/2007

  22. Minimize M, given a desirable rate μ • Minimize w => minimize M • Minimum wmin = λcloop • In this case λ = μ • Thus, M is a function only of b, computed by simple calculus ICDE 2007, Constantinople 18/4/2007

  23. Roadmap • Motivation & Problem statement • The Mesh-Join Algorithm • Cost model & Tuning • Experiments • Conclusions ICDE 2007, Constantinople 18/4/2007

  24. Experimental methodology • Synthetic data set: Zipf distribution, skew in [0,1], 10% of R as available memory, 3.5M rows, domain of 1.35M values • Real data set: cloud cover data, 10M rows, domain of 36,000 values • INL as an opponent, based on a clustered B+, in Berkeley DB • Platform: Pentium IV 3GHz, 1GB main memory, 7200 RPM disk ICDE 2007, Constantinople 18/4/2007

  25. Predicted and measuredperformance (synthetic data) ICDE 2007, Constantinople 18/4/2007

  26. Performance for varyingmemory (synthetic data) ICDE 2007, Constantinople 18/4/2007

  27. Performance for varyingdata skew (synthetic data) ICDE 2007, Constantinople 18/4/2007

  28. Performance for varyingmemory (real-life data) ICDE 2007, Constantinople 18/4/2007

  29. Roadmap • Motivation & Problem statement • The Mesh-Join Algorithm • Cost model & Tuning • Experiments • Conclusions ICDE 2007, Constantinople 18/4/2007

  30. Conclusions • We have proposed themesh join, a join operator particularly fit for active data warehousing that operates under minimum assumptions for the stream and the relation • We have presented a cost model and tuning methodology that accurately associates memory consumption with the incoming stream rate ICDE 2007, Constantinople 18/4/2007

  31. Other capabilities & Possible extensions • Approximate processing • Ordered join output • Tuning for join conditions other than equality • Dynamic tuning for changes in the stream rate • Possible Extensions • multi-way joins • other active ETL operators ICDE 2007, Constantinople 18/4/2007

  32. Thank you for your attention! … many thanks to our hosts! Figures of the Antikythera mechanism by Rupert Russell <rupert@giant.net.au>URL: http://www.giant.net.au/users/rupert/kythera/kythera.htm This research was co-funded by the European Union in the framework of the program “Pythagoras IΙ” of the “Operational Program for Education and Initial Vocational Training” of the 3rd Community Support Framework of the Hellenic Ministry of Education, funded by 25% from national sources and by 75% from the European Social Fund (ESF). ICDE 2007, Constantinople 18/4/2007

  33. Questions? ICDE 2007, Constantinople 18/4/2007

  34. Backup Slides ICDE 2007, Constantinople 18/4/2007

  35. Related work • Applications of Symmetric Hash-Joins over windows of streaming inputs that fit in M/M • Chandrasekaran, Franklin @ VLDBJ, 2003 • Golab, Ozsu @ VLDB 2003 • Hammad, Franklin, Aref, Elmagarmid @ VLDB 2003 • Viglas, Naughton, Burger @ VLDB 2003 • Joins of streamed bounded relations: Xjoin variants that flush overflow tuples to disk • Dittrich, Seeger, Taylor, Widmayer @ VLDB 2002 • Tao, Yiu, Papadias, Hadjieleftheriou, Mamoulis @ SIGMOD 2005 ICDE 2007, Constantinople 18/4/2007

  36. Involved Measures ICDE 2007, Constantinople 18/4/2007

  37. Cost model I/O per stream tuple I/O per second ICDE 2007, Constantinople 18/4/2007

  38. Loops of Mesh Join ICDE 2007, Constantinople 18/4/2007

More Related