380 likes | 476 Views
Supporting Streaming Updates in an Active Data Warehouse. Neoklis Polyzotis, Spiros Skiadopoulos, Panos Vassiliadis , Alkis Simitsis, Nils-Erik Frantzell. Forecast. Problem in active data warehousing :
E N D
Supporting Streaming Updates in an Active Data Warehouse Neoklis Polyzotis, Spiros Skiadopoulos, Panos Vassiliadis, Alkis Simitsis, Nils-Erik Frantzell
Forecast • Problem in active data warehousing: • the join between a fast stream of source updates and a disk-based relationunder the constraint of limited memory • Solution: • themesh join, a novel join operator that operates under minimum assumptions for the stream and the relation • Features: • a cost model and tuning methodology that accurately associates memory consumption with the incoming stream rate ICDE 2007, Constantinople 18/4/2007
Roadmap • Motivation & Problem statement • The Mesh-Join Algorithm • Cost model & Tuning • Experiments • Conclusions ICDE 2007, Constantinople 18/4/2007
Roadmap • Motivation & Problem statement • The Mesh-Join Algorithm • Cost model & Tuning • Experiments • Conclusions ICDE 2007, Constantinople 18/4/2007
ETL workflows DS.PS1.PKEY, LOOKUP_PS.SKEY, SUPPKEY DS.PS_NEW1 DS.PS_NEW1.PKEY, DS.PS_OLD1.PKEY SUPPKEY=1 COST DATE DS.PS1 DIFF1 A2EDate SK1 $2€ Add_SPK1 DS.PS_OLD1 U rejected rejected rejected Log Log Log DS.PS2.PKEY, LOOKUP_PS.SKEY, SUPPKEY DS.PS_NEW2 DS.PS_NEW2.PKEY, DS.PS_OLD2.PKEY SUPPKEY=2 COST DATE=SYSDATE QTY>0 DS.PS2 NotNULL AddDate Add_SPK2 SK2 CheckQTY DIFF2 DS.PS_OLD2 rejected rejected Log Log DSA PKEY, DAY MIN(COST) DW.PARTSUPP S1_PARTSUPP V1 Aggregate1 FTP1 PKEY, MONTH AVG(COST) DW.PARTSUPP.DATE, DAY TIME S2_PARTSUPP V2 Aggregate2 FTP2 Sources DW ICDE 2007, Constantinople 18/4/2007
Active Data Warehousing • Traditionally, data warehouse refreshment has been performed off-line, through Extraction-Transformation-Loading (ETL) software • Active Data Warehousing refers to a new trend where data warehouses are updated as frequently as possible, to accommodate the high demands of users for fresh data ICDE 2007, Constantinople 18/4/2007
Issues around Active Warehousing • Smooth upgrade of the software at the (legacy) source • minimal modification of the software configuration at the source side • Minimal overhead of the source system • No data losses are allowed in the long run • Maximum freshness of data • the response time for the transport, cleaning, transformation and loading of a new source record to the DW should be small and predictable • Scalability at the warehouse side • the architecture should scale up with respect to the number of sources and data consumers at the DW • if possible, cover issues like checkpointing, index maintenance ICDE 2007, Constantinople 18/4/2007
Grand view of an Active DW ICDE 2007, Constantinople 18/4/2007
Problem statement • Joining a fast stream of updates with a persistent relation within limited memory bounds is of particular importance in the Active Warehousing setting • Example practical cases: • Surrogate Key assignment • Duplicate detection • … ICDE 2007, Constantinople 18/4/2007
Example: Surrogate Key ICDE 2007, Constantinople 18/4/2007
Roadmap • Motivation & Problem statement • The Mesh-Join Algorithm • Cost model & Tuning • Experiments • Conclusions ICDE 2007, Constantinople 18/4/2007
Operation of Mesh-Join ICDE 2007, Constantinople 18/4/2007
(Not really any) Assumptions • No assumption of any order in either the stream or the relation • No indexes are necessarily present • Limited memory is available • The join condition is arbitrary (equality, similarity, range, etc.) • The join relationship is general (i.e., many-to-many, one-to-many, or many-to-one) • The result is exact. … But .. • The relation remains fixedthroughout the join ICDE 2007, Constantinople 18/4/2007
Architecture of Mesh-Join ICDE 2007, Constantinople 18/4/2007
Roadmap • Motivation & Problem statement • The Mesh-Join Algorithm • Cost model & Tuning • Experiments • Conclusions ICDE 2007, Constantinople 18/4/2007
Critical issues • The important measures are: • the stream rate λ • the available memory M • the service rate μ of the join • The main challenge is to interrelate these metrics in a cost formula, so as to be able to tune the system • minimize M, given a desirable rate μ • maximize μ, give a constraint of available memory M ICDE 2007, Constantinople 18/4/2007
NR b = # iterations a stream tuple must “see” Cost model: Memory wrt b, s Size of hash H Size of queue Q Size of b buffers Size of w buffers ICDE 2007, Constantinople 18/4/2007
Cost model: cost of an iteration wrt b, s ICDE 2007, Constantinople 18/4/2007
Cost model Interrelated M, μ, λvia w, s Cloop = function (w, b) M = function (w, b) ICDE 2007, Constantinople 18/4/2007
Tuning: M,μ as a function of b ICDE 2007, Constantinople 18/4/2007
Minimize M, given a desirable rate μ • Minimize w => minimize M • Minimum wmin = λcloop • In this case λ = μ • Thus, M is a function only of b, computed by simple calculus ICDE 2007, Constantinople 18/4/2007
Roadmap • Motivation & Problem statement • The Mesh-Join Algorithm • Cost model & Tuning • Experiments • Conclusions ICDE 2007, Constantinople 18/4/2007
Experimental methodology • Synthetic data set: Zipf distribution, skew in [0,1], 10% of R as available memory, 3.5M rows, domain of 1.35M values • Real data set: cloud cover data, 10M rows, domain of 36,000 values • INL as an opponent, based on a clustered B+, in Berkeley DB • Platform: Pentium IV 3GHz, 1GB main memory, 7200 RPM disk ICDE 2007, Constantinople 18/4/2007
Predicted and measuredperformance (synthetic data) ICDE 2007, Constantinople 18/4/2007
Performance for varyingmemory (synthetic data) ICDE 2007, Constantinople 18/4/2007
Performance for varyingdata skew (synthetic data) ICDE 2007, Constantinople 18/4/2007
Performance for varyingmemory (real-life data) ICDE 2007, Constantinople 18/4/2007
Roadmap • Motivation & Problem statement • The Mesh-Join Algorithm • Cost model & Tuning • Experiments • Conclusions ICDE 2007, Constantinople 18/4/2007
Conclusions • We have proposed themesh join, a join operator particularly fit for active data warehousing that operates under minimum assumptions for the stream and the relation • We have presented a cost model and tuning methodology that accurately associates memory consumption with the incoming stream rate ICDE 2007, Constantinople 18/4/2007
Other capabilities & Possible extensions • Approximate processing • Ordered join output • Tuning for join conditions other than equality • Dynamic tuning for changes in the stream rate • Possible Extensions • multi-way joins • other active ETL operators ICDE 2007, Constantinople 18/4/2007
Thank you for your attention! … many thanks to our hosts! Figures of the Antikythera mechanism by Rupert Russell <rupert@giant.net.au>URL: http://www.giant.net.au/users/rupert/kythera/kythera.htm This research was co-funded by the European Union in the framework of the program “Pythagoras IΙ” of the “Operational Program for Education and Initial Vocational Training” of the 3rd Community Support Framework of the Hellenic Ministry of Education, funded by 25% from national sources and by 75% from the European Social Fund (ESF). ICDE 2007, Constantinople 18/4/2007
Questions? ICDE 2007, Constantinople 18/4/2007
Backup Slides ICDE 2007, Constantinople 18/4/2007
Related work • Applications of Symmetric Hash-Joins over windows of streaming inputs that fit in M/M • Chandrasekaran, Franklin @ VLDBJ, 2003 • Golab, Ozsu @ VLDB 2003 • Hammad, Franklin, Aref, Elmagarmid @ VLDB 2003 • Viglas, Naughton, Burger @ VLDB 2003 • Joins of streamed bounded relations: Xjoin variants that flush overflow tuples to disk • Dittrich, Seeger, Taylor, Widmayer @ VLDB 2002 • Tao, Yiu, Papadias, Hadjieleftheriou, Mamoulis @ SIGMOD 2005 ICDE 2007, Constantinople 18/4/2007
Involved Measures ICDE 2007, Constantinople 18/4/2007
Cost model I/O per stream tuple I/O per second ICDE 2007, Constantinople 18/4/2007
Loops of Mesh Join ICDE 2007, Constantinople 18/4/2007