380 likes | 536 Views
Graph-Based Modeling of ETL Activities with Multi-Level Transformations and Updates. Alkis Simitsis 1 , Panos Vassiliadis 2 , Manolis Terrovitis 1 , Spiros Skiadopoulos 1 (1) National Technical University of Athens {asimi,mter , spiros}@dbnet . ece . ntua . gr (2) University of Ioannina
E N D
Graph-Based Modeling of ETL Activities with Multi-Level Transformations and Updates Alkis Simitsis1, Panos Vassiliadis2, Manolis Terrovitis1, Spiros Skiadopoulos1 (1) National Technical University of Athens {asimi,mter,spiros}@dbnet.ece.ntua.gr (2) University of Ioannina pvassil@cs.uoi.gr
Outline • Background & Motivation • Database updates and composite transformations • Zooming in and out the Architecture Graph • Conclusions and Future Work DaWaK'05, Copenhagen, August 2005
Outline • Background & Motivation • Database updates and composite transformations • Zooming in and out the Architecture Graph • Conclusions and Future Work DaWaK'05, Copenhagen, August 2005
Extract-Transform-Load (ETL) DaWaK'05, Copenhagen, August 2005
Motivation DS.PS1.PKEY, LOOKUP_PS.SKEY, SUPPKEY DS.PS_NEW1 DS.PS_NEW1.PKEY, DS.PS_OLD1.PKEY SUPPKEY=1 COST DATE DS.PS1 DIFF1 A2EDate SK1 $2€ Add_SPK1 DS.PS_OLD1 U rejected rejected rejected Log Log Log DS.PS2.PKEY, LOOKUP_PS.SKEY, SUPPKEY DS.PS_NEW2 DS.PS_NEW2.PKEY, DS.PS_OLD2.PKEY SUPPKEY=2 COST DATE=SYSDATE QTY>0 DS.PS2 NotNULL AddDate Add_SPK2 SK2 CheckQTY DIFF2 DS.PS_OLD2 rejected rejected Log Log DSA PKEY, DAY MIN(COST) DW.PARTSUPP S1_PARTSUPP V1 Aggregate1 FTP1 PKEY, MONTH AVG(COST) DW.PARTSUPP.DATE, DAY TIME S2_PARTSUPP V2 Aggregate2 FTP2 Sources DW DaWaK'05, Copenhagen, August 2005
Background • Traditional workflow modeling has treated workflows as graphs with control-flow semantics • We take advantage of the data-centric, script-based nature of ETL activities to model their internals as graphs, too, as a graph, which we call Architecture Graph • Our previous efforts [DMDW’02] handled simple cleanings and transformations and templates to simplify the definition of scenarios [CAiSE’03] DaWaK'05, Copenhagen, August 2005
A1 A1 A2 A2 A3 A3 A4 A4 Background [DMDW’02, CAiSE’03] POPULATED_FIELD NotNull_A1 IN.A1 IN OUT IN A1 A1 A2 A2 R A3 A3 A4 A4 Legend: NotNull_A1 SK_A2 Black-box model: no semantics in the graph R … DaWaK'05, Copenhagen, August 2005
Why is it important? • What part of the scenario is affected if we delete an attribute? Which attributes/tables are involved in the population of an attribute? • Straightforward to follow the data propagation chain • How “good” is my design of the ETL scenario? • Detection of inconsistencies • Detection of important, vulnerable or uselessattributes • Well-defined quality measurement theory based on graphs DaWaK'05, Copenhagen, August 2005
Vulnerable point in the scenario PKEY SOURCE PKEY PKEY PKEY PKEY SK SKEY SUPPKEY SUPPKEY SUPPKEY AddSPK1 SK1 Graph-based modeling of data centric workflows is important!!! IN OUT IN OUT LOOKUP_PS DaWaK'05, Copenhagen, August 2005
Contribution • We incorporate update semantics in our graph-based modeling of ETL activities. • We also cover complex transformationslike negation, aggregation and self-joins. • We introduce a systematic way of transforming the Architecture Graph to allow zooming in and out at multiple levels of abstraction. • Long version and ER’05: details for adding internal semantics DaWaK'05, Copenhagen, August 2005
Outline • Background & Motivation • Database updates and composite transformations • Zooming in and out the Architecture Graph • Conclusions and Future Work DaWaK'05, Copenhagen, August 2005
Updates and Transformations • In this paper, we consider adding graph modeling techniques for several kinds of activities: • Update (INS, UPD, DEL) activities • Aggregates • Rules employing negation and aliases • Functions • We use LDL++ as language that declaratively describes the semantics DaWaK'05, Copenhagen, August 2005
Updates to the database • An update expression is of the form head <- query part, update part • with the following semantics: • a query to the database for the tuples that abide by the query part • we update the predicate of the update part as specified in the rule. raise1(Name, Sal, NewSal) <- employee(Name, Sal), Sal = 1100, (a) NewSal = Sal * 1.1, (b) - employee(Name, Sal), (c) + employee(Name, NewSal). (d) DaWaK'05, Copenhagen, August 2005
Updates to the database DaWaK'05, Copenhagen, August 2005
Updates to the database • A side-effect rule is treated as an activity, with the corresponding node. The output schema of the activity is derived from the structure of the predicate of the head of the rule. • For every predicate with a + or – in the body of the rule, a respective provider edge from the output schema of the side-effect activity is assumed. • For every predicate that appears in the rule without a + or – tag, we assume the respective input schema. Provider edges from this predicate towards these schemata are added as usual. The same applies for the attributes of the input and output schemata of the side effect activity. DaWaK'05, Copenhagen, August 2005
Aggregation • Aggregation in LDL: • grouping of values to a bag and • application of an aggregate function over the values of the bag. R16: aggregate1.a_in(skey,suppkey,date,qty,cost)<- dw.partsupp(skey,suppkey,date,qty,cost) R17: temp(skey,day,<cost>) <- aggregate1.a_in(skey,suppkey,date,qty,cost). R18: aggregate1.a_out(skey,day,min_cost) <- temp(skey,day,all_costs), aggr(min,all_costs,min_cost). R19: v1(skey,day,min_cost) <- aggregate1.a_out(skey,day,min_cost). DaWaK'05, Copenhagen, August 2005
Aggregation DaWaK'05, Copenhagen, August 2005
Aggregation • Relations which create a set from the values of a field employ a pair of regulator edges through an intermediate node ‘<>’. • Provider relations for attributes used as groupers are tagged with ‘g’. • One of the attributes of the aggr function node consumes data from a constant that indicates which aggregate function should be used (e.g., avg, min, max). DaWaK'05, Copenhagen, August 2005
Outline • Background & Motivation • Database updates and composite transformations • Zooming in and out the Architecture Graph • Conclusions and Future Work DaWaK'05, Copenhagen, August 2005
Zooming in and out • There is a principled way of zooming in and out, in various levels of abstraction: • The attribute level • The schema level • The activity level DaWaK'05, Copenhagen, August 2005
Zooming in and out • For each node x of the architecture graph G(V,E) representing a schema: • for each provider edge (xa,y) or (y,xa), involving an attribute of x and an entity y, external to x, introduce the respective provider edge between x and y (unless it already exists, of course); • remove the provider edges (xa,y) and (y,xa) of the previous step; • remove the nodes of the attributes of x and the respective part-of edges. DaWaK'05, Copenhagen, August 2005
Ga Zooming in and out Gs DaWaK'05, Copenhagen, August 2005
Ga Zooming in and out Gs DaWaK'05, Copenhagen, August 2005
Outline • Background & Motivation • Database updates and composite transformations • Zooming in and out the Architecture Graph • Conclusions and Future Work DaWaK'05, Copenhagen, August 2005
Summary • We have incorporated update semantics in our graph-based modeling of ETL activities. • We also cover complex transformationslike negation, aggregation and self-joins. • We have introduced a systematic way of transforming the Architecture Graph to allow zooming in and outat multiple levels of abstraction • Long version and ER’05: internal semantics for activities and quality measures for the design of ETL activities http://www.cs.uoi.gr/~pvassil/publications/2005_ER_AG/ETL_blueprints_long.pdf DaWaK'05, Copenhagen, August 2005
Arktos II On-going/Future Work • This work is part of the Arktos II project • Future work includes research in • What-if analysis of ETL scenarios • Measures for the quality of the design of ETL scenarios http://www.cs.uoi.gr/~pvassil/projects/arktos_II DaWaK'05, Copenhagen, August 2005
Arktos II Thank you! DaWaK'05, Copenhagen, August 2005 http://www.cs.uoi.gr/~pvassil/projects/arktos_II
Backup Slides DaWaK'05, Copenhagen, August 2005
Vision – big picture • The major research goal is to be able to have a metadata repository that incorporates metadata • on the static part of an information system, i.e., tables, constraints, query forms, etc • on the dynamic part, i.e., data-centric software modules • We invest on a graph-based modeling approach, based on the flexibility of graphs as modeling tools DaWaK'05, Copenhagen, August 2005
Integer Preliminaries • Data types • Constants • Attributes • RecordSets • Function types • Functions 1 PKEY R $2€ my$2€ DaWaK'05, Copenhagen, August 2005
Relationships • Instance-Of Relationships • Part-Of Relationships • Regulator Relationships • Provider Relationships • Derived Provider Relationships DaWaK'05, Copenhagen, August 2005
Parameters Activity Output Input 1 Input 2 Rejected Rows Activities • Name • Input Schemata • Output Schema • Rejections Schema • Parameter List • Output/Rejection Operational Semantics DaWaK'05, Copenhagen, August 2005
Importance Metrics • Dependency: the in-degree of the node with respect to the provider edges; • Responsibility: the out-degree of the node with respect to the provider edges; • Degree: dependency + responsibility Local vs. Transitive DaWaK'05, Copenhagen, August 2005
Functions Functions are treated as any other predicate in LDL, with the following special characteristics: • The function involves a list of parameters, the last of which is the return value of the function. • All function parameters referenced in the body of the rule either as homonyms with attributes, of other predicates or through equalities with such attributes, are linked through equality regulator relationships with these attributes. • The return value is possibly connected to the output through a provider relationship (or with some other predicate of the body, through a regulator relationship). DaWaK'05, Copenhagen, August 2005
Aliases & Negation • Alias relationships. An alias relationship is introduced whenever the same predicate appears in the same rule (e.g., in the case of a self-join). All the nodes representing these occurrences of the same predicate are connected through alias relationships to denote their semantic interrelationship. Note that due to the fact that intra-activity programs do not directly interact with external recordsets or activities, this practically involves the rare case of internal intermediate rules • Negation. When a predicates appears negated in a rule body, then the respective part-of edge between the rule and the literal’s node is tagged with ‘⌐’. Note that negated predicates can appear only in the rule body. DaWaK'05, Copenhagen, August 2005
Activity semantics in LDL R06: a_in1(pkey,suppkey,date,qty,cost)<- ps(pkey,suppkey,date,qty,cost). R07: a_in2(pkey,source,skey)<- lookUp(l_pkey,source,l_skey), pkey=l_pkey, skey=l_skey,source=1. R08: a_out(pkey,suppkey,date,qty,cost,skey)<- a_in1(pkey,date,qty,cost), a_in2(pkey,source,l_skey). R09: D2E.a_in(skey,suppkey,date,qty,cost)<- sk.a_out(pkey,suppkey,date,qty,cost,skey) DaWaK'05, Copenhagen, August 2005
Activities DaWaK'05, Copenhagen, August 2005
Zooming in and out DaWaK'05, Copenhagen, August 2005