The BioMap Data Warehouse

1. The BioMap Data Warehouse Integration of Relational & XML Data Using AutoMed

3. Data Integration Approaches Both-As-View (BAV) approach GAV & LAV approaches BAV approach Comparison of integration approaches BAV advantages

4. GAV & LAV Approaches Global-As-View (GAV) approach: describe GS constructs with view definitions over LSi constructs Local-As-View (LAV) approach: describe LSi constructs with view definitions over GS constructs

5. GAV Example student(id,name,left,degree) = [{x,y,z,w} |?x,y,z,w,_??ug ? ?x,_,_,_,_??phd ? ?x,y,z,w,_??phd ? w = �phd�] monitors(sno,id) = [{x,y} |?x,_,_,_,y??ug ? ?x,_,_,_,_??phd ? ?x,y??supervises] staff(sno,sname,dept) = [{x,y,z} |?x,y,z,w,_??tutor ? ?x,_,_??supervisor ? ?x,y,z??supervisor]

6. LAV Example tutor(sno,sname) = [{x,y} | ?x,y,_??staff ? ?x,z??monitors ? ?z,_,_,w??student ? w ? �phd�] ug(id,name,left,degree,sno) = [{x,y,z,w,v} |?x,y,z,w??student ? ?v,x??monitors ? w ? �phd�]

7. Both-As-View (BAV) (1/3) Schema transformation approach For each pair (LSi,GS): incrementally modify LSi/GS to match GS/LSi

8. Both-As-View (BAV) (2/3) Common Data Model: Hypergraph Data Model (HDM) Constructs are nodes, edges & constraints It avoids the semantic mismatches that may occur between constructs of higher-level modelling languages

9. Both-As-View (BAV) (3/3) Modify using primitive schema transformations add/delete rename extend/contract Supply transformations with queries add(??table,attrib3??, q), where q:[{t,(a1+a2)}|{t,a1}???table,attrib1??;{t,a2}???table,attrib2??] extend(??table,attrib3??, q1,q2)

10. Example (1/2) S1 ? Sg: add(??monitors?? ,q1) add(??monitors,sno??,q2) add(??monitors,id??,q3) add(??tutor,dept#??,q4) rename(??ug??,??student??) rename(??tutor,??staff??) delete(??student,sno??,q5) S2 ? Sg: can be derived similarly

11. Example (2/2) Automatically derivable reverse transformations add(C,q)/extend(C,q1,q2) : delete(C,q)/contract(C,q1,q2) delete/contract : add/extend rename(C1,C2) : rename(C2,C1)

12. BAV vs. LAV, GAV & GLAV BAV approach subsumes other integration approaches: Can be used to derive GAV & LAV view definitions (ICDE�03) Comparison with GAV, LAV & GLAV in DBIS'04

13. Schema Evolution Example Define the evolution of the global or local schema as a schema transformation pathway from the old to the new schema

14. Types Of Integration Virtual integration Materialised integration Hybrid integration

15. AutoMed Tools Data Lineage Tracing (DLT) Incremental View Maintenance (IVM) Schema matching tool Transformation pathway optimisation XML transformation/integration tool

16. Outline The AutoMed toolkit The BioMap integration Automatic XML data transformation/integration

17. Integration Outline Wrapping of sources Translation of source and global schemas into the XML schema type used within AutoMed Domain expert provides mappings between sources & global schema Automatic schema transformation/integration algorithm

18. Relational � To - XMLDSS

19. Integration Outline Wrapping of sources Translation of source and global schemas into the XML schema type used within AutoMed Domain expert provides mappings between sources & global schema Automatic schema transformation/integration algorithm

20. Outline The AutoMed toolkit The BioMap integration Automatic XML data transformation/integration

21. Outline Semantic Heterogeneity Schema Matching Ontologies Structural Heterogeneity XML schema type in AutoMed Schema transformation Schema integration

22. Semantic Heterogeneity Problem definition Schema Matching Data mining Neural networks Machine learning (LSD) Ontologies (RDFS/OWL)

23. Schema Matching (1/2) Types: 1-1, 1-n, n-1, n-m Subset, superset, equivalence Use schema matching output to create the intermediate schemas used by the schema restructuring / schema integration algorithms

24. Schema Matching (2/2) Necessary transformations: add attributes day, month, year in S delete attribute dob from S The reverse transformation pathway describes a n-1 match

25. Structural Heterogeneity Problem: Same information can be represented in many different ways Ancestor � descendant ?? different branches Elements & attributes not clearly distinguished in XML model Ordering policy

26. Aims XML-specific solution: Insert-remove-rename operations on elements, attributes, edges Efficient �move� (node/subtree) operation Element-to-attribute, attribute-to-element transformations Avoid loss of data due to structural incompatibilities Automation

27. XML DataSource Schema (1/2) Basic characteristics: Structure-only representation XML format ? ease of traversal & manipulation Automatically derived from an XML file XMLDSS from other schema types (DTD, XML Schema)

28. XML DataSource Schema (2/2)

29. Schema Transformation Target schema T given Source schema S is transformed to match the structure of T

30. Algorithm Growing phase: traverse the target schema and issue an add/extend transformation for every construct that does not exist in the source schema. Shrinking phase: traverse the source schema and issue an delete/contract transformation for every construct that does not exist in the target schema. Completeness of algorithm

31. Transformation Types AutoMed primitive transformations: add/extend delete/contract rename Schema level: Insert, remove or rename schema constructs Move element/subtree Element ?? attribute

32. Example 1 Insert element C ext(<C>,Void,Any) ext(<A,C>, Void,Any) ext(<C,B>, Void,Any) del(<A,B>,q) Remove element C add(<A,B>,q) con(<C>, Void,Any) con(<C,B>, Void,Any) con(<A,C>, Void,Any) Let us know see some examples of transformations at the schema level. The figure shows the insertion of element C in schema S1 or, in the other direction, the removal of element C from schema S2. One thing to remember when inserting or removing constructs in our graph environment is that, because all edges are defined by referencing the nodes they connect, one must be careful to leave the graph in a consistent state. As an example, consider the removal of element C from schema S1. One cannot remove element C before removing the incoming and outgoing edges as this would leave the graph in an inconsistent state.Let us know see some examples of transformations at the schema level. The figure shows the insertion of element C in schema S1 or, in the other direction, the removal of element C from schema S2. One thing to remember when inserting or removing constructs in our graph environment is that, because all edges are defined by referencing the nodes they connect, one must be careful to leave the graph in a consistent state. As an example, consider the removal of element C from schema S1. One cannot remove element C before removing the incoming and outgoing edges as this would leave the graph in an inconsistent state.

33. Example 2 Insert/remove edge: move operation This slide illustrates the basic move operation. In order to move element C from being the child of element B to being the child of element A, we simply insert an edge from element A to element C, then we remove the edge from element B to element C. Since it is possible to describe both the inserted and the removed edge using the rest of the schema constructs, the transformations are add and delete respectively.This slide illustrates the basic move operation. In order to move element C from being the child of element B to being the child of element A, we simply insert an edge from element A to element C, then we remove the edge from element B to element C. Since it is possible to describe both the inserted and the removed edge using the rest of the schema constructs, the transformations are add and delete respectively.

34. Example 3 Move: add(<root,B>,q3) add(<B,A>, [{b,a}|{a,b}?<A,B>]) delete(<A,B>) [{a,b}|{b,a}?<B,A>]) Complete: add(<B�>, ++q1) add(<A,B�>, <A,B>++q2) delete(<A,B>, <A,B�>) delete(, <B�>) rename(<B�>, ) This slide shows another move operation example. This time the transformations are not add and delete, but extend and contract. Furthermore, if we want to avoid loss of data, the algorithm has to create synthetic structure. The reason is that in the data source of S1 there may be instances of A that do not have instances of B as children. Therefore, when migrating the data of data source S1 to the data source S2, if the algorithm does not create synthetic structure, these instances of A will be lost. In this example, the pathway consists of two transformations, a composite transformation named complete, which extends schema S1 with element B then extends schema S1 with the edge from element root to element B. The second transformation is an extend transformation that inserts the edge from element B to element A.This slide shows another move operation example. This time the transformations are not add and delete, but extend and contract. Furthermore, if we want to avoid loss of data, the algorithm has to create synthetic structure. The reason is that in the data source of S1 there may be instances of A that do not have instances of B as children. Therefore, when migrating the data of data source S1 to the data source S2, if the algorithm does not create synthetic structure, these instances of A will be lost. In this example, the pathway consists of two transformations, a composite transformation named complete, which extends schema S1 with element B then extends schema S1 with the edge from element root to element B. The second transformation is an extend transformation that inserts the edge from element B to element A.

35. Example 1 - revisited Actually, this can also be treated with an add/delete transformation

36. Example 4 Element-to-attribute transformation insert(<A,A:B>,q) remove(<A,B>,q) remove(<B,PCDATA>,q) remove(,q) Attribute-to-elementtransformation insert(,q) insert(<A,B>,q) insert(<B,PCDATA>,q) remove(<A,A:B>,q) This slide shows illustrates the element-to-attribute and attribute-to-element transformations. Element B in schema S1 is transformed into attribute B of element A in schema S2. The algorithm first creates the attribute name from the element name. Then the algorithm adds the attribute�s extent, since it is possible to describe it using element B, the PCDATA node and the edge from element B to the PCDATA node. In order to transform attribute B into an element, the algorithm first creates the element name using the attribute name. Then it inserts the edge from element A to element B and the edge from element B to the PCDATA node.This slide shows illustrates the element-to-attribute and attribute-to-element transformations. Element B in schema S1 is transformed into attribute B of element A in schema S2. The algorithm first creates the attribute name from the element name. Then the algorithm adds the attribute�s extent, since it is possible to describe it using element B, the PCDATA node and the edge from element B to the PCDATA node. In order to transform attribute B into an element, the algorithm first creates the element name using the attribute name. Then it inserts the edge from element A to element B and the edge from element B to the PCDATA node.

37. Schema Integration Augment with missing constructs Remove redundant constructs

38. Materialisation Strategy: Materialise root and its attributes Consider all edges (ep,ec) in a depth-first way Materialise ec and its attributes

39. Conclusions XML specific transformation & integration algorithms: element??attribute transformations move operation No loss of data by synthetically creating missing structure Automation � if sources have been previously semantically reconciliated

40. Future Work Ontologies instead of schema matching XMLDSS Constraints Support for XML databases XQuery capability for XML wrapper

The BioMap Data Warehouse

The BioMap Data Warehouse

Presentation Transcript

Data Warehouse

The Data Warehouse

Data Warehouse

Data Warehouse

The Data Warehouse

Data Warehouse

Data Warehouse

Data Warehouse

Data Warehouse

Data Warehouse

Data Warehouse