170 likes | 304 Views
Using AutoMed Metadata in Data Warehousing Environments. Hao Fan Alexandra Poulovassilis School of Computer Science & Information Systems Birkbeck college, University of London ACM International Workshop on Data Warehousing and OLAP 7 th November 2003. Outline. What is AutoMed?
E N D
Using AutoMed Metadata in Data Warehousing Environments Hao Fan Alexandra Poulovassilis School of Computer Science & Information Systems Birkbeck college, University of London ACM International Workshop on Data Warehousing and OLAP 7th November 2003
Outline • What is AutoMed? • Creating AutoMed DW Metadata • Using AutoMed DW Metadata • Comparison of AutoMed and Conceptual Data Model (CDM) approaches • Conclusion
What is AutoMed • HDM (Hypergraph Data Model) schemas consist of a set of Nodes, Edges and Constraints • Transformation Pathways • add/extend • delete/contract • rename • IQL language (See http://www.doc.ic.ac.uk/automed/techreports/ for a technical report on “The Automed Intermediate Query Language”.)
A Data Integration Example 7. contractHierarchy (<<Salary, Person,id,id>>); 8. contractHierarchy (<<Salary,Job,job_id,job_id>>); 9. contractAtt (<<Salary, salary>>); 10. contractAtt (<<Salary, dept_id>>); 11. contractFact (<<Salary, id, job_id>>); 12. contractAtt (<<Job, job_descr>>); 13. contractDim (<<Job, job_id>>); 14. contractAtt (<<Person, name>>); 15. contractDim (<<Person, id>>); 1. addRel (<<Dept,id>>, <<Dept_id>>); 2. addAtt (<<Dept,dept_name>>, <<_,Dept_id,name>>); 3. addAtt (<<Dept,total_salary>>, gc sum [(d,s)|(i,j,s)<-<<Salary,salary>>; (i',j',d)<-<Salary,dept_id>>; i=i'; j=j']); 4. delEdge (<<_,Dept_id,name>>, <<Dept,dept_name>>); 5. delNode (<<name>>,[n|(d,n)<-<<Dept,dept_name>>); 6. delNode (<<Dept_id>>, <<Dept,id>>);
Create Automed metadata repository Any DBMS supporting JDBC Specify data models All data Models used in DW schemas e.g., RDB, XML, Multi-Dim, etc. Extract data source schemas Define transformation pathways Manually Automatically Creating AutoMed Metadata
Transforming Single-source cleansing Multi-source cleansing Integration Summarizing Creating data marts Creating AutoMed Transformation Pathways AutoMed Transformation Pathways can be used for the following data warehousing activities:
adds a new construct `temp’ to the schema, whose extent consists of clean data; contracts the dirty construct , C, which is being cleaned adds a new construct, C, derived from the the data in`temp’ ; deletes or contracts the `temp’ construct. Data Cleansing The general pathway used for Data Cleansing:
Single-source Cleansing Person (id, name, address, zip, city, country) addRel (<<Temp,id,address,zip>>, toolCall 'QuickAddress Batch' '<<Person,address>>' '<<Person,zip>' ' <<Person,city>>'); contractAtt (<<Person,zip>>); contractAtt (<<Person,address>>); addAtt (<<Person,zip>>, [(i,z)|(i,a,z)<-<<Temp,id,address,zip>>]); addAtt (<<Person,address>>, [(i,a)|(i,a,z)<-<<Temp,id,address,zip>>]); deleteRel (<<Temp,id,address,zip>>, [(i,a,z)|(i,a)<-<<Person,address>>; (i',z)<-<<Person,zip>>;i=i']);
Multi-source Cleansing Person (id, maritalStatus) Emp (id, name, maritalStatus) addAtt(<<Person,maritalStatus_new>>, <<Person,maritalStatus>>-- [(i,s)|(i,s)<-<<Person,maritalStatus>>; (i',s')<- <<Emp,maritalStatus>>; i = i'; not (s = s')]); contractAtt(<<Person,maritalStatus>>); renameAtt (<<Person,maritalStatus_new>>, <<Person,maritalStatus>>);
Incremental View Maintenance Data Lineage Tracing Using AutoMed Metadata
Incremental View Maintenance S GS TP = tp1, …, tpr D V i r 1 1 i r tp2, …, tpi tp1 tpi+1, …, tpr Using AutoMed Metadata for IVM • See H. Fan. Incremental view maintenance and data lineage tracing in heterogeneous database environments. In proc. BNCOD’02 PhD Summer school, Sheffied, 2002.
Data Lineage Tracing Algorithms Fully Materialized Pathway Fully Virtual Pathway Partially Materialized Pathway Using AutoMed Metadata for DLT • Data Lineage • Affect-Pool • Origin-Pool • DLT formulae • qsAP(t) • qsOP(t) • SeeH. Fan and A. Poulovassilis. Tracing data lineage using schema transformation pathways. In knowledge Transformation for the Semantic Web, IOS Press, 2003.
Semantic mismatches Tightly coupled with the CDM Not straightforward to reuse the integration effort if a source schema is changed No semantic mismatch Possible to extend data warehouse views into a different data model Easily reuse the trans-formation and integration efforts if a source schema is changed - see Section 5 of the paper Discussion Conceptual Data Model: AutoMed:
Conclusion • AutoMed metadata can be used for expressing data warehousing activities, including data cleansing; • AutoMed metadata can be used for incrementally maintaining the DW data and data lineage tracing; • Compared with CDM, AutoMed has several advantages; • In contrast to commercial ETL tools, AutoMed metadata provides sufficient information for IVM and DLT. Limitations: • Not all data warehouse metadata can be captured by AutoMed • Currently, transformation pathways are created manually. However, we are investigating automatic/semi-automatic generation techniques
Acknowledge Thank you!