470 likes | 618 Views
Carlos Aldeias Gabriel David Cristina Ribeiro. DWXML A Preservation Format for Data Warehouses. Introduction. Outline. Motivation. DW Preservation. DWXML. DBPreserve Suite. Conclusions. Introduction Motivation Data Warehouse Preservation DWXML Definition
E N D
Carlos Aldeias Gabriel David Cristina Ribeiro DWXMLA Preservation Format for Data Warehouses
Introduction Outline Motivation DW Preservation DWXML DBPreserve Suite Conclusions Introduction Motivation Data Warehouse Preservation DWXML Definition DBPreserve Suite Application Conclusions
Introduction Introduction Motivation DW Preservation DWXML DBPreserve Suite Conclusions • Companies, institutions and governments rely increasingly on On-Line Analytical Processing (OLAP) • Major benefits for analysis and decision support • Selective extraction and analysis of data from different perspectives • Most systems are structured using Data Warehouses • OLAP types: • ROLAP – Relational OLAP • MOLAP – Multidimensional OLAP • HOLAP – Hybrid OLAP
Introduction Preservation Concern Motivation DW Preservation DWXML DBPreserve Suite Conclusions • Data Warehouse as a digital object • Different from conventional digital objects: data warehouses are complex digital objects • They are based on a dimensional model: Star schema, facts, dimensions with levels and hierarchies, bridges and datamarts • They are often implemented on relational databases (ROLAP), keeping data in tables, views and schemas • Data vs. Metadata • The primary data stored into tables must be archived as well as the metadata, both at the relational and dimensional levels • Technologies are evolving continually • Data Warehouses created with today’s technologies may not be accessible with the upcoming versions
Introduction Relevant Works Motivation DW Preservation DWXML DBPreserve Suite Conclusions InterPARES
Introduction Databases / Data Warehouses Motivation DW Preservation DWXML DBPreserve Suite Conclusions • Is Data Warehouse just another name for Database
Introduction Data Warehouse:Dimensional Model Concepts Motivation DW Preservation DWXML DBPreserve Suite Conclusions
Introduction DBPreserve Project Motivation DW Preservation DWXML DBPreserve Suite Conclusions [CCSDS, 2002] [Rahman, 2010]
Introduction Data Warehouse Preservation Motivation DW Preservation DWXML DBPreserve Suite Conclusions Existing preservation approaches don´t comply with data warehouse preservation requirements Regarding data warehouses implemented with relational database technologies, some efforts can be reused Although, they still lack an important metadata layer that describes the data warehouse structure and entities
Introduction Data Warehouse Metadata Motivation DW Preservation DWXML DBPreserve Suite Conclusions Example from a case study, implemented using Oracle Database 11g Enterprise Edition Release 11.1.0.7.0 - 64bit Production • Star Schema - fact table is surrounded by dimensional tables • Bridge Tables
Introduction DW Metadata – Fact Tables Motivation DW Preservation DWXML DBPreserve Suite Conclusions • A fact table is the center of a star schema • Consists of facts of a business process • Facts • Measures : • ADDITIVE • NON ADDITIVE • SEMI ADDITIVE
Introduction DW Metadata – Dimensions Motivation DW Preservation DWXML DBPreserve Suite Conclusions • Usually represented by one or more dimensional tables • Levels • Hierarchies • Attributes They give the context and meaning to the facts Represent the relevant vectors of analysis of the business process facts
Introduction DW Metadata: Create Dimension Motivation DW Preservation DWXML DBPreserve Suite Conclusions CREATEDIMENSIONclass_dim LEVEL class IS (IPDW_CLASS.CLASS_ID) LEVEL course IS (IPDW_CLASS.COURSE_ID) HIERARCHYclass_rollup( class CHILD OF course) ATTRIBUTE class DETERMINES (IPDW_CLASS.CODE, IPDW_CLASS.ACRONYM, IPDW_CLASS.NAME, IPDW_CLASS.TYPE) ATTRIBUTE course DETERMINES (IPDW_CLASS.COUR_CODE, IPDW_CLASS.COUR_ACRONYM, IPDW_CLASS.COUR_NAME, IPDW_CLASS.COUR_TYPE, IPDW_CLASS.COURSE_PREVIOUS_COD); Project’s case study implements the dimensional model using Oracle Database 11g
Introduction DW Metadata – Bridge Table Motivation DW Preservation DWXML DBPreserve Suite Conclusions Bridge tables are used to resolve a many to many relationship between a fact and a dimension Also used to flatten out a hierarchy in a dimension
Introduction DW Metadata – Snowflake Motivation DW Preservation DWXML DBPreserve Suite Conclusions • Snowflake schema is similar to a star schema, but one or more dimension tables are partially normalized • Sub-dimensions
Introduction DW Metadata – Datamart Motivation DW Preservation DWXML DBPreserve Suite Conclusions Subset of a data warehouse Typically, a set of star and snowflake schemas
Introduction Data Warehouse Preservation Format Proposal Motivation DW Preservation DWXML DBPreserve Suite Conclusions • Analysis of relational database preservation formats • DBML (Database Markup Language) [Ramalho, 2007] • SIARD Format (Software Independent Archiving of Relational Databases) [SFA, 2008] • Analysis on Data Warehouse XML representation • XCube (for multidimensional schemas) [Hummer, 2003]
Introduction Data Warehouse Preservation Format Proposal Motivation DW Preservation DWXML DBPreserve Suite Conclusions • Decision on extending the SIARD Format • Separates metadata from primary data • Segmented representation of primary data • Ready to use application that creates a SIARD format from a relational database (MSAccess, MSSQL and Oracle) • Add a metadata layer regarding the dimensional model perspective • Extracting data warehouse metadata from data dictionary • Defining a XML structure for the dimensional model • Embedding it into the SIARD format
Introduction Relational Database Preservation with SIARD Motivation DW Preservation DWXML DBPreserve Suite Conclusions • Header folder for metadata • Content folder for primary data • Organized in directories • Single XML file for each data table • SIARD Suite – set of tools formigrating, editing and reactivating databases
Introduction Extending SIARD format Motivation DW Preservation DWXML DBPreserve Suite Conclusions Add a XML file with the extra metadata layer for data warehouse characterization Add the corresponding schema No action on the primary data Data in the DW ingested to the SIARD Suite as a relational database
Introduction DWXML Definition Motivation DW Preservation DWXML DBPreserve Suite Conclusions
Introduction DWXML - Stars Motivation DW Preservation DWXML DBPreserve Suite Conclusions
Introduction DWXML - FactTable Motivation DW Preservation DWXML DBPreserve Suite Conclusions
Introduction DWXML - Dimension Motivation DW Preservation DWXML DBPreserve Suite Conclusions
Introduction DWXML – Level and Attribute Motivation DW Preservation DWXML DBPreserve Suite Conclusions
Introduction DWXML –Hierarchy Motivation DW Preservation DWXML DBPreserve Suite Conclusions
Introduction DWXML –Datamart Motivation DW Preservation DWXML DBPreserve Suite Conclusions
Introduction DWXML –Schema Motivation DW Preservation DWXML DBPreserve Suite Conclusions
Introduction DWXML –Table Motivation DW Preservation DWXML DBPreserve Suite Conclusions
Introduction DWXML –View Motivation DW Preservation DWXML DBPreserve Suite Conclusions
Introduction DWXML –Sample Motivation DW Preservation DWXML DBPreserve Suite Conclusions <?xml version="1.0"encoding="UTF-8"?> <dwxmlversion="1.0"xsi:noNamespaceSchemaLocation="dw.xsd" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"> <stars> <star> <name>IPDW_ANSWERS_STAR</name> <description>Star related to the answers</description> <factTable> <schema>CALDEIAS</schema> <name>IPDW_ANSWERS</name> <facts> <fact> <name>ANSWER</name> <column>ANSWER</column> <measure>ADDITIVE</measure> </fact> </facts> </factTable> <ray> <dimension> <schema>CALDEIAS</schema> <name>IPDW_QUESTION</name> </dimension> </ray> <ray> ... </ray> </star> </stars> ... </dwxml>
Introduction DBPreserve Suite ApplicationFeatures Motivation DW Preservation DWXML DBPreserve Suite Conclusions • Integrates the SiardFromDb application to build the SIARD format of the data warehouse • Extracts metadata for characterization of the dimensional model • Schemas, dimensions, hierarchies, levels, attributes, tables, table comments, primary and foreign keys, views • Sorts the tables according to their role in the data warehouse • Proposes a DWXML description based on the extracted metadata • DWXML editing using GUI • Graphical representation of star schemas and dimensions and their relationships • Creates, views and embeds the DWXML file into the SIARD format • Access and retrieves the primary data
Introduction DBPreserve Suite Architecture Motivation DW Preservation DWXML DBPreserve Suite Conclusions Metadata Module SIARD Module Output Module DWXML Module Connection Module SIARDfromDB OJDBC, … JDOM Netbeans Platform 7 RC1 | JDK 7
Introduction DBPreserve Suite Connection to the DW Motivation DW Preservation DWXML DBPreserve Suite Conclusions
Introduction DBPreserve Suite SIARDfromDB Integration Motivation DW Preservation DWXML DBPreserve Suite Conclusions
Introduction DBPreserve Suite Metadata Extraction Motivation DW Preservation DWXML DBPreserve Suite Conclusions
Introduction DBPreserve Suite DWXML Proposal Motivation DW Preservation DWXML DBPreserve Suite Conclusions
Introduction DBPreserve Suite Schemas Viewer: Stars Motivation DW Preservation DWXML DBPreserve Suite Conclusions
Introduction DBPreserve Suite Schemas Viewer: Dimensions Motivation DW Preservation DWXML DBPreserve Suite Conclusions
Introduction DBPreserve Suite DWXML Editing Motivation DW Preservation DWXML DBPreserve Suite Conclusions
Introduction DBPreserve Suite DWXML Embedding and Viewing Motivation DW Preservation DWXML DBPreserve Suite Conclusions
Introduction DBPreserve Suite Primary Data retrieval from XML Motivation DW Preservation DWXML DBPreserve Suite Conclusions
Introduction DBPreserve Suite Case Study Results Motivation DW Preservation DWXML DBPreserve Suite Conclusions • Extraction times: • SIARD data: 2h30m • SIARD metadata : 4 min • DWXML metadata : 3 sec • Data Warehouse • 17 tables (one with more than 2M records) • Data size: 115 MB • SIARD Format • 17 XML files with primary data (one with 323 MB) • SIARD metadata size: 71 KB • DWXML metadata size: 86 KB • Total size: 360 MB
Introduction Conclusions Motivation DW Preservation DWXML DBPreserve Suite Conclusions • Definition of DWXML, a representation of the dimensional model of a DW • Design and implementation of DBPreserve Suite • Extraction of the metadata that describes the dimensional model • Manual adjustments of the dimensional model • Generation of the XML file and embedding into SIARD format file • Primary data browse • The result is compliant with the SIARD Suite tools (just the relational level)
Introduction References Motivation DW Preservation DWXML DBPreserve Suite Conclusions
Introduction References Motivation DW Preservation DWXML DBPreserve Suite Conclusions