260 likes | 459 Views
Data Warehouses. Mert Terzihan CmpE 422 Database Systems. Outline. What is a Data Warehouse? Architecture Design and Implementation of a DW Logical Design Metadata in DW Data Marts Data Staging Testing Data Warehouses. Data Warehouse.
E N D
Data Warehouses Mert Terzihan CmpE 422 Database Systems
Outline • What is a Data Warehouse? • Architecture • Design and Implementation of a DW • Logical Design • Metadata in DW • Data Marts • Data Staging • Testing Data Warehouses
Data Warehouse • “Subject-oriented, integrated, timevarying, non-volatile collection of data that is used primarily in organizational decision making” • Historical Data for decision support • Seperate from organization’s operational databases (OLTP)
Design and Implementation • Define the architecture, do capacity planning, and select the storage servers, database and OLAP servers, and tools. • Integrate the servers, storage, and client tools. • Design the warehouse schema and views. • Define the physical warehouse organization, data placement, partitioning, and access methods.
Design and Implementation • Connect the sources using gateways, ODBC drivers, or other wrappers. • Design and implement scripts for data extraction, cleaning, transformation, load, and refresh. • Populate the repository with the schema and view definitions, scripts, and other metadata. • Design and implement end-user applications. • Roll out the warehouse and applications.
Design and Implementation: Buttom-Up Design • First data marts are created, then DW • Drill across: Integration of multiple data marts • Advantages: Faster and easier implementation, less risk of failure • Disadvantages: Each data mart has a narrow view of data, redundant data in data marts
Design and Implementation: Top-Down Design • DW is designed using a normalized enterprise data model. • Advantages: Enterprise view of data, centralized rules and control • Disadvantages: Longer to build, high risk of failure
Logical Design: Dimensional Data Modeling • Critical operation: Read • Join operations decrease speed • Normalization means more tables and more joins • Minimize join operations -> Quick answers to queries • Centralized approach with only 1 join
Logical Design: Dimensional Data Modeling Star Schema • Fact Table: - Dimension Attributes (FK-PK relationships with dimension tables) - Degenerate Dimensions (Dimension attributes without related to any dimension table) - Measures (contain values to be aggregated) • Dimension Tables
Logical Design: Dimensional Data Modeling • Star Schema
Star Schema: Example Queries • Selecting rows from fact table: - Select cost and sell for all jobs where the Ship Month is January 2005 • Group rows for summary information - Select the total cost and sell for each Ship Month in the Ship Year 2005
Star Schema • Dimension tables allow users to examine different level of details (Hierarchy): i.e. allowing rows to be grouped by Ship Date, Month, Quarter, or Year • Drill-down operation (examine more detailed data-moving down a hierarchy) • Roll-up operation (summarize details-moving up a hierarchy)
Logical Design: Dimensional Data Modeling • Snowflake Schema - Normalization of dimension tables - Each hierarchical level has its own table
Logical Design: Dimensional Data Modeling • Snowflake Schema
Metadata in Data Warehouse • Administrative metadata (information for setting up and using a DW) • Business metadata (business terms and definitions, ownership of data, and charging policies) • Operational metadata (information collected during operations of DW )
Data Marts • Scaled-down DW • Serves data to the users • Oriented to specific business line or team • Advantages: increased speed, reduced cost, complexity and risks • Disadvantages: non-centralized, limited scope
Data Staging • Data coming from sources (operational databases, web etc.) needs to be changed, converted, and made ready in a format that is suitable to be stored for querying and analysis. • Data Extraction • Data Transformation • Data Loading • Together ETL processes
Data Staging: Extract • Extract data from different sources (RDBs, flat files, etc.) • Each one may have different data formats • Parse and convert data for transformation
Data Staging: Transform • Apply series of rules to extracted data for loading • Cleaning (duplicates, default values for missing elements, etc.) • Standardization (data types, synonyms, etc.) • Combining • Summarizing
Data Staging: Load • Load the data into DW • Additional preprocessing is required • Checking integrity constraints • Sorting, aggregation • Other computations to build the tables stored in data warehouse
Testing Data Warehouses • Check the quality of the data • Data Completeness: Ensures that all the expected data is loaded • Data Transformation: Ensures that all data is transformed correctly according to business rules and/or design specifications.
Challenges for Testing a Data Warehouse • Data selection from multiple source systems. • Volume and the complexity of the data. • Inconsistent and redundant data in a data warehouse. • Non-Availability of comprehensive test bed. • Critical data for Business.
References • Database Modeling and Design, T. Teorey, Morgan Kaufmann Publishers • An Overview of Data Warehousing and OLAP Technology, S. Chaudhuri, Microsoft Research • Data Warehousing with Oracle, M. A. Shahzad Thank you for listening...