640 likes | 980 Views
Data Warehouse. Why Data Warehouse and OLAP?. The Intelligence Stage of Decision Making needs correct data Data also should be clean, organized with fast access
E N D
Why Data Warehouse and OLAP? The Intelligence Stage of Decision Making needs correct data Data also should be clean, organized with fast access Often the data should only be used for subsequent stages and should not be changed (in other words decision makers are users of data, not generators of data) The data for decision making comes from multiple sources OLTP databases, XML files, Flat files like CSV, PDF, etc. They need to be combined for one version of the truth These needs cannot be easily served by the traditional data sources, hence the concept of data warehouse/data marts
Business Intelligence Process Increasing potential to support business decisions End User Making Decisions Business Analyst Data Presentation Visualization Techniques Data Mining Information Discovery Data Analyst Data Exploration Statistical Analysis, Querying and Reporting Data Warehouses / Data Marts OLAP, MDA DBA Data Sources Paper, Files, Information Providers, Database Systems, OLTP
Data Warehouse: Definitions Data warehouse There is no single definition, as it can encompass many aspects of the data management. Some of the ones found in the literature are: A physical repository where relational data are specially organized to provide enterprise-wide, cleansed data in a standardized format Defined in many different ways, but not rigorously. A decision support database that is maintained separately from the organization’s operational database. A consistent database source that bring together information from multiple sources for decision support queries Support information processing by providing a solid platform of consolidated, historical data for analysis Data warehousing A process by which data from multiple sources are extracted, transformed and loaded into a data warehouse in a planned operation
Data Warehouse vs. Operational DBMS • OLTP (on-line transaction processing) Major task of traditional relational DBMS Day-to-day operations: purchasing, inventory, banking, manufacturing, payroll, registration, accounting, etc. Aims at reliable and efficient processing of a large number of transaction and ensuring data consistency • OLAP (on-line analytical processing) Major task of data warehouse system Data analysis and decision making Aims at efficient multidimensional processing of large data volumes Fast, interactive answers to large aggregate queries • Distinct features (OLTP vs. OLAP): User and system orientation: customer vs. market Data contents: current, detailed vs. historical, consolidated Database design: ER + application vs. star + subject View: current, local vs. evolutionary, integrated Access patterns: update vs. read-only but complex queries
Data Warehousing Characteristics Subject oriented Integrated Time variant (time series) Nonvolatile Web based Relational/multidimensional Client/server Real-time Include metadata
Data Warehouse Architecture Data Warehouse Architecture Inmon (Bill Inmon) Model: EDW approach Kimball (Ralph Kimball) Model: Data mart approach Which model is best? There is no one-size-fits-all strategy to data warehousing One alternative is the hosted warehouse
Categories/Types of Data Warehouses Data Mart A departmental data warehouse that stores only relevant data Dependent Data Mart A subset that is created directly from a data warehouse Independent Data Mart A small data warehouse designed for a strategic business unit or a department Operational Data Stores A type of database often used as an interim area for a data warehouse, especially for customer information files Operational Data Marts An operational data mart. An operational data mart is a small-scale data mart typically used by a single department or functional area in an organization
Various OLAPs ROLAP: Relational OLAP. Analysis cube supported by a Multi-Dimensional Data Warehouse (or Data Mart) based on Relational database. MOLAP: Multidimensional OLAP. Analysis cube supported by a multi-dimensional data warehouse based on another multi-dimensional data storage. HOLAP: Hybrid OLAP. Combination of the above two to optimize performance. Your task: Search the web to find the relative advantages and disadvantages of each.
Data Warehousing Architectures Issues to consider when deciding which architecture to use: Which database management system (DBMS) should be used? Will parallel processing and/or partitioning be used? Will data migration tools be used to load the data warehouse? What tools will be used to support data retrieval and analysis?
Data Warehouse Development and Implementation
Data Warehouse and BI Life Cycle Structured data Business Intelligence Reporting, Ad-hoc Query Analysis OLTP Data Warehouse (Enterprise) Unstructured data
Data Warehouse and BI Life Cycle Phase I Extract -Transform - Load Data Integration Data Warehouse (Enterprise) ODS Source Data Reporting Tools Develop Discover Deliver
Data Warehouse and BI Life Cycle Phase II Data Marts (Departments) Business Intelligence Data Warehouse (Enterprise) Multi-Dimensional Cubes (Subject wise) Develop Deliver Discover
From: http://oraclezine.blogspot.com/2009/01/data-warehousing-and-olap-cube.html, accessed January 20, 2011 Data Warehouse OLAP Cube Total Sales for Customer 3, Product C, for July.
Data Integration and the Extraction, Transformation, and Load (ETL) Process Extraction reading data from a database Transformation Converting the extracted data from its previous form into the form in which it needs to be so that it can be placed into a data warehouse or simply another database Load Putting the data into the data warehouse
Data Warehouse Schema Dimensional Modeling The Star Schema Dimension Tables that contain the Dimension for Analysis Example: Time, Region, Salesperson, etc. Fact Tables that contains the measures and aggregates Example: Average sales, total commission, total sales, etc. The Snowflake Schema Very similar to Star-schema with a central fact table, but the dimensions are in hierarchical fashion. Example: Listing agent is a part of the listing company, one city can have multiple zip code etc. Reduces the redundant data but can be inefficient for queries that do not follow patterns.
Steps for DW OLP Design Decide on your Information Needs I want to know the Total Sales by month, by region and by salesperson Decide the sources of data for each information need Total Sales is not available, must be calculated from unit and quantity sold Most of the unit and qty sold are in OLTP, but some are also in CSV files for some of the stores not connected to the database The Monthly information can be obtained from the date and is in OLTP The region information is also in OLTP transaction records vi a region code The Salesperson information is maintained in an Excel file and is coded by region Declare the grain of the fact table (preferably at the most atomic level) A grain is the lowest level of information you are interested in. The finer is the grain (tending towards atomic), the more are the dimensions. Atomic grain is preferred as it can allow easy roll-ups. However, it can take a lot of space and processing. Often the grains in the fact tables are much coarser with ability to drill down Example of Grains Sales of Each Customer, for each day for each store. Monthly total sales for a particular region for all female customers Add dimensions for "everything you know" about this grain Add numeric measured facts true to the grain.
Creating a Star* Schema Identify the dimensions (typically this is the analysis by). In our real estate listing example, it can be the city name, bedrooms, listing agent, etc. Identify the measures. In our case, it can be the average price/sq ft, total number of houses in a city, the average price, etc. Identify the attributes of each dimension. Attributes are the properties of a dimension. For example, if the listing agent is a dimension, then the first name, last name, phone number etc, of a listing agent will be the attributes. * Snowflake schema design is very similar with hierarchy of the dimensions separated in another table.
Star Schema continued. Create the dimension tables with a surrogate Primary Key (PK). Include all the necessary attributes. Decide on the measures and calculations. Those will be in the fact table. For each PK in the dimension tables, create foreign keys (FK) in the star table.
Implementing the Star Schema 1. Extract Data From Multiple Sources 2. Integrate, Transform, and Restructure Data 3. Load Data Into Dimension Tables and Fact Tables
The Star Schema Data Load External Files Internal Files External Files Heterogeneous Data Sources Data Warehouse Northwind OLTP Staging Area Sales Star DTS DTS Financial Inventory Star Extracting Data From Transforming Loading the Heterogeneous Sources Data Star Schema DTS DTS
Verifying the Dimension Source Data Correcting Invalid Data • Transforming data • Reassigning data values Managing Invalid Data • Rejecting invalid data • Saving invalid data to a log Verifying Accuracy of Source Data • Integrating data from multiple sources • Applying business rules • Checking structural requirements
Dimension Data Load Examples: buyer_name reg_id Barr, Adam 2 Chai, Sean 4 Smith, Jane 2 Paper, Anne 4 buyer_code buyer_first buyer_code buyer_last buyer_last buyer_last reg_id reg_id reg_id buyer_name reg_id U999 Adam Barr Barr Barr 2 2 2 Barr, Adam A123 2 A123 Sean Chai Chai Chai 4 4 4 Chai, Sean B456 4 B456 Erin O’Melia O’Melia O’Melia 6 6 6 O’Melia, Erin ... 6 ... ... ... ... ... ... ... ... ... ... buyer_name reg_id Barr, Adam II Chai, Sean IV buyer_name reg_id Smith, Jane 2 Paper, Anne 4 DTS DTS DTS
Maintaining Integrity of the Dimension Assigning a Surrogate Key to Each Record Defines the dimension’s primary key Relates to the foreign key fields of the fact table Loading One Record Per Application Key Maintains uniqueness in the dimension Depends on how you manage changing dimension data Maintains integrity of the fact table
Managing Changing Dimension Data • One of the problem in DW is to deal with the changing data on the reload and refresh. OLTP is operational and thus does not face this problem. These change areas are often called SCD or Slow Changing Dimensions • Dimensions with Changing Column Values • Inserts of new data • Updates of existing data • Slowly-Changing Dimension Design Solutions • Type 1: Overwrite the dimension record • Type 2: Write another dimension record • Type 3: Add attributes to the dimension record
Examining the Star Schema Sales Star Schema Fact Table Dimension Tables Dimension Table
Type 1: Overwriting the Dimension Slide Product Dimension After product key product name product size product package product dept product cat product subcat ... 001 Rice Puffs 10 oz. Bag Grocery Dry Goods Snacks ... Before 12 oz. 001 Rice Puffs 12 Oz Bag Grocery Dry Goods Snacks ... Existing record is changed
Type 2: Writing Another Dimension Record Product Dimension Before product key product name product size product package product dept product cat product subcat effective_date … 001 Rice Puffs 10 oz. Bag Grocery Dry Goods Snacks 05-01-1995 ... 001 Rice Puffs 10 Oz Bag Grocery Dry Goods Snacks 05-01-1995 ... After Rice Puffs 12 Oz Bag Grocery Dry Goods Snacks 10-15-1998 ... 731 12 oz. 10 oz. Adds a new record
Type 3: Adding Attributes in the Dimension Record Before After Product Dimension 001 Rice Puffs 10 Oz Bag Grocery Dry Goods Snacks 05-01-1995 11 Oz 03-20-1994 (null) (null) ... 001 Rice Puffs 12 oz. Bag Grocery Dry Goods Snacks 10-15-1998 10 oz. 05-01-1995 11 Oz 03-20-1994 ... product key product name product size product package product dept product cat product subcat current product size date previous product size previous product size date 2nd previous product size 2nd previous product size date ... 12 oz 10 oz. product size 10-15-1998 previous product size previous product size date 11 oz. 03-20-1994 05-01-1995 11 oz. 03-20-1994 Additional information is storedin an existing record
Verifying the Fact Table Source Data Verifying Accuracy of Source Data • Integrating data from multiple sources • Applying business rules • Checking structural requirements • Creating calculated fields Correcting Invalid Data • Transforming data • Reassigning data values Managing Invalid Data • Rejecting invalid data • Saving invalid data to a log
Assigning Foreign Keys time_dim 134 1/1/2000 DimensionTables Source Data customer_dim customer id product id order date quantity_sales amount_sales 201 ALFI Alfreds ALFI 123 123 1/1/2000 1/1/2000 400 400 10,789 10,789 product_dim 25 123 Chai Sales Fact Data cust_key prod_key time_key quantity_sales amount_sales 201 123 25 1/1/2000 134 400 400 10,789
Defining Measures product_id 9GZ 1KJ 0ZA customer_key product_key qty ... 100 512 32 238 207 48 437 338 9 customer_id price qty total_sales ... ... ... VINET .55 32 17.60 ALFI 1.10 48 52.80 HANAR .98 9 8.82 ... ... ... ... Source System Data Fact Table Data Loading Measures from the Source System Calculating Additional Measures
Maintaining Data Integrity • Adhering to the Fact Table Grain • A fact table can only have one grain • You must load a fact table with data at the same level of detail as defined by the grain • Enforcing Column Constraints • NOT NULL constraints • FOREIGN KEY constraints
Implementing Staging Tables market_stage Staging Area sales_stage shipments_stage inventory_stage Centralize and Integrate Source Data Break Up Complex Data Transformations Facilitate Error Recovery
DTS Functionality Accessing Heterogeneous Data Sources Importing, Exporting, and Transforming Data Creating Reusable Transformations and Functions Automating Data Loads Managing Metadata Customizing and Extending Functionality
Defining DTS Packages Identifies Data Sources and Destinations Defines Tasks or Actions Implements Transformation Logic Defines Order of Operations
Identifying Package Components Connections:Access Data Sources and Destinations Tasks: Describe Data Transformations or Functions Steps:Define the Order of Task Operations or Workflow Global Variables: Store Data that Can Be Shared Across Tasks
Creating Packages • Using the DTS Import / Export Wizard • Perform ad-hoc table and data transfers • Develop a prototype package • Using DTS Package Designer • Edit packages created with the DTS Import/Export Wizard • Create packages with a wide range of functionality • Programming DTS Applications • Directly access the functionality of the DTS Object Model • Requires Microsoft Visual Basic or Microsoft Visual C++
Using DTS to Populate the Sales Star Populating the Sales Star Dimensions Populating the Sales Star Fact Table
Populating the Sales Star Dimensions product_dim Product Tab Delimited Files DTS customer_dim Northwind OLTP DTS time_dim SQL Server Stored Procedure DTS
Populating the Sales Star Fact Table product_dim sales_stage customer_dim time_dim Sales Data File sales_stage DTS sales_fact DTS
Why not Just Excel Pivot Tables? For small project, Pivot Tables are excellent However, if you insert/delete records, you have to refresh the pivot table and may have to change the table reference (though Table features make it easy) Calculation cannot be done for one field that will be reflected for all records. You have to copy it for all the records! No pre-processing can be done. The pivot table is refreshed and recalculated every time. Can lead to inefficiencies No built-in integration mechanism with a data source (can be done by connecting to cube servers and ODBC compliant databases, but again an extra step and work) Selective drill-down is difficult, mainly all or none (again grouping can help, but you have to be skilled in Pivot Tables)