650 likes | 853 Views
The Data Warehouse Environment. Data Warehouse Usage. Three kinds of data warehouse applications Information processing supports querying, basic statistical analysis, and reporting using crosstabs, tables, charts and graphs Analytical processing and Interactive Analysis
E N D
Data Warehouse Usage • Three kinds of data warehouse applications • Information processing • supports querying, basic statistical analysis, and reporting using crosstabs, tables, charts and graphs • Analytical processing and Interactive Analysis • multidimensional analysis of data warehouse data • supports basic OLAP operations, slice-dice, drilling, pivoting • Data mining • knowledge discovery from hidden patterns • supports associations, constructing analytical models, performing classification and prediction, and presenting the mining results using visualization tools
Why Separate Data Warehouse? • Performance • Op dbs designed & tuned for known OLTP uses & workloads. • Complex OLAP queries would degrade performance. • Special data organization, access & implementation methods needed for multidimensional views & queries. • Function • Missing data: Decision support requires historical data, which op dbs do not typically maintain. • Data consolidation: Decision support requires consolidation (aggregation, summarization) of data from many heterogeneous sources: op dbs, external sources. • Data quality: Different sources typically use inconsistent data representations, codes, and formats which have to be reconciled.
What are Operational Systems? • They are OLTP systems • Run mission critical applications • Need to work with stringent performance requirements for routine tasks • Run the business in real time • Based on up-to-the-second data • Optimized to handle large numbers of simple read/write transactions • Optimized for fast response to predefined transactions • Used by people who deal with customers, products -- clerks, salespeople etc. • They are increasingly used by customers
RDBMS used for OLTP • Database Systems have been used traditionally for OLTP • clerical data processing tasks • detailed, up to date data • structured repetitive tasks • read/update a few records
Data Industry Usage Technology Volumes Customer All Track Legacy application, flat Small-medium File Customer files, main frames Details Account Finance Control Legacy applications, Large Balance account hierarchical databases, activities mainframe Point-of- Retail Generate ERP, Client/Server, Very Large Sale data bills, manage relational databases stock Call Telecommu- Billing Legacy application, Very Large Record nications hierarchical database, mainframe Production Manufactu- Control ERP, Medium Record ring Production relational databases, AS/400 Examples of Operational Data
OLTP vs. Data Warehouse • OLTP systems are tuned for known transactions and workloads while workload is not known a priori in a data warehouse • Special data organization, access methods and implementation methods are needed to support data warehouse queries (typically multidimensional queries) • e.g., average amount spent on phone calls between 9AM-5PM in Charlotte during the month of December
OLTP Application Oriented Used to run business Detailed data Current up to date Isolated Data Repetitive access Clerical User Warehouse (DSS) Subject Oriented Used to analyze business Summarized and refined Snapshot data Integrated Data Ad-hoc access Knowledge User (Manager) OLTP vs Data Warehouse
OLTP Performance Sensitive Few Records accessed at a time (tens) Read/Update Access No data redundancy Database Size 100MB -100 GB Data Warehouse Performance relaxed Large volumes accessed at a time(millions) Mostly Read (Batch Update) Redundancy present Database Size 100 GB - few terabytes OLTP vs Data Warehouse
OLTP Transaction throughput is the performance metric Thousands of users Managed in entirety Data Warehouse Query throughput is the performance metric Hundreds of users Managed by subsets OLTP vs Data Warehouse
OLTP Systems are used to “run” a business The Data Warehouse helps to “optimize” the business To summarize ...
Why Now? • Data is being produced • ERP provides clean data • The computing power is available • The computing power is affordable • The competitive pressures are strong • Commercial products are available
Subject Orientation • DW is organized by major subject areas and entities of the business organization • Data warehouse model aligns with the corporate logical data model • Example of major subject areas for Insurance • Customer • Product • Transaction activity • Claim • Policy • Account
Application-Orientation Subject-Orientation Operational Database Data Warehouse Credit Card Customer Loans Vendor Product Trust Savings Activity Application-Orientation vs. Subject-Orientation
Integrated data • There is not application consistency in the operational data • As data from different systems is entered into the DW, entities and attributes are encoded using a consistent key or measurement
Time • Data warehouse is nothing more than a sophisticated series of snapshots, taken at one moment in time • The key structure of the DW always contains some element of time
RelationalDatabases ExtractionCleansing Optimized Loader ERP Systems Data Warehouse Engine AnalyzeQuery Purchased Data LegacyData Metadata Repository Data Warehouse Architecture
Components of the Warehouse • Data Extraction and Loading • The Warehouse • Analyze and Query -- OLAP Tools • Metadata • Data Mining tools
Loading the Warehouse Extracting, cleaning/transforming the data before it is loaded
Data Extraction and Cleansing • Extract data from existing operational and legacy data • Issues: • Sources of data for the warehouse • Data quality at the sources • Merging different data sources • Data transformation • How to propagate updates (on the sources) to the warehouse • Terabytes of data to be loaded
Source Data Operational/ Source Data Sequential • Typically host based, legacy applications • Customized applications, COBOL, 3GL, 4GL • Point of Contact Devices • POS, ATM, Call switches • External Sources • Need to go through ETL – Extract, Transform, Load Legacy Relational External
Data Quality - The Reality • Legacy systems not well documented – if at all • Outside sources with questionable quality procedures • Production systems with no built-in integrity checks and no integration • Operational systems are usually designed to solve a specific business problem and are rarely developed to a corporate plan • “And get it done quickly, we do not have time to worry about corporate standards...”
Data Integration Across Sources Trust Credit card Savings Loans Same data different name Different data Same name Data found here nowhere else Different keys same data
Data Integrity Problems • Same person, different spellings • Different account numbers generated by different applications for the same customer • Required fields left blank • Invalid product codes collected at point of sale • manual entry leads to mistakes • “in case of a problem use 9999999”
Data Quality - The Reality • Tempting to think that creating a data warehouse is simply extracting operational data and entering into a data warehouse • Nothing could be farther from the truth • Warehouse data comes from disparate questionable sources
Scrubbing Data • Sophisticated transformation tools. • Used for cleaning the quality of data • Clean data is vital for the success of the warehouse
Loads • After extracting, scrubbing, cleaning, validating etc. need to load the data into the warehouse • Issues • huge volumes of data to be loaded • small time window available when warehouse can be taken off line (usually nights) • when to build index and summary tables • allow system administrators to monitor, cancel, resume, change load rates • recover gracefully -- restart after failure from where you were and without loss of data integrity
Data -- Heart of the Data Warehouse • Heart of the data warehouse is the data itself! • Single version of the truth • Corporate memory • Data is organized in a way that represents business -- subject orientation
Data Warehouse Structure • Subject Orientation -- customer, product, policy, account etc... A subject may be implemented as a set of related tables. E.g., customer may be five tables
Data Warehouse Design Issues • Major DW design issues: • Granularity and Partitioning • Determining the level of Granularity is a major design issue in the DW environment • Granularity profoundly affects the volume of data that resides in the DW and the type of query that can be answered
Granularity • Granularity refers to the level of detail or summarization held in the units of data in the DW • Lower level of Granularity ---> more detail - transaction data • High level of Granularity ----> less detail - summarized data
Data Granularity in Warehouse • High Level of Granularity • reduces storage costs • reduces CPU usage • increases performance since smaller number of records to be processed • design around traditional high level reporting needs • tradeoff with volume of data to be stored and detailed usage of data
Granularity in Warehouse • But … • Can not answer some questions with summarized data • Did John call Jane last month? • Not possible to answer if only the total duration of calls by John over a month is maintained and individual call details are not. • Detailed data too voluminous
Granularity and Data Analysis LOW LEVEL OF GRANULARITY HIGH LEVEL OF GRANULARITY LOW LEVEL OF DETAIL HIGH LEVEL OF DETAIL The details of every phone call made by a customer for a month The summary of phone calls made by a customer for a month 40,000 bytes per month 200 records per month 200 bytes 1 record per month The level of granularity is determined by what questions/queries can be answered and what resources are required to answer a question
Granularity and Data Analysis LOW LEVEL OF GRANULARITY HIGH LEVEL OF GRANULARITY HIGH LEVEL OF DETAIL LOW LEVEL OF DETAIL The details of every phone call made by a customer for a month The summary of phone calls made by a customer for a month 40,000 bytes per month 200 records per month 200 bytes 1 record per month Did John call Mary in Boston last week? 01 activityrecord. 02 month 02 cumcalls 02 avglength …... 01 activityrecord. 02 date 02 time 02 to whom 02 city …... YES UNKNOWN TRADEOFF
Granularity in Warehouse • How about having a dual level of granularity? • Store summary data on disks • 95% of DSS processing done against this data • Store detail on tapes • 5% of DSS processing against this data
Dual Levels of Granularity • The dual level of granularity design option should be the default for almost every organization that is building a warehouse. Low level of detail High level of detail Flexibility- small volumes of data Easy to manipulated Answer to any question Large volumes of data More DASD and resources
Partitioning • All currenty detail data in the DW is partitioned. • Data is partitioned when data of a like structure is divided into more than one physical unit data. • Separate small physical untis can be handled independently. • Partition of the DW data can be done at the system or at the application level.
Partitioning • Small units of data can be: • restructured • indexed • sequentially scanned, if needed • reorganized • recovered • monitored
Partitioning • All current detail DW data will be partitioned in order to provide flexible access to data • Choices for partitioning data are strictly up to the developer. However, TIME is always a mandatory criterion for partitioning • by date • by line of business • by geography • etc.
Structuring Data in the DW • Simple Cumulative Structure • Rolling Summary Data • Simple Direct • Continuous
Purging Warehouse Data • There are several ways in which data is purged or the detail of data is transformed: • Data is added to a rolling summary file where detail is lost. • Data is transferred to a bulk medium from a high-performance medium such as DASD. • Data is actually purged
Data Warehouse vs. Data Marts What comes first
Information Less IndividuallyStructured History Normalized Detailed DepartmentallyStructured Data Warehouse OrganizationallyStructured More Data From the Data Warehouse to Data Marts
Data Warehouse and Data Marts OLAP Data Mart Lightly summarized Departmentally structured Organizationally structured Atomic Detailed Data Warehouse Data
Data Mart Centric Data Sources Data Marts Data Warehouse
Problems with Data Mart Centric Solution If you end up creating multiple warehouses, integrating them is a problem
True Warehouse Data Sources Data Warehouse Data Marts