520 likes | 624 Views
DATAWAREHOUSE. Introduction. Information. Data. What is Data Warehousing?. A process of transforming data into information and making it available to users in a timely enough manner to make a difference [Forrester Research]. What are the users saying.
E N D
DATAWAREHOUSE Introduction www.notesvillage.com
Information Data What is Data Warehousing? A process of transforming data into information and making it available to users in a timely enough manner to make a difference [Forrester Research] www.notesvillage.com
What are the users saying... • Data should be integrated across the enterprise • Summary data has a real value to the organization • Historical data holds the key to understanding data over time • What-if capabilities are required www.notesvillage.com
What is a Data Warehouse? A single, complete and consistent store of data obtained from a variety of different sources made available to end users in a what they can understand and use in a business context. [Barry Devlin] www.notesvillage.com
Data, Data everywhereyet ... • I can’t find the data I need • data is scattered over the network • many versions, subtle differences • I can’t get the data I need • need an expert to get the data • I can’t understand the data I found • available data poorly documented • I can’t use the data I found • results are unexpected • data needs to be transformed from one form to other www.notesvillage.com
Which are our lowest/highest margin customers ? Who are my customers and what products are they buying? What is the most effective distribution channel? What product prom--otions have the biggest impact on revenue? Which customers are most likely to go to the competition ? What impact will new products/services have on revenue and margins? A producer wants to know…. www.notesvillage.com
Necessity • Need for Strategic Information • Retain consumer base • Increase customer base by 20% • Enhance consumer satisfaction level • Launch two new products • Information Crisis • How to maintain lots of data in the organization • IT resources and strategy is not effective for data • Opportunities and Risks: • Quick decisions • Managerial analysis • Crisis management www.notesvillage.com
History of DSS • Ad hoc reports • Small applications • Information centers • Decision support system www.notesvillage.com
Disadvantages • Many adhoc requests . • Supplementary report problems • No interactivity of the user • Flexible informations. www.notesvillage.com
OPERATIONAL VS INFORMATIONAL • Current Values • Optimized for transaction • High frequency • Read Update Delete • Large user • Response time is very fast • Archived value • Optimized for complex queries • Low • Read • Adhoc,Random • Relatively more several seconds www.notesvillage.com
Scenerio 1 • ABC Pvt Ltd is a company with branches at Mumbai, Delhi, Chennai and Banglore. The Sales Manager wants quarterly sales report. Each branch has a separate operational system. www.notesvillage.com
Scenario 1 : ABC Pvt Ltd. Mumbai Delhi Sales per item type per branch for first quarter. Sales Manager Chennai Banglore www.notesvillage.com
Solution 1:ABC Pvt Ltd • Extract sales information from each database. • Store the information in a common repository at a single site. www.notesvillage.com
Solution 1:ABC Pvt Ltd . Mumbai Data Warehouse Report Delhi Query & Analysis tools Sales Manager Chennai Banglore www.notesvillage.com
Scenario 2 : One Stop Shopping Data Entry Operator Operational Database Report Wait Management Data Entry Operator www.notesvillage.com
Solution 2 • Extract data needed for analysis from operational database. • Store it in warehouse. • Refresh warehouse at regular interval so that it contains up to date information for analysis. • Warehouse will contain data with historical perspective. www.notesvillage.com
Solution 2 Data Entry Operator Report Operational database Data Warehouse Extract data Manager Transaction Data Entry Operator www.notesvillage.com
Scenario 3 Cakes & Cookies is a small,new company.President of the company wants his company should grow.He needs information so that he can make correct decisions. www.notesvillage.com
Solution 3 • Improve the quality of data before loading it into the warehouse. • Perform data cleaning and transformation before loading the data. • Use query analysis tools to support adhoc queries. www.notesvillage.com
Solution 3 Expansion sales Data Warehouse Query and Analysis tool President time Improvement www.notesvillage.com
Definitions of a Data Warehouse “A subject-oriented, integrated, time-variant and non-volatile collection of data in support of management's decision making process” 1. - W.H. Inmon “A copy of transaction data, specifically structured for query and analysis” 2. - Ralph Kimball www.notesvillage.com
Data Warehouse • For organizational learning to take place, data from many sources must be gathered together and organized in a consistent and useful way – hence, Data Warehousing (DW) • DW allows an organization (enterprise) to remember what it has noticed about its data • Data Mining techniques make use of the data in a DW www.notesvillage.com
What is Data Warehouse?? • Inmons’s definition A data warehouse is -subject-oriented, -integrated, -time-variant, -nonvolatile collection of data in support of management’s decision making process www.notesvillage.com
Subject-oriented • Data warehouse is organized around subjects such as sales, product, customer. • It focuses on modeling and analysis of data for decision makers. • Excludes data not useful in decision support process. www.notesvillage.com
Integrated Data • Integration • Data Warehouse is constructed by integrating multiple heterogeneous sources. • Data Preprocessing are applied to ensure consistency. RDBMS Data Warehouse Legacy System www.notesvillage.com Flat File
Integration • In terms of data. • encoding structures. • Measurement of • attributes. • physical attribute. • of data • naming conventions. • Data type format www.notesvillage.com
Time-variant • Provides information from historical perspective e.g. past 5-10 years • Data is stored as snapshots • Every key structure contains either implicitly or explicitly an element of time • Allows analysis for the past • Relates information to the present • Enables forecasts for the future www.notesvillage.com
Nonvolatile • Data once recorded cannot be updated. • Data warehouse requires two operations in data accessing • Initial loading of data • Access of data www.notesvillage.com
Data Granularity • Operational system is kept at the lowest level • Data is summarized at different levels • The lower the level the more data granularity. www.notesvillage.com
Operational v/s Information System www.notesvillage.com
Data Mart • A Data Mart is a smaller, more focused Data Warehouse – a mini-warehouse. • A Data Mart typically reflects the business rules of a specific business unit within an enterprise. www.notesvillage.com
Data Mart Data Mart Data Mart Data Warehouse to Data Mart Decision Support Information Data Warehouse Decision Support Information Decision Support Information www.notesvillage.com
Meta-data Repository Layer Warehouse Management Layer Data Marts vs Data Warehouses Internet/Intranet Layer 11 direct queries virtual queries Operational Data Layer ad hoc queries Virtual DW Core DW Layer Data Feed/Data Mining/Indexing Layer Coarse DW Presentation/ Desktop Access Layer Data Staging and Quality Layer Data Access Layer Central DW 2a Data Mart Layer External Data Layer Distributed DW 3 2b Non-operational Data Layer 1 4 7 2c 6 5 8 9 10 Application Messaging (Transport) Layer www.notesvillage.com
Operational vs. InformationalSystems Operational Systems Information Delivery System Data Warehouse Data Marts Informational Systems www.notesvillage.com
Virtual Date Warehouse • A Virtual Data Warehouse approach is often chosen when there are infrequent demands for data and management wants to determine if/how users will use operational data. • One of the weaknesses of a Virtual Data Warehouse approach is that user queries a made against operational DBs. • One way to minimize this problem is to build a “Query Monitor” to check the performance characteristics of a query before executing it. www.notesvillage.com
A Coarse Data Warehouse is often chosen when the organization has a relatively clean/new operational system and management wants to make the operational data more easily available for just that system. • A Central Data Warehouse is often chosen when the organization has a clear understanding about it Information Access needs and wants to provide “quality”, “integrated” , information to its knowledge workers • A Distributed Data Warehouse is similar in most respects to a Central Data Warehouse, except that the data is distributed to separate mini-Data Warehouses (Data Marts )on local or specialized servers www.notesvillage.com
Meta-data Repository Layer Warehouse Management Layer Central Data Warehouse Internet/Intranet Layer 11 direct queries virtual queries Operational Data Layer ad hoc queries Virtual DW Core DW Layer Data Feed/Data Mining/Indexing Layer Coarse DW Presentation/ Desktop Access Layer Data Staging and Quality Layer Data Access Layer Central DW 2a Data Mart Layer External Data Layer Distributed DW 3 2b Non-operational Data Layer 1 4 7 2c 6 5 8 9 10 Application Messaging (Transport) Layer www.notesvillage.com
Top-down approach • Single, Central storage of data about the content • Centralized Rule and Content • May seek quick result if implemented with Iteration • Needs high level cross functional skills • High outlay with proof of concepts • High exposure to risk/Failures www.notesvillage.com
Data Warehouse Architectures • Generic Two-Level Architecture • Independent Data Mart • Dependent Data Mart and Operational Data Store • Logical Data Mart and Real-Time Data Warehouse • Three-Layer architecture All involve some form of extraction, transformation and loading (ETL) www.notesvillage.com
Process Architecture • These interrelated sub-processes are sometimes referred to as an “ETL” process. • 1)Extract- Since data for the data warehouse can come from different sources and may be of different types, the plan to extract the data along with appropriate compression and encryption techniques is an important requirement for consideration. • 2)Transform- Transformation of data with appropriate conversion, aggregation and cleaning besides de-normalization and surrogate key management is also an important process to be planned for building a data warehouse. • 3)Load- Steps to be considered to load data with optimization by considering the multiple areas where the data is targeted to be loaded and retrieved is also an important part of the data warehouse architecture plan www.notesvillage.com
Data Model Architecture • In Data Model Architecture (also known as Dimensional Data Model), there are 3 main data modeling styles for enterprise warehouses: • 3rd Normal Form - Top Down Architecture, Top Down Implementation • Federated Star Schemas - Bottom Up Architecture, Bottom Up Implementation • Data Vault - Top Down Architecture, Bottom Up Implementation . www.notesvillage.com
Technology Architecture • Scalability and flexibility is required in all facets. The extent of these features are largely depend upon organizational size, business requirements, nature of business etc. • Technology or Technical architecture primary evolved from derivations from the process architecture, meta data management requirements based on business rules and security levels implementations and technology tool specific evaluation. • Besides these, the Technology architecture also looks into the various technology implementation standards in database management, database connectivity protocols (ODBC, JDBC, OLE DB etc), Middleware (based on ORB, RMI, COM/DOM etc.), Network protocols (DNS, LDAP etc) and other related technologies. www.notesvillage.com
Information Architecture • Information Architecture is the process of translating the information from one form to another in a step by step sequence so as to manage the storage, retrieval, modification and deletion of the data in the data warehouse. www.notesvillage.com
Resource Architecture • Resource architecture is related to software architecture in that many resources come from software resources. Resources are important because they help determine performance. • Workload is the other part of the equation. If you have enough resources to complete the workload in the right amount of time, then performance will be high. www.notesvillage.com
Figure 11-2: Generic two-level data warehousing architecture L One, company-wide warehouse T E Periodic extraction data is not completely current in warehouse www.notesvillage.com
Scrub/Cleanse…uses pattern recognition and AI techniques to upgrade data quality Figure 11-10: Steps in data reconciliation (cont.) Fixing errors: misspellings, erroneous dates, incorrect field usage, mismatched addresses, missing data, duplicate data, inconsistencies Also: decoding, reformatting, time stamping, conversion, key generation, merging, error detection/logging, locating missing data www.notesvillage.com
Data marts: Mini-warehouses, limited in scope L T E Separate ETL for each independent data mart Data access complexity due to multiple data marts Figure 11-3 Independent data mart data warehousing architecture www.notesvillage.com
Figure 11-4 Dependent data mart with operational data store:a three-level architecture ODS provides option for obtaining current data L T E Simpler data access Single ETL for enterprise data warehouse (EDW) Dependent data marts loaded from EDW www.notesvillage.com