490 likes | 505 Views
Learn the fundamentals of data warehousing, including its definition, benefits, and characteristics. Understand the data management challenges in large enterprises and explore the query-driven and warehousing approaches. Discover the ETL process and its challenges in building a data warehouse.
E N D
Intro. to Data Warehouse รศ.ดร. วรพจน์ กรีสุระเดช WorapojKreesuradej, Ph.D. Associate Professor Data Mining & Data Exploration Laboratory (DME Lab), Faculty of Information Technology, King Mongkut's Institute of Technology Ladkrabang, Web: www.it.kmitl.ac.th/dme Email: worapoj@it.kmitl.ac.th
Book • PaulrajPonniah, Data Warehousing Fundamentals, John Wiley & Sons, 2001. • Ralph Kimbal and Margy Ross, The Data Warehouse Toolkit, John Wiley and Sons, 2002.
Definition of DW • “A collection of integrated, subject-oriented databases designed to supply the information required for decision-making.” - W. Inmon • A decision support database that is maintained separately from the organization’s operational databases. • A physical repository where relational data are specially organized to provide enterprise-wide, cleansed data in a standardized format –E. Turban and etc.
R. Kimball’s definition of a DW • A data warehouse is a copy of transactional data specifically structured for querying and analysis.
Problem: Data Management in Large Enterprises • Vertical fragmentation of informational systems • Result of application (user)-driven development of operational systems Sales Planning Suppliers Num. Control Stock Mngmt Debt Mngmt Inventory ... ... ... Sales Administration Finance Manufacturing ...
? Source Source Problem: Data Management in Large Enterprises • Two Approaches for accessing data: • Query-Driven (Lazy) • Warehouse (Eager)
The Need for DW • Query-driven (lazy, on-demand) Clients Metadata Integration System . . . Wrapper Wrapper Wrapper . . . Source Source Source
Disadvantages of Query-Driven Approach • Delay in query processing • Inefficient and potentially expensive for frequent queries • Competes with local processing at sources
The Warehousing Approach • Information integrated in advance • Stored in wh for direct querying and analysis Clients Data Warehouse Metadata Integration System . . . Extractor/ Monitor Extractor/ Monitor Extractor/ Monitor . . . Source Source Source
Advantages of Warehousing Approach • High query performance • Doesn’t interfere with local processing at sources • Information copied at warehouse • Can modify, annotate, summarize, restructure, etc. • Can store historical information • Security, no auditing
Subject Oriented • Data Warehouse is designed around “subjects” rather than processes • A company may have • Retail Sales System • Outlet Sales System • Catalog Sales System • DW will have a Sales Subject Area
OLTP Systems Catalog Sales System Retail Sales System Outlet Sales System Sales Subject Area Data Warehouse Subject-Oriented Sales Information Subject Oriented
Integrated • Heterogeneous Source Systems • Need to Integrate source data • For Example: Product codes could be different in different systems • Arrive at common code in DW
Integrated • Information integrated in advance • Stored in DW for direct querying and analysis Clients Data Warehouse Metadata Integration System . . . Extractor/ Monitor Extractor/ Monitor Extractor/ Monitor . . . Source Source Source
Non-Volatile • Operational update of data does not occur in the data warehouse environment. • Does not require transaction processing, recovery, and concurrency control mechanisms • Requires only two operations in data accessing: • initial loading of data and access of data.
Non-Volatile(Read-Mostly) Write OLTP USER Read DW USER Read
Time Variant • The time horizon for the data warehouse is significantly longer than that of operational systems. • Operational database: current value data. • Data warehouse data: provide information from a historical perspective (e.g., past 5-10 years)
Time Variant • Most business analysis has a time component • Trend Analysis (historical data is required) Sales 2001 2002 2003 2004
Data Warehousing Process Overview • The major components of a data warehousing process • Data sources • Data extraction • Data loading • Comprehensive Database/Data Store • Data Mart • Metadata • Middleware tools /information delivery tools
ETL • Data Extraction • Data Cleaning and Transformation • Convert from legacy/host format to warehouse format • Load • Sort, summarize, consolidate, compute views, check integrity, build indexes, partition
The ETL Process Source Systems Staging Area DW Database Transform Extract Load
Data Staging Area • A storage area where extracted data is cleaned, transformed and deduplicated. • Initial storage for data • Need not be based on Relational model • Mainly sorting and Sequential processing • Does not provide data access to users • Analogy – kitchen of a restaurant
ETL Process Issues & Challenges • Consumes 70-80% of project time • Heterogeneous Source Systems • Little or no control over source systems • Source systems scattered • Different currencies, measurement units • Ensuring data quality
Comprehensive Database/Data Store • Mostly a relational DB • Oracle, DB2, Sybase, SQL Server • New DB design for special purpose of DW (e.g., scale up, speed up, parallel processing)
Data Warehouse Design • OLTP Systems are Data Capture Systems • “DATA IN” systems • DW are “DATA OUT” systems OLTP DW
Dimensional Modeling • Facts are stored in FACT Tables • Dimensions are stored in DIMENSION tables • Dimension tables contains textual descriptors of business • Fact and dimension tables form a Star Schema • “BIG” fact table in center surrounded by “SMALL” dimension tables
Data mart • Data mart = subset of DW for community users, e.g. accounting department • Sometimes exist as Multidimensional Database • Info mart = summarized data + report for community users
Meta Data • Data about data • Needed by both information technology personnel and users • IT personnel need to know data sources and targets; database, table and column names; refresh schedules; data usage measures; etc. • Users need to know entity/attribute definitions; reports/query tools available; report distribution information; help desk contact information, etc.
Information Delivery Tools • Tools • Query & reporting • OLAP • Data mining, visualization, segmentation, clustering • New developments: text mining, web mining & personalization • Mining multimedia data
Information Delivery Tools • Commercial tools • Crystal Report, Impromptu, WebFocus • Increasingly common mode of delivery: Web-enabled
Data Warehouse Architecture • Data Flow Architecture • System Architecture
Data Flow Architecture • Operational data stores (ODS) A type of database often used as an interim area for a data warehouse, especially for customer information files • MDB=Multidimensional databases
System Architectures • Three parts of the data warehouse • The data warehouse that contains the data and associated software • Data acquisition (back-end) software that extracts data from legacy systems and external sources, consolidates and summarizes them, and loads them into the data warehouse • Client (front-end) software that allows users to access and analyze data from the warehouse
Data Warehouse Development • Data warehouse development approaches • Inmon Model: EDW approach, Enterprise-wide warehouse, top down • Kimball Model: Data mart approach, Data mart, bottom up • Which model is best? • There is no one-size-fits-all strategy to data warehousing • When properly executed, both result in an enterprise-wide data warehouse, but with different architectures
The Data Mart Strategy • The most common approach • Begins with a single mart and architected marts are added over time for more subject areas • Relatively inexpensive and easy to implement • Can be used as a proof of concept for data warehousing • Can perpetuate the “silos of information” problem • Can postpone difficult decisions and activities • Requires an overall integration plan
The Enterprise-wide Strategy • A comprehensive warehouse is built initially • An initial dependent data mart is built using a subset of the data in the warehouse • Additional data marts are built using subsets of the data in the warehouse • Like all complex projects, it is expensive, time consuming, and prone to failure • When successful, it results in an integrated, scalable warehouse
Data Warehouse Development • Some best practices for implementing a data warehouse (Weir, 2002): • Project must fit with corporate strategy and business objectives • There must be complete buy-in to the project by executives, managers, and users • It is important to manage user expectations about the completed project • The data warehouse must be built incrementally • Build in adaptability
Data Warehouse Development • Some best practices for implementing a data warehouse (Weir, 2002): • The project must be managed by both IT and business professionals • Develop a business/supplier relationship • Only load data that have been cleansed and are of a quality understood by the organization • Do not overlook training requirements • Be politically aware