1 / 49

Intro. to Data Warehouse

Learn the fundamentals of data warehousing, including its definition, benefits, and characteristics. Understand the data management challenges in large enterprises and explore the query-driven and warehousing approaches. Discover the ETL process and its challenges in building a data warehouse.

Download Presentation

Intro. to Data Warehouse

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Intro. to Data Warehouse รศ.ดร. วรพจน์ กรีสุระเดช WorapojKreesuradej, Ph.D. Associate Professor Data Mining & Data Exploration Laboratory (DME Lab), Faculty of Information Technology, King Mongkut's Institute of Technology Ladkrabang, Web: www.it.kmitl.ac.th/dme Email: worapoj@it.kmitl.ac.th

  2. Book • PaulrajPonniah, Data Warehousing Fundamentals, John Wiley & Sons, 2001. • Ralph Kimbal and Margy Ross, The Data Warehouse Toolkit, John Wiley and Sons, 2002.

  3. Definition of DW • “A collection of integrated, subject-oriented databases designed to supply the information required for decision-making.” - W. Inmon • A decision support database that is maintained separately from the organization’s operational databases. • A physical repository where relational data are specially organized to provide enterprise-wide, cleansed data in a standardized format –E. Turban and etc.

  4. R. Kimball’s definition of a DW • A data warehouse is a copy of transactional data specifically structured for querying and analysis.

  5. Problem: Data Management in Large Enterprises • Vertical fragmentation of informational systems • Result of application (user)-driven development of operational systems Sales Planning Suppliers Num. Control Stock Mngmt Debt Mngmt Inventory ... ... ... Sales Administration Finance Manufacturing ...

  6. ? Source Source Problem: Data Management in Large Enterprises • Two Approaches for accessing data: • Query-Driven (Lazy) • Warehouse (Eager)

  7. The Need for DW • Query-driven (lazy, on-demand) Clients Metadata Integration System . . . Wrapper Wrapper Wrapper . . . Source Source Source

  8. Disadvantages of Query-Driven Approach • Delay in query processing • Inefficient and potentially expensive for frequent queries • Competes with local processing at sources

  9. The Warehousing Approach • Information integrated in advance • Stored in wh for direct querying and analysis Clients Data Warehouse Metadata Integration System . . . Extractor/ Monitor Extractor/ Monitor Extractor/ Monitor . . . Source Source Source

  10. Advantages of Warehousing Approach • High query performance • Doesn’t interfere with local processing at sources • Information copied at warehouse • Can modify, annotate, summarize, restructure, etc. • Can store historical information • Security, no auditing

  11. Characteristics of DW

  12. Subject Oriented • Data Warehouse is designed around “subjects” rather than processes • A company may have • Retail Sales System • Outlet Sales System • Catalog Sales System • DW will have a Sales Subject Area

  13. OLTP Systems Catalog Sales System Retail Sales System Outlet Sales System Sales Subject Area Data Warehouse Subject-Oriented Sales Information Subject Oriented

  14. Integrated • Heterogeneous Source Systems • Need to Integrate source data • For Example: Product codes could be different in different systems • Arrive at common code in DW

  15. Integrated • Information integrated in advance • Stored in DW for direct querying and analysis Clients Data Warehouse Metadata Integration System . . . Extractor/ Monitor Extractor/ Monitor Extractor/ Monitor . . . Source Source Source

  16. Non-Volatile • Operational update of data does not occur in the data warehouse environment. • Does not require transaction processing, recovery, and concurrency control mechanisms • Requires only two operations in data accessing: • initial loading of data and access of data.

  17. Non-Volatile(Read-Mostly) Write OLTP USER Read DW USER Read

  18. Time Variant • The time horizon for the data warehouse is significantly longer than that of operational systems. • Operational database: current value data. • Data warehouse data: provide information from a historical perspective (e.g., past 5-10 years)

  19. Time Variant • Most business analysis has a time component • Trend Analysis (historical data is required) Sales 2001 2002 2003 2004

  20. Data Warehousing Process Overview

  21. Data Warehousing Process Overview • The major components of a data warehousing process • Data sources • Data extraction • Data loading • Comprehensive Database/Data Store • Data Mart • Metadata • Middleware tools /information delivery tools

  22. ETL • Data Extraction • Data Cleaning and Transformation • Convert from legacy/host format to warehouse format • Load • Sort, summarize, consolidate, compute views, check integrity, build indexes, partition

  23. The ETL Process Source Systems Staging Area DW Database Transform Extract Load

  24. Data Staging Area • A storage area where extracted data is cleaned, transformed and deduplicated. • Initial storage for data • Need not be based on Relational model • Mainly sorting and Sequential processing • Does not provide data access to users • Analogy – kitchen of a restaurant

  25. ETL Process Issues & Challenges • Consumes 70-80% of project time • Heterogeneous Source Systems • Little or no control over source systems • Source systems scattered • Different currencies, measurement units • Ensuring data quality

  26. Comprehensive Database/Data Store • Mostly a relational DB • Oracle, DB2, Sybase, SQL Server • New DB design for special purpose of DW (e.g., scale up, speed up, parallel processing)

  27. Data Warehouse Design • OLTP Systems are Data Capture Systems • “DATA IN” systems • DW are “DATA OUT” systems OLTP DW

  28. Dimensional Modeling • Facts are stored in FACT Tables • Dimensions are stored in DIMENSION tables • Dimension tables contains textual descriptors of business • Fact and dimension tables form a Star Schema • “BIG” fact table in center surrounded by “SMALL” dimension tables

  29. Star Schema

  30. Star Schema

  31. Data mart • Data mart = subset of DW for community users, e.g. accounting department • Sometimes exist as Multidimensional Database • Info mart = summarized data + report for community users

  32. Meta Data • Data about data • Needed by both information technology personnel and users • IT personnel need to know data sources and targets; database, table and column names; refresh schedules; data usage measures; etc. • Users need to know entity/attribute definitions; reports/query tools available; report distribution information; help desk contact information, etc.

  33. Information Delivery Tools • Tools • Query & reporting • OLAP • Data mining, visualization, segmentation, clustering • New developments: text mining, web mining & personalization • Mining multimedia data

  34. Information Delivery Tools • Commercial tools • Crystal Report, Impromptu, WebFocus • Increasingly common mode of delivery: Web-enabled

  35. Data Warehouse Architecture • Data Flow Architecture • System Architecture

  36. Data Flow Architecture

  37. Data Flow Architecture

  38. Data Flow Architecture • Operational data stores (ODS) A type of database often used as an interim area for a data warehouse, especially for customer information files • MDB=Multidimensional databases

  39. System Architectures • Three parts of the data warehouse • The data warehouse that contains the data and associated software • Data acquisition (back-end) software that extracts data from legacy systems and external sources, consolidates and summarizes them, and loads them into the data warehouse • Client (front-end) software that allows users to access and analyze data from the warehouse

  40. System Architectures

  41. System Architectures

  42. System Architecture

  43. System Architecture

  44. Data Warehouse Development • Data warehouse development approaches • Inmon Model: EDW approach, Enterprise-wide warehouse, top down • Kimball Model: Data mart approach, Data mart, bottom up • Which model is best? • There is no one-size-fits-all strategy to data warehousing • When properly executed, both result in an enterprise-wide data warehouse, but with different architectures

  45. The Data Mart Strategy • The most common approach • Begins with a single mart and architected marts are added over time for more subject areas • Relatively inexpensive and easy to implement • Can be used as a proof of concept for data warehousing • Can perpetuate the “silos of information” problem • Can postpone difficult decisions and activities • Requires an overall integration plan

  46. The Enterprise-wide Strategy • A comprehensive warehouse is built initially • An initial dependent data mart is built using a subset of the data in the warehouse • Additional data marts are built using subsets of the data in the warehouse • Like all complex projects, it is expensive, time consuming, and prone to failure • When successful, it results in an integrated, scalable warehouse

  47. DW Lifecycle (Ralph Kimball )

  48. Data Warehouse Development • Some best practices for implementing a data warehouse (Weir, 2002): • Project must fit with corporate strategy and business objectives • There must be complete buy-in to the project by executives, managers, and users • It is important to manage user expectations about the completed project • The data warehouse must be built incrementally • Build in adaptability

  49. Data Warehouse Development • Some best practices for implementing a data warehouse (Weir, 2002): • The project must be managed by both IT and business professionals • Develop a business/supplier relationship • Only load data that have been cleansed and are of a quality understood by the organization • Do not overlook training requirements • Be politically aware

More Related