1 / 26

Data Warehousing

Data Warehousing. DSCI 4103 Dr. Mennecke. Introduction and Chapter 1. Introduction:. Definitions Legacy Systems Dimensions Data Dependencies Model Dimensional Model. Ship Type. Shipper. Ship To. Product. District Credit. Order Item. ContactLocat. Product Line. Sales Order.

ora-thomas
Download Presentation

Data Warehousing

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Data Warehousing DSCI 4103 Dr. Mennecke Introduction and Chapter 1

  2. Introduction: • Definitions • Legacy Systems • Dimensions • Data Dependencies Model • Dimensional Model

  3. ShipType Shipper Ship To Product DistrictCredit OrderItem ContactLocat. ProductLine SalesOrder Cust.Locat. ProductGroup Contract ContractType Customer Contact SalesRep SalesDistrict SalesRegion SalesDivision An ER Model

  4. A Dimensional Model Time Market Product

  5. Why Data Warehouses? • To meet the long sought after goal of providing the user with more flexible data bases containing data that can be accessed “every which way.”

  6. OLTP vs. OLAP • OLTP (Online transaction processing) has been the standard reason for IS and DP for the last thirty years. Most legacy systems are quite good at capturing data but do not facilitate data access. • OLAP (Online analytical processing) is a set of procedures for defining and using a dimension framework for decision support

  7. The Goals for and Characteristics of a DW • Make organizational data accessible • Facilitate consistency • Adaptable and yet resilient to change • Secure and reliable • Designed with a focus on supporting decision making

  8. The Goals for and Characteristics of a DW • Generate an environment in which data can be sliced and diced in multiple ways • It is more than data, it is a set of tools to query, analyze, and present information • The DW is the place where operational data is published (cleaned up, assembled, etc.)

  9. Basic elements of the data warehouse Operational Source Systems DataStaging Area DataPresentation Area DataAccessTools • Services: • Clean, combine, and standardizeConform DimensionsNo user query services • Data Store: • Flat files and relational tables • Processing: • Sorting and sequential processing • Data Mart #1 • DimensionalAtomic and summary dataBased on a single business process Ad hoc query tools Report Writers Analytical Applications Modeling: Forecasting Scoring Data Mining Extract Load Access DW Bus:Conformed facts and dimensions Extract • Data Mart #2 • Similar design Extract Load Access

  10. Data Staging Area • Extract-Transformation-Load • Extract: Reading the source data and copying the data to the staging area • Transformation: • Cleaning • Combining • Duplicating • Assigning keys • Load: present data to the bulk loading facilities of the data mart

  11. Organization of data in the presentation area of the data warehouse • Data in the warehouse are dimensional, not normalized relations • However, data that are ultimately presented in the data warehouse will often be derived directly from relational DBs • Data should be atomic someplace in the warehouse; even if the presentation is aggregate • Uses the bus architecture to support a decentralized set of data marts

  12. Updates to a data warehouse • For many years, the dogma stated that data warehouses are never updated. • This is unrealistic since labels, titles, etc. change. • Some components will, therefore, be changed; albeit, via a managed load (as opposed to transactional updates)

  13. Dimensional Modeling Terms and Concepts • Fact table • Dimension tables

  14. Fact Tables • Fact table: a table in the data warehouse that contains • Numerical performance measures • Foreign keys that tie the fact table to the dimension tables

  15. Fact Tables • Each row records a measurement describing a transaction • Where? • When? • Who? • How much? • How many? • The level of detail represented by this data is referred to as the grain of the data warehouse • Questions can only be asked down to a level corresponding with the grain of the data warehouse

  16. Fact Tables • Fact tables contain numeric data that can be one of three types • Additive • Semi-additive • Non-additive • Fact tables contain foreign keys • A group of foreign keys will be used to create a concatenated primary key • Fact tables generally don’t contain textual data

  17. Dimension tables • Tables containing textual descriptors of the business • Dimension tables are usually wide (e.g., 100 columns) • Dimension tables are usually shallow (100s of thousand or a few million rows) • Values in the dimensions usually provide • Constraints on queries (e.g., view customer by region) • Report headings

  18. Dimension tables • The quality of the dimensions will determine the quality of the data warehouse; that is, the DW is only as good as its dimension attributes • Dimensions are often split into hierarchical branches (i.e., snowflakes) because of the hierarchical nature of organizations • Product part  Product  Brand • Dimensions are usually highly denormalized

  19. Dimension tables • The dimension attributes define the constraints for the DW. Without good dimensions, it becomes difficult to narrow down on a solution when the DW is used for decision support

  20. Bringing together facts and dimensions – Building the dimensional Model • Start with the normalized ER Model • Group the ER diagram components into segments based on common business processes and model each as a unit • Find M:M relationships in the model with numeric and additive non-key facts and include them in a fact table • Denormalize the other tables as needed and designate one field as a primary key

  21. Sales Fact Time Dimension Product Dimension time_key product_key store_key dollars_sold units_sold dollars_cost time_key day_of_Week month quarter year holiday_flag product_key description brand category Store Dimension store_key store_name address floor_plan_type A Dimensional Model

  22. So, What is a DW? • A data warehouse is a subject-oriented, integrated, non-volatile, and time-variant collection of data in support of management’s decisionsW.H. Inmon (the father of DW)

  23. Subject Oriented • Data in a data warehouse are organized around the major subjects of the organization

  24. Integrated • Data from multiple sources are standardized (scrubbed, cleansed, etc.) and brought into one environment

  25. Non-Volatile • Once added to the DW, data are not changed (barring the existence of major errors)

  26. Time Variant • The DW captures data at a specific moment, thus, it is a snap-shot view of the organization at that moment in time. As these snap-shots accumulate, the analyst is able to examine the organization over time (a time series!) • The snap-shot is called a production data extract

More Related