280 likes | 435 Views
Data Warehousing. DSCI 4103 Dr. Mennecke. Introduction and Chapter 1. Introduction:. Definitions Legacy Systems Dimensions Data Dependencies Model Dimensional Model. Ship Type. Shipper. Ship To. Product. District Credit. Order Item. ContactLocat. Product Line. Sales Order.
E N D
Data Warehousing DSCI 4103 Dr. Mennecke Introduction and Chapter 1
Introduction: • Definitions • Legacy Systems • Dimensions • Data Dependencies Model • Dimensional Model
ShipType Shipper Ship To Product DistrictCredit OrderItem ContactLocat. ProductLine SalesOrder Cust.Locat. ProductGroup Contract ContractType Customer Contact SalesRep SalesDistrict SalesRegion SalesDivision An ER Model
A Dimensional Model Time Market Product
Why Data Warehouses? • To meet the long sought after goal of providing the user with more flexible data bases containing data that can be accessed “every which way.”
OLTP vs. OLAP • OLTP (Online transaction processing) has been the standard reason for IS and DP for the last thirty years. Most legacy systems are quite good at capturing data but do not facilitate data access. • OLAP (Online analytical processing) is a set of procedures for defining and using a dimension framework for decision support
The Goals for and Characteristics of a DW • Make organizational data accessible • Facilitate consistency • Adaptable and yet resilient to change • Secure and reliable • Designed with a focus on supporting decision making
The Goals for and Characteristics of a DW • Generate an environment in which data can be sliced and diced in multiple ways • It is more than data, it is a set of tools to query, analyze, and present information • The DW is the place where operational data is published (cleaned up, assembled, etc.)
Basic elements of the data warehouse Operational Source Systems DataStaging Area DataPresentation Area DataAccessTools • Services: • Clean, combine, and standardizeConform DimensionsNo user query services • Data Store: • Flat files and relational tables • Processing: • Sorting and sequential processing • Data Mart #1 • DimensionalAtomic and summary dataBased on a single business process Ad hoc query tools Report Writers Analytical Applications Modeling: Forecasting Scoring Data Mining Extract Load Access DW Bus:Conformed facts and dimensions Extract • Data Mart #2 • Similar design Extract Load Access
Data Staging Area • Extract-Transformation-Load • Extract: Reading the source data and copying the data to the staging area • Transformation: • Cleaning • Combining • Duplicating • Assigning keys • Load: present data to the bulk loading facilities of the data mart
Organization of data in the presentation area of the data warehouse • Data in the warehouse are dimensional, not normalized relations • However, data that are ultimately presented in the data warehouse will often be derived directly from relational DBs • Data should be atomic someplace in the warehouse; even if the presentation is aggregate • Uses the bus architecture to support a decentralized set of data marts
Updates to a data warehouse • For many years, the dogma stated that data warehouses are never updated. • This is unrealistic since labels, titles, etc. change. • Some components will, therefore, be changed; albeit, via a managed load (as opposed to transactional updates)
Dimensional Modeling Terms and Concepts • Fact table • Dimension tables
Fact Tables • Fact table: a table in the data warehouse that contains • Numerical performance measures • Foreign keys that tie the fact table to the dimension tables
Fact Tables • Each row records a measurement describing a transaction • Where? • When? • Who? • How much? • How many? • The level of detail represented by this data is referred to as the grain of the data warehouse • Questions can only be asked down to a level corresponding with the grain of the data warehouse
Fact Tables • Fact tables contain numeric data that can be one of three types • Additive • Semi-additive • Non-additive • Fact tables contain foreign keys • A group of foreign keys will be used to create a concatenated primary key • Fact tables generally don’t contain textual data
Dimension tables • Tables containing textual descriptors of the business • Dimension tables are usually wide (e.g., 100 columns) • Dimension tables are usually shallow (100s of thousand or a few million rows) • Values in the dimensions usually provide • Constraints on queries (e.g., view customer by region) • Report headings
Dimension tables • The quality of the dimensions will determine the quality of the data warehouse; that is, the DW is only as good as its dimension attributes • Dimensions are often split into hierarchical branches (i.e., snowflakes) because of the hierarchical nature of organizations • Product part Product Brand • Dimensions are usually highly denormalized
Dimension tables • The dimension attributes define the constraints for the DW. Without good dimensions, it becomes difficult to narrow down on a solution when the DW is used for decision support
Bringing together facts and dimensions – Building the dimensional Model • Start with the normalized ER Model • Group the ER diagram components into segments based on common business processes and model each as a unit • Find M:M relationships in the model with numeric and additive non-key facts and include them in a fact table • Denormalize the other tables as needed and designate one field as a primary key
Sales Fact Time Dimension Product Dimension time_key product_key store_key dollars_sold units_sold dollars_cost time_key day_of_Week month quarter year holiday_flag product_key description brand category Store Dimension store_key store_name address floor_plan_type A Dimensional Model
So, What is a DW? • A data warehouse is a subject-oriented, integrated, non-volatile, and time-variant collection of data in support of management’s decisionsW.H. Inmon (the father of DW)
Subject Oriented • Data in a data warehouse are organized around the major subjects of the organization
Integrated • Data from multiple sources are standardized (scrubbed, cleansed, etc.) and brought into one environment
Non-Volatile • Once added to the DW, data are not changed (barring the existence of major errors)
Time Variant • The DW captures data at a specific moment, thus, it is a snap-shot view of the organization at that moment in time. As these snap-shots accumulate, the analyst is able to examine the organization over time (a time series!) • The snap-shot is called a production data extract