350 likes | 620 Views
What is a Data Warehouse. by W. H. Inmon http://www.cait.wustl.edu/cait/papers/prism/vol1_no1/. What is a Data Warehouse?. A data warehouse is a: subject-oriented , integrated , time-variant , nonvolatile , collection of data in support of management's decision making process
E N D
What is a Data Warehouse by W. H. Inmon http://www.cait.wustl.edu/cait/papers/prism/vol1_no1/
What is a Data Warehouse? • A data warehouse is a: subject-oriented, integrated, time-variant, nonvolatile, collection of data in support of management's decision making process • The data comes from the operational environment • The data warehouse is always a physically separate store
Difference Between Operational Systems and Data and the Data Warehouse (DW) • DW is oriented around the major subjects of the enterprise • The data-driven, subject orientation is in contrast to the more classical process/functional orientation of applications • The DW world focuses on data modeling and database design exclusively • DW data excludes data that will not be used for DSS processing • DW data spans a spectrum of time and the relationships found in the data warehouse are many
The data warehouse has a strong subject orientation Operational Data warehouse Customer Loans Vendor Savings Product Bank card Activity Trust An application orientation A subject orientation
Integration • Data found within the DW is integrated • ALWAYS • WITH NO EXCEPTIONS • consistent naming conventions • consistent measurement of variables • consistent encoding structures • consistent physical attributes of data • data needs to be stored in the DW in a singular, globally-acceptable fashion
? When data is moved to the DW from the application-oriented operational environment, the data is integrated before entering the DW Operational Data warehouse appl A - m, f appl B - 1, 0 appl C - x, y appl D - male, female m, f pipeline cm appl A - pipeline cm appl B - pipeline inches appl C - pipeline mcf appl D - pipeline yds balance dec fixed (13,2) appl A - balance dec fixed (13,2) appl B - balance pic 9(9)v99 appl C - balance dec fixed (11,0) appl D - balance pic s9(7)v99 comp 3 appl A - description appl B - description appl C - description appl D - description description appl A - bal-on-hand appl B - current-balance appl C - cash-in-house appl D - balance balance appl A - date (Julian) appl B - date (yymmdd) appl C - date (mmddyy) appl D - date (absolute) date (Julian)
Integration • The collective ability of many application designers to create inconsistent applications is legendary • The integration affects almost every aspect of design - the physical characteristics of data, the dilemma of having more than one source of data, the issue of inconsistent naming standards, inconsistent date formats, and so forth
Time Variancy • All data in the data warehouse is accurate as of some moment in time (i.e., not "right now") • In the operational environment data is accurate as of the moment of access • Data found in the warehouse is said to be "time variant
Time Variancy Operational Data warehouse • Current value data: • time horizon -- 60 - 90 days • key may or may not have an • element of time • data can be updated • Snapshot data: • time horizon -- 5 - 10 years • key contains an element of • time • once snapshot is made, • record cannot be updated
Nonvolatile Change Replace Insert Insert Load Replace Replace Access Change Operational Data warehouse Data is updated on a record-by-record basis regularly Data is loaded into the warehouse and is accessed there, but once the snapshot of data is made, the data in the warehouse does not change
Nonvolatile • The basic manipulation of data that occurs in the data warehouse is simple • There are only two kinds of operations • the initial loading of data • the access of data • There is no update of data • The need to be cautious of the update anomaly is no factor • Liberties can be taken to optimize the access of data
Nonvolatile • Another consequence is in the technology • Technologies to support: • record-by-record update in an on-line mode • backup and recovery • transaction and data integrity • detection and remedy of deadlock are quite complex and unnecessary for data warehouse processing • DW environment is VERY, VERY different from the classical operational environment
Nonvolatile • The source of nearly all data warehouse data is the operational environment • It is a temptation to think that there is massive redundancy of data between the two environments • In fact there is a MINIMUM of data redundancy • data is filtered; much data never passes out of the operational environment • the time horizon of data is very different • the data warehouse contains summary data • data undergoes a fundamental transformation as it passes into the data warehouse
The Structure of the Warehouse • Data warehouses have a distinct structure • Different components of the data warehouse are: • meta data • current detail data • older detail data • lightly summarized data • highly summarized data • The major concern is the current detail data • the most recent happenings are always of great interest • voluminous, stored at the lowest level of granularity • disk storage is fast to access but expensive and complex to manage
There are different levels of summarization and detail that demark the data warehouse Highly summarized Lightly summarized META DATA Current data Older detail data
The Structure of the Warehouse • Older detail data is stored on some form of mass storage • it is infrequently accessed • it is stored at a level of detail consistent with current detailed data • Lightly summarized data is distilled from the low level of detail found at the current detailed level • it is almost always stored on disk storage • the design issues are: • what unit of time is the summarization done over • what attributes will the lightly summarized data contain
The Structure of the Warehouse • Highly summarized data is compact and easily accessible • Meta data plays a special and very important role in the data warehouse • It is used as: • a directory to help locate the contents • a guide to the mapping of data as the data is transformed from the operational to the DW environment • a guide to the algorithms used for summarization
An example of the levels of summarization that might be found in the data warehouse national sales by month 1988-1996 monthly sales by product line 1993-1996 national sales by week 1986-1996 weekly sales by subproduct 1988-1996 META DATA sales detail 1995-1996 sales detail 1985-1994
An Example of the Data Warehouse • Old sales detail is that detail about sales that is older than 1995 • The current value detail contains data from 1995 to 1996 • The sales detail is summarized weekly by subproduct line and by region to produce the lightly summarized stores of data • The weekly sales detail is further summarized monthly along even broader lines to produce the highly summarized data • Meta data contains (at the least!): • the structure of the data • the algorithms used for summarization • the mapping from the operational environment to the data warehouse
Old Detail Storage Medium • A wide variety of storage media that should be considered for storing older detail data • photo optical storage • CD-ROM • micro fiche • magnetic tape • mass storage • It is entirely likely that other storage media will serve the needs
The Flow of Data Inside the Data Warehouse Summarization process Operational environment Aging process
Flow of Data • As data enters the data warehouse from the operational environment, it is transformed • Upon entering the data warehouse, data goes into the current detail level of detail • It resides there and is used there until one of three events occurs: • it is purged • it is summarized, and/or • it is archived
The higher the levels of summarization, the more the usage of the data
Summarized Data • The more summarized the data, the quicker and more efficient it is to get to the data • The DSS analyst in a pre-data warehouse environment has used data at the detailed level • One of the tasks of the data architect is to wean the DSS user from constantly using data at the lowest level of detail • installing a chargeback system • pointing out very good response time when dealing with data at a high level of summarization
Other Considerations • Data at the higher levels of summarization can be freely indexed • Data at the lower levels of detail is so voluminous that it can be indexed sparingly • The data model and formal design applies almost exclusively to the current level of detail • The data modeling activities do not apply to the levels of summarization
Indexes and Data Model Data model
Partitioning of DW Data • Partitioning can be done in two ways • at the DBMS level • the DBMS is aware of the partitions and manages them accordingly • the automatic management of the partitions is inflexible • at the application level • the responsibility for the management of the partitions is left up to the programmer • provides flexibility in the management of data in the data warehouse
vendor/supplier history 91 - present parts/order history 91 92 93 94 95 parts manufacture history q1 93 q2 93 q3 93 q4 93 q1 94 q2 94 q3 94 q4 94 q1 95 q2 95 q3 95 q4 95 q1 96 94 95 92 93 The internal structuring of data in a sample data warehouse current detailed data customer history order/customer 88 - present part part/order part part part/assembly parts shipments parts bill of material assembly history 91 92 93 94 95 87 - present
An Example of a Data Warehouse • The levels of summarization are not shown, nor is the old detail archive shown • There are tables of the same type divided over time • For different types of tables there are different units of time physically dividing the units of data • Different tables are linked by means of a common identifier
Other Anomalies • Public summary data is summary data that has been calculated outside the boundaries of the data warehouse but is used throughout the corporation • Another anomaly is that of external data • Another exceptional type of data sometimes found in a data warehouse is that of permanent detail data stored for ethical or legal reasons • the medium the data is stored on must be as safety proof as possible • the data must be able to be restored • the data needs special treatment in the indexing to be accessible
Summary A data warehouse is a subject-oriented, integrated, time-variant, nonvolatile collection of data in support of management's decision needs There are four levels of data warehouse data: • old detail • current detail • lightly summarized data • highly summarized data Meta data is also an important part of the data warehouse environment