630 likes | 1k Views
Granularity in the Data Warehouse. Chapter 4. Raw Estimates. The single most important design issue facing the data warehouse developer is determining the proper level of granularity of the data that will reside in the data warehouse.
E N D
Granularity in the Data Warehouse Chapter 4
Raw Estimates • The single most important design issue facing the data warehouse developer is determining the proper level of granularity of the data that will reside in the data warehouse. • Granularity is also important to the warehouse architect because it affects all the environments that depend on the warehouse for data. • Granularity affects how efficiently data can be shipped to the different environments and determines the types of analysis that can be done. • The primary issue of granularity is that of getting it at the right level. The level of granularity needs to be neither too high nor too low.
Raw Estimates • The starting point for determining the appropriate level of granularity is to do a raw estimate of the number of rows of data and the DASD (direct access storage device) that will be in the data warehouse. • The raw estimate of the number of rows of data that will reside in the data warehouse tells the architect a great deal. • If there are only 10,000 rows, almost any level of granularity will do. • If there are 10 million rows, a low level of granularity is possible. • If there are 10 billion rows, not only is a higher level of granularity needed, but a major portion of the data will probably go into overflow storage. • Figure shows an algorithmic path to calculate the space occupied by a data warehouse.
Input to the Planning Process The estimate of rows and DASD then serves as input to the planning process, as shown in Figure
Data in Overflow • Once the raw estimate as to the size of the data warehouse is made, the next step is to compare the total number of rows in the warehouse environment to the charts shown in Figure. • Depending on how many total rows will be in the warehouse environment, different approaches to design, development, and storage are necessary.
Data in Overflow On the five-year horizon, the totals shift by about an order of magnitude or perhaps even more. The theory is that after five years, these factors will be in place: ■■ There will be more expertise available in managing the data warehouse volumes of data. ■■ Hardware costs will have dropped to some extent. ■■ More powerful software tools will be available. ■■ The end user will be more sophisticated.
Overflow Storage • Data in the data warehouse environment grows at a rate never before seen by IT professionals. The combination of historical data and detailed data produces a growth rate that is phenomenal. • As data grows large, a natural subdivision of data occurs between actively used data and inactively used data. Inactive data is sometimes called dormant data or infrequently used data. • At some point in the life of the data warehouse, the vast majority of the data in the warehouse becomes stale and unused. At this point, it makes sense to start separating the data onto different storage media.
Overflow Storage • Figure shows that a data monitor is needed to determine the usage of data. The data monitor tells where to place data by determining what data is and is not being used in the data warehouse. • The movement between disk storage and near-line storage is controlled by means of software called a cross-media storage manager (CMSM). • The data in alternate storage or near-line storage can be accessed directly by means of software that has the intelligence to know where data is located in near-line storage. • These three software components are the minimum required for alternate storage or near-line storage to be used effectively.
Some Feedback Loop Techniques • Build the first parts of the data warehouse in very small, very fast steps, and carefully listen to the end users’ comments at the end of each step of development. Be prepared to make adjustments quickly. • If available, use prototyping and allow the feedback loop to function using observations gleaned from the prototype. • Look at how other people have built their levels of granularity and learn from their experience. • Go through the feedback process with an experienced user who is aware of the process occurring. • Under no circumstances should you keep your users in the dark as to the dynamics of the feedback loop. • Look at whatever the organization has now that appears to be working, and use those functional requirements as a guideline. • Execute joint application design (JAD) sessions and simulate the output to achieve the desired feedback.
Some Feedback Loop Techniques Granularity of data can be raised in many ways, such as the following: • Summarize data from the source as it goes into the target. • Average or otherwise calculate data as it goes into the target. • Push highest and/or lowest set values into the target. • Push only data that is obviously needed into the target. • Use conditional logic to select only a subset of records to go into the target. The ways that data may be summarized or aggregated are limitless. When building a data warehouse, keep one important point in mind: In classical requirements systems development, it is unwise to proceed until the vast majority of the requirements are identified. But in building the data warehouse, it is unwise not to proceed if at least half of the requirements for the data warehouse are identified.
The Data Warehouseand Technology Chapter 5
Some basic requirement of technology supporting Data warehousing
Managing Large Amounts of Data • In the ideal case, the data warehouse developer builds a data warehouse under the assumption that the technology that houses the data warehouse can handle the volumes required. • When the designer has to go to extraordinary lengths in design and implementation to map the technology to the data warehouse, then there is a problem with the underlying technology. • When technology is an issue, it is normal to engage more than one technology. • The ability to participate in moving dormant data to overflow storage is perhaps the most strategic capability that a technology can have.
Managing Multiple Media • In conjunction with managing large amounts of data efficiently and cost effectively, the technology underlying the data warehouse must handle multiple storage media. It is insufficient to manage a mature data warehouse on DASD alone. • Following is a hierarchy of storage of data in terms of speed of access and cost of storage:
Indexing and Monitoring Data • Of course, the designer uses many practices to make data as flexible as possible, such as spreading data across different storage media and partitioning data. But the technology that houses the data must be able to support easy indexing as well. • Unlike the monitoring of transaction processing, where the transactions themselves are monitored, data warehouse activity monitoring determines what data has and has not been used. Monitoring data warehouse data determines such factors as the following: • If a reorganization needs to be done • If an index is poorly structured • If too much or not enough data is in overflow • The statistical composition of the access of the data • Available remaining space
Interfaces to Many Technologies Extremely important component of the data warehouse is the ability both to receive data from and to pass data to a wide variety of technologies. The interface to different technologies requires several considerations: ■■ Does the data pass from one DBMS to another easily? ■■ Does it pass from one operating system to another easily? ■■ Does it change its basic format in passage (EBCDIC, ASCII, and so forth)? ■■ Can passage into multidimensional processing be done easily? ■■ Can selected increments of data, such as changed data capture (CDC) be passed rather than entire tables? ■■ Is the context of data lost in translation as data is moved to other environments?
Programmer or Designer Controlof Data Placement Because of efficiency of access and update, the programmer or designer must have specific control over the placement of data at the physical block or page level, as shown in Figure
Metadata Management Metadata becomes even more important in the data warehouse than in the classical operational environment. Metadata is vital because of the fundamental difference in the development life cycle that is associated with the data warehouse. Typically, the technical metadata that describes the data warehouse contains the following: • Data warehouse table structures • Data warehouse table attribution • Data warehouse source data (the system of record)
Metadata Management • Mapping from the system of record to the data warehouse • Data model specification • Extract logging • Common routines for access of data • Definitions and/or descriptions of data • Relationships of one unit of data to another Metadata comes in different varieties: • Business metadata is that metadata that is of use and value to the business person. • Technical metadata is that metadata that is of use and value to the technician.
Language Interface The data warehouse must have a rich language specification. The languages used by the programmer and the DSS end user to access data inside the data warehouse should be easy to use and robust. Typically, the language interface to the data warehouse should do the following: ■■ Be able to access data a set at a time ■■ Be able to access data a record at a time ■■ Specifically ensure that one or more indexes will be used in the satisfaction of a query ■■ Have an SQL interface ■■ Be able to insert, delete, or update data
Efficient Loading of Data An important technological capability of the data warehouse is the ability to load the data warehouse efficiently, as shown in Figure
Data is loaded into a data warehouse in two fundamental ways: a record at a time through a language interface or en masse with a utility.
Efficient Loading of Data Loading data by means of a utility is much faster. In addition, indexes must be efficiently loaded at the same time the data is loaded. As the burden of the volume of loading becomes an issue, the load is often parallelized. When this happens, the data being loaded is divided into one of several job streams. Another related approach to the efficient loading of very large amounts of data is staging the data prior to loading
Efficient Index Utilization Technology can support efficient index access in several ways: ■■ Using bit maps ■■ Having multileveled indexes ■■ Storing all or parts of an index in main memory ■■ Compacting the index entries when the order of the data being indexed allows such compaction ■■ Creating selective indexes and range indexes
Compaction of Data • The ability of DW to manage large amounts of data necessitates the ability to compact data. • Of course, when data is compacted, it can be stored in a minimal amount of space which makes the access of the data very efficient. • Another advantage is that the programmer gets the most out of a given I/O when data is stored compactly. • Of course, there is always the corresponding issue of decompaction of data on access. • As a rule, in the data warehouse environment, I/O resources are much scarcer than CPU resources, so decompaction of data is not a major issue since it consumes CPU cycles not the I/O ones.
Variable-Length Data Another simple but vital technological requirement of the data warehouse environment is the ability to manage variable-length data efficiently, as seen in Figure
Because of the variety of data found in the data warehouse, variable-length structuring of data must be supported.
Lock Management A standard part of database technology is the lock manager, which ensures that two or more people are not updating the same record at the same time. But an update is not done in the data warehouse; instead, data is stored in a series of snapshot records. When a change occurs, a new snapshot record is added, rather than an update being done. One of the effects of the lock manager is that it consumes a fair amount of resources, even when data is not being updated. Merely turning the lock manager on requires overhead. Therefore, to streamline the data warehouse environment, being able to selectively turn the lock manager off and on is necessary.
Other Technological Features It is noteworthy that many features of DBMS technology found in the classical transaction processing play only a small role (if they play a role at all) in the support of the data warehouse environment. Some of those features include the following: ■■ Transaction integrity ■■ High-speed buffering ■■ Row- or page-level locking ■■ Referential integrity ■■ VIEWs of data ■■ Partial block loading Indeed, whenever a transaction-based DBMS is used in the data warehouse environment, it is desirable to turn off such features, because they interfere with the efficient processing of data inside the data warehouse.
DBMS Types and the Data Warehouse With the advent of data warehousing and the recognition of DSS as an integral part of the modern information systems infrastructure, a new class of DBMS has arisen. This class can be called a data warehouse-specific DBMS. The data warehouse-specific DBMS is optimized for data warehousing and DSS processing.This specific type of DBMS differs from the general purpose DBMS in various regards. • No overhead of update. • No freespace for update is needed. • A much more robust and sophisticated indexing structure is needed. • A data warehouse-specific DBMS allows data to be physically optimized for DSS access and analysis.
Multidimensional DBMS and the Data Warehouse One of the technologies often discussed in the context of the data warehouse is multidimensional DBMS processing (sometimes called OLAP processing). Multidimensional database management systems, or data marts, provide an information system with the structure that allows an organization to have very flexible access to data, to slice and dice data any number of ways, and to dynamically explore the relationship between summary and detail data. Multidimensional DBMSs offer both flexibility and control to the end user, and as such they fit well in a DSS environment.
Though tempting to think that multidimensional DBMS technology should be the database technology for the data warehouse, in all but the most unusual cases, this is a mistake. Consider the differences between the multidimensional DBMS and the data warehouse: ■■ The data warehouse holds massive amounts of data; the multidimensional DBMS holds at least an order of magnitude less data. ■■ The data warehouse is geared for a limited amount of flexible access; the multidimensional DBMS is geared for very heavy and unpredictable access and analysis of data. ■■ The data warehouse contains data with a very lengthy time horizon (from 5 to 10 years); the multidimensional DBMS holds a much shorter time horizon of data. ■■ The data warehouse allows analysts to access its data in a constrained fashion; the multidimensional DBMS allows unfettered access. ■■ Instead of the data warehouse being housed in a multidimensional DBMS, the multidimensional DBMS and the data warehouse enjoy a complementary relationship.
Multidimensional DBMSs come in several flavors. Some multidimensional DBMSs operate on a foundation of relational technology, and some operate on a technological foundation optimal for “slicing and dicing” the data, where data can be thought of as existing in multidimensional cubes. The latter technological foundation is sometimes called a cube or OLAP foundation. Both foundations can support multidimensional DBMS data marts. But there are some differences between the two types of technological foundations.
Following is the relational foundation for multidimensional DBMS data marts: Strengths: ■■ Can support a lot of data. ■■ Can support dynamic joining of data. ■■ Has proven technology. ■■ Is capable of supporting general-purpose update processing. ■■ If there is no known pattern of usage of data, then the relational structure is as good as any other. Weaknesses: ■■ Has performance that is less than optimal. ■■ Cannot be purely optimized for access processing.
Following is the cube foundation for multidimensional DBMS data marts: Strengths: ■■ Performance that is optimal for DSS processing. ■■ Can be optimized for very fast access of data. ■■ If pattern of access of data is known, then the structure of data can be optimized. ■■ Can easily be sliced and diced. ■■ Can be examined in many ways. Weaknesses: ■■ Cannot handle nearly as much data as a standard relational format. ■■ Does not support general-purpose update processing. ■■ May take a long time to load. ■■ If access is desired on a path not supported by the design of the data, the structure is not flexible. ■■ Questionable support for dynamic joins of data.
Context and Content An Example: Suppose a manager asks for a report from the data warehouse for 1995. The report is generated, and the manager is pleased. In fact, the manager is so pleased that a similar report for 1990 is requested. Because the data warehouse carries historical information, such a request is not hard to accommodate. The report for 1990 is generated. Now the manager holds the two reports—one for 1995 and one for 1990—in his hands and declares that the reports are a disaster. The data warehouse architect examines the reports and sees that the financial statement for 1995 shows $50 million in revenue, while the report for 1990 shows a value of $10,000 for the same category. The manager declares that there is no way that any account or category could have increased in value that much in five years’ time.
Before giving up, the data warehouse architect points out to the manager that there are other relevant factors that do not show up in the report. • In 1990, there was a different source of data than in 1995. • In 1990, the definition of a product was not the same as in 1995. • In 1990, there were different marketing territories than in 1995. • In 1990, there were different calculations, such as for depreciation, than in 1995. • In addition, there were many different external considerations, such as a difference in inflation, taxation, economic forecasts, and so forth. • Once the context of the reports is explained to the manager, the contents now appear to be quite acceptable.