200 likes | 254 Views
DATA WAREHOUSING. Introduction. Modern organizations have huge amounts of data but are starving for information – facing information gap! Reasons for information gap:
E N D
Introduction • Modern organizations have huge amounts of data but are starving for information – facing information gap! • Reasons for information gap: • Fragmented information systems, uncoordinated (sometimes inconsistent) databases /islands of information due to time and resources constraints • Most systems support operational /transaction processingrather than informational processing. • Bridging the information gap are data warehouses!
Introduction (Contd.) • A data warehouse bridges this information gap as • it consolidates and integrates information from internal and external sources and • arranges it in a meaningful format for making accurate business decisions • The two noticeable pioneers in the DWH field are Ralph Kimball and Bill Inmon. • The term Data Warehouse was coined by Bill Inmon in 1990
What is a Data Warehouse? • Inmon defines a data warehouse as follows: • "a subject-oriented, integrated, time-variant and non-volatile collection of data in support of management's decision making process". • Description of the above terms according to Inmon • Subject Oriented: • Data that gives information about a particular subject instead of about a company's ongoing operations. • Integrated: • Data that is gathered into the data warehouse from a variety of sources and merged into a coherent whole.
Time-variant: • All data in the data warehouse is identified with a particular time dimension. The time factor can be used to study trends and changes. • Non-volatile • Data is stable in a data warehouse. More data is added but data is never removed. This enables management to gain a consistent picture of the business.
What is a Data Warehouse? • This definition by Inmon remains reasonably accurate almost ten years later. However, • a single-subject data warehouse is typically referred to as a data mart, while data warehouses are generally enterprise in scope. • data warehouses can be volatile. • Due to the large amount of storage required for a data warehouse, (multi-terabyte data warehouses are not uncommon), only a certain number of periods of history are kept in the warehouse. • For instance, if three years of data are decided on and loaded into the warehouse, every month the oldest month will be "rolled off" the database, and the newest month added.
What is a Data Warehouse? • Ralph Kimball provided a much simpler definition of a data warehouse. • a data warehouse is "a copy of transaction data specifically structured for query and analysis". • This definition provides less insight and depth than Mr. Inmon's, but is no less accurate.
The Need for Data warehousing • Two factors drive the need for data warehousing • Need of company-wide view of data • Need to separate operational* and informational* systems * See slide notes
Data warehousing? • It is the process of creating, populating, and then querying a data warehouse • Components of data warehousing are • Source Systems Identification • Data warehouse Design and Creation • Data Acquisition • Changed data capture • Data Cleansing • Data Aggregation • Business Intelligence • Multidimensional Analysis Tools • Query Tools • Data Mining Tools • Data Visualization Tools • Metadata Management
Components of Data warehousing • Source System Identification • Appropriate data must be located to build a data warehouse. • This data comes from • Current OLTP systems (providing day-to-day information) • `Legacy systems (providing historical data for prior periods) • Data Warehouse Design and Creation • The design is often an iterative process • Requires effort to understand database schema and a great deal of user interaction
Components of Data warehousing • Data Acquisition • Involves moving company data from source systems into warehouse • Time consuming activity • Performed using ETL (Extract/Transform/ Load) Tools • Data acquisition is an ongoing, scheduled process. Warehouse is refreshed monthly • Changed data capture • Periodic update of warehouse from transactional systems is complicated as it is difficult to identify which records in source have changed since last update. • Some technologies used in this area are Replication servers, Publish/Subscribe, Triggers and Stored procedures, and Database Log analysis.
Components of Data warehousing • Data Cleansing • Typically performed in conjunction with data acquisition (part of “T” in “ETL”) • A complicated process that validates and if required corrects data before its loaded into warehouse • Example • The entries for "Customer Name" may appear differently in various source systems for the same customer. • one entered as "IBM", one as "I.B.M.", and one as "International Business Machines". • A decision must be made as to which is correct, and then the data cleansing tool will change the others to match the rule. • This process is also referred to as "data scrubbing" or "data quality assurance". It can be an extremely complex process, especially if some of the warehouse inputs are from older mainframe file systems (commonly referred to as "flat files" or "sequential files").
Components of Data warehousing • Data Aggregation • If required, it is performed during “T” phase of “ETL” • Data warehouse can be designed to store data at detail level (each individual transaction), at some aggregate level (summary data) or a combination of both. • The advantage of summarized data is that typical queries against the warehouse run faster. • The disadvantage is that information, which may be needed to answer a query, is lost during aggregation • Business Intelligence (BI) • Contains technologies such as Decision Support Systems (DSS), Executive Information Systems (EIS), On-Line Analytical Processing (OLAP), Relational OLAP (ROLAP), Multi-Dimensional OLAP (MOLAP), Hybrid OLAP (HOLAP, a combination of MOLAP and ROLAP), and more.
Components of Data warehousing • BI can be broken down into four broad fields • Multidimensional Analysis Tools • Allow user to look at data from various different angles • Often use a multidimensional database called cube • Query Tools • Allow user to issue SQL queries against warehouse and get results • Data Mining Tools • Automatically search for patterns in data • Driven by complex statistical formulas • Data Visualization Tools • Graphically represent data including 3D data pictures
Components of Data warehousing • Metadata management • Throughout the entire process of identifying, acquiring, and querying the data, metadata management takes place • The datatype (e.g., string or integer) of the column is metadata. The name of the column is another. The actual value in the column for a particular row is not metadata - it is data • Metadata is stored in metadata repository • Metadata is useful in almost all components of data warehousing discussed earlier
Data Mining • Definition: • Knowledge discovery using a sophisticated blend of techniques from traditional statistics, artificial intelligence and computer graphics • Goals • To explain observed events or conditions, such as why sales of a product have increased in a particular area • To confirm hypothesis, such as, whether two-income families are more likely to buy family medical cover than single-income families • To analyze data for new or unexpected relationships, such as what spending patterns are likely to accompany credit card fraud.
Data Mining Techniques • Case-based reasoning • Derives rules from real world case examples • Rule discovery • Searches for patterns and correlations in large data sets • Signal Processing • Identifies clusters of observations with similar characteristics • Neural nets • Develops predictive models based on principles modeled after the human brain • Fractals • Compresses large databases without losing information
Some Data Mining Applications • Analysis of business trends • Identifying markets with above average or below average growth • Target marketing • Identifying customers for promotional activity • Usage Analysis • Identifying usage patterns of products and services • Product Affinity • Identifying products that are purchased concurrently or characteristics of shoppers • For further application types - refer to book page 437 table 11.5
Reading Assignment • Read these concepts from Book • The ETL process • Definitions • @ctive data warehouse • Data mart • Independent data mart • dependent data mart • EDW • Star Schema • Snowflake Schema • OLAP • MOLAP • ROLAP
Resources • http://www.intranetjournal.com/features/datawarehousing.html • Modern Database Management, 6/e