320 likes | 614 Views
Chapter 1: Introduction to Data Mining, Warehousing, and Visualization. Modern Data Warehousing, Mining, and Visualization: Core Concepts by George M. Marakas Spring 2012. Objectives. What is the purpose and motivation for developing a Data Warehouse (DW)?
E N D
Chapter 1: Introduction to Data Mining, Warehousing, and Visualization Modern Data Warehousing, Mining, and Visualization: Core Concepts by George M. Marakas Spring 2012 Modern Data Warehousing, Mining & Visualization, 2003, George Marakas
Objectives • What is the purpose and motivation for developing a Data Warehouse (DW)? • Position of DW within IT infrastructure • Relationship between DW and business data mart • What can a DW do? • Foundations for Data Mining • Steps in a typical Data mining project • What is a “Correlation”? KEY CONCEPT • History of Data Visualization vis-à-vis DW Modern Data Warehousing, Mining & Visualization, 2003, George Marakas
1-1: The Modern Data Warehouse • A data warehouse is a copy of transaction data specifically structured for querying, analysis and reporting • Note that the data warehouse contains a copy of the transactions. These are not updated or changed later by the transaction system. • Also note that this data is specially structured, and may have been transformed when it was placed in the warehouse Modern Data Warehousing, Mining & Visualization, 2003, George Marakas
1-2: Data Warehouse Roles and Structures The DW has the following primary functions: • It is a direct reflection of the business rules of the enterprise. • It is the collection point for strategic information. • It is the historical store of strategic information. • It is the source of information later delivered to data marts. • It is the source of stable data regardless of how the business processes may change. Modern Data Warehousing, Mining & Visualization, 2003, George Marakas
Elements of a DW Extract Transform Store [ETS] Modern Data Warehousing, Mining & Visualization, 2003, George Marakas
Position of the Data Warehouse Within the Organization – Figure 1-2 Modern Data Warehousing, Mining & Visualization, 2003, George Marakas
Data Mining ExampleService Quality vs. Training Courtesy: MicroStrategy (2005) Modern Data Warehousing, Mining & Visualization, 2003, George Marakas
Examples of Common DW Applications Table 1-1 Modern Data Warehousing, Mining & Visualization, 2003, George Marakas
Comparison of Typical DW Costs and Benefits Modern Data Warehousing, Mining & Visualization, 2003, George Marakas
1-4: The Cost of DW • Expenditures can be categorized as one-time initial costs or as recurring, ongoing costs. • The initial costs can further be identified as for hardware or software. • Expenditures can also be categorized as capital costs (associated with acquisition of the warehouse) or as operational costs (associated with running and maintaining the warehouse) • Cost of a Data Warehouse: • Rule of Thumb: $1 million per 1 Terabyte of data Modern Data Warehousing, Mining & Visualization, 2003, George Marakas
Expenditures Associated with Building a DW Table 1-3 Modern Data Warehousing, Mining & Visualization, 2003, George Marakas
1-5: Data Mining:Farmers and Explorers • Every corporation has two types of DW users. • Farmers[Traditional Statistical Hypothesis testing] know what they want before they set out to find it. They submit small queries and retrieve small nuggets of information. • Explorers [Data Mining] are quite unpredictable. They often submit large queries. Sometimes they find nothing, sometimes they find priceless “golden” nuggets. • Cost justification for the DW is usually done on the basis of the results obtained by farmers since explorers are unpredictable. Modern Data Warehousing, Mining & Visualization, 2003, George Marakas
1-6: Foundations of Data Mining • Data mining is the process of using raw data to infer important business relationships. • Despite a consensus on the value of data mining, a great deal of confusion exists about what it is. • It is a collection of powerful techniques intended for analyzing large datasets. • There is no single data mining approach, but rather a set of techniques that can be used in combination with each other. Modern Data Warehousing, Mining & Visualization, 2003, George Marakas
1-6 & -7: The Foundations of Data Mining • Data mining has roots in practice dating back over 30 years using standard statistics [e.g., bio-statistics] • In the early 1960s, data mining was called statistical analysis, and the pioneers were statistical software companies such as SAS and SPSS. • By the 1980s, the traditional techniques had been augmented by new methods such as fuzzy logic, heuristics and neural networks. • Also, DSS tools came into popular use in the 1980’s with tools such as Lotus 1-2-3 & EXCEL Modern Data Warehousing, Mining & Visualization, 2003, George Marakas
Data Mining – A General Approach Although all data mining endeavors are unique, they possess a common set of process steps: • Infrastructure preparation – choice of hardware platform, the database system and one or more mining tools • Exploration – looking at summary data, sampling and applying intuition [Data visualization useful here] • Analysis – each discovered pattern is analyzed for significance and trends Modern Data Warehousing, Mining & Visualization, 2003, George Marakas
A General Approach (continued) • Interpretation – Once patterns have been discovered and analyzed, the next step is to interpret them. Considerations include business cycles, seasonality and the population the pattern applies to. • Exploitation – this is both a business and a technical activity. One way to exploit a pattern is to use it for prediction. Others are to package, price or advertise the product in a different way. Modern Data Warehousing, Mining & Visualization, 2003, George Marakas
The Data Warehouse and Data Mining • Data mining does not require the use of a data warehouse (DW), however, DWs are designed with data mining in mind. • The data in the DW is integrated and stable (non-volatile) • Data changes continuously in an operational database. • If multiple analyses are run in sequence, the data need to be held constant (as in a DW). Modern Data Warehousing, Mining & Visualization, 2003, George Marakas
Volumes of Data – The Biggest Challenge • The largest challenge a “data miner” may face is the sheer volume of data in the warehouse. • It is quite important, then, that summary data also be available to get the analysis started. • A major problem is that this sheer volume may mask the important relationships the analyst is interested in. • The ability to overcome the volume and visualize the data becomes quite important. Modern Data Warehousing, Mining & Visualization, 2003, George Marakas
1.9: Foundations of Data Visualization [DV] • One of the earliest known examples of data visualization was in London during the 1854 cholera epidemic. A map (next slide) helped to identify the source of the disease. • Modern visualization techniques grew from the twin technologies of computer graphics and high performance computing in the 1970s and 1980s. Modern Data Warehousing, Mining & Visualization, 2003, George Marakas
Dr. John Snow used a map to show the source of cholera was a water pump, thus proving the disease was water borne. Broad Street Pump Modern Data Warehousing, Mining & Visualization, 2003, George Marakas
DV: Opportunity and Timing • Alternative input devices (light pen, sketch pad and mouse) began to appear in the 1960s. • In the 1970s, flight simulators became much more realistic when graphics replaced film. • In the same decade, special effects computers became entrenched in the entertainment industry. • In the 1980s, visualization grew more dynamic with applications like the animation of weather patterns. Modern Data Warehousing, Mining & Visualization, 2003, George Marakas
Data Visualization – Sales by Region Typical Spreadsheet Graphic Modern Data Warehousing, Mining & Visualization, 2003, George Marakas
Data Visualization – Total Precipitation Modern Data Warehousing, Mining & Visualization, 2003, George Marakas
DV & DM: Future Success Drivers • In the 1990s, rapid advances in chip technology, both at the CPU and the graphics processor, put data visualization everywhere. • On-going reduced costs of computing. • Each new generation has a 10X-100X performance-cost improvements. • Approximately every 18 months [Moore’s Law]. • Web-based E-commerce • Business to Consumer Commerce [B to C; and C:C] • Generates billions and even trillions of characters per reporting period Modern Data Warehousing, Mining & Visualization, 2003, George Marakas
The End Modern Data Warehousing, Mining & Visualization, 2003, George Marakas