CS 345: Topics in Data Warehousing

CS 345:Topics in Data Warehousing Tuesday, September 28, 2004

Outline of Today’s Class • What is data warehousing? • Transaction processing vs. data analysis • Course logistics • Data integration

A Brief History of Information Technology • The “dark ages”: paper forms in file cabinets • Computerized systems emerge • Initially for big projects like Social Security • Same functionality as old paper-based systems • The “golden age”: databases are everywhere • Most activities tracked electronically • Stored data provides detailed history of activity • The next step: use data for decision-making • The focus of this course! • Made possible by omnipresence of IT • Identify inefficiencies in current processes • Quantify likely impact of decisions

Databases for Decision Support • 1st phase: Automating existing processes makes them more efficient. • Automation → Lots of well-organized, easily accessed data • 2nd phase: Data analysis allows for better decision-making. • Analyze data → better understanding • Better understanding → better decisions • “Data Entry” vs. “Thinking” • Data analysts are decision-makers: managers, executives, etc.

OLTP: On-Line Transaction Processing Many short transactions (queries + updates) Examples: Update account balance Enroll in course Add book to shopping cart Queries touch small amounts of data (one record or a few records) Updates are frequent Concurrency is biggest performance concern OLAP: On-Line Analytical Processing Long transactions, complex queries Examples: Report total sales for each department in each month Identify top-selling books Count classes with fewer than 10 students Queries touch large amounts of data Updates are infrequent Individual queries can require lots of resources OLTP vs. OLAP

Why OLAP & OLTP don’t mix (1) Different performance requirements • Transaction processing (OLTP): • Fast response time important (< 1 second) • Data must be up-to-date, consistent at all times • Data analysis (OLAP): • Queries can consume lots of resources • Can saturate CPUs and disk bandwidth • Operating on static “snapshot” of data usually OK • OLAP can “crowd out” OLTP transactions • Transactions are slow → unhappy users • Example: • Analysis query asks for sum of all sales • Acquires lock on sales table for consistency • New sales transaction is blocked

Why OLAP & OLTP don’t mix (2) Different data modeling requirements • Transaction processing (OLTP): • Normalized schema for consistency • Complex data models, many tables • Limited number of standardized queriesand updates • Data analysis (OLAP): • Simplicity of data model is important • Allow semi-technical users to formulate ad hocqueries • De-normalized schemas are common • Fewer joins → improved query performance • Fewer tables → schema is easier to understand

Why OLAP & OLTP don’t mix (3) Analysis requires data from many sources • An OLTP system targets one specific process • For example: ordering from an online store • OLAP integrates data from different processes • Combine sales, inventory, and purchasing data • Analyze experiments conducted by different labs • OLAP often makes use of historical data • Identify long-term patterns • Notice changes in behavior over time • Terminology, schemas vary across data sources • Integrating data from disparate sources is a major challenge

Data Warehouses • Doing OLTP and OLAP in the same database system is often impractical • Different performance requirements • Different data modeling requirements • Analysis queries require data from many sources • Solution: Build a “data warehouse” • Copy data from various OLTP systems • Optimize data organization, system tuning for OLAP • Transactions aren’t slowed by big analysis queries • Periodically refresh the data in the warehouse

Course Logistics • Course web site:http://cs345.stanford.edu • Course format will be lecture-based • As opposed to a paper-reading course • Prerequisite: • Knowledge of SQL

Assigned Work • Five homework assignments • One problem set • Four programming assignments • Not a lot of code to write • Emphasis will be on interacting with Oracle • Course project • Open-ended • Focus on a topic of your choosing • Any of these types: • Research project, or… • Programming project, or… • Survey of research literature • May be done individually or in groups of two • Final Exam

High-Level Course Outline • Logical Database Design • How should the data be modeled? • Designing the data warehouse schema • Query Processing • Analysis queries are hard to answer efficiently • What techniques are available to the DBMS? • Physical Database Design • How should the data be organized on disk? • What data structures should be used? • Data Mining • What use is all this data? • Which questions should we ask our data warehouse?

Additional Topics • Related topics to be touched on briefly: • Data integration • Data cleaning • Approximate query answering • Data lineage • Data visualization • Incremental maintenance of materialized views • Answering queries using views • Indexing special data types (spatial, text, geographic) • Metadata management • Projects can be done in these areas

The Textbook • “The Data Warehouse Toolkit”by Ralph Kimball and Margy Ross • Written by well-known data warehouse designer • Clearly written and readable • Lots of generic but realistic examples • Semi-technical (no math!) • Business-focused • We’ll use it for the first one-third of the course • Get the second edition!

Course Objectives • Gain practical understanding of how data warehouses are built and used • Gain exposure to data modeling “best practices” • Learn techniques used to process complex queries over very large data sets • Understand the performance trade-offs that come from alternative data structures • Learn commonly-used methods for mining and analysis of large data sets • Become familiar with current research directions in data warehousing and related areas

Loading the Data Warehouse Data is periodically extracted Data is cleansed and transformed Users query the data warehouse Data Staging Area Data Warehouse Source Systems (OLTP)

Data Integration is Hard • Data warehouses combine data from multiple sources • Data must be translated into a consistent format • Data integration represents ~80% of effort for a typical data warehouse project! • Some reasons why it’s hard: • Metadata is poor or non-existent • Data quality is often bad • Missing or default values • Multiple spellings of the same thing (Cal vs. UC Berkeley vs. University of California) • Inconsistent semantics • What is an airline passenger?

Rewritten Queries Query Extraction Query Answer Answer Federated Databases • An alternative to data warehouses • Data warehouse • Create a copy of all the data • Execute queries against the copy • Federated database • Pull data from source systems as needed to answer queries • “lazy” vs. “eager” data integration Mediator Warehouse Source Systems Source Systems Data Warehouse Federated Database

Warehouses vs. Federation • Advantages of federated databases: • No redundant copying of data • Queries see “real-time” view of evolving data • More flexible security policy • Disadvantages of federated databases: • Analysis queries place extra load on transactional systems • Query optimization is hard to do well • Historical data may not be available • Complex “wrappers” needed to mediate between analysis server and source systems • Data warehouses are much more common in practice • Better performance • Lower complexity • Slightly out-of-date data is acceptable

CS 345: Topics in Data Warehousing