1.16k likes | 2.73k Views
Data Warehouses. Richard Goerwitz. What’s a Data Warehouse?. People disagree on what a DW is Instead of defining one up front I’m going to Talk about key concepts used by data warehouse architects Show you an example of a small data warehouse component
E N D
Data Warehouses Richard Goerwitz
What’s a Data Warehouse? • People disagree on what a DW is • Instead of defining one up front I’m going to • Talk about key concepts used by data warehouse architects • Show you an example of a small data warehouse component • Talk about how you can repeat what I’ve done
Key Terms, Acronyms • By the end of this talk, you’ll know what these things all mean: • OLTP • OLAP • ETL • Star schema • Conformed dimension • Data mart • Data warehouse
OLTP • OLTP = online transaction processing • The process of moving data around to handle day-to-day affairs • Scheduling classes • Registering students • Recording grades • Recording payments, etc. • Systems supporting this kind of activity are called transactional systems
Transactional Systems • Transactional systems are optimized primarily for the here and now • They allow for constant change • Don’t record all previous data states • Name changes • Course rosters, etc. • Support many simultaneous users • Permit often heavy read/write access
Transactional Databases • Databases that support transactional systems must be heavily normalized • This means they should - • Shun duplicate data (1NF) • Keep dependent data with the stuff it depends on (2NF) • Keep data dependencies straightforward within each database table (3NF) • Structure tables so that changes only “lock up” a small part of the DB at a time
Carleton Databases • Here are some major DBs at Carleton • Informix – Card access • UniData 6.1 – SIS, financials, etc. • MS SQL Server 2000 – Asset tracking, etc. • Oracle 9i – Alumni, document management • MySQL 4.x – web system back ends • PostgreSQL 8.x – courses like this one • You tell me - • Which ones are transactional in nature?
OLAP • OLAP = online analytical processing • The process of analyzing data derived from OLTP (and other) systems • What might we analyze? • Class enrollments, by department • Courses taught, by department • Who is majoring in what • All of the above, correlated against an additional time dimension
OLAP Support Systems • OLAP support systems are optimized for analysis • What does this mean? • They are geared mainly for read access • They support fewer simultaneous users • They hold snapshots of OLTP data • Provide history • Give us time depth to our analyses • They change relatively slowly
ETL • When OLAP support systems change, the changes are typically made via periodic (e.g., nightly) ETL processes • ETL = extract, transform, load • ETL means specifically - • Extracting data from various sources • Transforming and cleaning the data • Loading data into databases used for analysis and reporting
Star Schemas • ETL processes need to load data into a database • How does this database look? • It’s structurally simple • Separates data into fact tables and dimension tables • Reorganizes fact and dimension tables into a series of star-like schemas • (Don’t worry, I’ll be showing you a graphic example of one of these.)
Data Marts • A data mart is really just a series of star schemas • Dimension tables in a series of star schemas must be identical or at least conformed • By conformed I mean • The tables are proper subsets of each other • They are hierarchized the same way internally
Data Warehouses • So what is a data warehouse (DW)? • As noted, people use this term loosely, in various ways • But in general we can say that a DW: • Encompasses a number of smaller data marts • Derives its data from multiple back-end systems • Is optimized for OLAP, not OLTP • Holds historical snapshots • Must be heavily documented