Design and Maintenance of Data Warehouses

Design and Maintenance of Data Warehouses Timos Sellis National Technical University of Athens KDBS Laboratoryhttp://www.dbnet.ece.ntua.gr/ Many thanks to P. Vassiliadis and A. Tsois

Outline • What’s and Why’s for DW’s • DW architecture • DW Schema • Back End of the DW • Front End of the DW • DW Servers • Metadata Repository • Conclusions

OLTP • On-line transaction processing (OLTP) is the traditional way of using a database • Legacy systems: relational, hierarchical, network databases / COBOL applications / … • Short transactions (read/update few records) with ACID properties • Normally, only the last version of data stored in the database

DSS & OLAP • Decision support systems - help the executive, manager, analyst make faster and better decisions. • What where the sales volumes by region and product category for the last year? • Will a 10% discount increase sales volumes sufficiently? • On-line analytical processing (OLAP) is an element of decision support systems (DSS)

OLTP vs. OLAP Chaudhuri & Dayal @VLDB’96

Data Warehouse • A decision support database that is maintained separately from the organization’s operational database. • S. Chaudhuri, U. Dayal, VLDB’96 tutorial • A data warehouse is a subject-oriented, integrated, time-varying, non-volatile collection of data that is used primarily in organizational decision making. • W.H. Inmon, Building the Data Warehouse, 1992

Reasons for Building Data Warehouses • Semantic Reconciliation • Dispread data sources within the same organization • Different encoding of the same entities • DW encompasses the full volume of these data under a single, reconciled schema • Keeps the history of these data, too

Reasons for Building Data Warehouses • Performance • OLAP applications need different organization of data • Complex OLAP queries would degrade OLTP performance • Availability • Separation increases availability • Possibly the only way to query the dispread data sources

Reasons for Building Data Warehouses • Data Quality • The validity of source data is not guaranteed (data can be missing, inconsistent, out of date, violating business and database rules…) • Errors in data reach a minimum 10% in most data stores • Can lead to wasting of resources of 25-40% • DW acts as a data cleaning buffer …. and the market is there!

The Market Estimated sales in millions of dollars [ShTy98] (*estimates are from [Pend00]).

Client Client Query & Analysis Metadata Warehouse Integration Source Source Source Data Warehouse Architecture A Simple View

Quality Issues Quality Issues Data Warehouse Architecture Reporting / OLAP tools Metadata Repository Data Marts DW DSA Sources End User Administrator Administrator Designer

Two / Three Tier Architecture • Warehouse database server • almost always relational (RDBMS) • Data Marts / OLAP server • Relational OLAP (ROLAP) • Multidimensional OLAP (MOLAP) • Clients • Query and reporting tools • Analysis tools / Data mining tools

Data Warehouse Architecture • Enterprise warehouse: collects all information about subjects • requires extensive business modeling • may take years to design and build • Data Marts: Departmental subsets that focus on selected subjects • Virtual warehouse: views over operational dbs

How to build the DW • Top – down • Single integrated enterprise model • Reduce all sources (and clients, if necessary) to the central model • Time consuming; labor intensive; slow to produce results • Enhances the risk of the DW project due to late delivery of results • Provides a consistent, global view of the enterprise data

How to build the DW • Bottom – up • Build smaller data marts first • Progressively combine pairwise • Fails to provide a global view of the enterprise data • Possibly enhances the risk since a complete integration might prove impossible late in the project • Early delivery of results • Less time consuming, less labor intensive

Quality Issues Quality Issues Data Warehouse Back-End Reporting / OLAP tools Metadata Repository Data Marts DW DSA Sources End User Administrator Administrator Designer

Design: Global-As-View Integration • Preintegration. What schemata to integrate and in which order • Schema Comparison. To determine the correlations among concepts of different schemata and to detect possible naming, semantic, structural, … conflicts • Schema Conforming. Conflict resolution for heterogeneous schemata • Schema Merging and Restructuring. Production of a single conformed schema

Design: Local-As-View Integration • Works the other way around. • Main deliverable is a central conceptual model, produced by interactively examining user needs and existing schemata • All source and client schemata are expressed in terms of the central data warehouse schema and not the other way around.

PKEY, DAY MIN(COST) DW.PARTSUPP S1_PARTSUPP V1 Aggregate1 U PKEY, MONTH AVG(COST) DW.PARTSUPP.DATE, DAY TIME S2_PARTSUPP V2  Aggregate2 Sources DW DW = Materialized Views? Simple View of a DW

DWMaterialized Views ! DS.PS1.PKEY, LOOKUP_PS.SKEY, SUPPKEY DS.PS_NEW1 DS.PS_NEW1.PKEY, DS.PS_OLD1.PKEY SUPPKEY=1 COST DATE DS.PS1 DIFF1 A2EDate SK1 $2€ Add_SPK1 DS.PS_OLD1 U rejected rejected rejected Log Log Log DS.PS2.PKEY, LOOKUP_PS.SKEY, SUPPKEY DS.PS_NEW2 DS.PS_NEW2.PKEY, DS.PS_OLD2.PKEY SUPPKEY=2 COST DATE=SYSDATE QTY>0 DS.PS2 NotNULL AddDate Add_SPK2 SK2 CheckQTY DIFF2 DS.PS_OLD2 rejected rejected Log Log DSA PKEY, DAY MIN(COST) DW.PARTSUPP S1_PARTSUPP V1 Aggregate1 FTP1 PKEY, MONTH AVG(COST) DW.PARTSUPP.DATE, DAY TIME S2_PARTSUPP V2  Aggregate2 FTP2 Sources DW

Operational Processes • Data extraction, transform & load • Originally treated as the ‘refreshment’ problem • Requires to transform, clean, integrate data from different sources. • Build/refresh derived data and views • Service queries • Monitor the warehouse

The Refreshment Problem • Propagate updates on source data to the warehouse • Issues: • when to refresh • on every update • periodically • refresh policy set by administrator • how to refresh

Refreshment Techniques • Full extract from base tables • Incremental techniques • detect changes on base tables • snapshots • transaction shipping • active rules • logical correctness • transactional correctness • Currently, in practice we use ETL tools/scripts (see next)…

Data Extraction • Can take snapshot or differentials (new/deleted/updated) of source data • Transfer, encryption, compression are also involved • Time window and source system overhead involved • In general, faced with the requirement of minimal changes to existing configuration of sources

Data Transformation • Schema Reconciliation: conflicts at the schema level (different attributes for the same information) • ValueIdentification&Reconciliation: different (same) id’s for same (different) objects (use surrogate keys)

Data Cleaning • OffendingData: duplicates, integrity/business rules/format violations … • Incompleteness: missing data • Renicing: esp. addresses

Data Loading • This final stage may still require additional preprocessing: • sorting, summarizing, performing computations • Issues: • huge volumes of data to be loaded • small time window • when to build indexes and summary tables • restart after failure with no loss of data integrity

Loading Techniques • Cannot use SQL language interface to update or append data. • record at a time • too slow since it uses random disc I/O • can make rollback segment or log file to burst • Use batch load utility • sort input records on a clustering key • sequential I/O 100 times faster than random I/O • build index at the same time • use parallelism to accelerate load operations

Incremental Loading • Use incremental loads during refresh to reduce data volume (e.g. Redbrick) • insert only updated tuples • incremental load conflicts with queries • break into sequence of shorter transactions • coordinate this sequence of transactions: must ensure consistency between base and derived tables and indices.

Quality Issues Quality Issues Data Warehouse Front-End Reporting / OLAP tools Metadata Repository Data Marts DW DSA Sources End User Administrator Administrator Designer

Front End Tools • Ad hoc query and reporting • Example: MS Excel, ProReports • OLAP: ‘Multidimensional spreadsheet’ • pivot tables, drill down, roll up, slice, dice • Data Mining

Basic ideas for OLAP • Several numeric measures that are analyzed • sales, budget, revenue, inventory • Dimensions • contexts in which a measure appears • Example: store, product, date information associated with a sale. • each context is a dimension and the measure is a point in a multi-dimensional world

Basic ideas for OLAP • Nature of Analysis • aggregation (total sales, percent-to-total) • comparison (budget vs. expense) • ranking (top 10) • access to detailed and aggregate data • complex criteria specification • visualization

Basic ideas for OLAP • Attributes • information associated with a dimension • example: owner of store, county in which the store is located • Attribute Hierarchies • Attributes of a dimension are often related in a a hierarchical way • example: street í city ícountry

Sales volume Region Product Month Multidimensional Data Dimensions: Product, Region, DateHierarchical summarization paths: Country Year Industry Category Region Quarter City Week Product Month Day Office

Operations • Roll up: summarize data • Drill down: go from higher level summary to lower level summary or detailed data • Slice and dice: select and project • Pivot: re-orient cube

Sales volume Products Store1 Store2 Sales volume $5,2$1,9$2,3$1,1 ElectronicsToysClothingCosmetics $5,6$1,4$2,6$1,1 Products Store1 Store2 Q1 $14,1$2,65$6,9$2,6 ElectronicsToysClothingCosmetics $12,8$1,8$7,2$1,6 Year 1996 $8,9$0,75$4,6$1,5 ElectronicsToysClothingCosmetics $7,2$0,4$4,6$0,5 Q2 Roll up

Sales volume Sales volume Electronics Store1 Store2 Products Store1 Store2 $1,4$0,6$2,0$1,2 VCRCamcorderTVCD player $1,4$0,6$2,4$1,2 $5,2$1,9$2,3$1,1 ElectronicsToysClothingCosmetics $5,6$1,4$2,6$1,1 Q1 Q1 $2,4$3,3$2,2$1,0 VCRCamcorderTVCD player $2,4$1,3$2,5$1,0 $8,9$0,75$4,6$1,5 ElectronicsToysClothingCosmetics $7,2$0,4$4,6$0,5 Q2 Q2 Drill down

Sales volume Sales volume Products Q1 Q2 Products Store1 Store2 $5,2$1,9$2,3$1,1 ElectronicsToysClothingCosmetics $8,9$0,75$4,6$1,5 $5,2$1,9$2,3$1,1 ElectronicsToysClothingCosmetics $5,6$1,4$2,6$1,1 Store 1 Q1 $5,6$1,4$2,6$1,1 ElectronicsToysClothingCosmetics $7,2$0,4$4,6$0,5 $8,9$0,75$4,6$1,5 ElectronicsToysClothingCosmetics $7,2$0,4$4,6$0,5 Store 2 Q2 Pivot

Sales volume Sales volume Products Store1 Store2 Products Store1 $5,2$1,9$2,3$1,1 ElectronicsToysClothingCosmetics $5,6$1,4$2,6$1,1 $5,2$1,9 ElectronicsToys Q1 Q1 $8,9$0,75$4,6$1,5 ElectronicsToysClothingCosmetics $8,9$0,75 ElectronicsToys $7,2$0,4$4,6$0,5 Q2 Q2 Slice and Dice

Quality Issues Quality Issues Data Warehouse Server Reporting / OLAP tools Metadata Repository Data Marts DW DSA Sources End User Administrator Administrator Designer

Data Warehouse Servers - Outline • Server Technology: ROLAP & MOLAP • Indexing Techniques • Query Processing and Optimization

Database Servers • Relational and Specialized Relational DBMS • Relational OLAP (ROLAP) DBMS • Multidimensional OLAP (MOLAP) DBMS

Relational DBMS • Features that support DSS • Specialized Indexing techniques • Specialized Join and Scan Methods • Data Partitioning and use of Parallelism • Complex Query Processing • Intelligent Processing of Aggregates • Extensions to SQL and their processing

ROLAP Servers • Exploits services of a relational engine effectively • Key functionality • needs aggregation navigation logic • ability to generate multi statement SQL • optimize for each individual database backend • Additional services • cost-based query governor • design tool for DSS schema • performance analysis tool

Database Schemata for DW & ROLAP • Star Schema • Snowflake Schema • Fact Constellation • Aggregated data

Star Schema • A star schema consists of one central fact table and several denormalized dimension tables. • The measures of interest for OLAP are stored in the fact table (e.g. Dollar Amount, Units in the table SALES). • For each dimension of the multidimensional model there exists a dimension table (e.g. Geography, Product, Time, Account) with all the levels of aggregation and the extra properties of these levels.

Star Schema Stanford Technology Group, Inc., 1996 Time Geography Time Code Geography Code Quarter Code Region Code Quarter Name Region Manager SALES Month Code State Code Geography Code Month Name City Code Time Code Date ..... Account Code Product Code Dollar Amount Account Product Units Account Code Product Code KeyAccount Code Product Name KeyAccountName Brand Code Account Name Brand Name Account Type Prod. Line Code Account Market Prod. Line Name

Snowflake Schema • The normalized version of the star schema • Explicit treatment of dimension hierarchies (each level has its own table) • Easier to maintain, slower in query answering

Design and Maintenance of Data Warehouses

Design and Maintenance of Data Warehouses

Presentation Transcript

Data Warehouses and Data Cubes

Data Warehouses

Data Warehouses

Data Warehouses and OLAP

Data Warehouses, OLAP and Data Mining

Data Warehouses, Decision Support and Data Mining

Data Warehouses and OLAP

Conceptual, Logical, and Physical Design of Data Warehouses

Data Warehouses

DATABASE SYSTEMS, DATA WAREHOUSES, AND DATA MARTS

Data Warehouses and OLAP

Data Warehouses

Data Warehouses and OLAP

Data Warehouses

A Comparsion of Databases and Data Warehouses

DATABASE SYSTEMS, DATA WAREHOUSES, AND DATA MARTS

Chapter 15 Data Warehouses and Data Mining

Data Warehouses and OLAP Data Management

Data Warehouses

Data warehouses

Databases and Data Warehouses