250 likes | 529 Views
Advanced Databases Data Warehouses Dr Theodoros Manavis tmanavis@ist.edu.gr. Which are our lowest/highest margin customers ?. Who are my customers and what products are they buying?. What is the most effective distribution channel?.
E N D
Advanced Databases Data Warehouses Dr TheodorosManavis tmanavis@ist.edu.gr
Which are our lowest/highest margin customers ? Who are my customers and what products are they buying? What is the most effective distribution channel? What product prom--otions have the biggest impact on revenue? Which customers are most likely to go to the competition ? What impact will new products/services have on revenue and margins? A producer wants to know….
Data, Data everywhere yet ... • I can’t find the data I need • data is scattered over the network • many versions, subtle differences • I can’t get the data I need • need an expert to get the data • I can’t understand the data I found • available data poorly documented • I can’t use the data I found • results are unexpected • data needs to be transformed from one form to other
OLTP Application Oriented Used to run business Detailed data Current up to date Isolated Data Repetitive access Clerical User Warehouse (DSS) Subject Oriented Used to analyze business Summarized and refined Snapshot data Integrated Data Ad-hoc access Knowledge User (Manager) OLTP vs Data Warehouse
Application-Orientation Subject-Orientation Operational Database Data Warehouse Credit Card Customer Loans Vendor Product Trust Savings Activity Application-Orientation vs. Subject-Orientation
OLTP Performance Sensitive Read/Update Access Database Size 100MB -100 GB Data Warehouse Performance relaxed Mostly Read (Batch Update) Database Size 100 GB - few terabytes OLTP vs Data Warehouse
OLTP Transaction throughput is the performance metric Thousands of users Data Warehouse Query throughput is the performance metric Hundreds of users OLTP vs Data Warehouse
Terabytes -- 10^12 bytes: Petabytes -- 10^15 bytes: Exabytes -- 10^18 bytes: Zettabytes -- 10^21 bytes: Zottabytes -- 10^24 bytes: Walmart -- 24 Terabytes Geographic Information Systems National Medical Records Weather images Intelligence Agency Videos Very Large Data Bases
Data Warehousing Concepts • A subject-oriented, integrated, time-variant, and non-volatile collection of data in support of management’s decision-making process (Inmon, 1993).
Subject-oriented Data • The warehouse is organized around the major subjects of the business (e.g. customers, products, and sales) rather than the major application areas (e.g. customer invoicing, stock control, and product sales). • This is reflected in the need to store decision-support data rather than application-oriented data.
Integrated Data • The data warehouse integrates corporate application-oriented data from different source systems, which often includes data that is inconsistent. • The integrated data source must be cleaned up and made consistent to present a unified view of the data to the users.
Time-variant Data • Data in the warehouse is only accurate and valid at some point in time or over some time interval. • Time is associated with all data . The data represents a series of snapshots of historical data.
Non-volatile Data • Data in the warehouse is not normally updated in real-time (RT) but is refreshed from operational systems on a regular basis. It is non-volatile as you can’t change history!) • However, there is an emerging trend is towards RT or near RT DWs) • New data is always added as a supplement to the database, rather than a replacement.
Why Do We Need Data Warehouses? • Consolidation of information resources • Improved query performance • Separate research and decision support functions from the operational systems • Foundation for data mining, data visualization, advanced reporting and OLAP tools
Benefits of Data Warehousing • Potential high returns on investment • Competitive advantage – possible advantage over rival brands • Increased productivity of corporate decision-makers (they can make more money/profit)
Design Differences Operational System Data Warehouse Star Schema ER Diagram
Data Warehouses, Data Marts, and Operational Data Stores • Data Mart – A logical subset of the complete data warehouse. Often viewed as a restriction of the data warehouse to a single business process or to a group of related business processes targeted toward a particular business group. • Operational Data Store (ODS) – A point of integration for operational systems that developed independent of each other. Since an ODS supports day to day operations, it needs to be continually updated. SOURCE: Ralph Kimball
Operational Data Sources • Main sources are online transaction processing (OLTP) databases. • Also include sources such as personal databases and spreadsheets, Enterprise Resource Planning (ERP) files, and web usage log files.
Operational Data Store (ODS) • Holds current and integrated operational data for analysis. • Often structured and supplied with data in the same way as the data warehouse. • May act as staging area for data to be moved into the warehouse. • Often created when legacy operational systems are found to be incapable of achieving reporting requirements.
Warehouse Manager • Performs all the operations associated with the management of the data in the warehouse such as: • Analysis of data to ensure consistency. • Transformation and merging of source data from temporary storage into data warehouse tables. • Creation of indexes and views on base tables. • Generation of denormalizations, (if necessary). • Generation of aggregations, (if necessary). • Backing-up and archiving data.
ETL Manager • Data for a DW must be extracted from one or more data sources, transformed into a form that is easy to analyze and consistent with data already in the warehouse, and then finally loaded into the DW. • Nowadays there are tools that automate the extraction, transformation, and loading (ETL) processes.
Data Mart • A database that contains a subset of corporate data to support the analytical requirements of a particular business unit (such as the Sales department) or to support users who share the same requirements to analyse a particular business process (such as property sales).