1.2k likes | 1.23k Views
Data Warehouse & OLAP Introduction. Sesi: 03-04 Dosen Pembina : Danang Junaedi. Interpretation/ Evaluation. Data Mining. Knowledge. Transformation. Patterns / Models. Preprocessing. Transformed Data. Selection. Preprocessed Data. Target Data. Data Sources.
E N D
Data Warehouse & OLAP Introduction Sesi: 03-04 Dosen Pembina : Danang Junaedi
Interpretation/ Evaluation Data Mining Knowledge Transformation Patterns / Models Preprocessing Transformed Data Selection Preprocessed Data Target Data Data Sources Source: Fayyad, U., Piatetsky-Shapiro, G., Smyth, P, From Data Mining To Knowledge Discovery In Databases, AI Magazine, Fall 1996. The Knowledge Discovery Process
Data Sources • Relational Databases • Data Warehouses • WWW • Audio • Video • Printed Materials : :
Han, J. & Kamber, M., Data Mining: Concepts and Techniques, Morgan Kaufmann, 2000 Multidimensional Data Cube
Evolution of DSS • Transaction Processing Systems (TPS) • Operational data stores and OLTP • Batch reports, hard to find and analyze information, inflexible and expensive, reprogram every new request (circa 60’s) • MIS • Management reporting from transactions in TPS • Still inflexible, not integrated with desktop tools (circa 70’s) • DSS • Combine data with analytic models or expert rules • Integration with desktop tools (80’s) • Data Warehousing • Data integrated after (cleaning and scrubbing) from multiple sources (both internal and external to the organization) • OLAP is the technology used to study the data in terms of operations on a multi-dimensional data set • Data warehousing also supports processing of data by analytic methods and permits data mining (90’s)
Applications • Retail - inventory management, promotions • Manufacturing - order shipment • Insurance – policy and claims tracking • Telecommunications - call analysis • Financial – account tracking • CRM/eCRM – customer profiling, clickstream analysis • Healthcare – disease management, patient and physician profiling
Databases for Decision Support • Transaction Processing systems are optimized for performance • Data they capture are too detailed to be of use for decision support purposes • Online Analytical Processing (OLAP) imposes very different demands on databases than does Online Transaction Processing (OLTP)
Integration System World Wide Web Personal Databases Digital Libraries Scientific Databases Heterogeneous Database Integration • Collects and combines information from disparate sources • Provides integrated view, and a uniform user interface • Supports sharing of data between entities
Why look at data in this way? • What would be the demand for services (forecasting)? • Who are our key customers/patients, and • What are the margins/outcomes? (profitable customers/satisfied patients) • How do we market to them/treat them? • What pricing/treatment strategy is desirable? • What are their preferences? • What type of customer/patient services are required? • What services when packaged result in higher/better sales/revenues/margins/outcomes, efficient workflow? • Which promotion/patient education/counseling works or does not work and why? • What is the inventory/patient turnover? • Which channel/technology is more effective/profitable? • Why do margins/outcomes differ from one place to another or one patient to another?
Data Warehousing and Industry • One of the hottest topic in IS. • Over 90% of larger companies either have a DW or are starting one. • Warehousing is big business • $2 billion in 1995 • $3.5 billion in early 1997 • $8 billion in 1998 [Metagroup] • over $200 billion over next 5 years.
Data Warehousing and Industry (2) • A 1996 study of 62 data warehousing projects showed: • An average return on investment of 321%, with an average payback period of 2.73 years. • WalMart has largest warehouse • 900-CPU, 2,700 disk, 23 TB Teradata system • ~7TB in warehouse • 40-50GB per day
Why Data Warehousing? • Advance of information technology. • Data collected in huge amounts. • Need to make good use of data? • Architecture and tools to • Bring together scattered information from multiple sources to provide consistent data source for decision support. • Support information processing by providing a solid platform of consolidated, historical data for analysis.
Which are our lowest/highest margin customers ? Who are my customers and what products are they buying? What is the most effective distribution channel? What product prom--otions have the biggest impact on revenue? Which customers are most likely to go to the competition ? What impact will new products/services have on revenue and margins? A producer wants to know….
Data, Data everywhere yet ... • I can’t find the data I need • data is scattered over the network • many versions, subtle differences • I can’t get the data I need • need an expert to get the data • I can’t understand the data I found • available data poorly documented • I can’t use the data I found • results are unexpected • data needs to be transformed from one form to other
What are the users saying... • Data should be integrated across the enterprise • Summary data has a real value to the organization • Historical data holds the key to understanding data over time • What-if capabilities are required
What is a Data Warehouse? • Defined in many different ways non-rigorously. • A DB for decision support. • Maintained separatelyfrom an organization’s operational database. • A data warehouse is a subject-oriented, integrated, time-variant, and nonvolatile collection of data in support of management’s decision-making process.— W. H. Inmon • 90% of major organizations have or are building some kind of data warehouse. • A decision support database that is maintained separately from the organization’s operational databases.
Data Warehouse • A data warehouse is a • subject-oriented • integrated • time-varying • non-volatile collection of data that is used primarily in organizational decision making. -- Bill Inmon, Building the Data Warehouse 1996
Information Data What is Data Warehousing? A process of transforming data into information and making it available to users in a timely enough manner to make a difference [Forrester Research, April 1996]
Data Warehousing -- It is a process • Technique for assembling and managing data from various sources for the purpose of answering business questions. Thus making decisions that were not previous possible • A decision support database maintained separately from the organization’s operational database
Explorers, Farmers and Tourists Tourists: Browse information harvested by farmers Farmers: Harvest information from known access paths Explorers: Seek out the unknown and previously unsuspected rewards hiding in the detailed data
Data Warehouse Architecture OLAP / Decision support/ Data cubes/ data mining User queries Relational database (warehouse) Data extraction programs Data cleaning/ scrubbing Data source Data source Data source
RelationalDatabases ExtractionCleansing Optimized Loader ERP Systems Data Warehouse Engine AnalyzeQuery Purchased Data LegacyData Metadata Repository Data Warehouse Architecture
OLAP servers Data Warehouse Architecture Clinical System Payroll System (e.g., AS400) Meta-Data Data Mining Tools Oracle Financials on HP 9000 Data Warehouse Billing System Transformation Integration Access, Files (Industry Reports) Other Internal Data External Data Excel Web Other
Extraction, Transformation, & Load (ETL) • ETL is a set of tools and techniques used to populate a data warehouse • Extraction • Extract data from sources (e.g., operational DBMSs, file systems, Web pages) • Transformation • Clean data • Convert from legacy/host format to warehouse format (e.g., convert “surname” to “last name”)
Extraction, Transformation, & Load (ETL) • Load • Sort, summarize, consolidate, compute views, check integrity, build indexes, partition • Huge volumes of data to be loaded, yet small time window (usually at night) when the warehouse can be taken off-line • Techniques: batch, sequential load often too slow; incremental, parallel loading techniques may be used • Refresh • Propagate updates from sources to the warehouse • When to refresh - on every update, periodically (e.g., every 24 hours), or after “significant” events • How to refresh – full extract from base tables vs. incremental techniques
Data Mart • A data mart stores data for a limited number of subject areas, such as marketing and sales data. It is used to support specific applications. • An independent data mart is created directly from source systems. • A dependent data mart is populated from a data warehouse.
Data Warehouse vs. Data Marts • Enterprise warehouse: collects all information about subjects (customers, products, sales, assets, personnel) that span the entire organization. • Requires extensive business modeling • May take years to design and build • Data Marts: departmental subsets that focus on selected subjects: Marketing data mart: customer, products, sales. • Faster roll out, but complex integration in the long run.
Data Warehouse for Decision Support & OLAP • Putting Information technology to help the knowledge worker make faster and better decisions • Which of my customers are most likely to go to the competition? • What product promotions have the biggest impact on revenue? • How did the share price of software companies correlate with profits over last 10 years?
Decision Support • Used to manage and control business • Data is historical or point-in-time • Optimized for inquiry rather than update • Use of the system is loosely defined and can be ad-hoc • Used by managers and end-users to understand the business and make judgements
The Complete Decision Support System (Source: Franconi) Information Sources Data Warehouse Server (Tier 1) OLAP Servers (Tier 2) Clients (Tier 3) e.g., MOLAP Analysis Semistructured Sources Data Warehouse serve extract transform load refresh etc. Query/Reporting serve e.g., ROLAP Operational DB’s serve Data Mining Data Marts
Data Mining works with Warehouse Data • Data Warehousing provides the Enterprise with a memory • Data Mining provides the Enterprise with intelligence
We want to know ... • Given a database of 100,000 names, which persons are the least likely to default on their credit cards? • Which types of transactions are likely to be fraudulent given the demographics and transactional history of a particular customer? • If I raise the price of my product by Rs. 2, what is the effect on my ROI? • If I offer only 2,500 airline miles as an incentive to purchase rather than 5,000, how many lost responses will result? • If I emphasize ease-of-use of the product as opposed to its technical capabilities, what will be the net effect on my revenues? • Which of my customers are likely to be the most loyal? Data Mining helps extract such information
Why Separate Data Warehouse? • Performance • Op dbs designed & tuned for known txs & workloads. • Complex OLAP queries would degrade perf. for op txs. • Special data organization, access & implementation methods needed for multidimensional views & queries. • Function • Missing data: Decision support requires historical data, which op dbs do not typically maintain. • Data consolidation: Decision support requires consolidation (aggregation, summarization) of data from many heterogeneous sources: op dbs, external sources. • Data quality: Different sources typically use inconsistent data representations, codes, and formats which have to be reconciled.
Data Warehouse vs. Heterogeneous DBMS • Traditional heterogeneous DB integration: • Build wrappers/mediators on top of heterogeneous databases • Query driven approach • When a query is posed to a client site, a meta-dictionary is used to translate the query into queries appropriate for individual heterogeneous sites involved, and the results are integrated into a global answer set • Complex information filtering, compete for resources • Data warehouse: update-driven, high performance • Information from heterogeneous sources is integrated in advance and stored in warehouses for direct query and analysis
Data Warehouse vs. Operational DBMS • OLTP (on-line transaction processing) • Major task of traditional relational DBMS • Day-to-day operations: purchasing, inventory, banking, manufacturing, payroll, registration, accounting, etc. • OLAP (on-line analytical processing) • Major task of data warehouse system • Data analysis and decision making • Distinct features (OLTP vs. OLAP): • User and system orientation: customer vs. market • Data contents: current, detailed vs. historical, consolidated • Database design: ER + application vs. star + subject • View: current, local vs. evolutionary, integrated • Access patterns: update vs. read-only but complex queries
RDBMS used for OLTP • Database Systems have been used traditionally for OLTP • clerical data processing tasks • detailed, up to date data • structured repetitive tasks • read/update a few records • isolation, recovery and integrity are critical
Operational Systems • Run the business in real time • Based on up-to-the-second data • Optimized to handle large numbers of simple read/write transactions • Optimized for fast response to predefined transactions • Used by people who deal with customers, products -- clerks, salespeople etc. • They are increasingly used by customers
Data Industry Usage Technology Volumes Customer All Track Legacy application, flat Small-medium File Customer files, main frames Details Account Finance Control Legacy applications, Large Balance account hierarchical databases, activities mainframe Retail Generate Very Large Point-of- ERP, Client/Server, bills, manage Sale data relational databases stock Telecomm- Billing Legacy application, Very Large Call unications hierarchical database, Record mainframe Production Manufact- Control ERP, Medium Record uring Production relational databases, AS/400 Examples of Operational Data
Application-Orientation Subject-Orientation Operational Database Data Warehouse Credit Card Customer Loans Vendor Product Trust Savings Activity Application-Orientation vs. Subject-Orientation
OLTP vs. Data Warehouse • OLTP systems are tuned for known transactions and workloads while workload is not known a priori in a data warehouse • Special data organization, access methods and implementation methods are needed to support data warehouse queries (typically multidimensional queries) • e.g., average amount spent on phone calls between 9AM-5PM in Pune during the month of December
OLTP vs Data Warehouse OLTP Warehouse (DSS) Subject Oriented Used to analyze business Summarized and refined Snapshot data Integrated Data Ad-hoc access Knowledge User (Manager) • Application Oriented • Used to run business • Detailed data • Current up to date • Isolated Data • Repetitive access • Clerical User