320 likes | 652 Views
Introduction to Data Warehouse and Data Mining MIS 2502 Data Analytics. Importance of data. What organizations do with data? Transaction Processing (E)commerce: Amazon.com; PNC Bank B2B systems: Supply Chain Management Web Search Decision Making Financial reporting Inventory management
E N D
Introduction to Data Warehouse and Data MiningMIS 2502Data Analytics
Importance of data What organizations do with data? • Transaction Processing • (E)commerce: Amazon.com; PNC Bank • B2B systems: Supply Chain Management • Web Search • Decision Making • Financial reporting • Inventory management • Budget allocation • Customer Relationship Management • Target Marketing • Product Design and Promotions • Fraud Detection
Decision Making with Databases • Databases are used for transaction processing • Data from transaction processing is used for tactical decision making • Database provides basic reporting function • But…
The Need for Data Analysis • Different managers require different data and data may come from other part of the organization or outside the organization • External and internal forces require tactical and strategic decisions • Search for competitive advantage • Business environments are dynamic • Decision-making cycle time is reduced
Some Questions Analysts Need to Answers • Sales analysis: • What are the sales by quarter and geography? • How do sales compare in two different stores in the same state? • Profitability analysis: • Which is the most profitable store in the state CA? • Which product lines are the highest revenue producers this year? • Which products and product lines are the most profitable this quarter? • Sale force analysis • Which salesperson is the best revenue producer this year?Do salesperson X meet his sale target this quarter?
Operational vs. Decision Support Data • Operational data • Relational, normalized database • Optimized to support transactions • Real time updates • DSS • Snapshot of operational data • Summarized • Large amounts of data • Data analyst viewpoint • Timespan • Granularity • Dimensionality
Data Warehouse • Integrated • Centralized • Holds data retrieved from entire organization • Subject-Oriented • Optimized to give answers to diverse questions • Used by all functional areas • Time Variant • Flow of data through time • Projected data • Non-Volatile • Data never removed • Always growing
Data Warehouse • Extraction, transformation, and loading (ETL) – a process that extracts information from internal and external databases, transforms the information using a common set of enterprise definitions, and loads the information into a data warehouse • Data mart – contains a subset of data warehouse information
ETL – Extraction, Transformation, Load Extract data from source systems • Transform: cleanse data for consistency and output exceptions • Apply business rules • Selecting certain columns to load (not null records) • Translating coded values (1, M, male = 0 ) • Derive new calculated value (sale_amount = qty * unit_price) • Join data from multiple sources (lookup, merge) • Aggregate (rollup/summarize data) • Transpose/pivot (turning columns into rows) • Data validation. Load: data into repository
Data in a Data Warehouse • Data for a data warehouse is obtained from a variety of databases • E.g. customer database, transaction database, accounts database • Data in data warehouse is multidimensional
Multidimensional Analysis • Cube – common term for the representation of multidimensional information
Star Schema • Data-modeling technique • Maps multidimensional decision support into relational database • Yield model for multidimensional data analysis while preserving relational structure of operational DB • Four Components: • Facts • Dimensions • Attributes • Attribute hierarchies
Simple Star Schema Figure 13.12
Slice and Dice View of Sales Figure 13.14
Star Schema Representation • Facts and dimensions represented by physical tables in data warehouse DB • Fact table related to each dimension table (M:1) • Fact and dimension tables related by foreign keys • Subject to the primary/foreign key constraints
Star Schema for Sales Sales fact table and its four dimensions: location, time, product, and customer . Allows sales to be aggregated by time, geographic location, product, and by customer.
Data Warehouse to Data Marts • Given the large size of a data warehouse, organizations create data marts • Subject oriented data • Subset of data in a data warehouse • Used for focused decision-making
Online Analytical Processing (OLAP) • Advanced data analysis environment • Supports decision making, business modeling, and operations research activities • Characteristics of OLAP • Use multidimensional data analysis techniques • Provide advanced database support • Provide easy-to-use end-user interfaces • Support client/server architecture
OLAP Client/Server Architecture Figure 13.6
Data Mining • Seeks to discover patterns or relationships within the data • Data mining tools automatically search data for patterns and relationships • Data mining tools • Analyze data • Uncover problems or opportunities • Form computer models based on findings • Predict business behavior with models • Require minimal end-user intervention
Business Intelligence AB113 - Information Technology
MS SQL 2008 Architecture Relational Model Dimensional Model/Star Schema
Back Room—Data prepared from many sources Front Room—Information presented
Multidimensional Analysis and Data Mining • Differences between databases and data warehouse/data mart? • Data mining – the process of analyzing data to extract information not offered by the raw data alone • To perform data mining users need data-mining tools • Data-mining tool – uses a variety of techniques to find patterns and relationships in large volumes of information and infers rules that predict future behavior and guide decision making • Business intelligence – taking data from multiple sources and turn it into useful and easy to understand information to support decision-making efforts for various kinds of people.