1 / 19

Data Warehousing/Mining

Data Warehousing/Mining. MSC636 Winter 2003. Timeliness Sufficiency Detail/aggregation Understandability Bias Relevance. Comparability Reliability Redundancy Cost efficiency Quantifiability Format. Data/Information Quality Issues. Internal data Shared sources

vance
Download Presentation

Data Warehousing/Mining

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Data Warehousing/Mining MSC636 Winter 2003

  2. Timeliness Sufficiency Detail/aggregation Understandability Bias Relevance Comparability Reliability Redundancy Cost efficiency Quantifiability Format Data/Information Quality Issues

  3. Internal data Shared sources TPS/operational data Financial Accounting Production HRM Data warehouses/data marts Private/personal data External data Particularly for EIS Commercial/ subscription databases Industry data Governmental/agency/public access data Primary data collection Sources of Data

  4. Data Management • DBMS software well developed • Hierarchical • Simple, predefined relationships • Efficient large-volume processing • Network • Complex, pre-defined relationships • Efficient, but inflexible • Relational • Flexible • Efficient-enough

  5. Data Warehouses • Database system extracted from TPS and other data sources • Specifically designed to support DSS/OLAP/DM applications • Data warehousing • Implies type of data being collected • Implies interest in exploiting data and building knowledge about customers, markets, other environments

  6. Characteristics of DW • Separate from TPS data (for speed) • Subject-oriented (vs. application oriented) • Integrated (filtered & cleaned) • Non-volatile (no user updates) • Time variant (historical, time stamped) • Summarized (various levels of aggregation) • Larger (historical, intentionally non-normalized, 50 GB to 1+ TB) • Data sources (internal & external)

  7. Related Concepts • Data store--> operational databases • Data mart--> smaller scope, devoted to SBU or department (baby step?) • Data warehouse--> generally enterprise-level strategy • Benefits of integrating info needs • Challenges of analysis, implementation, & management

  8. Meta-Data in DW • Descriptions of the data in the DW; how it was acquired, aggregated, processed; and where it was collected • Comparable to repository & data dictionaries • Tables, attributes, & keys • Data sources • Coding, domains, aliases • Record counts, table sizes • Transformation mapping • Explanation of business rules & changes • Algorithms for summarization • Data ownership • Recording DW use history • Logical data models from TPS data

  9. Why Warehouse Data? • Data quality • Data “clean-up” • Less variability in extracted data (spider web) • Data access (30% in Sentry Market survey) • Dependability (cabinet effect) • Simplified access to multiple sources • Summarized and historical data • System performance • Less interference with TPS/OLTP processing • IS staff productivity • UPS, HP: programmer reductions

  10. Why? • Sentry Market survey 1996 • 55%-->DW vital to business objectives • 30%-->Improved enterprise-wide access to data top benefit • Other top reasons: improved data accuracy, better data control, better consistency, & cost savings • >50% of Fortune 500 have DW projects underway or completed • Assumption that DW will be used, decisions will be made differently

  11. Justifying the DW Project • ROI, NPV used to allocate resources to projects • DW costs are uncertain & benefits are qualitative • Suggested approaches: value analysis, intangible value analysis, business value-added analysis, scenario approach • 1996 paper (IDC): avg 3-yr ROI 401% • 62 non-random selection, completed projects • 85% fail to meet objectives • Higher than expected costs, poor utilization, poor planning • 40% never fully implemented • Investigating DW vital, implementation optional

  12. Warehousing Woes • Turf battles, user resistance, & power struggles • Crossing organizational boundaries • Issues of information ownership • High-level champion • Assigning “stewardship” for data with business units with IS assuming technical & advisor responsibilities • Establish conflict resolution process • Demonstrate benefits & roll out continuous progress • Encourage user involvement & management expectations • 7 Deadly Sins • Technology issues (lack of standardization)

  13. OLAP • Approach to DSS (E.F. Codd, 1993) • On-Line Analytical Processing • vs. On-Line Transactional Processing • Requires rapid response to complex queries • Consolidation, drill-down, slicing & dicing • Synonymous with multi-dimensional analysis, BI systems, data analysis systems • Supported by warehoused data • Multi-dimensional databases (MOLAP) • Modified relational databases (ROLAP)

  14. Multi-Dimensional DBMS • Data related in database via • Measurement of interest (sales, inventory, profit) • Dimension 1: aggregated by category (product, region, segment, etc.) • Dimension 2: aggregated by time period (daily, weekly, monthly, etc.) • Data organized in cubes (hypercubes) rather than related tables • Worksheet analogy (sheets, books, more books) • Sparcity creates inefficient storage (50 GB)

  15. Relational OLAP • De-normalized tables • Star schema • Dimension tables (entity related) • Fact tables (creates “slices” of derived data regarding dimensions) • Able to handle more data than MOLAP • Focus on speed justifies redundancies • Beta vs. VHS?

  16. MD-DBMS Realities • Larger storage requirement • up to 40% larger than RDBMS • More expensive DBMS software • up to 50% higher than RDBMS • Implementation slower & maintenance more difficult • Standards not yet set & compatibility limited • IBM, Microsoft, Sybase, Oracle, MicroStrategy, Arbor Essbase, Cognos • Web-enabled OLAP servers • www.cognos.com-->PowerPlay, web-enabled OLAP tool

  17. Data Mining • Competitive advantage of knowledge discovery • Algorithmic models for finding patterns in data and inferring rules from those patterns (one aspect of OLAP) • Siftware, data digging, BI, KDS • IBM, SAS, Hyperion, Cognos • Designed for functional specialists and/or analysts • Future integration into DBMS and industry-specific products

  18. Data Mining • ANN, AI, inferential & predictive statistics • Identify unpredictable patterns not amenable to traditional abstract or conceptual modeling • Classification (assign individuals to categories; supervised training; fraud testing, buyer categories) • Associations (correlation between events or measures; market basket analysis--things bought together) • Sequences (ordering of events; explanatory) • Clustering (creating new groups based on known or unknown attributes)

  19. Data Visualization • Process of turning numerical data into meaningful images • Beyond simple charting/graphing • Tools to integrate analysis & visual presentation in OLAP environment • Relies on human capacity to interpret images more easily than numbers • Geographic Information Systems • Specialized database that stores, retrieves, and analyzes spatial data • Marketing utility

More Related