190 likes | 313 Views
Data Warehousing/Mining. MSC636 Winter 2003. Timeliness Sufficiency Detail/aggregation Understandability Bias Relevance. Comparability Reliability Redundancy Cost efficiency Quantifiability Format. Data/Information Quality Issues. Internal data Shared sources
E N D
Data Warehousing/Mining MSC636 Winter 2003
Timeliness Sufficiency Detail/aggregation Understandability Bias Relevance Comparability Reliability Redundancy Cost efficiency Quantifiability Format Data/Information Quality Issues
Internal data Shared sources TPS/operational data Financial Accounting Production HRM Data warehouses/data marts Private/personal data External data Particularly for EIS Commercial/ subscription databases Industry data Governmental/agency/public access data Primary data collection Sources of Data
Data Management • DBMS software well developed • Hierarchical • Simple, predefined relationships • Efficient large-volume processing • Network • Complex, pre-defined relationships • Efficient, but inflexible • Relational • Flexible • Efficient-enough
Data Warehouses • Database system extracted from TPS and other data sources • Specifically designed to support DSS/OLAP/DM applications • Data warehousing • Implies type of data being collected • Implies interest in exploiting data and building knowledge about customers, markets, other environments
Characteristics of DW • Separate from TPS data (for speed) • Subject-oriented (vs. application oriented) • Integrated (filtered & cleaned) • Non-volatile (no user updates) • Time variant (historical, time stamped) • Summarized (various levels of aggregation) • Larger (historical, intentionally non-normalized, 50 GB to 1+ TB) • Data sources (internal & external)
Related Concepts • Data store--> operational databases • Data mart--> smaller scope, devoted to SBU or department (baby step?) • Data warehouse--> generally enterprise-level strategy • Benefits of integrating info needs • Challenges of analysis, implementation, & management
Meta-Data in DW • Descriptions of the data in the DW; how it was acquired, aggregated, processed; and where it was collected • Comparable to repository & data dictionaries • Tables, attributes, & keys • Data sources • Coding, domains, aliases • Record counts, table sizes • Transformation mapping • Explanation of business rules & changes • Algorithms for summarization • Data ownership • Recording DW use history • Logical data models from TPS data
Why Warehouse Data? • Data quality • Data “clean-up” • Less variability in extracted data (spider web) • Data access (30% in Sentry Market survey) • Dependability (cabinet effect) • Simplified access to multiple sources • Summarized and historical data • System performance • Less interference with TPS/OLTP processing • IS staff productivity • UPS, HP: programmer reductions
Why? • Sentry Market survey 1996 • 55%-->DW vital to business objectives • 30%-->Improved enterprise-wide access to data top benefit • Other top reasons: improved data accuracy, better data control, better consistency, & cost savings • >50% of Fortune 500 have DW projects underway or completed • Assumption that DW will be used, decisions will be made differently
Justifying the DW Project • ROI, NPV used to allocate resources to projects • DW costs are uncertain & benefits are qualitative • Suggested approaches: value analysis, intangible value analysis, business value-added analysis, scenario approach • 1996 paper (IDC): avg 3-yr ROI 401% • 62 non-random selection, completed projects • 85% fail to meet objectives • Higher than expected costs, poor utilization, poor planning • 40% never fully implemented • Investigating DW vital, implementation optional
Warehousing Woes • Turf battles, user resistance, & power struggles • Crossing organizational boundaries • Issues of information ownership • High-level champion • Assigning “stewardship” for data with business units with IS assuming technical & advisor responsibilities • Establish conflict resolution process • Demonstrate benefits & roll out continuous progress • Encourage user involvement & management expectations • 7 Deadly Sins • Technology issues (lack of standardization)
OLAP • Approach to DSS (E.F. Codd, 1993) • On-Line Analytical Processing • vs. On-Line Transactional Processing • Requires rapid response to complex queries • Consolidation, drill-down, slicing & dicing • Synonymous with multi-dimensional analysis, BI systems, data analysis systems • Supported by warehoused data • Multi-dimensional databases (MOLAP) • Modified relational databases (ROLAP)
Multi-Dimensional DBMS • Data related in database via • Measurement of interest (sales, inventory, profit) • Dimension 1: aggregated by category (product, region, segment, etc.) • Dimension 2: aggregated by time period (daily, weekly, monthly, etc.) • Data organized in cubes (hypercubes) rather than related tables • Worksheet analogy (sheets, books, more books) • Sparcity creates inefficient storage (50 GB)
Relational OLAP • De-normalized tables • Star schema • Dimension tables (entity related) • Fact tables (creates “slices” of derived data regarding dimensions) • Able to handle more data than MOLAP • Focus on speed justifies redundancies • Beta vs. VHS?
MD-DBMS Realities • Larger storage requirement • up to 40% larger than RDBMS • More expensive DBMS software • up to 50% higher than RDBMS • Implementation slower & maintenance more difficult • Standards not yet set & compatibility limited • IBM, Microsoft, Sybase, Oracle, MicroStrategy, Arbor Essbase, Cognos • Web-enabled OLAP servers • www.cognos.com-->PowerPlay, web-enabled OLAP tool
Data Mining • Competitive advantage of knowledge discovery • Algorithmic models for finding patterns in data and inferring rules from those patterns (one aspect of OLAP) • Siftware, data digging, BI, KDS • IBM, SAS, Hyperion, Cognos • Designed for functional specialists and/or analysts • Future integration into DBMS and industry-specific products
Data Mining • ANN, AI, inferential & predictive statistics • Identify unpredictable patterns not amenable to traditional abstract or conceptual modeling • Classification (assign individuals to categories; supervised training; fraud testing, buyer categories) • Associations (correlation between events or measures; market basket analysis--things bought together) • Sequences (ordering of events; explanatory) • Clustering (creating new groups based on known or unknown attributes)
Data Visualization • Process of turning numerical data into meaningful images • Beyond simple charting/graphing • Tools to integrate analysis & visual presentation in OLAP environment • Relies on human capacity to interpret images more easily than numbers • Geographic Information Systems • Specialized database that stores, retrieves, and analyzes spatial data • Marketing utility