350 likes | 498 Views
COMP 3503 Data Warehousing. with Daniel L. Silver. Agenda. Beginning at the End Data Consolidation and DW Overview Technology of Data Warehousing Management Issues Consolidation and Cleaning Process. The KDD Process. Interpretation and Evaluation. Data Mining. Knowledge.
E N D
COMP 3503Data Warehousing with Daniel L. Silver
Agenda • Beginning at the End • Data Consolidation and DW Overview • Technology of Data Warehousing • Management Issues • Consolidation and Cleaning Process
The KDD Process Interpretation and Evaluation Data Mining Knowledge Selection and Preprocessing p(x)=0.02 Data Consolidation Patterns & Models Prepared Data Warehouse Consolidated Data Data Sources
Let’s Begin at the End of Data Preparation • DM methods require data arranged into a row and column table structure similar to an Excel spreadsheet Columns = Variables Rows = Units of Analysis
The KDD Process Interpretation and Evaluation Data Mining Knowledge Selection and Preprocessing p(x)=0.02 Data Consolidation Warehouse
Data Consolidation Garbage in Garbage out • The quality of results relates directly to quality of the data • 50%-70% of KDD process effort will be spent on data consolidation, cleansing and preprocessing • Major justification for a corporate Data Warehouse
Data Consolidation & Warehousing From data sources to consolidated data repository Analysis and Info Sharing RDBMS Inflow Data Consolidation and Cleansing Legacy DBMS Warehouse or Datamart Flat Files Upflow Downflow Outflow Metaflow External
Data Consolidation –The Process • Collect & Consolidate • Define requirements - Generate data model • Identify authoritative sources (internal/external) • Extract required data (ETL) • Load/Integrate into working database (ODS) • Generate meta-data = data about the data (MDR) • Clean - Measure data quality at the source • Completeness - Accuracy - Integrity • Load only clean data into warehouse • Schedule periodic source checking/cleansing
Data Warehousing The strategic collection, cleansing, and consolidation of organizational data to meet operational, analytical, and communication needs.
Operational DB Data from transactions Application oriented Specific Current Details Changes continually Data Warehouse Data from internal systems Subject Oriented Holistic integrated Current + historical Details + Summaries Stable Data Warehousing Major DW Framework suppliers/consultants: Teradata, Oracle, SAP/Sybase, Microsoft, HP/Vertica
Data Warehousing - Technology Choices for a Warehouse Repository • Relational DBMS (Oracle, IBM,Sybase, MS) • Supports very large, multipurpose databases • Multidimensional access via ROLAP methods • Slow for massive/complex data analysis • Multi-dimensionsal DBMS (MS, Oracle, IBM, SAP, MicroStrategy) • Fast, full feature OLAP • Size limitations - 5 GB of raw data (100 GB total) • Standards still evolving, proprietary systems
Data Warehousing DW Data Models • RDBMS (DB2, Oracle, Sybase, Ingres) • Data stored in tables - select, join, project • Based on logical normalized data model • Think “independent spreadsheet tables with links to each other” • SQL = Structured Query Language • e.g. SELECT CUSTOMER-NAME FROM CUSTOMER-MASTER WHERE CITY = “HALIFAX” AND TOTAL-PURCHASES > “500.00”
Data Warehousing DW Data Models • MDBMS (MS, Oracle, IBM, SAP) • Multi-dimensional matrix (spreadsheet) of many variables with various levels of abstraction • Think “Rubics Cube” • Rotate, select range, roll-up, drill-down through levels • 10% data, 90% indices - facilitates rapid numerical analysis and visualization • OLAP = On-Line Analytical Processing
Data Warehousing Future DW Data Models • Object-Oriented (ODBMS = POET, ODE) • classes (includes data types and functions) • objects are instantiations of classes • inheritance of class properties • used for modeling complex and large objects • better suited for distributed computing • Object Relation (DBMS = Omniscience) • best of relation and object oriented: SQL • transactions + object classes • Deductive (DRBMS) • Extension of RDBMS, embeds logic within database
Data Warehousing OLAP Knowledge Workers “The Ideal Picture” Stats IDT Data Marts & Analytical Pocessors ANN One or more central repositories Data Warehouse Operational feedback from analytics Extraction Transformation Load Operational Data Store (ODS) Source Systems and Operational Users
Data Warehouse Enterprise Wide RDBMS Normalized Full Time Horizon Read-only /Non-Volatile No End-user Access Data Marts Functional Area Subsets RDBMS or MDBMS Aggregated / Summarized Data Limited Time Horizon Analysis / Responsive End-user Access Data Warehouse / Data Mart This slide courtesy Anders Stjarne
Data Warehousing The Importance of Metadata • Data about the data • Provides the key link between business users and data – a schema for the data • Name, ID, description, source • Data type, size, range, default value • Metric scale (nominal, ordinal, interval, continuous) • A good DW and DM project should provide • Ability to browse accurate and accessible metadata encouraging use of the DW • Provides appropriate levels of security • Secure load and update
Data Warehousing - Meta Data • What is Meta Data? • Any data about data • Describes / defines the meaning and representation form • Provides a context for understanding and interpreting • Why Meta Data? • To understand and track data – from its source & over time • Basis for integration and sharing • Who needs / uses Meta Data? • Users of the data – Mgr/DSS analyst starts here to plan query • Tools / IS apps which record, read, and process the data • Conflicting objectives: • Consistency, uniformity, standards across the organization • vs. End user autonomy, freedom to manipulate & analyze • Historical efforts – mainframe data catalogue or data dictionary This slide courtesy Anders Stjarne
Data Warehousing Management Issues • Data Warehouse projects that have completed have enjoyed up to 400% ROI • Historically 75% of Data Warehouse projects did not complete as originally envisioned • projects too large, objectives not clear • 50% become Data Mart projects - scaled down: specific, achievable, measurable objectives • 25% die a slow death
Data Warehousing • DW is not a project – it is an on-going set of organizational activities • Must be driven by business objectives • Rapid change in technology and business requirements => demands short cycles, flexibility • Keys to success are strategic planning with tactical deployment (informed bottom-up approach): • few data sources -> single Data Mart • multiple data sources -> multiple Data Marts • evolve to enterprise Data Warehouse • DW are now becoming cloud-based (distributed and web enabled)
Relationship between DW and DM? Strategic Tactical Rationale for data consolidation Analysis Query/Reporting OLAP Data Mining Data Warehousing Source of consolidated data
Consolidating Internal Sources • Define business requirements, subject areas • Generate data model • Identify authoritative sources: • operational systems (process control, POS) • billing, accounting, administrative systems • branch offices, PCs, filing cabinets => paper! • Identify and involve data source stakeholders and operational system’s maintenance personnel
Consolidating Internal Sources • ETL = Extraction, Transformation and Load [ref: http://en.wikipedia.org/wiki/Extract,_transform,_load] • A process in data warehousing that involves: • Extracting data from outside sources • Transforming it to fit business needs • Loading it into the data warehouse.
Data Cleaning • Act of detecting and correcting (or removing) corrupt or inaccurate attributes or records
Data Cleaning Process • Identify authoritative data sources • Measure data quality • Completeness - all of the data • Accuracy - correct • Integrity- consistent, no ambiguities • Identify inconsistent, missing, incomplete, redundant, and incorrect values • Clean data whenever possible at the source • Load only clean data into warehouse • Schedule periodic source checking/cleansing
Data Cleaning and Preparation Missing or Invalid Data • Missing or Invalid data values may be due to • Equipment malfunction • Inconsistent with other recorded data and thus deleted • Data not entered due to misunderstanding • Not be considered important at the time of entry • Changes in the data not recorded • What to do? • Inferred (imputed) values • Remove the row or column
Data Cleaning and Preparation Noisy Data • When two or more examples conflict in terms of the dependent variable • Incorrect attribute values may be due to • Faulty data collection instruments • Data entry problems • Data transmission problems • Technology limitations (cheap thermometer) • Inconsistency in naming convention • What to do? • Ignore • Remove the row
Data Cleaning and Preparation • Data cleaning is part of the ETL and makes use of an Operation Data Store • Automated cleaning tools are available • Trillum Software (Harte-Hanks) www.trilliumsoft.com • Winpure http://www.winpure.com/Article--DataCleaningTool.html • Vality -> Ascential -> IBM Websphere Data Intergratorhttp://www-306.ibm.com/software/data/integration/ascential.html • Consolidate internal with external sources • May decide to prepare data for exploratory analysis using spreadsheets, OLAP, visualization software
Data Enrichment Requirements • Behavioral • purchase from related businesses (Air Miles) • Eg. number of vehicles, travel frequency • Demographic • Eg. age, gender, marital status, children, income level • Psychographic • Eg. “risk taker”, “conservative”, “cultured“, “hi-tech adverse”, “credit worthy”, “trustworthy”
Data Enrichment Sources • Statistics Canada (census, enumeration area) • Software for under $1000 • Data from $100 - $4000 (region - nation) • http://www.statcan.ca/ • Association joint surveys • Equifax (based on postal code segments) • Data from $500 - $50,000 • http://www.equifax.com/ • Loyalty management organizations • LMG - http://www.loyalty.co.uk • http://www.loyalty.ca • http://www.loyalty.com • Independent survey ($20,000 - $200,000)