160 likes | 318 Views
Database Support to Data Mining. Olson/Shi -- Chapter 3. Data Warehousing. “Data warehouses are used to store massive quantities of data in a manner that can be easily updated and allow quick retrieval of specific types of data.” Integrate information from a variety of sources. DW Processes.
E N D
Database Support to Data Mining Olson/Shi -- Chapter 3
Data Warehousing • “Data warehouses are used to store massive quantities of data in a manner that can be easily updated and allow quick retrieval of specific types of data.” • Integrate information from a variety of sources.
DW Processes • Warehouse generation – design data warehouse and load data • Data management – store data • Information analysis – use data to support organizational decision making
Data Warehousing • Provide business users views of data appropriate to mission • Consolidate & reconcile data • Give macro views of critical aspects • Timely & detailed access to information • Provide specific information to groups • Ability to identify trends McGraw-Hill Irwin 2007
DW cont… • Within a DW – data is organized around subjects • Data comes from operational and external sources • Integrated into a common format • DW must be scalable: • In terms of size and configuration (based on changing conditions)
Data Marts • Granularity of data • Is the level of data detail • Data warehouse – tends to be fine granularity • OLAP – tends to aggregate to coarse granularity McGraw-Hill Irwin 2007
OLAP • Extract data & generate reports based on dimensions important to users • Multidimensional spreadsheet • Hypercube – term to reflect ability to sort on many dimensions • Many forms • MOLAP – multidimensional • ROLAP – relational (uses SQL) • DOLAP – desktop • WOLAP – web enabled • HOLAP - hybrid McGraw-Hill Irwin 2007
Data Warehouse Implementation • Reliable, comprehensive source of clean data • Accurate, complete, in correct format • Processes • System development • Data acquisition – supported by DW generation (ETL) • Data extraction for use McGraw-Hill Irwin 2007
Data Warehouse Generation • Extract data from sources • Transform • Clean • Load into data warehouse • 60-80% of effort in operating data warehouse McGraw-Hill Irwin 2007
Data Extraction Routines • Interpret data formats • Identify changed records • Copy information to intermediate file McGraw-Hill Irwin 2007
Data Transformation • Consolidate data from multiple sources • Filter to eliminate unnecessary details • Clean data • eliminate incorrect entries • eliminate duplications • Convert & translate data into proper format • Aggregate data as designed McGraw-Hill Irwin 2007
Data Management • Retrieve information • Extraction programs • Problems: • Required data not available • Initial data warehouse scope too broad • Not enough time to do prototyping, or needs analysis • Insufficient senior direction McGraw-Hill Irwin 2007
Meta Data • Data to keep track of data & used to describe organization of data warehouse • Life cycle: • Manage meta data • Design data warehouse • Ensure data quality • Manage system during operations McGraw-Hill Irwin 2007
Business Meta Data • What data are available • Source of each data element • Frequency of data updates • Location of specific data • Predefined reports & queries • Methods of data access McGraw-Hill Irwin 2007
Technical Meta Data • Data source • (internal or external) • Data preparation features • (transformation & aggregation rules) • Logical structure of data • Physical structure & content of data warehouse • Data ownership • Security aspects • (access rights, restrictions) • System information • (date of last update, retention policy, data usage) McGraw-Hill Irwin 2007
Data Quality • Data warehouse projects tend to fail because of data quality issues • Impacted by the fact that data may be: • Corrupt or missing from original sources • Not be transferred into or out of DW • Inconsistent as a result of unresolved data cleaning processes • Data Integrity – meaningless, corrupt, or redundant data should not be entered into the data warhouse • Examples (starting on p. 46) – Walmart, Summers Rubber Co.