1 / 16

Database Support to Data Mining

Database Support to Data Mining. Olson/Shi -- Chapter 3. Data Warehousing. “Data warehouses are used to store massive quantities of data in a manner that can be easily updated and allow quick retrieval of specific types of data.” Integrate information from a variety of sources. DW Processes.

eadoin
Download Presentation

Database Support to Data Mining

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Database Support to Data Mining Olson/Shi -- Chapter 3

  2. Data Warehousing • “Data warehouses are used to store massive quantities of data in a manner that can be easily updated and allow quick retrieval of specific types of data.” • Integrate information from a variety of sources.

  3. DW Processes • Warehouse generation – design data warehouse and load data • Data management – store data • Information analysis – use data to support organizational decision making

  4. Data Warehousing • Provide business users views of data appropriate to mission • Consolidate & reconcile data • Give macro views of critical aspects • Timely & detailed access to information • Provide specific information to groups • Ability to identify trends McGraw-Hill Irwin 2007

  5. DW cont… • Within a DW – data is organized around subjects • Data comes from operational and external sources • Integrated into a common format • DW must be scalable: • In terms of size and configuration (based on changing conditions)

  6. Data Marts • Granularity of data • Is the level of data detail • Data warehouse – tends to be fine granularity • OLAP – tends to aggregate to coarse granularity McGraw-Hill Irwin 2007

  7. OLAP • Extract data & generate reports based on dimensions important to users • Multidimensional spreadsheet • Hypercube – term to reflect ability to sort on many dimensions • Many forms • MOLAP – multidimensional • ROLAP – relational (uses SQL) • DOLAP – desktop • WOLAP – web enabled • HOLAP - hybrid McGraw-Hill Irwin 2007

  8. Data Warehouse Implementation • Reliable, comprehensive source of clean data • Accurate, complete, in correct format • Processes • System development • Data acquisition – supported by DW generation (ETL) • Data extraction for use McGraw-Hill Irwin 2007

  9. Data Warehouse Generation • Extract data from sources • Transform • Clean • Load into data warehouse • 60-80% of effort in operating data warehouse McGraw-Hill Irwin 2007

  10. Data Extraction Routines • Interpret data formats • Identify changed records • Copy information to intermediate file McGraw-Hill Irwin 2007

  11. Data Transformation • Consolidate data from multiple sources • Filter to eliminate unnecessary details • Clean data • eliminate incorrect entries • eliminate duplications • Convert & translate data into proper format • Aggregate data as designed McGraw-Hill Irwin 2007

  12. Data Management • Retrieve information • Extraction programs • Problems: • Required data not available • Initial data warehouse scope too broad • Not enough time to do prototyping, or needs analysis • Insufficient senior direction McGraw-Hill Irwin 2007

  13. Meta Data • Data to keep track of data & used to describe organization of data warehouse • Life cycle: • Manage meta data • Design data warehouse • Ensure data quality • Manage system during operations McGraw-Hill Irwin 2007

  14. Business Meta Data • What data are available • Source of each data element • Frequency of data updates • Location of specific data • Predefined reports & queries • Methods of data access McGraw-Hill Irwin 2007

  15. Technical Meta Data • Data source • (internal or external) • Data preparation features • (transformation & aggregation rules) • Logical structure of data • Physical structure & content of data warehouse • Data ownership • Security aspects • (access rights, restrictions) • System information • (date of last update, retention policy, data usage) McGraw-Hill Irwin 2007

  16. Data Quality • Data warehouse projects tend to fail because of data quality issues • Impacted by the fact that data may be: • Corrupt or missing from original sources • Not be transferred into or out of DW • Inconsistent as a result of unresolved data cleaning processes • Data Integrity – meaningless, corrupt, or redundant data should not be entered into the data warhouse • Examples (starting on p. 46) – Walmart, Summers Rubber Co.

More Related