1 / 31

Course Overview

Discover the basics, techniques, and real-world applications of data mining. Explore future trends and essential reference books in the field for insightful knowledge discovery.

mcomer
Download Presentation

Course Overview

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Data Mining : The Discovery Technology for Knowledge ManagementYike GuoDept. of ComputingImperial College

  2. Goal Basic Concepts of Data Mining Basic Data Mining Techniques Data Mining procedure in Real World Applications Future Research Trends on Data Mining Reference Books Advances in Knowledge Discovery and Data Mining U.M Fayyad and G, Piatetsky-Shapiro AAAI/MIT Press. 1996 Predictive Data Mining: A Practical Guide Sholom M.Weiss and Nitin Indurkhya Morgan Kaufmann Publishers, Inc. 1997 Data Mining Techniques Wiley Computer Publishing, 1997 Course Overview

  3. What does the data say? Day Outlook Temperature Humidity Wind Play Tennis 1 Sunny Hot High Weak No 2 Sunny Hot High Strong No 3 Overcast Hot High Weak Yes 4 Rain Mild High Weak Yes 5 Rain Cool Normal Weak Yes 6 Rain Cool Normal Strong No 7 Overcast Cool Normal Strong Yes 8 Sunny Mild High Weak No 9 Sunny Cool Normal Weak Yes 10 Rain Mild Normal Weak Yes 11 Sunny Mild Normal Strong Yes 12 Overcast Mild High Strong Yes 13 Overcast Hot Normal Weak Yes 14 Rain Mild High Strong No

  4. Turing Data into Knowledge

  5. Data Mining Machine Learning Statistics Enabling Technology Decision Support Knowledge Discovery Data Mining High Databases Performance Infrastructure & Distributed Computing

  6. Limitation of traditional database querying: Most queries of interest to data owners are difficult to state in a query language “ find me all records indicating fraud”=> “ tell me the characteristics of fraud” (Summarisation) “find me who likely to buy product X” (classification problem) “find all records that are similar to records in table X” (clustering problem) Ability to support analysis and decision making using traditional (SQL) queries become infeasible (query formulation problem ). Why Data Mining

  7. Terabyte databases, consisting of billions of records, are becoming common Relational data model is the defacto standard A relational database : set of relations A relation : a set of homogenous tuples Relations are created, updated and queried using SQL Query = Keyword based search SELECT telephone_number FROM telephone_book WHERE last_name = “Smith” Relational Database Revisited

  8. Provides a well-defined set of operations: scan, join, insert, delete, sort, aggregate, union, difference Scan -- applies a predicate P to relation R For each tuple tr from R if P(tr) is true, tr is inserted in the output stream Join -- composes two relations R and S For each tuple tr from R For each tuple ts from S if join attribute of tr equals to join attribute of ts form output tuple by concatenating tr and ts SQL : Relational Querying Language

  9. It is not solvable via query optimisation Has not received much attention in the database field or in traditional statistical approaches These problems are of inductive features: learning from data rather than search from data Natural solution is via train-by-example approach to construct inductive models as the answers The Query Formulation Problem Consider the query : What kinds of weather condition are suitable for playing tennis ?

  10. Why Data Mining Now • Data Explosion • Business Data :organisations such as supermarket chains, credit card companies, investment banks, government agencies, etc. routinely generate daily volumes of 100MB of data • Scientific Data: Scientific and remote sensing instruments collect data at the rates of Gigabytes per day: far beyond human analysis abilities. • Data Wasting • Only a small portion (5% - 10%) of the collected data is ever analysed • Data that may never be analysed continues to be collected, at great expense. • We are drowning in data, but starving for knowledge!

  11. What is Data Mining Data Mining: a non-trivial data analysis process for identifying valid, useful and understandable patterns from databases.

  12. Data: set of facts F ( records in a database) Pattern : An expression E in a language L describing data in a subset FE of F and E is simpler than the enumeration of al l the facts of FE. FE is also called a class and E is also called a model or knowledge. Data Mining Process: data mining is a multi-step process involving multiple choices, iteration and evaluation. It is non-trivial since there is no closed-form solution. It always involve intensive search. Validity : E is true (with high probability) for F Useful : patterns are not trivial inductive properties of data Understandable: patterns should be understandable by data owners to aid in understanding the data/domain

  13. How Data Mining Works Data Knowledge Data Mining System Decision Support System Historical Data (Data Warehouse) Predictive Models Business Intelligence Decision Evaluation Feedback Operational Data Business Action

  14. DataWarehousing • “ A data warehouse is a subject-oriented, integrated, time-variant, and nonvolatile collection of data in support of management’s decision-making process.” --- W. H. Inmon • A data warehouse is • A decision support database that is maintained separately from the organization’s operational databases. • It integrates data from multiple heterogeneous sources to support the continuing need for structured and /or ad-hoc queries, analytical reporting, and decision support.

  15. Modeling Data Warehouses • Modeling data warehouses: dimensions & measurements • Star schema: A single object (fact table) in the middle connected to a number of objects (dimension tables) radically. • Snowflake schema: A refinement of star schema where the dimensional hierarchy is represented explicitly by normalizing the dimension tables. • Fact constellations: Multiple fact tables share dimension tables. • Storage of selected summary tables: • Independent summary table storing pre-aggregated data, e.g., total sales by product by year. • Encoding aggregated tuples in the same fact table and the same dimension tables.

  16. Time Dimension Table Sales Fact Table Product Dimension Table Many Time Attributes Time_Key Many Product Attributes Product_Key Store Dimension Table Location Dimension Table Store_Key Many Location Attributes Many Store Attributes Location_Key unit_sales dollar_sales Measures Yen_sales Example of Star Schema

  17. Supplier_Key Sales Fact Table Product Dimension Table Time Dimension Table Time_Key Supplier_Key Many Time Attributes Product_Key Product_Key Store_Key Store Dimension Table Location Dimension Table Location_Key Many Store Attributes Location_Key unit_sales Country dollar_sales Measures Location_Key Yen_sales Region Location_Key Example of a Snowflake Schema

  18. Customer Orders Shipping Method Customer CONTRACTS AIR-EXPRESS ORDER TRUCK PRODUCT LINE Time Product ANNUALY QTRLY DAILY PRODUCT ITEM PRODUCT GROUP DISTRICT SALES PERSON REGION DISTRICT COUNTRY DIVISION Geography Promotion Organization A Star-Net Query Model

  19. View of Warehouses and Hierarchies • Importing data • Table Browsing • Dimension creation • Dimension browsing • Cube building • Cube browsing

  20. All, All, All Construction of Data Cubes All Amount Comp_Method, B.C. Amount 0-20K 20-40K 40-60K 60K- sum Province B.C. Prairies Comp_Method Ontario sum Database Discipline … ... sum Each dimension contains a hierarchy of values for one attribute A cube cell stores aggregate values, e.g., count, sum, max, etc. A “sum” cell stores dimension summation values. Sparse-cube technology and MOLAP/ROLAP integration. “Chunk”-based multi-way aggregation and single-pass computation.

  21. OLAP: On-Line Analytical Processing • A multidimensional, LOGICAL view of the data. • Interactive analysis of the data: drill, pivot, slice_dice, filter. • Summarization and aggregations at every dimension intersection. • Retrieval and display of data in 2-D or 3-D crosstabs, charts, and graphs, with easy pivoting of the axes. • Analytical modeling: deriving ratios, variance, etc. and involving measurements or numerical data across many dimensions. • Forecasting, trend analysis, and statistical analysis. • Requirement: Quick response to OLAP queries.

  22. OLAP Architecture • Logical architecture: • OLAP view: multidimensional and logic presentation of the data in the data warehouse/mart to the business user. • Data store technology: The technology options of how and where the data is stored. • Three services components: • data store services • OLAP services, and • user presentation services. • Two data store architectures: • Multidimensional data store: (MOLAP). • Relational data store: Relational OLAP (ROLAP).

  23. Dimension Browsing • Product <====== • Location ======>

  24. Ad Hoc Queries: Q: How many customers do we have in London? A: 32776 Decision Support with Data Warehouse

  25. Report and Spreadsheet

  26. OLAP: Q:What are the sales figures for Y in the different regions:

  27. Statistics: Q: Is there a relation between age and buy behaviour? A: Older clients buy more

  28. Data Mining: Q: What factors influence buying behaviour ? Age Old Young Middle Hair color Wage N L H B W Y Y N N • A1: : Young men in sports cars buy 3 times as much audio equipment (clustering/regression): • A2: Older woman with dark hair more often buy rinse (classification) • A3: Buyers of cars are also the buyers of houses (asociation)

  29. Commercial : Fraud detection: Identify Fraudulent transaction Loan approval: Establish the credit worthiness of a customer requesting a loan Investment analysis : Predict a portfolio's return on investment Marketing and sales data analysis: Identify potential customers; establishing the effectiveness of a sales campaign Medical: Drug effect analysis : from patient records to learn drug effects Disease causality analysis Political policy: Election policy : people’s voting patterns Social policy: tax/benefit policy Manufacturing: Manufacturing process analysis: identify the causes of manufacturing problems Experiment result analysis : Summarise experiment results and create predictive models Example Data Mining Applications

  30. Scientific data analysis: cataloguing in surveys, basic processing needed before higher-level science analysis can occur, scientific discovery over large data sets. Data Mining (Statistical Computing and Machine Learning) Theory Experiments Numerical Computing (Iterative Equation Solving) Simulation Data Assimilation (Data Warehousing) Numerical Computing : simulating the real world systems based on the underlying theory Data Assimilation :comprehending, consolidating and warehousing the simulation/experiment data Data Mining : analysis the warehoused simulation/experiment data for knowledge discovery

  31. Related Fields: • Machine learning: Inductive reasoning • Statistics : Sampling, Statistical Inference, Error Estimation • Pattern recognition: Neural Networks, Clustering • Knowledge Acquisition, Statistical Expert Systems • Data Visualisation • Databases: OLAP, Parallel DBMS, Deductive Databases • Data Warehousing: collection, cleaning of transactional data for on-line retrial

More Related