1 / 56

Introduction to KDD

Introduction to KDD. Evolution of Database Technology. 1950s: First computers, use of computers for census. 1960s: Data collection, database creation (hierarchical and network models) 1970s: Relational data model, relational DBMS implementation

Download Presentation

Introduction to KDD

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Introduction to KDD

  2. Evolution of Database Technology • 1950s: First computers, use of computers for census. • 1960s: Data collection, database creation (hierarchical and network models) • 1970s: Relational data model, relational DBMS implementation • 1980s: Ubiquitous RDBMS, advanced data models (extended-relational, OO, deductive, etc.) and application-oriented DBMS (spatial, scientific, engineering, etc) • 1990s: Data mining and data warehousing, massive media digitization, multimedia databases and Web technology

  3. Data Rich but Information Poor Databases are too big. Data Mining can help discover knowledge Terrorbytes

  4. What should we do? • We are not trying to find the needle in the hay stock because DBMSs know how to do that. • We are merely trying to understand the consequences of the presence of the needle if it exists.

  5. What Led Us To This? • Necessity is the Mother of Invention • Technology is available to help us collect data. • Barcode, scanners, satellites, cameras, etc. • Technology is available to help us store data. • Database, data warehouses, variety of responses • We are starving for knowledge (competitive edge, research, etc). • We are swamped by data that continually pour on us. • We do not know what to do with this data. • We need to interpret this data in search for new knowledge.

  6. What is our Need? Data Extract interesting knowledge (rules, regularities, patterns, constraints) from data in large collections. Knowledge

  7. Introduction: Outline • What kind of information are we collecting? • What are Data Mining and Knowledge Discovery? • What kind of data can be mined? • What can be discovered? • Is all that is discovered interesting and useful? • Are there application examples?

  8. Introduction: Outline • What kind of information are we collecting? • What are Data Mining and Knowledge Discovery? • What kind of data can be mined? • What can be discovered? • Is all that is discovered interesting and useful? • Are there application examples?

  9. Data Collected (SOURCE) • Business transactions • Scientific data • Medical and personal data • Surveillance video and pictures • Satellite sensing • Games

  10. Data Collected (Format) • Digital Media • CAD and Software Engineering • Virtual Worlds • Text Reports and Memos • The World Wide Web

  11. Introduction: Outline • What kind of information are we collecting? • What are Data Mining and Knowledge Discovery? • What kind of data can be mined? • What can be discovered? • Is all that is discovered interesting and useful? • Are there application examples?

  12. Knowledge Discovery Process of non-trivial extraction of implicit, previously unknown and potentially useful information from large collections of data.

  13. Many Steps in KDD Process • Gathering the data together. • Cleanse the data and fit it together. • Select the necessary data. • Crunch and squeeze the data to extract the essence of it. • Evaluate the output and use it.

  14. So What is Data Mining? • In theory, Data Mining is a STEP in the knowledge discovery process. It is the extraction of implicit information from a large data set. • In practice, data mining and knowledge discovery are becoming synonymous.

  15. So What is Data Mining? • There are other equivalent terms: KDD, knowledge extraction, discovery of regularities, patterns discovery, data archeology, data dredging, business intelligence, information harvesting • Shouldn’t it be called knowledge mining instead of data mining?

  16. Data Mining: A KDD Process Knowledge Pattern Evaluation Data mining: the core of knowledge discovery process. Data Mining Task-relevant Data Selection and transformation Data Warehouse Data Cleaning Data Integration Databases

  17. Data Mining: A KDD Process Knowledge Pattern Evaluation Data Mining Task-relevant Data Selection and transformation Data Warehouse Data Cleaning Data Integration Databases

  18. Data Mining: A KDD Process Knowledge Multiple data sources, often heterogeneous, may be combined in a common source Pattern Evaluation Data Mining Task-relevant Data Data Selection and transformation Data Warehouse Data Cleaning Data Integration Databases

  19. Data Mining: A KDD Process Knowledge Noise data and irrelevant data are removed from the collections Pattern Evaluation Data Mining Task-relevant Data Selection and transformation Data Warehouse Data Cleaning Data Integration Databases

  20. Data Mining: A KDD Process Data relevant to the analysis is decided on and retrieved from the data collection Knowledge Pattern Evaluation Data Mining Task-relevant Data Data Selection and transformation Data Warehouse Data Cleaning Data Integration Databases

  21. Data Mining: A KDD Process Selected data is transformed into forms appropriate for the mining procedure Knowledge Pattern Evaluation Data Mining Task-relevant Data Data Selection and Data transformation Data Warehouse Data Cleaning Data Integration Databases

  22. Data Mining: A KDD Process Clever techniques are applied to extract patterns potentially useful Knowledge Pattern Evaluation Data Mining Task-relevant Data Selection and transformation Data Warehouse Data Cleaning Data Integration Databases

  23. Data Mining: A KDD Process Strictly interesting patterns representing knowledge are identified based on given measures Knowledge Pattern Evaluation Data Mining Task-relevant Data Selection and transformation Data Warehouse Data Cleaning Data Integration Databases

  24. Data Mining: A KDD Process Knowledge Knowledge representation: Uses visualization techniques to help users understand and interpret the data mining results Pattern Evaluation Data Mining Task-relevant Data Selection and transformation Data Warehouse Data Cleaning Data Integration Databases

  25. KDD is an Iterative Process

  26. Introduction: Outline • What kind of information are we collecting? • What are Data Mining and Knowledge Discovery? • What kind of data can be mined? • What can be discovered? • Is all that is discovered interesting and useful? • Are there application examples?

  27. What Kind of Data Can be Mined? • Flat Files • Heterogeneous and legacy databases • Relational databases And other DB: Object-oriented and object-relational data bases • Transactional databases Transaction(TID, Timestamp, {item1, items,..})

  28. What Kind of Data Can be Mined? Data warehouses

  29. What Kind of Data Can be Mined? Multimedia Databases

  30. What Kind of Data Can be Mined? Spatial Databases

  31. What Kind of Data Can be Mined? Time-Series Databases

  32. What Kind of Data Can be Mined? Text Documents

  33. What Kind of Data Can be Mined? World Wide Web • Components of WWW: • Content of the web • Structure of the web • Usage of the web

  34. Introduction: Outline • What kind of information are we collecting? • What are Data Mining and Knowledge Discovery? • What kind of data can be mined? • What can be discovered? • Is all that is discovered interesting and useful? • Are there application examples?

  35. What can be discovered? • What can be discovered depends on the data mining task employed. • Descriptive DM • Describe general properties • Predictive DM Tasks • Infer on available data

  36. Data Mining Functionality • Characterization Summarization of general features of objects in a target class (concept description) Ex: Characterize graduate students in Science • Discrimination Comparison of general features of objects between a target class and a contrasting class (concept description) Ex: Compare students is Science and students in Arts

  37. Data Mining Functionality • Association Studies the frequency of items occurring together in transactional databases Ex: buys(x, bread)  buys(x, milk) • Prediction Predicts some unknown or missing attribute values based on other information Ex: Forecast the sale value for next week based on available data.

  38. Data Mining Functionality • Classification Organizes data in given classes based on attribute values (supervised classification) Ex: classify students based on final results • Clustering Organizes data in classes based on attribute values (unsupervised classification) Ex: group crime locations to find distribution patterns.

  39. Data Mining Functionality • Outlier analysis Identifies and explains exceptions (surprises) • Time-series analysis: Analyzes trends and deviations, regression, sequential pattern, similar sequences

  40. Introduction: Outline • What kind of information are we collecting? • What are Data Mining and Knowledge Discovery? • What kind of data can be mined? • What can be discovered? • Is all that is discovered interesting and useful? • Are there application examples?

  41. Is all that is Discovered Interesting? • A data mining operation may generate thousands of patterns, not all of them are interesting. • Data Mining results are sometimes so large that we may need to mine it too (Meta-Mining?) How to measure? INTERESTINGNESS

  42. Interestingness • Objective interestingness measures: • Based on statistics and structure of patterns, e.g., support, confidence, etc. • Subjective interestingness measures: • Based on user’s beliefs in data, e.g., unexpectedness, novelty

  43. Interestingness Measures: A pattern is interesting if it is: • Easily understood by humans • Valid on new or test data with some degree of certainty • Potentially useful • Novel, or validates some hypothesis that a user seeks to confirm

  44. Introduction: Outline • What kind of information are we collecting? • What are Data Mining and Knowledge Discovery? • What kind of data can be mined? • What can be discovered? • Is all that is discovered interesting and useful? • Are there application examples?

  45. Potential and/or Successful Applications • Business data analysis and decision support • Marketing focalization • Recognizing specific market segments that respond to particular characteristics • Return on mailing campaign (target marketing) • Customer profiling • Segmentation of customer for marketing strategies and/or product offerings • Customer behavior understanding • Customer retention and loyalty

  46. Potential and/or Successful Applications • Business data analysis and decision support • Market analysis and management • Provide summary information for decision-making • Market basket analysis, cross selling, market segmentation • Risk analysis and management • What if analysis • Forecasting • Pricing analysis, competitive analysis • Time-series analysis (ex. Stock market)

  47. Potential and/or Successful Applications • Fraud detection • Detection of telephone fraud -Telephone call model: destination of the call, duration, time of the day or week. Analyze patterns that deviate from an expected norm. • Detecting automotive and health insurance fraud • Detection of credit-card fraud • Detecting suspicious money transactions (money laundering)

More Related