1 / 44

Data Mining Demystified

Data Mining Demystified. John Aleshunas Fall Faculty Institute October 2006. Prediction is very hard, especially when it's about the future. - Yogi Berra. Data Mining Stories.

nen
Download Presentation

Data Mining Demystified

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Data Mining Demystified John Aleshunas Fall Faculty Institute October 2006

  2. Prediction is very hard, especially when it's about the future. - Yogi Berra

  3. Data Mining Stories • “My bank called and said that they saw that I bought two surfboards at Laguna Beach, California.” - credit card fraud detection • The NSA is using data mining to analyze telephone call data to track al’Qaeda activities • Victoria’s Secret uses data mining to control product distribution based on typical customer buying patterns at individual stores

  4. Preview • Why data mining? • Example data sets • Data mining methods • Example application of data mining • Social issues of data mining

  5. Why Data Mining? • Database systems have been around since the 1970s • Organizations have a vast digital history of the day-to-day pieces of their processes • Simple queries no longer provide satisfying results • They take too long to execute • They cannot help us find new opportunities Source: Han

  6. Why Data Mining? • Data doubles about every year while useful information seems to be decreasing • Vast data stores overload traditional decision making processes • We are data rich, but information poor Source: Han

  7. Data Mining: a definition Simply stated, data mining refers to the extraction of knowledge from large amounts of data.

  8. Data Mining ModelsA Taxonomy Data Mining Predictive Descriptive Clustering Association Rules Classification Time Series Analysis Sequence Discovery Regression Prediction Summarization Source: Dunham

  9. Example Datasets • Iris • Wine • Diabetes

  10. Iris Dataset • Created by R.A. Fisher (1936) • 150 instances • Three cultivars (Setosa, Virginica, Versicolor) 50 instances each • 4 measurements (petal width, petal length, sepal width, sepal length) • One cultivar (Setosa) is easily separable, the others are not – noisy data Source: Fisher

  11. Iris Dataset Analysis

  12. Wine Dataset • This data is the result of a chemical analysis of wines grown in the same region in Italy but derived from three different varieties. • 153 instances with 13 constituents found in each of the three types of wines. Source: UCI Machine Learning Repository

  13. Wine Dataset Analysis

  14. Diabetes Dataset • Data is based on a population of women who were at least 21 years old of Pima Indian heritage and living near Phoenix in 1990 • 768 instances • 9 attributes (Pregnancies, PG Concentration, Diastolic BP, Tri Fold Thick, Serum Ins, BMI, DP Function, Age, Diabetes) • Dataset has many missing values, only 532 instances are complete Source: UCI Machine Learning Repository

  15. Diabetes Dataset Analysis

  16. Classification • Classification builds a model using a training dataset with known classes of data • That model is used to classify new, unknown data into those classes

  17. Classification Techniques • K-Nearest Neighbors • Decision Tree Classification (ID3, C4.5)

  18. A A A A A A B A A B X A A B A B B B B B B B K-Nearest Neighbors Example • Easy to explain • Simple to implement • Sensitive to the selection of the • classification population • Not always conclusive for complex • data

  19. K-Nearest Neighbors Example Source: Indelicato

  20. Decision Tree Example (C4.5) • C4.5 is a decision tree generating algorithm, based on the ID3 algorithm. It contains several improvements, especially needed for software implementation. • Choice of best splitting attribute is based on an entropy calculation. • These improvements include: • Choosing an appropriate attribute selection measure. • Handling training data with missing attribute values. • Handling attributes with differing costs. • Handling continuous attributes.

  21. Decision Tree Example (C4.5) Iris dataset Wine dataset Accuracy 97.67% Accuracy 86.7% Source: Siedler

  22. Decision Tree Example (C4.5) Diabetes dataset • C4.5 produces a complex tree (195 nodes) • The simplified (pruned) tree reduces the classification accuracy

  23. Association Rules Association rules are used to show the relationships between data items. Purchasing one product when another product is purchased is an example of an association rule. They do not represent any causality or correlation.

  24. Association Rule Techniques • Market Basket Analysis • Terminology • Transaction database • Association rule – implication {A, B} ═> {C} • Support - % of transactions in which {A, B, C} occurs • Confidence – ratio of the number of transactions that contain {A, B, C} to the number of transactions that contain {A, B}

  25. Association Rule Example 1984 United States Congressional Voting Records Database Attribute Information: 1. Class Name: 2 (democrat, republican) 2. handicapped-infants: 2 (y,n) 3. water-project-cost-sharing: 2 (y,n)4. adoption-of-the-budget-resolution: 2 (y,n)5. physician-fee-freeze: 2 (y,n)6. El-Salvador-aid: 2 (y,n) 7. religious-groups-in-schools: 2 (y,n) 8. anti-satellite-test-ban: 2 (y,n) 9. aid-to-Nicaraguan-contras: 2 (y,n)10. MX-missile: 2 (y,n) 11. immigration: 2 (y,n) 12. synfuels-corporation-cutback: 2 (y,n) 13. education-spending: 2 (y,n)14. superfund-right-to-sue: 2 (y,n)15. crime: 2 (y,n) 16. duty-free-exports: 2 (y,n) 17. export-administration-act-south-africa: 2 (y,n) Rules: {budget resolution = no, MX-missile = no, aid to El Salvador = yes}  {Republican} confidence 91.0% {budget resolution = yes, MX-missile = yes, aid to El Salvador = no}  {Democrat} confidence 97.5% {crime = yes, right-to-sue = yes, Physician fee freeze = yes}  {Republican} confidence 93.5% {crime = no, right-to-sue = no, Physician fee freeze = no}  {Democrat} confidence 100.0% Source: UCI Machine Learning Repository

  26. Clustering Clustering is similar to classification in that data are grouped. Unlike classification, the groups are not predefined; they are discovered. Grouping is accomplished by finding similarities between data according to characteristics found in the actual data.

  27. Clustering Techniques • K-Means Clustering • Neural Network Clustering (SOM)

  28. K-Means Example • The K-Means algorithm is an method to cluster objects based on their attributes into k partitions. • It assumes that the k clusters exhibit normal distributions. • The objective it tries to achieve is to minimize the variance within the clusters.

  29. Mean 1 Mean 2 Mean 3 Cluster 1 Cluster 3 Cluster 2 Dataset K-Means Example

  30. K-Means Example Iris dataset, only the petal width attribute, Accuracy 95.33% Iris dataset, all attributes, Accuracy 66.0% Iris dataset, all attributes, Accuracy 90.67%

  31. Self-Organizing Map Example • The Self-Organizing Map was first described by the Finnish professor Teuvo Kohonen and is thus sometimes referred to as a Kohonen map. • SOM is especially good for visualizing high-dimensional data. • SOM maps input vectors onto a two-dimensional grid of nodes. • Nodes that are close together have similar attribute values and nodes that are far apart have different attribute values.

  32. Z Y X Self-Organizing Map Example Z X Input Vectors Y

  33. Self-Organizing Map Example Iris Data

  34. Self-Organizing Map Example Wine Data

  35. Self-Organizing Map Example Diabetes Data

  36. NFL Quarterback Analysis • Data from 2005 for 42 NFL quarterbacks • Preprocessed data to normalize for a full 16 game regular season • Used SOM to cluster individuals based on performance and descriptive data Source: McKee

  37. NFL Quarterback Analysis The SOM Map Source: McKee

  38. NFL Quarterback Analysis QB Passing Rating Overall Clustering Source: McKee

  39. NFL Quarterback Analysis The SOM Map Source: McKee

  40. Data Mining Stories - Revisited • Credit card fraud detection • NSA telephone network analysis • Supply chain management

  41. Social Issues of Data Mining • Impacts on personal privacy and confidentiality • Classification and clustering is similar to profiling • Association rules resemble logical implications • Data mining is an imperfect process subject to interpretation

  42. Conclusion • Why data mining? • Example data sets • Data mining methods • Example application of data mining • Social issues of data mining

  43. What on earth would a man do with himself if something did not stand in his way? - H.G. Wells I don’t think necessity is the mother of invention – invention, in my opinion, arises directly from idleness, probably also from laziness, to save oneself trouble. - Agatha Christie, from “An Autobiography, Pt III, Growing Up”

  44. References • Dunham, Margaret, Data Mining Introductory and Advanced Topics, Pearson Education, Inc., 2003 • Fisher, R.A., The Use of Multiple Measurements in Taxonomic Problems, Annals of Eugenics 7, pp. 179-188 • Han, Jiawei, Data Mining: Concepts and Techniques, Elsevier Inc., 2006 • Indelicato, Nicolas, Analysis of the K-Nearest Neighbors Algorithm, MATH 4500: Foundations of Data Mining, 2004 • McKee , Kevin, The Self Organized Map Applied to 2005 NFL Quarterbacks, MATH 4200: Data Mining Foundations, 2006 • Newman, D.J. & Hettich, S. & Blake, C.L. & Merz, C.J. (1998). UCI Repository of machine learning databases [http://www.ics.uci.edu/~mlearn/MLRepository.html]. Irvine, CA: University of California, Department of Information and Computer Science • Seidler, Toby, The C4.5 Project: An Overview of the Algorithm with Results of Experimentation, MATH 4500: Foundations of Data Mining, 2004

More Related