440 likes | 631 Views
Data Mining Demystified. John Aleshunas Fall Faculty Institute October 2006. Prediction is very hard, especially when it's about the future. - Yogi Berra. Data Mining Stories.
E N D
Data Mining Demystified John Aleshunas Fall Faculty Institute October 2006
Prediction is very hard, especially when it's about the future. - Yogi Berra
Data Mining Stories • “My bank called and said that they saw that I bought two surfboards at Laguna Beach, California.” - credit card fraud detection • The NSA is using data mining to analyze telephone call data to track al’Qaeda activities • Victoria’s Secret uses data mining to control product distribution based on typical customer buying patterns at individual stores
Preview • Why data mining? • Example data sets • Data mining methods • Example application of data mining • Social issues of data mining
Why Data Mining? • Database systems have been around since the 1970s • Organizations have a vast digital history of the day-to-day pieces of their processes • Simple queries no longer provide satisfying results • They take too long to execute • They cannot help us find new opportunities Source: Han
Why Data Mining? • Data doubles about every year while useful information seems to be decreasing • Vast data stores overload traditional decision making processes • We are data rich, but information poor Source: Han
Data Mining: a definition Simply stated, data mining refers to the extraction of knowledge from large amounts of data.
Data Mining ModelsA Taxonomy Data Mining Predictive Descriptive Clustering Association Rules Classification Time Series Analysis Sequence Discovery Regression Prediction Summarization Source: Dunham
Example Datasets • Iris • Wine • Diabetes
Iris Dataset • Created by R.A. Fisher (1936) • 150 instances • Three cultivars (Setosa, Virginica, Versicolor) 50 instances each • 4 measurements (petal width, petal length, sepal width, sepal length) • One cultivar (Setosa) is easily separable, the others are not – noisy data Source: Fisher
Wine Dataset • This data is the result of a chemical analysis of wines grown in the same region in Italy but derived from three different varieties. • 153 instances with 13 constituents found in each of the three types of wines. Source: UCI Machine Learning Repository
Diabetes Dataset • Data is based on a population of women who were at least 21 years old of Pima Indian heritage and living near Phoenix in 1990 • 768 instances • 9 attributes (Pregnancies, PG Concentration, Diastolic BP, Tri Fold Thick, Serum Ins, BMI, DP Function, Age, Diabetes) • Dataset has many missing values, only 532 instances are complete Source: UCI Machine Learning Repository
Classification • Classification builds a model using a training dataset with known classes of data • That model is used to classify new, unknown data into those classes
Classification Techniques • K-Nearest Neighbors • Decision Tree Classification (ID3, C4.5)
A A A A A A B A A B X A A B A B B B B B B B K-Nearest Neighbors Example • Easy to explain • Simple to implement • Sensitive to the selection of the • classification population • Not always conclusive for complex • data
K-Nearest Neighbors Example Source: Indelicato
Decision Tree Example (C4.5) • C4.5 is a decision tree generating algorithm, based on the ID3 algorithm. It contains several improvements, especially needed for software implementation. • Choice of best splitting attribute is based on an entropy calculation. • These improvements include: • Choosing an appropriate attribute selection measure. • Handling training data with missing attribute values. • Handling attributes with differing costs. • Handling continuous attributes.
Decision Tree Example (C4.5) Iris dataset Wine dataset Accuracy 97.67% Accuracy 86.7% Source: Siedler
Decision Tree Example (C4.5) Diabetes dataset • C4.5 produces a complex tree (195 nodes) • The simplified (pruned) tree reduces the classification accuracy
Association Rules Association rules are used to show the relationships between data items. Purchasing one product when another product is purchased is an example of an association rule. They do not represent any causality or correlation.
Association Rule Techniques • Market Basket Analysis • Terminology • Transaction database • Association rule – implication {A, B} ═> {C} • Support - % of transactions in which {A, B, C} occurs • Confidence – ratio of the number of transactions that contain {A, B, C} to the number of transactions that contain {A, B}
Association Rule Example 1984 United States Congressional Voting Records Database Attribute Information: 1. Class Name: 2 (democrat, republican) 2. handicapped-infants: 2 (y,n) 3. water-project-cost-sharing: 2 (y,n)4. adoption-of-the-budget-resolution: 2 (y,n)5. physician-fee-freeze: 2 (y,n)6. El-Salvador-aid: 2 (y,n) 7. religious-groups-in-schools: 2 (y,n) 8. anti-satellite-test-ban: 2 (y,n) 9. aid-to-Nicaraguan-contras: 2 (y,n)10. MX-missile: 2 (y,n) 11. immigration: 2 (y,n) 12. synfuels-corporation-cutback: 2 (y,n) 13. education-spending: 2 (y,n)14. superfund-right-to-sue: 2 (y,n)15. crime: 2 (y,n) 16. duty-free-exports: 2 (y,n) 17. export-administration-act-south-africa: 2 (y,n) Rules: {budget resolution = no, MX-missile = no, aid to El Salvador = yes} {Republican} confidence 91.0% {budget resolution = yes, MX-missile = yes, aid to El Salvador = no} {Democrat} confidence 97.5% {crime = yes, right-to-sue = yes, Physician fee freeze = yes} {Republican} confidence 93.5% {crime = no, right-to-sue = no, Physician fee freeze = no} {Democrat} confidence 100.0% Source: UCI Machine Learning Repository
Clustering Clustering is similar to classification in that data are grouped. Unlike classification, the groups are not predefined; they are discovered. Grouping is accomplished by finding similarities between data according to characteristics found in the actual data.
Clustering Techniques • K-Means Clustering • Neural Network Clustering (SOM)
K-Means Example • The K-Means algorithm is an method to cluster objects based on their attributes into k partitions. • It assumes that the k clusters exhibit normal distributions. • The objective it tries to achieve is to minimize the variance within the clusters.
Mean 1 Mean 2 Mean 3 Cluster 1 Cluster 3 Cluster 2 Dataset K-Means Example
K-Means Example Iris dataset, only the petal width attribute, Accuracy 95.33% Iris dataset, all attributes, Accuracy 66.0% Iris dataset, all attributes, Accuracy 90.67%
Self-Organizing Map Example • The Self-Organizing Map was first described by the Finnish professor Teuvo Kohonen and is thus sometimes referred to as a Kohonen map. • SOM is especially good for visualizing high-dimensional data. • SOM maps input vectors onto a two-dimensional grid of nodes. • Nodes that are close together have similar attribute values and nodes that are far apart have different attribute values.
Z Y X Self-Organizing Map Example Z X Input Vectors Y
Self-Organizing Map Example Iris Data
Self-Organizing Map Example Wine Data
Self-Organizing Map Example Diabetes Data
NFL Quarterback Analysis • Data from 2005 for 42 NFL quarterbacks • Preprocessed data to normalize for a full 16 game regular season • Used SOM to cluster individuals based on performance and descriptive data Source: McKee
NFL Quarterback Analysis The SOM Map Source: McKee
NFL Quarterback Analysis QB Passing Rating Overall Clustering Source: McKee
NFL Quarterback Analysis The SOM Map Source: McKee
Data Mining Stories - Revisited • Credit card fraud detection • NSA telephone network analysis • Supply chain management
Social Issues of Data Mining • Impacts on personal privacy and confidentiality • Classification and clustering is similar to profiling • Association rules resemble logical implications • Data mining is an imperfect process subject to interpretation
Conclusion • Why data mining? • Example data sets • Data mining methods • Example application of data mining • Social issues of data mining
What on earth would a man do with himself if something did not stand in his way? - H.G. Wells I don’t think necessity is the mother of invention – invention, in my opinion, arises directly from idleness, probably also from laziness, to save oneself trouble. - Agatha Christie, from “An Autobiography, Pt III, Growing Up”
References • Dunham, Margaret, Data Mining Introductory and Advanced Topics, Pearson Education, Inc., 2003 • Fisher, R.A., The Use of Multiple Measurements in Taxonomic Problems, Annals of Eugenics 7, pp. 179-188 • Han, Jiawei, Data Mining: Concepts and Techniques, Elsevier Inc., 2006 • Indelicato, Nicolas, Analysis of the K-Nearest Neighbors Algorithm, MATH 4500: Foundations of Data Mining, 2004 • McKee , Kevin, The Self Organized Map Applied to 2005 NFL Quarterbacks, MATH 4200: Data Mining Foundations, 2006 • Newman, D.J. & Hettich, S. & Blake, C.L. & Merz, C.J. (1998). UCI Repository of machine learning databases [http://www.ics.uci.edu/~mlearn/MLRepository.html]. Irvine, CA: University of California, Department of Information and Computer Science • Seidler, Toby, The C4.5 Project: An Overview of the Algorithm with Results of Experimentation, MATH 4500: Foundations of Data Mining, 2004