600 likes | 814 Views
University of Rhode Island Department of Computer Science and Statistics March 30, 2007. An Overview and Example of Data Mining. Daniel T. Larose, Ph.D. Professor of Statistics Director, Data Mining @CCSU Editor, Wiley Series on Methods and Applications in Data Mining
E N D
University of Rhode IslandDepartment of Computer Science and Statistics March 30, 2007 An Overview and Exampleof Data Mining Daniel T. Larose, Ph.D.Professor of Statistics Director, Data Mining @CCSUEditor, Wiley Series on Methods and Applications in Data Mining larosed@ccsu.edu www.math.ccsu.edu/larose URI Dept of Computer Science and Statistics - An Overview and Example of Data Mining - Larose
Overview • Part One: • A Brief Overview of Data Mining • Part Two: • An Example of Data Mining: • Modeling Response to Direct Mail Marketing • But first, a shameless plug … URI Dept of Computer Science and Statistics - An Overview and Example of Data Mining - Larose
Master of Science in DM at CCSUFaculty • Dr. Roger Bilisoly (from Ohio State Univ., Statistics) • Text Mining, Intro to Data Mining • Dr. Darius Dziuda (from Warsaw Polytechnic Univ, CS) • Data Mining for Genomics and Proteomics, Biomarker Discovery • Dr. Zdravko Markov (from Sofia Univ, CS) • Data Mining (CS perspective), Machine Learning • Dr. Daniel Miller (from UConn, Statistics) • Applied Multivariate Analysis, Mathematical Statistics II, Intro to Data Mining • Dr. Krishna Saha (from Univ of Windsor, Statistics) • Intro to Data Mining using R • Dr. Daniel Larose (Program Director) (from UConn, Statistics) • Intro to Data Mining, Data Mining Methods, Applied Data Mining, Web Mining URI Dept of Computer Science and Statistics - An Overview and Example of Data Mining - Larose
Master of Science in DM at CCSU Program (36 credits) • Core Courses (27 credits) All available online. • Stat 521 Introduction to Data Mining (4 cr) • Stat 522 Data Mining Methods (4 cr) • Stat 523 Applied Data Mining (4 cr) • Stat 525 Web Mining • Stat 526 Data Mining for Genomics and Proteomics • Stat 527 Text Mining • Stat 416 Mathematical Statistics II • Stat 570 Applied Multivariate Analysis • Electives ( 6 credits. Choose two) • CS 570 Topics in Artificial Intelligence: Machine Learning • CS 580 Topics in Advanced Database: Data Mining • Stat 455 Experimental Design • Stat 551 Applied Stochastic Processes • Stat 567 Linear Models • Stat 575 Mathematical Statistics III • Stat 529 Current Issues in Data Mining • Capstone Requirement: Stat 599 Thesis (3 credits) URI Dept of Computer Science and Statistics - An Overview and Example of Data Mining - Larose
Master of Science in DM at CCSU • Only MS in DM that is entirely online. • Some courses available on campus. • Student must come to CCSU to present Thesis • We reach students in about 30 US States and a dozen foreign countries • Half of our students already have master’s degrees • About 15% already have Ph.D.’s • Typical student is a mid-career professional • Backgrounds are diverse: Computer Science, Engineering, Finance, Chemistry, Database Admin, Statistics, etc. • www.ccsu.edu/datamining URI Dept of Computer Science and Statistics - An Overview and Example of Data Mining - Larose
Graduate Certificate in Data Mining • 18 Credits: • Required Courses (12 credits) • Stat 521 Introduction to Data Mining • Stat 522 Data Mining Methods and Models • Stat 523 Applied Data Mining • Elective Courses (6 credits. Choose Two): • Stat 525 Web Mining • Stat 526 Data Mining for Genomics and Proteomics • Stat 527 Text Mining • Stat 529 Current Issues in Data Mining • Some other graduate-level data mining or statistics course, with approval of advisor. • No Mathematical Statistics requirement. URI Dept of Computer Science and Statistics - An Overview and Example of Data Mining - Larose
Material for Part I Drawn From:Discovering Knowledge in Data: An Introduction to Data Mining(Wiley, 2005) • Chapter 1. An Introduction to Data Mining • Chapter 2. Data Preprocessing • Chapter 3. Exploratory Data Analysis • Chapter 4. Statistical Approaches to Estimation and Prediction • Chapter 5. K-Nearest Neighbor • Chapter 6. Decision Trees • Chapter 7. Neural Networks • Chapter 8. Hierarchical and K-Means Clustering • Chapter 9. Kohonen networks • Chapter 10. Association Rules • Chapter 11. Model Evaluation Techniques URI Dept of Computer Science and Statistics - An Overview and Example of Data Mining - Larose
Material for Part II Drawn From:Data Mining Methods and Models(Wiley, 2006) • Chapter 1. Dimension Reduction Methods • Chapter 2. Regression Modeling • Chapter 3. Multiple Regression and Model Building • Chapter 4. Logistic Regression • Chapter 5. Naïve Bayes Classification and Bayesian Networks • Chapter 6. Genetic Algorithms • Chapter 7. Case Study: Modeling Response to Direct-Mail Marketing URI Dept of Computer Science and Statistics - An Overview and Example of Data Mining - Larose
No Material Drawn From:Data Mining the Web: Uncovering Patterns in Web Content, Structure, and Usage(Wiley, April 2007) • Part One: Web Structure Mining • Information Retrieval and Web Search • Hyperlink-Based Ranking • Part Two: Web Content Mining • Clustering • Evaluating Clustering • Classification • Part Three: Web Usage Mining • Data Preprocessing, • Exploratory Data Analysis, • Association Rules, Clustering, and Classification for Web Usage Mining • With Dr. Zdravko Markov, Computer Science, CCSU URI Dept of Computer Science and Statistics - An Overview and Example of Data Mining - Larose
Call for Book ProposalsWiley Series on Methods and Applications in Data Mining • Suggested topics: • Data Mining in Bioinformatics • Emerging Techniques in Data Mining (e.g., SVM) • Data Mining with Evolutionary Algorithms • Drug Discovery Using Data Mining • Mining Data Streams • Visual Analysis in Data Mining • Books in press: • Data Mining for Genomics and Proteomics, by Darius Dziuda • Practical Text Mining Using Perl, by Roger Bilisoly • Contact Series Editor at larosed@ccsu.edu URI Dept of Computer Science and Statistics - An Overview and Example of Data Mining - Larose
What is Data Mining? • “Data mining is the analysis of (often large) observational data sets to find unsuspected relationships and to summarize the data in novel ways that are both understandable and useful to the data owner.” • David Hand, Heikki Mannila & Padhraic Smyth, Principles of Data Mining, MIT Press, 2001 URI Dept of Computer Science and Statistics - An Overview and Example of Data Mining - Larose
Why Data Mining? • “We are drowning in information but starved for knowledge.” • John Naisbitt, Megatrends, 1984. • “The problem is that there are not enough trained human analysts available who are skilled at translating all of this data into knowledge, and thence up the taxonomy tree into wisdom.” • Daniel Larose, Discovering Knowledge in Data: An Introduction to Data Mining, Wiley, 2005. URI Dept of Computer Science and Statistics - An Overview and Example of Data Mining - Larose
Need for Human Direction • Automation is no substitute for human supervision and input. • Humans need to be actively involved at every phase of data mining process. • “Rather than asking where humans fit into data mining, we should instead inquire about how we may design data mining into the very human process of problem solving.” • - Daniel Larose, Discovering Knowledge in Data: An Introduction to Data Mining, Wiley, 2005. URI Dept of Computer Science and Statistics - An Overview and Example of Data Mining - Larose
“Data Mining is Easy to Do Badly” • Black box software • Powerful, “easy-to-use” data mining algorithms • Makes their misuse dangerous. • Too easy to point and click your way to disaster. • What is needed: • An understanding of the underlying algorithmic and statistical model structures. • An understanding of which algorithms are most appropriate in which situations and for which types of data. URI Dept of Computer Science and Statistics - An Overview and Example of Data Mining - Larose
CRISP-DM: Cross-Industry Standard Process for Data Mining URI Dept of Computer Science and Statistics - An Overview and Example of Data Mining - Larose
CRISP: DM as a Process • Business / Research Understanding Phase Enunciate your objectives • Data Understanding Phase: EDA • Data Preparation Phase: Preprocessing • Modeling Phase: Fun and interesting! • Evaluation Phase Confluence of results? Objectives Met? • Deployment Phase: Use results to solve problem. If desired: Use lessons learned to reformulate business / research objective. URI Dept of Computer Science and Statistics - An Overview and Example of Data Mining - Larose
What About Data Dredging? Data Dredging “A sufficiently exhaustive search will certainly throw up patterns of some kind. Many of these patterns will simply be a product of random fluctuations, and will not represent any underlying structure.” • David J. Hand, Data Mining: Statistics and More?The American Statistician, May, 1998. URI Dept of Computer Science and Statistics - An Overview and Example of Data Mining - Larose
Guarding Against Data Dredging:Cross-Validation is the Key • Partition the data into training set and test set. • If the pattern shows up in both data sets, decreases the probability that it represents noise. • More generally, may use n-fold cross-validation. URI Dept of Computer Science and Statistics - An Overview and Example of Data Mining - Larose
Inference and Huge Data Sets • Hypothesis testing becomes sensitive at the huge sample sizes prevalent in data mining applications. • Even very tiny effects will be found significant. • So, data mining tends to de-emphasize inference URI Dept of Computer Science and Statistics - An Overview and Example of Data Mining - Larose
Need for Transparency and Interpretability • Data mining models should be transparent • Results should be interpretable by humans • Decision Trees are transparent • Neural Networks tend to be opaque • If a customer complains about why he/she was turned down for credit, we should be able to explain why, without saying “Our neural net said so.” URI Dept of Computer Science and Statistics - An Overview and Example of Data Mining - Larose
Part Two:Modeling Response to Direct Mail Marketing Business Understanding Phase: • Clothing Store Purchase Data • Results of a direct mail marketing campaign • Task: Construct a classification model • For classifying customers as either responders or non-responders to the marketing campaign, • To reduce costs and increase return-on-investment URI Dept of Computer Science and Statistics - An Overview and Example of Data Mining - Larose
Data Understanding: The Clothing Store dataset List of fields in the dataset (28,7999 customers, 51 fields) URI Dept of Computer Science and Statistics - An Overview and Example of Data Mining - Larose
Data Preparation and EDA Phase • Not covered in this presentation. URI Dept of Computer Science and Statistics - An Overview and Example of Data Mining - Larose
Modeling Strategy • Apply principal components analysis to address multicollinearity. • Apply cluster analysis. Briefly profile clusters. • Balance the training data set. • Establish baseline model performance • In terms of expected profit per customer contacted. • Apply classification algorithms to training data set: • CART • C5.0 (C4.5) • Neural networks • Logistic regression. URI Dept of Computer Science and Statistics - An Overview and Example of Data Mining - Larose
Modeling Strategy continued • Evaluate each model using test data set. • Apply misclassification costs in line with cost benefit table. • Apply overbalancing as a surrogate for misclassification costs. • Find best overbalancing proportion. • Combine predictions from four models • Using model voting. • Using mean response probabilities. URI Dept of Computer Science and Statistics - An Overview and Example of Data Mining - Larose
Principal Components Analysis (PCA) • Multicollinearity does not degrade prediction accuracy. • But muddles individual predictor coefficients. • Interested in predictor characteristics, customer profiling, etc? • Then PCA is required. • But, if interested solely in classification (prediction, estimation), • PCA not strictly required. URI Dept of Computer Science and Statistics - An Overview and Example of Data Mining - Larose
Report Two Model Sets: • Model Set A: • Includes principal components • All purpose model set • Model Set B: • Includes correlated predictors, not principal components • Use restricted to classification URI Dept of Computer Science and Statistics - An Overview and Example of Data Mining - Larose
Principal Components Analysis (PCA) • Seven correlated variables. • Two components extracted • Account for 87% of variability URI Dept of Computer Science and Statistics - An Overview and Example of Data Mining - Larose
Principal Components Analysis (PCA) • Principal Component 1: • Purchasing Habits • Customer general purchasing habits • Expect component to be strongly indicative of response • Principal Component 2: • Promotion Contacts • Unclear whether component will be associated with response • Components validated by test data set URI Dept of Computer Science and Statistics - An Overview and Example of Data Mining - Larose
BIRCH Clustering Algorithm • Requires only one pass through data set • Scalable for large data sets • Benefit: Analyst need not pre-specify number of clusters • Drawback: Sensitive to initial records encountered • Leads to widely variable cluster solutions • Requires “outer loop” to find consistent cluster solution • Zhang, Ramakrishnan and Livny, BIRCH: A New Data Clustering Algorithm and Its Applications, Data Mining and Knowledge Discovery 1, 1997. URI Dept of Computer Science and Statistics - An Overview and Example of Data Mining - Larose
Cluster 3 shows: Higher response for flag predictors Higher averages for numeric predictors BIRCH Clusters URI Dept of Computer Science and Statistics - An Overview and Example of Data Mining - Larose
Cluster 3 has highest response rate (red). Cluster 1: 7.6% Cluster 2: 7.1% Cluster 3: 33.0% BIRCH Clusters URI Dept of Computer Science and Statistics - An Overview and Example of Data Mining - Larose
Balancing the Data • For “rare” classes, provides more equitable distribution. • Drawback: Loss of data: • Here, 40% of non-responders randomly omitted • All responders retained • Responders increases from 16.58% to 24.76% • Test data set should never be balanced URI Dept of Computer Science and Statistics - An Overview and Example of Data Mining - Larose
False Positive vs. False Negative:Which is Worse? • For direct mail marketing, a false negative error is probably worse than a false positive. • Generate misclassification costs based on the observed data. • Construct cost-benefit table URI Dept of Computer Science and Statistics - An Overview and Example of Data Mining - Larose
Decision Cost / Benefit Analysis URI Dept of Computer Science and Statistics - An Overview and Example of Data Mining - Larose
Establish Baseline Model Performance • Benchmarks • “Don’t Send a Marketing Promotion to Anyone” Model • “Send a Marketing Promotion to Everyone” Model • Will compare candidate models against this baseline error rate. URI Dept of Computer Science and Statistics - An Overview and Example of Data Mining - Larose
Model Set A (With 50% Balancing) • No model beats benchmark of $2.63 profit per customer • Misclassification costs had not been applied • Now define FN cost = $28.40, FP cost = $2 • Outperformed baseline “Send to everyone” model URI Dept of Computer Science and Statistics - An Overview and Example of Data Mining - Larose
Model Set A: Effect of Misclassification Costs • For the 447 highlighted records: • Only 20.8% responded. • But model predicts positive response. • Due to high false negative misclassification cost. URI Dept of Computer Science and Statistics - An Overview and Example of Data Mining - Larose
Model Set A: PCA Component 1 is Best Predictor • First principal component ($F-PCA-1), Purchasing Habits, represents both the root node split and the secondary split • Most important factor for predicting response URI Dept of Computer Science and Statistics - An Overview and Example of Data Mining - Larose
Over-Balancing as a Surrogate for Misclassification Costs • Software limitation: • Neural network and logistic regression models in Clementine: • Lack methods for applying misclassification costs • Over-balancing is an alternate method which can achieve similar results • Starves the classifier of instances of non-response URI Dept of Computer Science and Statistics - An Overview and Example of Data Mining - Larose
Over-Balancing as a Surrogate for Misclassification Costs • Neural network model results • Three over-balanced models outperform baseline • Properly applied, over-balancing can be used as a surrogate for misclassification costs URI Dept of Computer Science and Statistics - An Overview and Example of Data Mining - Larose
Over-Balancing as a Surrogate for Misclassification Costs • Apply 80% - 20% over-balancing to the other models. URI Dept of Computer Science and Statistics - An Overview and Example of Data Mining - Larose
Combination Models: Voting • Smoothes out strengths and weaknesses of each model • Each model supplies a prediction for each record • Count the votes for each record • Disadvantage of combination models: • Lack of easy interpretability • Four competing combination models… URI Dept of Computer Science and Statistics - An Overview and Example of Data Mining - Larose
Combination Models: Voting Mail a Promotion only if: • All four models predict response • Protects against false positive • All four classification algorithms must agree on a positive prediction • At least three models predict response • At least two models predict response • Any model predicts response • Protects against false negatives URI Dept of Computer Science and Statistics - An Overview and Example of Data Mining - Larose
Combination Models: Voting • None beat the logistic regression model: $2.96 profit per customer • Perhaps combination models will do better with Model Collection B… URI Dept of Computer Science and Statistics - An Overview and Example of Data Mining - Larose
Model Collection B: Non-PCA Models • Models retain correlated variables • Use restricted to prediction only • Since the correlated variables are highly predictive • Expect Collection B will outperform the PCA models URI Dept of Computer Science and Statistics - An Overview and Example of Data Mining - Larose
Model Collection B: CART and C5.0 • Using misclassification costs, and 50% balancing • Both models outperform the best PCA model URI Dept of Computer Science and Statistics - An Overview and Example of Data Mining - Larose
Model Collection B: Over-Balancing • Apply over-balancing as a surrogate for misclassification costs for all models • Best performance thus far. URI Dept of Computer Science and Statistics - An Overview and Example of Data Mining - Larose
Combination Models: Voting • Combine the four models via voting and 80%-20% over-balancing • Synergy: Combination model outperforms any individual model. URI Dept of Computer Science and Statistics - An Overview and Example of Data Mining - Larose
Combining Models Using Mean Response Probabilities • Combine the confidences that each model reports for its decisions • Allows finer tuning of the decision space • Derive a new variable: • Mean Response Probability (MRP): • Average of response confidences of the four models. URI Dept of Computer Science and Statistics - An Overview and Example of Data Mining - Larose