450 likes | 502 Views
Explore the Cross-Industry Standard Process for Data Mining (CRISP-DM), a comprehensive model for identifying actionable insights across various sectors. Learn about the key phases, from business understanding to deployment, and how to convert business objectives into data-mining goals using techniques like association, classification, and clustering. Dive into a detailed example of data mining in action, highlighting data selection, preprocessing, transformation, model analysis, and interpretation to achieve measurable success criteria and market competitiveness.
E N D
Data Mining Processes Identify actionable results
CRISP-DM • Cross-Industry Standard Process for Data Mining • One of first comprehensive attempts toward standard process model for data mining • Independent of industry sector & technology
CRISP-DM Phases • Business (or problem) understanding • Data understanding • Data preparation • Transform & create data set for modeling • Modeling • Evaluation • Check good models, evaluate to assure nothing missing • Deployment
Business Understanding • Solve a specific problem • Clear definition helps • Measurable success criteria • Convert business objectives to set of data-mining goals • What to achieve in technical terms
Data Understanding • Related data Can come from many sources • Internal • ERP (or MIS) • Data Warehouse • External • Government data • Commercial data • Created • Research
Data Preparation Clean data • Formats, gaps, filters outliers & redundancies Unified numerical scales • Nominal data • code • Ordinal data • Nominal code or scale • Cardinal data
Modeling • Data Treatment • Training set • Test set • Maybe others • Techniques • Association • Classification • Clustering • Predictions • Sequential patterns
Evaluation • Does model meet business objectives? • Any important business objectives not addressed? • Does model make sense? • Is model actionable?
Deployment • Ongoing monitoring & maintenance • Evaluate performance against success criteria • Market reaction & competitor changes
Example • Training set for computer purchase • 16 records • 5 attributes • Goal • Find classifier for consumer behavior
Data Selection • Gender has weak relationship with purchase • Based on correlation • Drop gender • Selected Attribute Set {Age, Income, Student, Credit}
Data Preprocessing • Income unknown in Case 15 • Credit not available in Case 16 • Drop these noisy cases
Data Transformation • Assign numerical values to each attribute • Age:≤30 = 3 31-40 = 2 >40 = 1 • Income: High = 3 Medium = 2 Low = 1 • Student: Yes = 2 No = 1 • Credit: Excellent = 2 Fair = 1
Data Mining • Categorize output • Buys = C1 Doesn’t buy = C2 • Conduct analysis • Model says A8, A12 don’t buy; rest do • Of the actual yes, 8 correct and 1 not • Of the actual no, 4 correct and 1 not
Data Interpretation • Test on independent data
Measures • Correct classification rate 9/10 = 0.90 • Cost function cost of error: model says buy, actual no $20 model says no, actual buy $200 • 1 x $20 + 0 x $200 = $20
Goals • Avoid broad concepts: • Gain insight; discover meaningful patterns; learn interesting things • Can’t measure attainment • Narrow and specify: • Identify customers likely to renew; reduce churn; • Rank order by propensity to…;
Goals • Description: what is • understand • explain • discover knowledge • Prescription: what should be done • classify • predict
Goal • Method A: • four rules, explains 70% • Method B: • fifty rules, explains 72% BEST? Gain understanding: Method A better minimum description length (MDL) Reduce cost of mailing: Method B better
Measurement • Accuracy • How well does model describe observed data? • Confidence levels • a proportion of the time between lower and upper limits • Comprehensibility • Whole or parts?
Measuring Predictive • Classification & prediction: error rate = incorrect / total requires evaluation set be representative • Estimators predicted – actual (MAD: Mean Absolute Deviation, MSE: Mean Squre Error MAPE: Mean Absolute Percent Error ) variance = sum(predicted - actual)^2 standard deviation = square root of variance distance - how far off
Statistics • Population - entire group studied • Sample - subset from population • Bias - difference between sample average & population average • mean, median, mode • distribution • significance • correlation, regression
Classification Models • LIFT = probability in class by sample divided by probability in class by population • if population probability is 20% and sample probability is 30%, LIFT = 0.3/0.2 = 1.5 • Best lift not necessarily best need sufficient sample size as confidence increases, longer list but lower lift
Measuring Impact • Ideal - $ (NPV, NetPresentValue) because of expenditure • Mass mailing may be better • Depends on: • fixed cost • cost per recipient • cost per respondent • value of positive response
Bottom Line • Return on investment
Example Application • Telephone industry • Problem: Unpaid bills • Data mining used to develop models to predict nonpayment as early as possible
Telephone Bill Study • Billing period sequence analyzed • Use 2 months, receive bill, payment due month of billing, disconnect if unpaid in given period • Hypothesis: Insolvent customers would change calling habits & phone usage during a critical period before & immediately after termination of billing period
1: Business Understanding • Predict which customers would be insolvent • In time for firm to take preventive measures (and avert losing good customers) • Hypothesis: • Insolvent customers would change calling habits & phone usage during a critical period before & immediately after termination of billing period
2: Data Understanding • Static customer information available in files • Bills, payments, usage • Used data warehouse to gather & organize data • Coded to protect customer privacy
Creating Target Data Set • Customer files • Customer information • Disconnects • Reconnections • Time-dependent data • Bills • Payments • Usage • 100,000 customers over 17-month period • Stratified sampling to assure all groups appropriately represented
3: Data Preparation • Filtered out incomplete data • Deleted inexpensive calls • Reduced data volume about 50% • Low number of fraudulent cases • Cross-checked with phone disconnects • Lagged data made synchronization necessary
Data Reduction & Projection • Information grouped by account • Customer data aggregated by 2-week periods • Discriminant analysis on 23 categories • Calculated average owed by category (significant) • Identified extra charges (significant) • Investigated payment by installments (not significant)
Choosing Data Mining Function • Classes: • Most possibly solvent (99.3%) • Most possibly insolvent (0.7%) • Costs of error widely different • New data set created through stratified sampling • Retained all insolvent • Altered distribution to 90% solvent • Used 2,066 cases total • Critical period identified • Last 15 two-week periods before service interruption • Variables defined by counting measures in two-week periods • 46 variables as candidate discriminant factors
4: Modeling • Discriminant Analysis • Linear model • SPSS – stepwise forward selection • Decision Trees • Rule-based classifier • Neural Networks • Nonlinear model
Data Mining • Training set about 2/3rds • Rest test • Discriminant analysis • Used 17 variables • Equal costs – 0.875 correct • Unequal costs – 0.930 correct • Rule-based – 0.952 correct • Neural network – 0.929 correct
5: Evaluation • 1st objective to maximize accuracy of predicting insolvent customers • Decision tree classifier best • 2nd objective to minimize error rate for solvent customers • Neural network model close to Decision tree • Used all 3 on case-by-case basis
6: Implementation • Every customer examined using all 3 algorithms • If all 3 agreed, used that classification • If disagreement, categorized as unclassified • Correct on test data 0.898 • Only 1 actually solvent customer would have been disconnected