Data Mining Processes

Data Mining Processes Identify actionable results

CRISP-DM • Cross-Industry Standard Process for Data Mining • One of first comprehensive attempts toward standard process model for data mining • Independent of industry sector & technology

CRISP-DM Phases • Business (or problem) understanding • Data understanding • Data preparation • Transform & create data set for modeling • Modeling • Evaluation • Check good models, evaluate to assure nothing missing • Deployment

Business Understanding • Solve a specific problem • Clear definition helps • Measurable success criteria • Convert business objectives to set of data-mining goals • What to achieve in technical terms

Data Understanding • Related data Can come from many sources • Internal • ERP (or MIS) • Data Warehouse • External • Government data • Commercial data • Created • Research

Data Preparation Clean data • Formats, gaps, filters outliers & redundancies Unified numerical scales • Nominal data • code • Ordinal data • Nominal code or scale • Cardinal data

Types of Data

Modeling • Data Treatment • Training set • Test set • Maybe others • Techniques • Association • Classification • Clustering • Predictions • Sequential patterns

Evaluation • Does model meet business objectives? • Any important business objectives not addressed? • Does model make sense? • Is model actionable?

Deployment • Ongoing monitoring & maintenance • Evaluate performance against success criteria • Market reaction & competitor changes

Example • Training set for computer purchase • 16 records • 5 attributes • Goal • Find classifier for consumer behavior

Database (1st half)

Database (2nd half)

Data Selection • Gender has weak relationship with purchase • Based on correlation • Drop gender • Selected Attribute Set {Age, Income, Student, Credit}

Data Preprocessing • Income unknown in Case 15 • Credit not available in Case 16 • Drop these noisy cases

Data Transformation • Assign numerical values to each attribute • Age:≤30 = 3 31-40 = 2 >40 = 1 • Income: High = 3 Medium = 2 Low = 1 • Student: Yes = 2 No = 1 • Credit: Excellent = 2 Fair = 1

Data Mining • Categorize output • Buys = C1 Doesn’t buy = C2 • Conduct analysis • Model says A8, A12 don’t buy; rest do • Of the actual yes, 8 correct and 1 not • Of the actual no, 4 correct and 1 not

Data Interpretation • Test on independent data

Test Data Set

Confusion Matrix

Measures • Correct classification rate 9/10 = 0.90 • Cost function cost of error: model says buy, actual no $20 model says no, actual buy $200 • 1 x $20 + 0 x $200 = $20

Goals • Avoid broad concepts: • Gain insight; discover meaningful patterns; learn interesting things • Can’t measure attainment • Narrow and specify: • Identify customers likely to renew; reduce churn; • Rank order by propensity to…;

Goals • Description: what is • understand • explain • discover knowledge • Prescription: what should be done • classify • predict

Goal • Method A: • four rules, explains 70% • Method B: • fifty rules, explains 72% BEST? Gain understanding: Method A better minimum description length (MDL) Reduce cost of mailing: Method B better

Measurement • Accuracy • How well does model describe observed data? • Confidence levels • a proportion of the time between lower and upper limits • Comprehensibility • Whole or parts?

Measuring Predictive • Classification & prediction: error rate = incorrect / total requires evaluation set be representative • Estimators predicted – actual (MAD: Mean Absolute Deviation, MSE: Mean Squre Error MAPE: Mean Absolute Percent Error ) variance = sum(predicted - actual)^2 standard deviation = square root of variance distance - how far off

Statistics • Population - entire group studied • Sample - subset from population • Bias - difference between sample average & population average • mean, median, mode • distribution • significance • correlation, regression

Classification Models • LIFT = probability in class by sample divided by probability in class by population • if population probability is 20% and sample probability is 30%, LIFT = 0.3/0.2 = 1.5 • Best lift not necessarily best need sufficient sample size as confidence increases, longer list but lower lift

Lift Chart

Measuring Impact • Ideal - $ (NPV, NetPresentValue) because of expenditure • Mass mailing may be better • Depends on: • fixed cost • cost per recipient • cost per respondent • value of positive response

Bottom Line • Return on investment

Example Application • Telephone industry • Problem: Unpaid bills • Data mining used to develop models to predict nonpayment as early as possible

Knowledge Discovery Process

Telephone Bill Study • Billing period sequence analyzed • Use 2 months, receive bill, payment due month of billing, disconnect if unpaid in given period • Hypothesis: Insolvent customers would change calling habits & phone usage during a critical period before & immediately after termination of billing period

1: Business Understanding • Predict which customers would be insolvent • In time for firm to take preventive measures (and avert losing good customers) • Hypothesis: • Insolvent customers would change calling habits & phone usage during a critical period before & immediately after termination of billing period

2: Data Understanding • Static customer information available in files • Bills, payments, usage • Used data warehouse to gather & organize data • Coded to protect customer privacy

Creating Target Data Set • Customer files • Customer information • Disconnects • Reconnections • Time-dependent data • Bills • Payments • Usage • 100,000 customers over 17-month period • Stratified sampling to assure all groups appropriately represented

3: Data Preparation • Filtered out incomplete data • Deleted inexpensive calls • Reduced data volume about 50% • Low number of fraudulent cases • Cross-checked with phone disconnects • Lagged data made synchronization necessary

Data Reduction & Projection • Information grouped by account • Customer data aggregated by 2-week periods • Discriminant analysis on 23 categories • Calculated average owed by category (significant) • Identified extra charges (significant) • Investigated payment by installments (not significant)

Choosing Data Mining Function • Classes: • Most possibly solvent (99.3%) • Most possibly insolvent (0.7%) • Costs of error widely different • New data set created through stratified sampling • Retained all insolvent • Altered distribution to 90% solvent • Used 2,066 cases total • Critical period identified • Last 15 two-week periods before service interruption • Variables defined by counting measures in two-week periods • 46 variables as candidate discriminant factors

4: Modeling • Discriminant Analysis • Linear model • SPSS – stepwise forward selection • Decision Trees • Rule-based classifier • Neural Networks • Nonlinear model

Data Mining • Training set about 2/3rds • Rest test • Discriminant analysis • Used 17 variables • Equal costs – 0.875 correct • Unequal costs – 0.930 correct • Rule-based – 0.952 correct • Neural network – 0.929 correct

5: Evaluation • 1st objective to maximize accuracy of predicting insolvent customers • Decision tree classifier best • 2nd objective to minimize error rate for solvent customers • Neural network model close to Decision tree • Used all 3 on case-by-case basis

Coincidence Matrix – Combined Models

6: Implementation • Every customer examined using all 3 algorithms • If all 3 agreed, used that classification • If disagreement, categorized as unclassified • Correct on test data 0.898 • Only 1 actually solvent customer would have been disconnected

Data Mining Processes

Data Mining Processes

Presentation Transcript

Data Mining

Data Mining

Data Mining: Data

Data Mining

DATA MINING

Data Mining: Data

Mining Processes

Copper Sulfide Mining Processes

Data Mining: Data

Data Mining: P enelitian Data Mining

Data Mining

Data Mining: Data

Data Mining: Data

Data-mining

Data Mining

Data Mining: Data

Data Mining: Data

Data Mining: Data

Mining for Social Processes in Intelligence Data Streams

Data Mining: Data

Data Mining: Data