410 likes | 451 Views
The CRISP Data Mining Process. The Data Mining Process. Business understanding. Data evaluation. Data preparation. Data. Deployment. Modeling. Evaluation. Business Understanding. Project objectives. Project requirements. DM Problem Formulation. Preliminary Plan. Case Study.
E N D
The Data Mining Process Business understanding Data evaluation Data preparation Data Deployment Modeling Evaluation Data Mining
Business Understanding Project objectives Project requirements DM Problem Formulation Preliminary Plan Data Mining
Case Study • Data mining project done for a large insurance company • Consider the use of data mining to improve understanding of customer databases • Led by the data warehousing team, which wanted to also improve their expertise Data Mining
Business Objectives • Understand what coverage packages are of interest to a customer group • Targeting of new customers • Cross-selling opportunities to existing customers • Understand why a customer group terminates coverage • Know in advance what groups are likely to terminate • Understand what factors influence termination Data Mining
What are the Goals? • The business goals • Improve customer retention • Increase cross-selling • Success criteria • Customer turnover rate • Amount of cross-selling Data Mining
Data Mining Problems • Classify new and existing customers as either interested or not interested in a particular coverage • Classify existing customers as either likely or unlikely to terminate coverage Data Mining
The Data Mining Process Business objectives Data evaluation Data preparation Data Deployment Modeling Evaluation Data Mining
Data warehousing team Data Evaluation • Initial data collections • Data quality • Initial insights • Interesting subsets Data Mining
Case Study: Data Evaluation • Data was extracted from select customer databases by company personnel • Coverage programs with few customers selected for pilot project • Five separate files extracted for five coverage programs Data Mining
The Data Mining Process Business objectives Data evaluation Data preparation Data Deployment Modeling Evaluation Data Mining
Data Preparation Raw Data Finished Data Set • Technical tasks: • Data selection • Attribute selection • Data cleaning Data Mining
Case Study: Data Preparation • Some initial formatting of data in MS Excel • Cleaning of data file • Combine headers/instances • Add a new attribute: interest (yes/no) • Must create the no interest cases • End up with a CSV formatted file Data Mining
Weka Data Mining Software • Data in CSV format loaded into Weka: • Data preprocessing • Attribute selection • Modeling • Classification • Clustering • Association rule mining • Visualization Data Mining
Data Preprocessing in Weka • Initial data inspection • Missing values • Useless attributes • Numeric attributes as nominal • Some helpful Weka filters • RemoveUseless • ReplaceMissingValues Data Mining
Data Preprocessing in Weka • Data reduction: • Instance dimension • RemovePercentage, and Resample filters • Attribute dimension • Remove redundant attributes • Remove irrelevant attributes • Identify most important attributes Data Mining
Attribute Selection Methods • Three main methods used: • InfoGain • ChiSquared • Relief • Combined results from complimentary methods • Final pruning of attribute list to twenty attributes Data Mining
Selected Attributes • Location • Tax State • Contract State • State Code • Zip Code Data Mining
Selected Attributes • Size • Case Size Range • Industry • Industry Classification • Industry Classification Name • SIC Code Data Mining
Selected Attributes • Timing • New Sale Flag • Decision Maker Effective Month • Decision Maker Effective Year • Next Renewal Month • Next Renewal Year Data Mining
Selected Attributes • Internal • Agency Number • Office Name • Pricing Category Code • Product Line Name • Small Group Flag Data Mining
Relevance of Attribute Selection • Improved modeling • Faster model induction • Higher accuracy • Easier to interpret models • Structural knowledge gained from the selection of attributes Data Mining
Most Important Attributes • Whatattributes effect the purchasing decision of a customer group? • E.g., the five most important factor that determine if a customer group purchases a particular insurance coverage • Agency Number • Small Group Flag • Zip Code • Decision Maker Effective Year • Next Renewal Month Data Mining
Customer Segmentation • Unique groups of customers • Similar characteristics • Similar behavior in terms of interest in coverage • For example, separate predictive models for customer segments for a particular type of insurance Data Mining
Customer Segments Used for Modeling • Results • Three segments for one database • Two segments for two databases • One segment for two databases • Continue modeling for each segment independently Data Mining
The Data Mining Process Business objectives Data evaluation Data preparation Data Deployment Modeling Evaluation Data Mining
Modeling • Select modeling technique(s) • Calibrate modeling techniques • Make adjustments to data Data Mining
Modeling • Mathematical models for predicting if a customer is interested in a coverage • Understand why a customer is interested • For example: If a customer’s state is Indianaand the office is Indianapolis_Office1then the customer is interested in Coverage_3 Data Mining
Modeling Techniques • Three modeling techniques tried for predicting customer interest: • Decision trees • Artificial neural networks (ANN) • Support vector machines (SVM) • Decision trees have the advantage of transparency • ANN and SVM did not have significantly better prediction accuracy Data Mining
Insurance Coverage Interest (Type 6) Small Group Flag Y N Product Line Name No Group_1 Group_2 Yes No Data Mining
Insurance Coverage Interest (Type 7) Pricing Category Code Others A4 Branches omitted A2 Industry Classification Name Next Renewal Year Transportation_and Public_Utilities Legal_Services > 2002 Group_1 <= 2002 Group_2 Next Renewal Year Agency Number Yes No <= 430 > 430 > 2000 <= 2000 Yes No Yes No Yes No Data Mining
Coverage Accuracy Type 1 84.0% Type 2 97.2% Type 3 98.3% Type 4 99.5% Type 5 88.4% Type 6 100% Type 7 76.3% Type 8 85.0% Type 9 94.8% Accuracy of Predicting Customer Interest Data Mining
Modeling • Mathematical models for predicting if a customer will terminate coverage • Why do customers terminate a specific type of coverage? • What are the important factors in a customers decision to terminate coverage? Data Mining
Who Terminates Type 3 Coverage? Correct for 95% of customers Customer Effective Year Coverage Effective Year Coverage Effective Year Terminated Next Renewal Month Active Terminated Active Terminated Active Data Mining
Who Terminates Type 1 Coverage? • Decision tree based on: • Distribution number • Underwriting department number • Price category • Rate type • Rate Plan Year • Predicts 96.3% of terminations correctly Data Mining
Model Accuracy Type 1 96.3% Type 2 96.5% Type 3 95.3% Type 4 88.9% Type 5 88.3% Accuracy of Predicting Termination Data Mining
The Data Mining Process Business objectives Data evaluation Data preparation Data Deployment Modeling Evaluation Data Mining
Evaluation • Data analysis results in a good model • Are business objectives being achieved? • Is there an important business issue that has not been considered? • Should the results be used? Data Mining
The Data Mining Process Business objectives Data evaluation Data preparation Data Deployment Modeling Evaluation Data Mining
Deployment • Incorporate the results in the organization’s decision making process • Report • Decision support system • Personalization of web pages • Repeatable data mining process Data Mining