1 / 24

Special Challenges With Large Data Mining Projects

This seminar by Beth Fitzgerald discusses the challenges and best practices in predictive modeling for large data mining projects. Topics include project overview, data preparation, modeling procedures, and implementation.

rcortes
Download Presentation

Special Challenges With Large Data Mining Projects

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Special Challenges With Large Data Mining Projects CAS PREDICTIVE MODELING SEMINAR Beth Fitzgerald ISO October 2006

  2.        

  3. Agenda • Project Overview • Prior to Modeling • Modeling • Business Issues

  4. Development of a Model - Project Overview • Data • Statistical Tools • Computer Capacity • Team Skills • Data management • Analytical/statistical • Technology • Business Knowledge

  5. Prior to Modeling • Formulate the Problem • Evaluate Possible Data Sources • Prepare the Data • Develop Understanding of Modeling Procedures and Diagnostics • Explore the Data with Simple Modeling Techniques

  6. What percent of a model building project is the data preparation and data management? 25% 50% 75% 85%

  7. Formulate the Problem • What problem are you trying to solve? • What results do you expect to see? • How will you know if the results are reasonable?

  8. Prepare the Data • Do quality checks in level of detail needed for project • Understand how to prepare individual variables for use in models • Need to be practical about number of classification categories models can handle • Need to decide on truncation and bucketing of variables that are continuous • Create new variables

  9. Develop Understanding of Modeling Procedures and Diagnostics • Basic modeling training – GLM, Data Mining • What software is available? • What software/models work for my data investigation, modeling problem, etc. • What computer capacity do I need? • Learn how to use software • Learn how to interpret the diagnostics

  10. Development of a Model • Analyze historical policy and loss data • Policy level detail • Location level detail • Link policy and loss data with external and/or internal data: • Specific business risk data – operational, financial • Specific location data – demographic, weather • Other data – building, vehicle, agency • Need link between policy detail and other data

  11. Explore the Data with Simple Modeling Techniques • Start with sample of data • Try different classical analysis on sample such as: • regression • linear models • correlation matrices • Make use of graphical options to explore data

  12. Data Management Issues • Matching additional internal policy information to premium/loss data • Different points in time • Tracking & balancing audited exposures • Different summarization keys – handling of mid-term endorsements • Address scrubbing • Matching to external data for correct point in time • Significance of missing values within variable

  13. Modeling Activities • Selection of Predictors – variable elimination, variable transformation • Start with classical models prior to evaluating more complex models • Methodology Understanding and Evaluation • Evaluation of Model Performance

  14. Data Mining Techniques Balance good fit with explanatory power • Generalized Linear Models • Classification Trees • Regression Trees • Multivariate Adaptive Regression Splines • Neural Networks

  15. Data Mining Process Data Linking Data Gathering Data Cleansing Analyze Variables Evaluation Business Knowledge Determine Predictive Variables Data Mining

  16. Model Performance • Lift Curve Analysis • Score all risks in sample • Rank risks by score from Bad to Good • Compare loss ratio of risks in each decile to loss ratio for all risks

  17. Sample Lift Curve Analysis

  18. Business Issues • Model uses information from a third-party vendor • Model needs to be accessible electronically • Technology Issues • Implementation Decisions

  19. Technology Issues • Develop/Modify Systems • Integrate into underwriting/rating workflow • Decision process • Agency system • Decide on technology • Web-based interface • API, FTP, MQ, TCP/IP, HTTPS webservices

  20. Implementation of Model Solution focus/usage: • Suitability of risk for underwriting decision • Source for additional pricing factors • Consistency in underwriting/pricing decisions • Compliance with regulations based on implementation decision • Consider model alone or model with other information available from application

  21. Implementation of Model Workflows: • Underwriting • New Business • Renewal business • Rating • Pricing • Coverage Adjustment

  22. Business Implementation of Model • Strategic Plan - need management involvement • Prepare Announcement/Training Material for Internal & External Customers • Coordinate Implementation • Monitor Feedback/Adjust Implementation

  23. Future Plans • Determine Process for Updates to Model • Use of Updated Data • Use of New Data Variables • Use of New Techniques

  24.        

More Related