1 / 48

Data Quality: GIRO Working Party Paper & Methods to Detect Data Glitches

This paper reviews the GIRO Data Quality Working Party Paper, discusses data quality in predictive modeling, and focuses on exploratory data analysis as a crucial part of data preparation. The paper includes literature reviews, horror stories, surveys, experiments, and concluding remarks, providing valuable insights into data management in the insurance industry.

joanat
Download Presentation

Data Quality: GIRO Working Party Paper & Methods to Detect Data Glitches

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Data Quality:GIRO Working Party Paper & Methods to Detect Data Glitches CAS Annual Meeting Louise Francis Francis Analytics and Actuarial Data Mining, Inc. www.data-mines.com Louise.francis@data-mines.com

  2. Objectives • Review GIRO Data Quality Working Party Paper • Discuss Data Quality in Predictive Modeling • Focus on a key part of data preparation: Exploratory data analysis

  3. GIRO Working Party Paper • Literature Review • Horror Stories • Survey • Experiment • Actions • Concluding Remarks

  4. Insurance Data Management Association • The IDMA is an American organisation which promotes professionalism in the Data Management discipline through education, certification and discussion forums • The IDMA web site: • Suggests publications on data quality, • Describes a data certification model, and • Contains Data Management Value Propositions which document the value to various insurance industry stakeholders of investing in data quality • http://www.idma.org

  5. Horror Stories – Non-Insurance • Heart-and-Lung Transplant – wrong blood type • Bombing of Chinese Embassy in Belgrade • Mars Orbiter – confusion between imperial and metric units • Fidelity Mutual Fund – withdrawal of dividend • Porter County, Illinois – Tax Bill and Budget Shortfall

  6. Horror Stories – Rating/Pricing • Examples faced by ISO: • Exposure recorded in units of $10,000 instead of $1,000 • Large insurer reporting personal auto data as miscellaneous and hence missed from ratemaking calculations • One company reporting all its Florida property losses as fire (including hurricane years) • Mismatched coding for policy and claims data

  7. Horror Stories - Katrina • US Weather models underestimated costs Katrina by approx. 50% (Westfall, 2005) • 2004 RMS study highlighted exposure data that was: • Out-of-date • Incomplete • Mis-coded • Many flood victims had no flood insurance after being told by agents that they were not in flood risk areas.

  8. Data Quality Survey of Actuaries • Purpose: Assess the impact of data quality issues on the work of general insurance actuaries • 2 questions: • percentage of time spent on data quality issues • proportion of projects adversely affected by such issues

  9. Results - Percentage of Time

  10. Results - Percentage of Projects

  11. Survey Conclusions • Data quality issues have a significant impact on the work of general insurance actuaries • about a quarter of their time is spent on such issues • about a third of projects are adversely affected • The impact varies widely between different actuaries, even those working in similar organisations • Limited evidence to suggest that the impact is more significant for consultants

  12. Hypothesis Uncertainty of actuarial estimates of ultimate incurred losses based on poor data is significantly greater than that of good data

  13. Data Quality Experiment • Examine the impact of incomplete and/or erroneous data on actuarial estimate of ultimate losses and the loss reserves • Use real data with simulated limitations and/or errors and observe the potential error in the actuarial estimates

  14. Data Used in Experiment • Real data for primary private passenger bodily injury liability business for a single no-fault state • Eighteen (18) accident years of fully developed data; thus, true ultimate losses are known

  15. Actuarial Methods Used • Paid chain ladder models • Incurred chain ladder models • Frequency-severity models • Inverse power curve for tail factors • No judgment used in applying methods

  16. Completeness of Data Experiments Vary size of the sample; that is, • All years • Use only 7 accident years • Use only last 3 diagonals

  17. Data Error Experiments Simulated data quality issues • Misclassification of losses by accident year • Early years not available • Late processing of financial information • Paid losses replaced by crude estimates • Overstatements followed by corrections in following period • Definition of reported claims changed

  18. Measure Impact of Data Quality • Compare Estimated to Actual Ultimates • Use Bootstrapping to evaluate effect of difference samples on results

  19. Results – Experiment 1 • More data generally reduces the volatility of the estimation errors

  20. Results – Experiment 2 • Extreme volatility, especially those based on paid data • Actuaries ability to recognise and account for data quality issues is critical • Actuarial adjustments to the data may never fully correct for data quality issues

  21. Results – Bootstrapping • Less dispersion in results for error free data • Standard deviation of estimated ultimate losses greater for the modified data (data with errors) • Confirms original hypothesis

  22. Conclusions Resulting from Experiment • Greater accuracy and less variability in actuarial estimates when: • Quality data used • Greater number of accident years used • Data quality issues can erode or even reverse the gains of increased volumes of data: • If errors are significant, more data may worsen estimates due to the propagation of errors for certain projection methods • Significant uncertainty in results when: • Data is incomplete • Data has errors

  23. Predictive Modeling & Exploratory Data Analysis

  24. Modeling Process

  25. Data Preprocessing

  26. Exploratory Data Analysis • Typically the first step in analyzing data • Makes heavy use of graphical techniques • Also makes use of simple descriptive statistics • Purpose • Find outliers (and errors) • Explore structure of the data

  27. Example Data • Private passenger auto • Some variables are: • Age • Gender • Marital status • Zip code • Earned premium • Number of claims • Incurred losses • Paid losses

  28. Some Methods for Numeric Data • Visual • Histograms • Box and Whisker Plots • Stem and Leaf Plots • Statistical • Descriptive statistics • Data spheres

  29. Histograms • Can do them in Microsoft Excel

  30. Histograms of Age VariableVarying Window Size

  31. Formula for Window Width

  32. Example of Suspicious Value

  33. Box and Whisker Plot

  34. Plot of Heavy Tailed DataPaid Losses

  35. Heavy Tailed Data – Log Scale

  36. Box and Whisker Example

  37. Descriptive StatisticsAnalysis ToolPak

  38. Descriptive Statistics • License Year has minimum and maximums that are impossible

  39. Data Spheres: The Mahalanobis Distance Statistic

  40. Screening Many Variables at Once • Plot of Longitude and Latitude of zip codes in data • Examination of outliers indicated drivers in Ca and PR even though policies only in one mid-Atlantic state

  41. Records With Unusual Values Flagged

  42. Categorical Data: Data Cubes

  43. Categorical Data • Data Cubes • Usually frequency tables • Search for missing values coded as blanks

  44. Categorical Data • Table highlights inconsistent coding of marital status

  45. Screening for Missing Data

  46. Blanks as Missing

  47. Conclusions • Anecdotal horror stories illustrate possible dramatic impact of data quality problems • Data quality survey suggests data quality issues have a significant cost on actualial work • Data Quality Working Party urges actuaries to become data quality advocated within their organizations • Techniques from Exploratory Data Analysis can be used to detect data glitches

  48. Library for Getting Started • Dasu and Johnson, Exploratory Data Mining and Data Cleaning, Wiley, 2003 • Francis, L.A., “Dancing with Dirty Data: Methods for Exploring and Cleaning Data”, CAS Winter Forum, March 2005, www.casact.org • The Data Quality Working Party Paper will be Posted on the GIRO web site in about nine months

More Related