1 / 49

Dimension Reduction in Workers Compensation

Dimension Reduction in Workers Compensation. CAS predictive Modeling Seminar Louise Francis, FCAS, MAAA Francis Analytics and Actuarial Data Mining, Inc. Louise_francis@msn.com www.data-mines.com. Objectives. Answer questions: What is dimension reduction and why use it?

duy
Download Presentation

Dimension Reduction in Workers Compensation

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Dimension Reduction in Workers Compensation CAS predictive Modeling Seminar Louise Francis, FCAS, MAAA Francis Analytics and Actuarial Data Mining, Inc. Louise_francis@msn.com www.data-mines.com

  2. Objectives • Answer questions: What is dimension reduction and why use it? • Introduce key methods of dimension reduction • Illustrate with examples in Workers Compensation • There will be some formulas, but emphasis is on insight into basic mechanisms of the procedures

  3. Introduction • “How do mere observations become data for analysis?” • “Specific variable values are never immutable characteristics of the data” • Jacoby, Data Theory and Dimension Analysis, Sage Publications • Many of the dimension reduction/measurement techniques originated in the social sciences and dealt with how to create scales from responses on attitudinal and opinion surveys

  4. Unsupervised learning • Dimension reduction methods generally unsupervised learning • Supervised Learning • A dependent or target variable • Unsupervised learning • No target variable • Group like variables or like records together

  5. The Data • BLS Economic indexes • Components of inflation • Employment data • Health insurance inflation • Texas Department of Insurance closed claim data for 2002 and 2003 • Employment related injury • Excludes small claims • About 1800 records

  6. What is a dimension? • Jacoby – The number of separate and interesting sources of variation • In many studies each variable is a dimension • However, we can also view each record in a database as a dimension

  7. Dimensions

  8. The Two Major Categories of Dimension Reduction • Variable reduction • Factor Analysis • Principal Components Analysis • Record reduction • Clustering • Other methods tend to be developments on these

  9. Principal Components Analysis • A form of dimension (variable) reduction • Suppose we want to combine all the information related to the “inflation” dimension of insurance costs • Medical care costs • Employment (wage) costs • Other • Energy • Transportation • Services

  10. Principal Components • These variables are correlated but not perfectly correlated • We replace many variables with a weighted sum of the variables • These are then used as independent variables in a predictive model

  11. Factor Analysis: A Latent Factor

  12. Factor/Principal Components Analysis • Linear methods – use linear correlation matrix • Correlation matrix decomposed to find smaller number of factors the are related to the same underlying drivers • Highly correlated variables tend to have high load on the same factor

  13. Factor/Principal Components Analysis

  14. Factor/Principal Components Analysis • Uses eignevectors and eigenvalues • R is correlation matrix, V eigenvectors, lambda eigenvalues

  15. Inflation Data

  16. Factor Rotation • Find simpler more easily interpretable factors • Use notion of factor complexity

  17. Factor Rotation • Quartimax Rotation • Maximize q • Varimax Rotation • Maximizes the variance of squared loadings for each factor rather than for each variable

  18. Varimax Rotation

  19. Plot of Loadings on Factors

  20. How Many Factors to Keep? • Eigenvalues provide information on how much variance is explained • Proportion explained by a given component=corresponding eigenvalue/n • Use Scree Plot • Rule of thumb: keep all factors with eigenvalues>1

  21. WC Severity vs Factor 1

  22. WC Severity vs Factor 2

  23. What About Categorical Data? • Factor analysis is performed on numeric data • You could code data as binary dummy variables • Categorical Variables from Texas data • Injury • Cause of loss • Business Class • Health Insurance (Y/N)

  24. Optimal Scaling • A method of dealing with categorical variables • Can be used to model nonlinear relationships • Uses regression to • Assign numbers to categories • Fit regression coefficients • Y*=f(X*) • In each round of fitting, a new Y* and X* is created

  25. Variable Correlations

  26. Visualizations of Scaled Variables

  27. Can we use scaled variables in prediction?

  28. Tree Using Optimal Scaling Scores

  29. Tree for Subrogation

  30. Row Reduction: Cluster Analysis • Records are grouped in categories that have similar values on the variables • Examples • Marketing: People with similar values on demographic variables (i.e., age, gender, income) may be grouped together for marketing • Text analysis: Use words that tend to occur together to classify documents • Fraud modeling • Territory definition • Note: no dependent variable used in analysis

  31. Clustering • Common Method: k-means, hierarchical • No dependent variable – records are grouped into classes with similar values on the variable • Start with a measure of similarity or dissimilarity • Maximize dissimilarity between members of different clusters

  32. Dissimilarity (Distance) Measure – Continuous Variables • Euclidian Distance • Manhattan Distance

  33. Binary Variables

  34. Binary Variables • Sample Matching • Rogers and Tanimoto

  35. Example: Texas Data • Data from 2002 and 2003 closed claim database by Texas Ins Dept • Only claims over a threshold included • Variables used for clustering: • Report Lag • Settlement Lag • County (ranked by how often in data) • Injury • Cause of Loss • Business class

  36. Results Using Only Numeric Variables Used Euclidian distance measure

  37. Two Stage Clustering With Categorical Variables • First compute dissimilarity measures • Then get clusters • Find optimum number of clusters

  38. Loadings of Injuries on Cluster

  39. Age and Cluster

  40. County vs Cluster

  41. Means of Financial Variables by Cluster

  42. Tying Things Together: Multidimensional Scaling • A mathematical way to connect clustering and factor analysis • Data can be decomposed into key row dimensions times a diagonal weight matrix times key column dimensions

  43. Modern dimension reduction • Hidden layer in neural networks like a nonlinear principle components • Projection Pursuit Regression – a nonlinear PCA • Kahonen self-organizing maps – a kind of neural network that does clustering • These can be understood as enhancements factor analysis or clustering

  44. Kahonen SOM for Fraud

  45. Recommended References • Hacher, 1994, A Step-by-Step Approach for Using the SAS System for Factor Ananlysis and Structural Equation Modeling, SAS Publications • Jacoby, 1991, Data Theory and Dimension Analysis, Sage Publications • Kaufman and Rousseeuw,1990, Finding Groups in Data, Wiley • Kim and Mueller, 1978, Factor Analysis: Statistical Methods and Practical Issues, Sage Publications

  46. Questions?

More Related