1 / 28

CAS Predictive Modeling Seminar Practical Issues in Model Design

CAS Predictive Modeling Seminar Practical Issues in Model Design. Chuck Boucek (312) 879-3859. Overview. Data usually does not seamlessly fit into model assumptions The focus of this presentation is the impact that selected issues have on the design matrix Agenda

gerard
Download Presentation

CAS Predictive Modeling Seminar Practical Issues in Model Design

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CAS Predictive Modeling SeminarPractical Issues in Model Design Chuck Boucek (312) 879-3859

  2. Overview • Data usually does not seamlessly fit into model assumptions • The focus of this presentation is the impact that selected issues have on the design matrix • Agenda • Overview of the Design Matrix • Non-linearity in predictors • Missing data

  3. Design Matrix Non-Linearity Missing Data What is the Design Matrix? • Representation of the predictor variables used to construct model

  4. Design Matrix x Coefficients = Linear Predictors 1 1 0 0 125 .033 a1 LP1 0 1 0 0 235 .032 a2 LP2 1 1 1 0 240 .034 a3 = LP3 X 0 1 1 0 350 .044 a4 LP4 1 1 0 1 100 .023 a5 LP5 0 1 0 1 110 .025 a6 LP6 Design Matrix Non-Linearity Missing Data How is GLM Fit to Data? • Linear predictors are transformed to estimate of response data via inverse link function • Family and link function determine form of MLE • Family: Gaussian, Link: identity, MLE:

  5. Design Matrix Non-Linearity Missing Data Non Linearity – Description of Issue

  6. Design Matrix Non-Linearity Missing Data Non Linearity – Description of Issue • GLMs fit linear patterns to data • Produces poor fit for certain predictor variables • Splines can address non-linearity within a GLM

  7. Design Matrix Non-Linearity Missing Data Natural Cubic Spline Characteristics • 3rd degree polynomial between the knots • Continuous value, first and second derivative at the knots • Linear outside of the boundary knots

  8. Design Matrix Non-Linearity Missing Data GLM with a Natural Spline • Two columns are added to the design matrix • These columns are the spline basis • Two additional coefficients are needed • GLM is fit with same MLE and link function 1 0.6 0.0 1 0 0 125 .033 a1 LP1 0 98.4 66.1 1 0 0 235 .032 a2 LP2 1 109.8 75.1 1 1 0 240 .034 a3 = LP3 X 0 497.3 401.3 1 1 0 350 .044 a4 LP4 1 0.0 0.0 1 0 1 100 .023 a5 LP5 0 0.0 0.0 1 0 1 110 .025 a6 LP6 a7 a8

  9. Design Matrix Non-Linearity Missing Data GLM with Natural Spline • Proper reasonability testing • Statistical Significance • Time Consistency Plot

  10. Time Consistency Plot Design Matrix Non-Linearity Missing Data

  11. Design Matrix Non-Linearity Missing Data Missing Data-Description of Issue • Missing data can present unique challenges in model creation

  12. Design Matrix Non-Linearity Missing Data Missing Data-Description of Issue • What methodologies exist for addressing missing data?

  13. Design Matrix Non-Linearity Missing Data Methodology #1 • Listwise Deletion: Eliminate any row in the design matrix with missing values

  14. Design Matrix Non-Linearity Missing Data Methodology #2 • Mean Imputation: Replace missing values with mean of values where data is present

  15. Design Matrix Non-Linearity Missing Data Methodology #3 • Linear Mean Imputation: Create spline basis excluding missing values and mean impute on spline basis

  16. Design Matrix Non-Linearity Missing Data Methodology #3 • Linear Mean Imputation: Create spline basis excluding missing values and mean impute on spline basis

  17. Design Matrix Non-Linearity Missing Data Methodology #4 • Single imputation: Use other predictor variables to build a model and impute missing values • Example: Model Pop Density based on AOI

  18. Design Matrix Non-Linearity Missing Data Methodology #5 • Multiple Imputation: Use other predictor variables to model missing values • Multiple imputations are created based on distribution of residuals in estimates of missing values

  19. Design Matrix Non-Linearity Missing Data Multiple Imputation Process • Choose starting values for mean and covariance matrix of predictor variables • Use mean and covariance matrix to estimate regression parameters • Use regression parameters to estimate missing values. Add a random draw from the residual normal distribution for that variable • Use the resulting data set to compute new mean and covariance matrix • Make a random draw from the posterior distribution of the means and covariances • Use the random draw from step 5, go back to step and cycle through the process until convergence is achieved

  20. Design Matrix Non-Linearity Missing Data Multiple Imputation Process • Assumptions underlying multiple imputation algorithms • Data is Missing At Random: Missingness of predictor variable “V” cannot depend on value of “V” but can depend on values of other predictor variables. • Data is distributed with a Multi-Variate Normal distribution • Two issues that must be addressed • Initial convergence of iterations • Correlation of consecutive iterations

  21. Design Matrix Non-Linearity Missing Data Time Series Plot • Initial convergence is assessed via a time series plot

  22. 1.0 0.8 0.6 ACF 0.4 0.2 0.0 0 20 40 60 80 100 lag Design Matrix Non-Linearity Missing Data Auto Correlation Plot • Spread between iterations is assessed via an autocorrelation plot

  23. Design Matrix Non-Linearity Missing Data Testing of Missing Value Methods • Method #1 • Created a training and holdout data sets • Both contained missing data • Built models of claim frequency under different missing value analysis methods with training dataset • Identical predictor variables in all models • Compared results (deviance) of methods in data set where all data is present

  24. Design Matrix Non-Linearity Missing Data Testing of Missing Value Options • Method #2 • Created a model of missing probability • Limited modeling database to observations in which all data was present • Randomly generated missing values based on missing probability • 100 iterations • Built models of claim frequency under different missing value analysis methods • Identical predictor variables in all models • Compared results (deviance) of methods in data set where all data is present

  25. Design Matrix Non-Linearity Missing Data Performance of Missing Value Methods • Single Imputation/Multiple Imputation • Linear Mean Imputation • Mean Imputation • Listwise Deletion

  26. Design Matrix Non-Linearity Missing Data Missing Data Framework • Questions • What is the level of missing data? • What can be inferred about the missing data mechanism? • What is the size of the modeling database in which all values are present? • Will the data continue to be missing when the model is applied?

  27. Design Matrix Non-Linearity Missing Data Missing Data Framework • Actions • For low proportions of missing data: Listwise Deletion • For higher proportions of missing data in a large modeling database: Listwise Deletion with oversampling • For mid to small modeling databases: employ imputation • Initial exploration with Linear Mean Imputation • Fit final model with Single Imputation or Multiple Imputation

  28. Sources • Splines • Hastie, Tibshirani and Friedman: The Elements of Statistical Learning • Missing Data • Paul Allison: Missing Data • J.L. Schafer: Analysis of Incomplete Multivariate Data • Insightful Corporation: Analyzing Data with Missing Values in S-Plus

More Related