280 likes | 454 Views
CAS Predictive Modeling Seminar Practical Issues in Model Design. Chuck Boucek (312) 879-3859. Overview. Data usually does not seamlessly fit into model assumptions The focus of this presentation is the impact that selected issues have on the design matrix Agenda
E N D
CAS Predictive Modeling SeminarPractical Issues in Model Design Chuck Boucek (312) 879-3859
Overview • Data usually does not seamlessly fit into model assumptions • The focus of this presentation is the impact that selected issues have on the design matrix • Agenda • Overview of the Design Matrix • Non-linearity in predictors • Missing data
Design Matrix Non-Linearity Missing Data What is the Design Matrix? • Representation of the predictor variables used to construct model
Design Matrix x Coefficients = Linear Predictors 1 1 0 0 125 .033 a1 LP1 0 1 0 0 235 .032 a2 LP2 1 1 1 0 240 .034 a3 = LP3 X 0 1 1 0 350 .044 a4 LP4 1 1 0 1 100 .023 a5 LP5 0 1 0 1 110 .025 a6 LP6 Design Matrix Non-Linearity Missing Data How is GLM Fit to Data? • Linear predictors are transformed to estimate of response data via inverse link function • Family and link function determine form of MLE • Family: Gaussian, Link: identity, MLE:
Design Matrix Non-Linearity Missing Data Non Linearity – Description of Issue
Design Matrix Non-Linearity Missing Data Non Linearity – Description of Issue • GLMs fit linear patterns to data • Produces poor fit for certain predictor variables • Splines can address non-linearity within a GLM
Design Matrix Non-Linearity Missing Data Natural Cubic Spline Characteristics • 3rd degree polynomial between the knots • Continuous value, first and second derivative at the knots • Linear outside of the boundary knots
Design Matrix Non-Linearity Missing Data GLM with a Natural Spline • Two columns are added to the design matrix • These columns are the spline basis • Two additional coefficients are needed • GLM is fit with same MLE and link function 1 0.6 0.0 1 0 0 125 .033 a1 LP1 0 98.4 66.1 1 0 0 235 .032 a2 LP2 1 109.8 75.1 1 1 0 240 .034 a3 = LP3 X 0 497.3 401.3 1 1 0 350 .044 a4 LP4 1 0.0 0.0 1 0 1 100 .023 a5 LP5 0 0.0 0.0 1 0 1 110 .025 a6 LP6 a7 a8
Design Matrix Non-Linearity Missing Data GLM with Natural Spline • Proper reasonability testing • Statistical Significance • Time Consistency Plot
Time Consistency Plot Design Matrix Non-Linearity Missing Data
Design Matrix Non-Linearity Missing Data Missing Data-Description of Issue • Missing data can present unique challenges in model creation
Design Matrix Non-Linearity Missing Data Missing Data-Description of Issue • What methodologies exist for addressing missing data?
Design Matrix Non-Linearity Missing Data Methodology #1 • Listwise Deletion: Eliminate any row in the design matrix with missing values
Design Matrix Non-Linearity Missing Data Methodology #2 • Mean Imputation: Replace missing values with mean of values where data is present
Design Matrix Non-Linearity Missing Data Methodology #3 • Linear Mean Imputation: Create spline basis excluding missing values and mean impute on spline basis
Design Matrix Non-Linearity Missing Data Methodology #3 • Linear Mean Imputation: Create spline basis excluding missing values and mean impute on spline basis
Design Matrix Non-Linearity Missing Data Methodology #4 • Single imputation: Use other predictor variables to build a model and impute missing values • Example: Model Pop Density based on AOI
Design Matrix Non-Linearity Missing Data Methodology #5 • Multiple Imputation: Use other predictor variables to model missing values • Multiple imputations are created based on distribution of residuals in estimates of missing values
Design Matrix Non-Linearity Missing Data Multiple Imputation Process • Choose starting values for mean and covariance matrix of predictor variables • Use mean and covariance matrix to estimate regression parameters • Use regression parameters to estimate missing values. Add a random draw from the residual normal distribution for that variable • Use the resulting data set to compute new mean and covariance matrix • Make a random draw from the posterior distribution of the means and covariances • Use the random draw from step 5, go back to step and cycle through the process until convergence is achieved
Design Matrix Non-Linearity Missing Data Multiple Imputation Process • Assumptions underlying multiple imputation algorithms • Data is Missing At Random: Missingness of predictor variable “V” cannot depend on value of “V” but can depend on values of other predictor variables. • Data is distributed with a Multi-Variate Normal distribution • Two issues that must be addressed • Initial convergence of iterations • Correlation of consecutive iterations
Design Matrix Non-Linearity Missing Data Time Series Plot • Initial convergence is assessed via a time series plot
1.0 0.8 0.6 ACF 0.4 0.2 0.0 0 20 40 60 80 100 lag Design Matrix Non-Linearity Missing Data Auto Correlation Plot • Spread between iterations is assessed via an autocorrelation plot
Design Matrix Non-Linearity Missing Data Testing of Missing Value Methods • Method #1 • Created a training and holdout data sets • Both contained missing data • Built models of claim frequency under different missing value analysis methods with training dataset • Identical predictor variables in all models • Compared results (deviance) of methods in data set where all data is present
Design Matrix Non-Linearity Missing Data Testing of Missing Value Options • Method #2 • Created a model of missing probability • Limited modeling database to observations in which all data was present • Randomly generated missing values based on missing probability • 100 iterations • Built models of claim frequency under different missing value analysis methods • Identical predictor variables in all models • Compared results (deviance) of methods in data set where all data is present
Design Matrix Non-Linearity Missing Data Performance of Missing Value Methods • Single Imputation/Multiple Imputation • Linear Mean Imputation • Mean Imputation • Listwise Deletion
Design Matrix Non-Linearity Missing Data Missing Data Framework • Questions • What is the level of missing data? • What can be inferred about the missing data mechanism? • What is the size of the modeling database in which all values are present? • Will the data continue to be missing when the model is applied?
Design Matrix Non-Linearity Missing Data Missing Data Framework • Actions • For low proportions of missing data: Listwise Deletion • For higher proportions of missing data in a large modeling database: Listwise Deletion with oversampling • For mid to small modeling databases: employ imputation • Initial exploration with Linear Mean Imputation • Fit final model with Single Imputation or Multiple Imputation
Sources • Splines • Hastie, Tibshirani and Friedman: The Elements of Statistical Learning • Missing Data • Paul Allison: Missing Data • J.L. Schafer: Analysis of Incomplete Multivariate Data • Insightful Corporation: Analyzing Data with Missing Values in S-Plus