Building useful models: Some new developments and easily avoidable errors

Building useful models: Some new developments and easily avoidable errors Michael Babyak, PhD

What is a model ? Y = f(x1, x2, x3…xn) Y = a + b1x1 + b2x2…bnxn Y = e a + b1x1 + b2x2…bnxn

“All models are wrong, some are useful” -- George Box • A useful model is • Not very biased • Interpretable • Replicable (predicts in a new sample)

Some Premises • “Statistics” is a cumulative, evolving field • Newer is not necessarily better, but should be entertained in the context of the scientific question at hand • Data analytic practice resides along a continuum, from exploratory to confirmatory. Both are important, but the difference has to be recognized. • There’s no substitute for thinking about the problem

Statistics is a cumulative, evolving field: How do we know this stuff? • Theory • Simulation

Concept of Simulation Y = b X + error bs1 bs2 bsk-1 bsk bs3 bs4 ………………….

Concept of Simulation Y = b X + error bs1 bs2 bsk-1 bsk bs3 bs4 …………………. Evaluate

Simulation Example Y = .4 X + error bs1 bs2 bsk-1 bsk bs3 bs4 ………………….

Simulation Example Y = .4 X + error bs1 bs2 bsk-1 bsk bs3 bs4 …………………. Evaluate

True Model:Y = .4*x1 + e

Ingredients of a Useful Model Correct probability model Based on theory Good measures/no loss of information Useful Model Comprehensive Parsimonious Tested fairly Flexible

Correct Model • Gaussian: General Linear Model • Multiple linear regression • Binary (or ordinal): Generalized Linear Model • Logistic Regression • Proportional Odds/Ordinal Logistic • Time to event: • Cox Regression or parametric survival models

Generalized Linear Model Normal Binary/Binomial Count, heavy skew, Lots of zeros Poisson, ZIP, negbin, gamma General Linear Model/ Linear Regression Logistic Regression ANOVA/t-test ANCOVA Chi-square Regression w/ Transformed DV Can be applied to clustered (e.g, repeated measures data)

Factor Analytic Family Structural Equation Models Partial Least Squares Latent Variable Models (Confirmatory Factor Analysis) Multiple regression Common Factor Analysis Principal Components

Use Theory • Theory and expert information are critical in helping sift out artifact • Numbers can look very systematic when the are in fact random • http://www.tufts.edu/~gdallal/multtest.htm

Measure well • Adequate range • Representative values • Watch for ceiling/floor effects

Using all the information • Preserving cases in data sets with missing data • Conventional approaches: • Use only complete case • Fill in with mean or median • Use a missing data indicator in the model

Missing Data • Imputation or related approaches are almost ALWAYS better than deleting incomplete cases • Multiple Imputation • Full Information Maximum Likelihood

Multiple Imputation

Modern Missing Data Techniques • Preserve more information from original sample • Incorporate uncertainty about missingness into final estimates • Produce better estimates of population (true) values

Don’t throw waste information from variables • Use all the information about the variables of interest • Don’t create “clinical cutpoints” before modeling • Model with ALL the data first, then use prediction to make decisions about cutpoints

Dichotomizing for Convenience = Dubious Practice(C.R.A.P.*) • Convoluted Reasoning and Anti-intellectual Pomposity • Streiner & Norman: Biostatistics: The Bare Essentials

Implausible measurement assumption “not depressed” “depressed” A B C Depression score

Loss of power http://psych.colorado.edu/~mcclella/MedianSplit/ Sometimes through sampling error You can get a ‘lucky cut.’ http://www.bolderstats.com/jmsl/doc/medianSplit.html

Dichotomization, by definition, reduces the magnitude of the estimate by a minimum of about 30% Dear Project Officer, In order to facilitate analysis and interpretation, we have decided to throw away about 30% of our data. Even though this will waste about 3 or 4 hundred thousand dollars worth of subject recruitment and testing money, we are confident that you will understand. Sincerely, Dick O. Tomi, PhD Prof. Richard Obediah Tomi, PhD

Power to detect non-zero b-weight when x is continuous versus dichotomized True model: y =.4x + e

Dichotomizing will obscure non-linearity Low High CESD Score

Dichotomizing will obscure non-linearity: Same data as previous slide modeled continuously

Type I error rates for the relation between x2 and y after dichotomizing two continuous predictors.Maxwell and Delaney calculated the effect of dichotomizing two continuous predictors as a function of the correlation between them. The true model is y = .5x1 + 0x2, where all variables are continuous. If x1 and x2 are dichotomized, the error rate for the relation between x2 and y increases as the correlation between x1 and x2 increases.

Is it ever a good idea to categorize quantitatively measured variables? • Yes: • when the variable is truly categorical • for descriptive/presentational purposes • for hypothesis testing, if enough categories are made. • However, using many categories can lead to problems of multiple significance tests and still run the risk of misclassification

CONCLUSIONS • Cutting: • Doesn’t always make measurement sense • Almost always reduces power • Can fool you with too much power in some instances • Can completely miss important features of the underlying function • Modern computing/statistical packages can “handle” continuous variables • Want to make good clinical cutpoints? Model first, decide on cuts afterward.

Sample size and the problem of underfitting vs overfitting • Model assumption is that “ALL” relevant variables be included—the “antiparsimony principle” • Tempered by fact that estimating too many unknowns with too little data will yield junk

Sample Size Requirements • Linear regression • minimum of N = 50 + 8:predictor (Green, 1990) • Logistic Regression • Minimum of N = 10-15/predictor among smallest group (Peduzzi et al., 1990a) • Survival Analysis • Minimum of N = 10-15/predictor (Peduzzi et al., 1990b)

Consequences of inadequate sample size • Lack of power for individual tests • Unstable estimates • Spurious good fit—lots of unstable estimates will produce spurious ‘good-looking’ (big) regression coefficients

All-noise, but good fit R-squares from a population model of completelyrandom variables Events per predictor ratio

Simulation: number of events/predictor ratio Y = .5*x1 + 0*x2 + .2*x3 + 0*x4 -- Where r x1 x4 = .4 -- N/p = 3, 5, 10, 20, 50

Parameter stability and n/p ratio

Peduzzi’s Simulation: number of events/predictor ratio P(survival) =a + b1*NYHA + b2*CHF + b3*VES +b4*DM + b5*STD + b6*HTN + b7*LVC --Events/p = 2, 5, 10, 15, 20, 25 --% relative bias = (estimated b – true b/true b)*100

Simulation results: number of events/predictor ratio

Approaches to variable selection • “Stepwise” automated selection • Pre-screening using univariate tests • Combining or eliminating redundant predictors • Fixing some coefficients • Theory, expert opinion and experience • Penalization/Random effects • Propensity Scoring • “Matches” individuals on multiple dimensions to improve “baseline balance” • Tibshirani’s “Lasso”

Any variable selection technique based on looking at the data first will likely be biased

“I now wish I had never written the stepwise selection code for SAS.” • --Frank Harrell, author of forward and backwards selection algorithm for SAS PROC REG

Automated Selection: Derksen and Keselman (1992) Simulation Study • Studied backward and forward selection • Some authentic variables and some noise variables among candidate variables • Manipulated correlation among candidate predictors • Manipulated sample size

Automated Selection: Derksen and Keselman (1992) Simulation Study • “The degree of correlation between candidate predictors affected the frequency with which the authentic predictors found their way into the model.” • “The greater the number of candidate predictors, the greater the number of noise variables were included in the model.” • “Sample size was of little practical importance in determining the number of authentic variables contained in the final model.”

Simulation results: Number of noise variables included Sample Size 20 candidate predictors; 100 samples

Simulation results: R-square from noise variables Sample Size 20 candidate predictors; 100 samples

SOME of the problems with stepwise variable selection. 1. It yields R-squared values that are badly biased high 2. The F and chi-squared tests quoted next to each variable on the printout do not have the claimed distribution 3. The method yields confidence intervals for effects and predicted values that are falsely narrow (See Altman and Anderson Stat in Med) 4. It yields P-values that do not have the proper meaning and the proper correction for them is a very difficult problem 5. It gives biased regression coefficients that need shrinkage (the coefficients for remaining variables are too large; see Tibshirani, 1996). 6. It has severe problems in the presence of collinearity 7. It is based on methods (e.g. F tests for nested models) that were intended to be used to test pre-specified hypotheses. 8. Increasing the sample size doesn't help very much (see Derksen and Keselman) 9. It allows us to not think about the problem 10. It uses a lot of paper

Building useful models: Some new developments and easily avoidable errors