680 likes | 695 Views
HSRP 734: Advanced Statistical Methods June 12, 2008. General Considerations for Multivariable Analyses. Variable Selection Residuals Influence diagnostics Multicollinearity. (2) Candidate Model Selection. (1) Preliminary Analysis.
E N D
General Considerations for Multivariable Analyses • Variable Selection • Residuals • Influence diagnostics • Multicollinearity
(2) Candidate Model Selection (1) Preliminary Analysis (4) Collinearity and Influential Observation Detection (3) Assumption Validation Yes (6) Prediction Testing (5) Model Revision No An Effective Modeling Cycle
Overview • Model building: applies outside of Logistic regression • Model diagnostics: specific to Logistic regression
Model selection • “Proper model selection rejects a model that is far from reality and attempts to identify a model in which the error of approximation and the error due to random fluctuations are well balanced.” - Shibata, 1989
Model building • Models are just that: approximating models of a truth • How best to quantify approximation? • Depends upon study goals (prediction, explanatory, exploratory)
Principle of Parsimony • “Everything should be made as simple as possible, but no simpler.” – Albert Einstein • Choose a model with “the smallest # of paramters for adequate representation of the data.” – Box & Jenkins
Principle of Parsimony • Bias vs. Variance trade-off as # of variables/parameters increases • Collect sample to learn about population (make inference) • Models are just that: approximating models of a truth • Balance errors of underfitting and overfitting
Why include multiple predictors in a model? • Interaction (effect modification) • Confounding • Increase precision (reduce unexplained variance) • Method of adjustment • Exploratory for unknown correlates
Interpreting Coefficients • When you have more than 1 variable in the model the interpretation is different • Continuous: “β1: For a unit change in X, there is a β1 change in Y, adjusting for the other variables in the model.”
Relationship between Variables Exposure (X) Disease (Y) Third Variable (Z) • Two main complications: • Confounding • Interaction (Effect Modification) Bias Useful Information
Interaction vs. Confounding • Confounding is a BIAS we want to REMOVE • Interaction is a PROPERTY we want to UNDERSTAND • Confounding • Apparent relationship of X (exposure of interest) with Y is distorted due to the relationship of Z (confounder) with X (and Y) • Interaction • Relationship between X and Y differs by the level of Z (when X and Z interact)
Model building • Science vs. Art • Different philosophies • Some agreement on what is worse • Not many agree on a best approach
Model building: Two approaches • Data-based approach • Non-data based
How do you decide what predictor variables to include? Well, what is the goal?
Selecting Predictor Variables • Different goals of analyses: • Estimate and test a treatment group effect • Explore which of a set of predictors are associated with an outcome • Maximize the variation explained/Best predict an outcome
Rule of Model Parsimony • Include just enough variables and no more. • Use a smaller number of variables if they accomplish the same goal (about the same c statisticor precision in treatment effect)
Variable Selection • Mechanics: • Automatic selection based on p-values • Select based on AIC or BIC • Select based on predictive ability • Select based on theoretical or prior literature considerations • Select based on changes in treatment group effects (confounding, interaction, precision)
Data-based: Using p-values • Popular (Remember Johnny from Cobra Kai?) • Selection methods: Forward, Backwards, Stepwise • Bivariate screening, then multivariable on those initially significant
Automatic Selection • Select predictor variables based on p-value cutoff rules • Cutoff rules aren’t necessarily considered at p<0.05 • Three types: Forward, Backwards, Stepwise
Forward Selection • Start off with no predictors in the model • First, add in the most significant variable with p < pcutoff • Next, add in the most significant variable with p < pcutoff , given a model with the 1st variable in the model • Stop when no addition variables have p < pcutoff
Backwards Elimination • Start off with all the predictors in the model • First, remove the least significant variable with p > pcutoff • Next, remove the least significant variable with p > pcutoff , given a reduced model with the 1st variable out of the model • Stop when no addition variables have p > pcutoff
Stepwise Selection • Start with no predictors in the model • 1st step: Proceed with Forward Selection • 2nd step: Add in the most significant variable with p<pFcutoff or remove the least significant variable in the model with p>pBcutoff • Continue until there are no more predictors with p<pFcutoff or p>pBcutoff
Criticisms of P-value based Model Building • Does not incorporate thinking into the problem/automates • Multiple comparisons issue • If multicollinearity is present, selection is made arbitrarily • β’s, SEβ’s are biased (Harrell Jr., 2001) • Test statistics don’t have right distribution (Grambsch, O’Brien, 1991)
Selection methods using p-values • If using these methods there is some preference given to Backwards elimination selection • Some evidence of performing better than Forward selection (Mantel, 1970) • At least initial full model is accurate
Non P-value based Methods • Theoretical Considerations • Prior Literature Considerations • Information Criteria: AIC, BIC
Theoretical Considerations • Adjust for theoretically associated predictors, regardless of p-value • One line of logic is, why would you want to examine the association of P with outcome, without adjusting for T?
Prior Literature Considerations • Adjust for predictors in prior literature, regardless of p-value • One line of logic is, why would you want to examine the association of P with outcome, without adjusting for L? • Example: Outcome=Survival, P=New treatment, L=Patient age
Information Criteria: AIC, BIC • Use non p-value based criteria that maximize the relative fit of competing models • Complex theoretical motivation for IC • Can be used for complicated modeling: non-nested models, functional form of same predictors, etc.
Data-based: Using AIC • AIC is unbiased estimator of the theoretical distance of a model to an unknown true mechanism (that actually generated the data)
Data-based: Using AIC • AIC is unbiased estimator of the theoretical distance of a model to an unknown true mechanism (that actually generated the data) • How is this so??? • If you are really curious…
Data-based: Using AIC • Useful for selecting best model out of candidate model set (not great if all are poor) • The size of 1 AIC value is not important but rather relative size to other AIC’s • Models need not be nested but have same sample size (Burnham & Anderson, 2002)
Treatment Effect Approach • Adjust for any and all confounders and effect-modifiers (interactions), regardless of p-values • From theoretical & prior literature • Goal is to get most accurate and precise estimate of treatment effect
Model Building for Treatment Effect Goal • If we don’t include confounders or interactions that were important then that could obscure picture of outcome-exposure relationship
Still will consider Parsimony • If we include many covariates (not confounders or interactions) perhaps some will only add “noise” to model • Noise added could obscure picture of outcome-exposure relationship
Data-based: Prediction goal • When Parsimony matters: find most accurate model that is most parsimonious (smallest # of predictors) • When doesn’t matter: pure accuracy = goal at any cost • Example: Quality control • Plausible but not typical
Best Predictive Model Approach • Adjust for any predictors that non-trivially increase c statistic (trivial is subject specific) • P-values are not considered • Goal is to maximize predictive ability of model • Future prediction is utmost; “Manage what you measure”
Book on Model building • Chapters 6, 7 • Basically takes the approach of trying to accurately establish the outcome-exposure relationship
Book recommendations • Multistage strategy: • Determine variables under study from research literature and/or that are clinically or biologically meaningful • Assess interaction prior to confounding • Assess for confounding • Additional considerations for precision
Book recommendations • Use backwards elimination of modeling terms • Retain lower-order terms if higher-orders are significant: • Keep 2 variables if 2-way interaction if significant • Keep lower power terms if highest power is significant
Model building • We will focus on treatment effect goal • Will consider book guidelines
Note about Model Building • Differences between “Best” model and nearest competitors may be small • Ordering among “Very Good” models may not be robust to independent challenges with new data
Note about Model Building • Be careful not to overstate importance of variables included in “Best” model • Remember that “Best” model odds ratios & p-values tend to be biased away from the null • Cross-validation approaches allow estimation of prediction errors associated with variable selection and also provide comparisons between sets of best models
After selecting a model • Want to check modeling fit and diagnostics to ensure adequacy • Could be worried about: • Influential data points • Correlated predictor variables • Leaving out variables or using wrong form • Overall model fit and prediction value
Problems to check for • Convergence problems • Model goodness-of-fit • Functional form (confounding, interaction, higher order for continuous) • Multicollinearity • Outlier effects
Convergence problems • SAS usually converges but sometimes will get a message: “There is possibly a quasicomplete separation in the sample points. The ML estimate may not exist. Validity of the model fit is questionable.”