HSRP 734: Advanced Statistical Methods June 12, 2008

1. HSRP 734: Advanced Statistical MethodsJune 12, 2008

2. General Considerations for Multivariable Analyses

3. An Effective Modeling Cycle

4. Overview Model building: applies outside of Logistic regression Model diagnostics: specific to Logistic regression

5. Model Building

6. Model selection �Proper model selection rejects a model that is far from reality and attempts to identify a model in which the error of approximation and the error due to random fluctuations are well balanced.� - Shibata, 1989

7. Model building Models are just that: approximating models of a truth How best to quantify approximation? Depends upon study goals (prediction, explanatory, exploratory)

8. Principle of Parsimony �Everything should be made as simple as possible, but no simpler.� � Albert Einstein Choose a model with �the smallest # of paramters for adequate representation of the data.� � Box & Jenkins

9. Principle of Parsimony Bias vs. Variance trade-off as # of variables/parameters increases Collect sample to learn about population (make inference) Models are just that: approximating models of a truth Balance errors of underfitting and overfitting

10. Why include multiple predictors in a model? Interaction (effect modification) Confounding Increase precision (reduce unexplained variance) Method of adjustment Exploratory for unknown correlates

11. Interpreting Coefficients When you have more than 1 variable in the model the interpretation is different Continuous: ��1: For a unit change in X, there is a �1 change in Y, adjusting for the other variables in the model.�

12. Relationship between Variables

13. Interaction vs. Confounding Confounding is a BIAS we want to REMOVE Interaction is a PROPERTY we want to UNDERSTAND Confounding Apparent relationship of X (exposure of interest) with Y is distorted due to the relationship of Z (confounder) with X (and Y) Interaction Relationship between X and Y differs by the level of Z (when X and Z interact)

14. Model building Science vs. Art Different philosophies Some agreement on what is worse Not many agree on a best approach

15. Model building: Two approaches Data-based approach Non-data based

16. How do you decide what predictor variables to include?

17. Selecting Predictor Variables

18. Rule of Model Parsimony

19. Variable Selection

20. Data-based: Using p-values Popular (Remember Johnny from Cobra Kai?) Selection methods: Forward, Backwards, Stepwise Bivariate screening, then multivariable on those initially significant

21. Automatic Selection

22. Forward Selection

23. Backwards Elimination

24. Stepwise Selection

25. Criticisms of P-value based Model Building Does not incorporate thinking into the problem/automates Multiple comparisons issue If multicollinearity is present, selection is made arbitrarily ߒs, SEߒs are biased (Harrell Jr., 2001) Test statistics don�t have right distribution (Grambsch, O�Brien, 1991)

26. Selection methods using p-values If using these methods there is some preference given to Backwards elimination selection Some evidence of performing better than Forward selection (Mantel, 1970) At least initial full model is accurate

27. Non P-value based Methods

28. Theoretical Considerations

29. Prior Literature Considerations

30. Information Criteria: AIC, BIC

31. Data-based: Using AIC AIC is unbiased estimator of the theoretical distance of a model to an unknown true mechanism (that actually generated the data)

32. AIC is unbiased estimator of the theoretical distance of a model to an unknown true mechanism (that actually generated the data) How is this so??? If you are really curious� Data-based: Using AIC

33. A Gross Simplification of AIC

34. Data-based: Using AIC Useful for selecting best model out of candidate model set (not great if all are poor) The size of 1 AIC value is not important but rather relative size to other AIC�s Models need not be nested but have same sample size (Burnham & Anderson, 2002)

35. Treatment Effect Approach

36. Model Building for Treatment Effect Goal If we don�t include confounders or interactions that were important then that could obscure picture of outcome-exposure relationship

37. Still will consider Parsimony If we include many covariates (not confounders or interactions) perhaps some will only add �noise� to model Noise added could obscure picture of outcome-exposure relationship

38. Data-based: Prediction goal When Parsimony matters: find most accurate model that is most parsimonious (smallest # of predictors) When doesn�t matter: pure accuracy = goal at any cost Example: Quality control Plausible but not typical

39. Best Predictive Model Approach

40. Book on Model building Chapters 6, 7 Basically takes the approach of trying to accurately establish the outcome-exposure relationship

41. Book recommendations Multistage strategy: Determine variables under study from research literature and/or that are clinically or biologically meaningful Assess interaction prior to confounding Assess for confounding Additional considerations for precision

42. Book recommendations Use backwards elimination of modeling terms Retain lower-order terms if higher-orders are significant: Keep 2 variables if 2-way interaction if significant Keep lower power terms if highest power is significant

43. Model building We will focus on treatment effect goal Will consider book guidelines

44. Note about Model Building Differences between �Best� model and nearest competitors may be small Ordering among �Very Good� models may not be robust to independent challenges with new data

45. Note about Model Building Be careful not to overstate importance of variables included in �Best� model Remember that �Best� model odds ratios & p-values tend to be biased away from the null Cross-validation approaches allow estimation of prediction errors associated with variable selection and also provide comparisons between sets of best models

46. SAS Lab: ICW

47. Model Diagnostics

48. After selecting a model Want to check modeling fit and diagnostics to ensure adequacy Could be worried about: Influential data points Correlated predictor variables Leaving out variables or using wrong form Overall model fit and prediction value

49. Problems to check for Convergence problems Model goodness-of-fit Functional form (confounding, interaction, higher order for continuous) Multicollinearity Outlier effects

50. Convergence problems SAS usually converges but sometimes will get a message: �There is possibly a quasicomplete separation in the sample points. The ML estimate may not exist. Validity of the model fit is questionable.�

51. Convergence problems Quasi-complete separation = occurs whenever there is complete separation except for a single value of the predictor Complete separation = some linear combination of the predictors perfectly predicts the outcome Problem is they�re too good! Example: CHD=1 whenever Gender=Male

52. Quasi-complete separation Typically easy to diagnose. Why? SAS prints a log warning. SE�s are gigantic, OR�s or CI�s are extreme. What to do about it?

53. Quasi-complete separation Options: If continuous, create groups If multi-group categorical, collapse groups If dichotomous, group another way if possible Drop variable Drop cases from analyses

54. Diagnostics Modeling fit: Hosmer & Lemeshow goodness of fit c statistic (area under ROC curve) Generalized R-square Residual analyses: Examine for outliers in X space (hii�s) Examine for odd combinations of Y, X Examine for influential points on ?�s (on all or on specific ones)

55. Hosmer-Lemeshow Goodness-of-fit LACKFIT option in LOGISTIC Generate predicted probabilities from the fitted model Group into i intervals (usually 10) based on size and compare to observed frequencies Calculate a Chi-square statistic with df = # of intervals - 2

56. Considerations for H-L GOF test Is a conservative test Low power to detect specific types of lack of fit (e.g., nonlinearity in a predictor variable) Highly dependent on how the observations are grouped Caution if p-value if large in concluding model is ok

57. Generalized R-square

58. Area under ROC curve: c statistic The Receiver Operating Characteristic (ROC) curve is a plot of the proportion of correctly predicted events (Sensitivity) against 1-proportion of correctly predicted non-events (1-Specificity) The sharper the initial rise of the ROC curve the better predicting model

59. Area under ROC curve: c statistic The c statistic is the area under the ROC curve and is a statistic that quantifies predictive ability Examples for c (Ashton, 1995): Good = 0.831 Bad = 0.493

60. c=0.696

61. Multicollinearity Diagnosing multicollinearity is similar to what was done for regression This is because it is a problem of the predictor variables One approach: can just use VIF in an analogous Linear Regression model Better approach: weight by predicted probabilities in an initial step

62. Multicollinearity If VIF > 7 attention is warranted If VIF > 10 indication of multicollinearity What do you do if you have it? Combine variables in an index Consider data reduction (e.g., PCA) Drop variables

63. hii: extreme points in X space hii�s are the Leverage values; the diagonal values of the Hat matrix Observations that are unusual in the combination of predictors can be quantified by hii�s

64. Deviance residuals: obs not explained by model well Deviance residuals can identify cases that are not explained well by the model The sum of the squared deviance residuals is the Deviance = -2lnL Why not plot di vs. hii ?

65. DFBETAs: influential points on ?�s Measures how much each regression coefficient changes with the ith case deleted Actual change is divided by the SE� If one case changes �K substantially then observation is highly influential

66. C-bar: confidence interval displacement Measure of the overall change in all the coefficients with the ith case deleted Similar to Cook�s distance in linear regression If one case changes ߒs substantially then observation is highly influential

67. SAS Lab: ICW

68. Looking ahead Extensions & Advanced methods Review with Q&A Exam 1: June 26th

HSRP 734: Advanced Statistical Methods June 12, 2008

HSRP 734: Advanced Statistical Methods June 12, 2008

Presentation Transcript

Capabilities and Quality Delivery System for Statistical, Data Management, and Medical Writing Services June, 2008

What Could We Do better? Alternative Statistical Methods

Advanced Statistics for Interventional Cardiologists

Advanced Statistics for Interventional Cardiologists

AAEC 4302 ADVANCED STATISTICAL METHODS IN AGRICULTURAL RESEARCH

Corpora and Statistical Methods Lecture 11

QM 2610 Statistical Methods I

Advanced Methods and Analysis for the Learning and Social Sciences

Statistical Methods

Statistical Methods II

HSRP 734: Advanced Statistical Methods July 17, 2008

Durham Statistical Techniques Conference Summary II

Do I need statistical methods?

Advanced Statistical Methods for the Analysis of Gene Expression and Proteomics

ETD 2008, Aberdeen, Scotland, June 4 – 7, 2008

Statistical Methods in NLP Course 10

Mutivariate statistical Analysis methods

STAT 3130 Statistical Methods I

Scientific Methods 1

A combining approach to statistical methods for p >> n problems

AAEC 4302 ADVANCED STATISTICAL METHODS IN AGRICULTURAL RESEARCH

HSRP 734: Advanced Statistical Methods June 12, 2008