270 likes | 396 Views
Biostatistics Case Studies 2014. Session 4 : Regression Models and Multivariate Analyses. Youngju Pak, PhD. Biostatistician ypak@labiomed.org. What and Why?. Multivariate analysis (MVA) techniques allow more than two variables to be analysed at once.
E N D
Biostatistics Case Studies 2014 Session 4: Regression Models and Multivariate Analyses Youngju Pak, PhD. Biostatistician ypak@labiomed.org
What and Why? • Multivariate analysis (MVA) techniques allow more than two variables to be analysed at once. • Compared with univariate or bivariate • Data richness with computational technologies advanced Data reductions or classifications • eg., Factor analysis, Principal Component Analysis(PCA) • Several variables are potentially correlated with some degree potential confounding bias the result • eg., Analysis of Covariance (ANCOVA), Multiple Linear or Generalized Linear Regression Models
What and Why ? • Many variables are all interrelated with multiple dependent and independent variables • eg., Multivariate Analysis of Variance (MANOVA), Path Models, Structural Equation Models(SEM), Partially Least Square(PLS) Models. • This Session will focus on multiple regression models.
Why regression models? • To reduce “Random Noise” in Data => better variance estimations by adding source of variability of your dependent variables • eg. ANCOVA • To determine a optimal set of predictors => predictive models • eg. Variable selection procedures for multiple regression models • To adjust for potential confounding effects • eg, regression models with covariates
Actual mathematical Models • ANOVA Yij=μ+τi+ϵij, ,whereYij represents the jth observation (j=1,2,…,n) on the ith treatment (i=1,2,…,l levels). The errors ϵij are assumed to be normally and independently (NID) distributed, with mean zero and variance σ2. • ANCOVA with k number of covariates Yij=μ+τi+X1ij + X2ij + …+ Xkij + ϵij, • MANOVA (with p number of outcome variables) Y(nxp) = X(nx[q+1]) B([q+1] x p) + E (n x p)
Actual mathematical Models • Simple Linear Regression Models (SLR) Yi= β0 + β1Xi+ εi µY(true mean value of Y) • ε =“error” (random noise due to random sampling error), assumed ε follow a normal distribution with mean=0, variance=σ2 • β0& β1= intercept & slope often called Regression (or beta) Coefficients • Y=Dependent Variable(DV) • X=Independent Variable (IV) eg., Y= Insulin Sensitivity X= FattyAcid in percentage • Multiple Linear Regression Models (MLR) • Simple Logistic Models(SL) • Multiple Logistic Models(ML)
SLR: Example SPSS output • Two-sided p-value=0.002. Thus, there is significant statistical evidence (alpha=0.05) to conclude that the true slope is notzero Fatty Acid(%) is significantly related to insulin sensitivity . • Mean Insulin sensitivity increase by 37.208 unit as Fatty Acid(%) increase by one percent.
Checking the assumptions using a residual Plot A plot has to be looked as “RANDOM” no special pattern is supposed to be shown if the assumptions are met.
Actual mathematical Models • Multiple Linear Regression Models (SLR) Y = β0+ β1X1 + β2 X2 + … + βk Xk + ε µY(true mean value of Y) • Assumptions are the same as SLR with one more addition : All Xs are not highly correlated. If they are, this is called “Multicollinearity”, which will make model very unstable. • Diagnosis for multicollinearity • Variance Inflation Factor (VIF) = 1 OK • VIF < 5 Tolerable • VIF > 5 Problematic Remove the variable which has a high VIF or do PCA • Multiple Linear Regression Models (MLR) • Simple Logistic Models(SL) • Multiple Logistic Models(ML)
MRL: Example mY= -56.935 + 1.634X1 + 0.249X2 • 1.634*Flexibility • For every 1 degree increase in flexibility, MEAN punt distance increases by 1.634 feet, adjusting for leg strength. • 0.249*Strength • For every 1 lb increase in strength, MEAN punt distance increases by 0.249 feet, adjusting for flexibility.
What do mean by “adjusted for”? • If categorical covariates? • eg., • Mean % gain w/o adjustment for Gender • Exercise & Diet: (20%x10+10%x40) / 50 = 12 % • Exercise only: (15%x40 + 5%x10) / 50 = 13 % • Mean % gain with adjustment for Gender • Exercise & Diet: Male avg. x 0.5 + Female avg. x 0.5 = 20% x 0.5 + 10% x 0.5=15 % • Exercise only: Male avg. x 0.5 + Female avg. x 0.5 = 15% x 0.5 + 5% x 0.5=10%
Why different? • % gain for males are 10% higher than female in both diet potential confounding • However, two groups are unbalanced in terms of gender, i.e, 80% male for the exercise group while 20% female for the diet & exercise group dilute the “treatment effect” • If continuous covariates such as baseline age, similar adjustment will be performed based on the correlation between % gain and the baseline age.
Graphical illustration : Adjusting for a continuous covariate * Changes in Adiponectin (a glucose regulating protein) b/w two groups
Multiple Logistic Regression Models • The model: Logit(π)=β0+ β1X1 + β2X2 + ••• +βkXk where π=Prob (event =1), Logit(π)=ln[π /(1- π)] • or π = e LP / (1+ e LP), where Lp= β0+ β1X1 + β2X2 + ••• +βkXk
Interpretation of the coefficients in logistic regression models • For a continuous predictor, a coefficient (e β)represents the multiplicative increase in the mean odds of Y=1 for one unit change in X odds ratio for X+1 to X. • Similarly, for a nominal predictor, the coefficient represent the odds ratio for one group (X=1) to another (X=0). • Remember, MLR has other covariates. Hence, the interpretation of one coefficient is applied when other covariates are adjusted for.
Other Models • Ordinal Logistic Regression for ordinal responses such as cancer stage I, II, III, IV : assumes the constant rate of change in OR between any two groups. • Poisson regressions when responses are count data such as # of pregnancy : over dispersion is common and some times a negative binomial distribution is used instead. • Mixed Model ; commonly used for a repeated measures ANOVA or ANCOVA. Time is used as within-subject factor and random factor. Mixed models are also used for nested design. • Cox proportional Hazard models: multivariate models for survival data.
General Linear Modelvs. Generalized Linear Model(GLM) • A Linear Model General Linear Model • eg., ANOVA, ANCOVA, MANOVA, MANCOVA, Linear regression, mixed model • A Non Linear Model Generalized Linear Model • Eg., Logistic, Ordinary Logistic, Possion All these used a link function for a response variable (Y) such as a logit link or possion link. • GEE(Generalized Estimating Equation) models are an extension of GLM.
Variable Selection Procedures • Forward • By adding a new predictor that as the lowest p-value and keep repeating this step until no more predictors to be added at 0.05 alpha level • Backward • Start a full model with all predictors and eliminate the predictor with the highest p-value and keep repeating this procedure until no more predictors left to be eliminated at 0.05 alpha level • Stepwise • Combination of Forward and Backward • Level of stay : 0.01, Level of entry: 0.05 usually used • Simulation studies show Backward is most recommendable based on many simulation studies.
Bariatric Surgery • Roux-en-Y gastric bypass, • Sleeve gastrectomy, • Gastric banding, • Biliopancreatic diversion.
Table 1 Figure 1 Appendix ?
Factors Associated with Achieving The Primary End Points at 3 Years