Biostat Review

Biostat Review November 29, 2012

Objectives • Review hw#8 • Review of last two lectures • Linear regression • Simple and multiple • Logistic regression

Review hw#8

Simple linear regression • The objective of regression analysis is to predict or estimate the value of the response that is associated with a fixed value of the explanatory variable.

Simple linear regression • The regression line equation is • The “best” line is the one that finds the α and β that minimize the sum of the squared residuals Σei2 (hence the name “least squares”) • Interpretation of results: The slope  is the change in the mean value of y that corresponds to a one-unit increase in x

Simple linear regression example: Regression of age on FEVFEV= α̂ + β̂ age regress yvar xvar . regress fev age Source | SS df MS Number of obs = 654 -------------+------------------------------ F( 1, 652) = 872.18 Model | 280.919154 1 280.919154 Prob > F = 0.0000 Residual | 210.000679 652 .322086931 R-squared = 0.5722 -------------+------------------------------ Adj R-squared = 0.5716 Total | 490.919833 653 .751791475 Root MSE = .56753 ------------------------------------------------------------------------------ fev | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- age | .222041 .0075185 29.53 0.000 .2072777 .2368043 _cons | .4316481 .0778954 5.54 0.000 .278692 .5846042 ------------------------------------------------------------------------------ β̂ ̂ = Coef for age α̂ = _cons (short for constant)

Interpretation of coefficients • Age • For every one increase unit in age FEV changes by 0.22 • When age = 0, the FEV is 0.431, which is also equal to the mean FEV

Model Fit • R2 represents the portion of the variability that is removed by performing the regression on X • Remember that the R2tells us the fit of the model with values closer to 1 having a better fit • Can get this number from your stata output (next slide)

regress fev age Source | SS df MS Number of obs = 654 -------------+------------------------------ F( 1, 652) = 872.18 Model | 280.919154 1 280.919154 Prob > F = 0.0000 Residual | 210.000679 652 .322086931 R-squared = 0.5722 -------------+------------------------------ Adj R-squared = 0.5716 Total | 490.919833 653 .751791475 Root MSE = .56753 ------------------------------------------------------------------------------ fev | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- age | .222041 .0075185 29.53 0.000 .2072777 .2368043 _cons | .4316481 .0778954 5.54 0.000 .278692 .5846042 ------------------------------------------------------------------------------ =.75652

Model fit • Can also look at residuals • Residuals are the difference between the observed y values and the regression line for each value of x • yi-ŷi • If all the points lie along a straight line, the residuals are all 0 • If there is a lot of variability at each level of x, the residuals are large • The sum of the squared residuals is what was minimized in the least squares method of fitting the line

Scatter plot of outcome vs. predictor This difference is the residual

Use of residual plots for model fit • Residual plot is a scatter plot • Y-axis residuals • X-axis outcome variable • Stata code to get residual plot: regress outcomevarpredictorvar rvfplot

rvfplot, title(Fitted values versus residuals for regression of FEV on age)

Why look at residual plot • This plot shows that as the fitted value of FEV increases, the spread of the residuals increase – this suggests heteroscedasticity • Note that heteroscedasticity does not bias the estimates of the parameters, but it does reduce the precision of the estimates (and therefore reduces power) • Homoscedasticity: constant variability across all values of x (same standard deviation for each value of y) • Important assumption if you are going to use linear regression!

Residual plots • Of note • rvfplot ** gives you Residuals vs. Fitted (outcome) • rvpplotpredictorvar ** gives you Residuals vs. Predictor (predictor) • Can use either one

Data transformation • So if you have heterostatisticity in your data, can transform your data • Something to note • Heterostastisticity does not change your β’s but it does make your standard errors larger • Transforming you data does not inherently change your data • Log transformation is the most common way to deal with heterostatisticity

Log transformation of FEV data • Do we still have heterostatisticity? • Sort of…

Log transformation stata output

Interpretation of regression coefficients for transformed y value • The regression equation is: ln(FEV) = ̂ + ̂ age = 0.051 + 0.087 age • So a one year change in age corresponds to a .087 change in ln(FEV) • The change is on a multiplicative scale, so if you exponentiate, you get a percent change in y • e0.087 = 1.09 – so a one year change in age corresponds to a 9% increase in FEV

Categorical variable/predictor • Previous example was of a predictor that was continuous (age) • Can also perform regression with a categorical predictor/variable • If dichotomous • Convention use 0 vs. 1 • ie is dichotomous: 0 for female, 1 for male

Categorical independent variable • Remember that the regression equation is μy|x = α +  x • The only variables x can take are 0 and 1 • μy|0 = αμy|1 = α +  • So the estimated mean FEV for females is ̂ and the estimated mean FEV for males is ̂ + ̂ • When we conduct the hypothesis test of the null hypothesis =0 what are we testing? • What other test have we learned that tests the same thing? • T-test

Categorical variable/predictor • What if you have more than two categories within a predictor (non-dichotomous)? • One is set to the reference

Categorical independent variables • E.g. Race group = White, Asian/PI, Other • If Race=White is set as reference category(this is convention)

Categorical independent variables • Dummy/indicator variables • Since all variables in regression analysis have to have a numerical value, for dichotomous variables we typically assign them 1 (yes/present) vs. 0 (no/not present) • Since they do not have any quantitative meaning they are called dummy/indicator variables

Categorical independent variables • Then the regression equation is: y =  + 1 xAsian/PI + 2 xOther+ ε • For race group= White (reference) ŷ = ̂ +v ̂10+ ̂20 = ̂ • For race group= Asian/PI ŷ = ̂ + ̂11 + ̂20 = ̂ + ̂1 • For race group= Other ŷ = ̂ + ̂10 + ̂21 = ̂ + ̂2

Categorical independent variables • For stata you just place an “i.variable” to identify it as categorical variable • Stata takes the lowest number as the reference group • You can change this by the prefix “b#. variable” where # is the number value of the group that you want to be the reference group.

Multiple regression • Additional explanatory variables might add to our understanding of a dependent variable • We can posit the population equation μy|x1,x2,...,xq = α + 1x1 + 2x2 + ... + qxq • αis the mean of y when all the explanatory variables are 0 • iis the change in the mean value of y the corresponds to a 1 unit change in xiwhen all the other explanatory variables are held constant

Multiple regression • Stata command (just add the additional predictors) • regress outcomevar predictorvar1 predictorvar2…

Multiple regression . regress fev age ht Source | SS df MS Number of obs = 654 -------------+------------------------------ F( 2, 651) = 1067.96 Model | 376.244941 2 188.122471 Prob > F = 0.0000 Residual | 114.674892 651 .176151908 R-squared = 0.7664 -------------+------------------------------ Adj R-squared = 0.7657 Total | 490.919833 653 .751791475 Root MSE = .4197 ------------------------------------------------------------------------------ fev | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- age | .0542807 .0091061 5.96 0.000 .0363998 .0721616 ht | .1097118 .0047162 23.26 0.000 .100451 .1189726 _cons | -4.610466 .2242706 -20.56 0.000 -5.050847 -4.170085 ------------------------------------------------------------------------------ • R2 will always increase as you add more variables into the model • The Adj R-squared accounts for the addition of variables and is comparable across models with different numbers of parameters • Note that the beta for age changed

How do you interpret the coefficients? • Age • When height is held constant for every 1 unit (in this case year) increase in age you will have a 0.054 increase in FEV • Height • What about this variable?!? • What about the constant? • How do you interpret that? • Does it make sense? • You will learn more transformations to get it to make sense in future course..

You can fit both continuous and categorical predictors . regress fev age smoke Source | SS df MS Number of obs = 654 -------------+------------------------------ F( 2, 651) = 443.25 Model | 283.058247 2 141.529123 Prob > F = 0.0000 Residual | 207.861587 651 .319295832 R-squared = 0.5766 -------------+------------------------------ Adj R-squared = 0.5753 Total | 490.919833 653 .751791475 Root MSE = .56506 ------------------------------------------------------------------------------ fev | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- age | .2306046 .0081844 28.18 0.000 .2145336 .2466755 smoke | -.2089949 .0807453 -2.59 0.010 -.3675476 -.0504421 _cons | .3673731 .0814357 4.51 0.000 .2074647 .5272814 ------------------------------------------------------------------------------ • The model is fêv = α̂ + β̂1 age + β̂2Xsmoke • So for non-smokers, we have fêv= α̂ + β̂1 age (b/c Xsmoke=0) • For smokers, fêv = α̂ + β̂1 age + β̂2(b/c Xsmoke= 1) • So β̂2 is the mean difference in FEV for smokers versus non-smokers at each age

When you have one continuous variable and one dichotomous variable, you can think of fitting two lines that only differ in y intercept by the coefficient of the dichotomous variable (in this case smoke) • E.g. β̂2=-.209

Logistic regression • Linear regression • Continuous outcome • Logistic regression • Dichotomous outcome • Eg disease or no disease or Alive/Dead • Model the probability of the disease

Logistic regression • Need an equation that will follow rules of probability • Specifically that probability needs to be between 0-1 • A model of the form p= α + βx would be able to take on negative values or values more than 1 • p=e α + βx is an improvement because it cannot be negative , but it still could be greater than 1

Logistic regression • How about the function? • This function =.5 when α + βx =0 • The function models the probability slowly increasing over the value of x, until there is a steep rise, and another leveling off

Logistic regression • The logistic function • Now check this out: • So odds of disease/success are • And therefore ln(p/(1-p)) = α + bx

Logistic regression • So instead of assuming that the relationship between x and p is linear , we are assuming that the relationship between ln(p/(1-p)) and x is linear. • ln(p/(1-p)) is called the logit function • It is a transformation • While the outcome is not linear, the other side of the equation α + bx is linear • Generalized linear models

Logistic regression • Stata code • logistic outcomevarpredictorvar 1 predictorvar2…, coef • Coef command gives you coefficient, β • This β, when you are interpreting is actually ln(OR) • To get the odds ratio, need to raise β to e • Odds ratio = e • Or you could just use this stata code instead (don’t use coeff) • logistic outcomevarpredictorvar 1 predictorvar2…,

Interpret these coefficients . logistic coldany i.rested_mostly, coef Logistic regression Number of obs = 504 LR chi2(1) = 19.71 Prob > chi2 = 0.0000 Log likelihood = -323.5717 Pseudo R2 = 0.0296 ------------------------------------------------------------------------------ coldany | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- 1.rested_m~y | -.9343999 .2187794 -4.27 0.000 -1.3632 -.5056001 _cons | -.2527658 .1077594 -2.35 0.019 -.4639704 -.0415612 ------------------------------------------------------------------------------

Interpret these coefficients • Cold data (from previous slide) • β = -0.934 • The natural log of the odds of someone who was rested of getting a cold to someone who is not rested is -0.934 • If you raise it to the power of e, you get 0.39 • Therefore another way of interpreting this is that the odds of someone who was rested of getting a cold compared to someone who is not rested is 0.39

Or get stata to calculate the odds ratio for you! logistic depvarindepvar . logistic coldanyi.rested_mostly Logistic regression Number of obs = 504 LR chi2(1) = 19.71 Prob > chi2 = 0.0000 Log likelihood = -323.5717 Pseudo R2 = 0.0296 ------------------------------------------------------------------------------ coldany | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- 1.rested_m~y | .3928215 .0859413 -4.27 0.000 .2558409 .6031435 ------------------------------------------------------------------------------ =e

Interpretation when you have a continuous variable . . logistic coldany age Logistic regression Number of obs = 504 LR chi2(1) = 23.77 Prob > chi2 = 0.0000 Log likelihood = -322.05172 Pseudo R2 = 0.0356 ------------------------------------------------------------------------------ coldany | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- age | .9624413 .0081519 -4.52 0.000 .9465958 .9785521 ------------------------------------------------------------------------------ • Interpretation of the coefficients: The odds ratio is for a one unit change in the predictor • For this example the 0.962 is the odds ratio for a year difference in age

When you want to change the value of the unit • So instead of looking at the change in the outcome for 1 unit, let’s say you want to look at the change in outcome for 10 unit

Continuous explanatory variable • To find the OR for a 10-year change in age . . logistic coldany age, coef Logistic regression Number of obs = 504 LR chi2(1) = 23.77 Prob > chi2 = 0.0000 Log likelihood = -322.05172 Pseudo R2 = 0.0356 ------------------------------------------------------------------------------ coldany | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- age | -.0382822 .00847 -4.52 0.000 -.0548831 -.0216813 _cons | .906605 .3167295 2.86 0.004 .2858265 1.527383 ------------------------------------------------------------------------------ OR for a 10-year change in age = exp(10*-.0382) = 0.682

Or you can also generate a new variable • To find the OR for a 10-year change in age . gen age_10=age/10 (2 missing values generated) . logistic coldany age_10 Logistic regression Number of obs = 504 LR chi2(1) = 23.77 Prob > chi2 = 0.0000 Log likelihood = -322.05172 Pseudo R2 = 0.0356 ------------------------------------------------------------------------------ coldany | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- age_10 | .6819344 .0577599 -4.52 0.000 .5776247 .8050807 ------------------------------------------------------------------------------ This is nice because stata will calculate your confidence interval as well!

Interpret this output . logistic coldany age_10 i.smoke Logistic regression Number of obs = 504 LR chi2(2) = 23.89 Prob > chi2 = 0.0000 Log likelihood = -321.99014 Pseudo R2 = 0.0358 ------------------------------------------------------------------------------ coldany | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- age_10 | .6835216 .0580647 -4.48 0.000 .5786864 .807349 1.smoke | 1.128027 .3863511 0.35 0.725 .5764767 2.20728 ------------------------------------------------------------------------------ .

Correct interpretations For this example the 0.684 is the odds ratio for a ten-year difference in age that you will get a cold when you hold smoking status constant 1.13 is the odds ratio that a person who smokes will get a cold compared to someone who does not smoke when you hold age constant

Stata notes • logit depvar indepvars • gives you coefficients • logit depvar indepvars, or • gives you odds ratios • logistic depvar indepvars • gives you odds ratios • logistic depvar indepvars, coef • gives you coefficients

Biostat Review

Biostat Review

Presentation Transcript

Biostat 201: Winter 11

Biostat 201: Winter 2011

Biostat 201: Winter 11

BIOSTAT - 1

BIOSTAT - 2

Biostat 2065 Review

Biostat/Stat 576

Biostat Review

Biostat 201: Winter 10

Biostat 200 Lecture 10

Biostat 200 Lecture 6

Biostat 200 Lecture 6

Biostat 215 Discussion #1

Biostat 200 Lecture 8

BIOSTAT 6 - Estimation

Biostat 200 Lecture 5

Biostat 245 Student Seminar

Biostat 215 Discussion #1

BIOSTAT 3

Biostat 200 Lecture 2

Biostat 200 Lecture 8

Biostat 200 Lecture 11