530 likes | 798 Views
Regression: (1) Simple Linear Regression. Hal Whitehead BIOL4062 / 5062. Regression. Purposes of regression Simple linear regression Formula Assumptions If assumptions hold, what can we do? Testing assumptions When assumptions do not hold. Regression. One Dependent Variable Y
E N D
Regression:(1) Simple Linear Regression Hal Whitehead BIOL4062 / 5062
Regression • Purposes of regression • Simple linear regression • Formula • Assumptions • If assumptions hold, what can we do? • Testing assumptions • When assumptions do not hold
Regression One Dependent Variable Y Independent Variables X1,X2,X3,...
Purposes of Regression 1. Relationship between Y and X's 2. Quantitative prediction of Y 3. Relationship between Y and X controlling for C 4. Which of X's are most important? 5. Best mathematical model 6. Compare regression relationships: Y1 on X, Y2 on X 7. Assess interactive effects of X's
Simple regression: one X • Multiple regression: two or more X's
Simple linear regression Y = β0 + β1X + Error
Assumptions of simple linear regression 1. Existence 2. Independence 3. Linearity 4. Homoscedasticity 5. Normality 6. X measured without error
Assumptions of simple linear regression 1. For any fixed value of X, Y is a random variable with a certain probability distribution having finite mean and variance (Existence) Y Prob of Y X
Assumptions of simple linear regression 2. The Y values are statistically independent of one another (Independence)
Assumptions of simple linear regression 3. The mean value of Y given X is a straight line function of X (Linearity) Y Prob of Y X
Assumptions of simple linear regression 4. The variance of Y is the same for all X (Homoscedasticity) Y Prob of Y X
Assumptions of simple linear regression 5. For any fixed value of X, Y has a normal distribution • (Normality) Y Prob of Y X
Assumptions of simple linear regression 6. There are no measurement errors in X (X measured without error)
Assumptions of simple linear regression 1. Existence 2. Independence 3. Linearity 4. Homoscedasticity 5. Normality 6. X measured without error
If assumptions hold, what can we do? 1. Estimate β0 (intercept), β1 (slope), together with measures of uncertainty 2. Describe quality of fit (variation of data around straight line) by estimate of σ² or r² 3. Tests of slope and intercept 4. Prediction and prediction bands 5. ANOVA Table
Parameters estimated using least-squares • Age-specific pregnancy rates of female sperm whales (from Best et al. 1984 Rep. int. Whal. Commn. Spec. Issue) Find line which minimizes squares of residuals
1. Estimate β0 (intercept), β1 (slope), together with measures of uncertainty • Age-specific pregnancy rates of female sperm whales (from Best et al. 1984 Rep. int. Whal. Commn. Spec. Issue)
1. Estimate β0 (intercept), β1 (slope), together with measures of uncertainty • β0 = 0.230 (SE 0.028) • 95% c.i.: 0.164; 0.296 • β1 = -0.0035 (SE 0.0009) • 95% c.i.: -0.0056; 0.0013
2. Describe quality of fit by estimate of σ² or r² σ² = 0.0195 r2 = 0.679 r2 (adjusted)= 0.633 (Propn. variance accounted for by regression)
3. Tests of slope and intercept a) Slope = 0 {Equivalent to r=0} b) Slope = Predetermined constant c) Intercept = 0 d) Intercept = Predetermined constant e) Compare slopes f) Compare intercepts {Assume same slope} (tests use t-distribution)
3a) Slope = 0 {Equivalent to r=0} Does pregnancy rate change with age? H0: β1 = 0 H1: β1≠ 0 P=0.006 Does pregnancy rate decline with age? H0: β1 = 0 H1: β1 > 0 P=0.003
3b) Slope = Predetermined constant β1 = 2.868 (SE 0.058) 95% c.i.: 2.752; 2.984 Does shape change with length? H0: β1 = 3 H1: β1≠ 3 P<0.05 weight=length3 Weights and Lengths of Cetacean Species Whitehead & Mann In Cetacean Societies 2000
3c) Intercept = 0 β0 = 0.436 (SE 0.080) 95% c.i.: 0.276; 0.596 Is birth length proportional to length? H0: β0 = 0 H1: β0≠ 0 P=0.000
3e) Compare slopes β1 (m) = 2.528 (SE 0.409) β1 (o) = 2.962 (SE 0.094) Does shape change differently with length for odontocetes and mysticetes? H0: β1 (m) = β1 (o) H1: β1 (m) ≠ β1 (o) P = 0.146 Weights and Lengths of Cetacean Species Whitehead & Mann 2000
3f) Compare intercepts{Assume same slope} β0 (m) = 2.528 (SE 0.409) β0 (o) = 2.962 (SE 0.094) Are odontocetes and mysticetes equally fat? H0: β0 (m) = β0 (o) H1: β0 (m) ≠β0 (o) P = 0.781 15 10 Log(Weight) 5 ORDER m o 0 0 1 2 3 4 Log(Length)
4. Prediction and prediction bands 95% Confidence Bands for Regression Line 95% Prediction Bands From: http://www.tufts.edu/~gdallal/slr.htm
5. ANOVA Table Analysis of Variance Source Sum-of-Squares df Mean-Square F-ratio P Regression 286.27 1 286.27 2475.07 0.00 Residual 5.32 46 0.12
If assumptions hold, what can we do? 1. Estimate β0 (intercept), β1 (slope), together with measures of uncertainty 2. Describe quality of fit (variation of data around straight line) by estimate of σ²or r² 3. Tests of slope and intercept 4. Prediction and prediction bands 5. ANOVA Table
Expected Testing assumptions: diagnostics • Use residuals to look at assumptions of regression: e(i) = Y(i) - (β0 + β1X(i)) Observed
Residuals • Residual: e(i) = Y(i) - (β0 + β1X(i)) • Standardized residuals: e(i)/S {S is the standard deviation of the residuals with adjusted degrees of freedom} • Studentized residuals: e(i) / [S(1 - h(i))] {h(i) is the "leverage value" of observation i: h(i) =1/n + (X(i) - ΣX(i)/n )²/[(n-1)S(X)²]} • Jackknifed residuals: e(i) / [S(-i) (1 - h(i))] {The residual variance (S(-i)) is calculated separately with each observation deleted}
Use Residuals to: a) look for outliers which we may wish to remove b) examine normality c) check for linearity d) check for homoscedasticity e) check for some kinds of non-independence
Yes if “outlier” was probably not produced by the process being studied measurement error different species ... No if “outlier” was probably produced by the process being studied extreme specimen Should outliers be removed?
b) Using residuals to examine normality • Lilliefors test for normality: P=0.62 • Lilliefors test for normality (excluding Bowhead whale): P=0.68
e) Use residuals to check for some kinds of non-independence • Durbin-Watson D Statistic: 1.48 • low values (<2) indicate autocorrelation • First Order Autocorrelation: 0.26 Days spent following sperm whales
Use Residuals to: a) look for outliers which we may wish to remove b) examine normality c) check for linearity d) check for homoscedasticity e) check for some kinds of non-independence
Assumptions of simple linear regression 1. Existence 2. Independence 3. Linearity 4. Homoscedasticity 5. Normality 6. X measured without error
When assumptions do not hold: 1. Existence: Forget it!
When assumptions do not hold: 2. Independence: • collect data differently • reduce the size of the data set • add additional terms to the regression model • (e.g. autocorrelation term, species effect) More a problem for testing than prediction
When assumptions do not hold: 3. Linearity: • Transform either X or Y or both variables. e.g.: Log(Y) = ß0+ ß1 Log(X) + E • Polynomial regression: Y = ß0 + ß1X + ß2X² + ... + E • Non-linear regression. e.g.: Y = c + EXP(ß0 + ß1X) + E • Piecewise linear regression: Y = ß0 + ß1X [X>XK] + E where [X> XK]=0 if X< XK and [X> XK]=1 if X> XK.
Y = ß0 + ß1X [X>XK] + E • Log(Y) = ß0+ ß1 Log(X) + E • Y = ß0 + ß1X + ß2X² + ... + E • Y = c + EXP(ß0 + ß1X) + E
When assumptions do not hold: 4. Homoscedasticity: • Transformations of the Y variable • Weighted regressions(if we know that some observations are more accurate than others)
When assumptions do not hold: 5. Normality: • Transformations of the Y variable • Non-normal error structures (e.g. Poisson) Small departures from normality are not especially important, unless doing a test
When assumptions do not hold: 6. X measured without error: • Major axis regression • Reduced major axis, or geometric mean, regression
Major axis regression: • Minimize sum of squares of perpendicular distances from observations to regression line • Only if variables are in same units {First principal component of covariance matrix}