1 / 49

Linear Regression

Ecole Nationale Vétérinaire de Toulouse. Linear Regression. Didier Concordet d.concordet@envt.fr. ECVPT Workshop April 2011. Can be downloaded at http://www.biostat.envt.fr/. An example. b>0. Y. Y. Y. a. b=0. b>0. a. b<0. a=0. x. x. x. About the straight line.

mhinton
Download Presentation

Linear Regression

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Ecole Nationale Vétérinaire de Toulouse Linear Regression Didier Concordet d.concordet@envt.fr ECVPT Workshop April 2011 Can be downloaded at http://www.biostat.envt.fr/

  2. An example

  3. b>0 Y Y Y a b=0 b>0 a b<0 a=0 x x x About the straight line Y= a + b x a = intercept b = slope

  4. Questions • How to obtain the best straight line ? • Is this straight line the best curve to use ? • How to use this straight line ?

  5. How to obtain the best straight line ? Proceed in three main steps • write a (statistical) model • estimate the parameters • graphical inspection of data

  6. Write a model A statistical model Mean model : functionnal relationship Variance model : Assumptions on the residuals

  7. Write a model Mean model = residual (error term)

  8. Assumptions on the residuals • the xi 's are not random variables • they are known with a high precision • the ei 'shave a constant variance • homoscedasticity • the ei 'sare independent • the ei 'sare normally distributed • normality

  9. Homoscedasticity homoscedasticity heteroscedasticity

  10. Normality Y x

  11. Estimate the parameters A criterion is needed to estimate parameters A statistical model A criterion

  12. How to estimate the "best" a et b ? Intuitive criterion : minimum compensation Reasonnable criterion : minimum Linear model Homoscedasticity Normality Least squares criterion (L.S.)

  13. The least squares criterion

  14. Result of optimisation and change with samples and are random variables

  15. Balance sheet True mean straight line Estimated straight line or Mean predicted value for the ith observation ith residual

  16. Example Dep Var: HPLC N: 18 Effect Coefficient Std Error t P(2 Tail) CONSTANT 20.046 3.682 5.444 0.000 CONCENT 2.916 0.069 42.030 0.000 Intercept Estimated straight line Slope

  17. Example

  18. Example

  19. Residual variance by construction but The residual variance is defined by standard error of estimate

  20. Example Dep Var: HPLC N: 18 Multiple R: 0.996 Squared multiple R: 0.991 Adjusted squared multiple R: 0.991 Standard error of estimate : 8.282 Effect Coefficient Std Error t P(2 Tail) CONSTANT 20.046 3.682 5.444 0.000 CONCENT 2.916 0.069 42.030 0.000

  21. Questions • How to obtain the best straight line ? • Is this straight line the best curve to use ? • How to use this straight line ?

  22. Is this model the best one to use ? • Tools to check the mean model : • scatterplot residuals vs fitted values • test(s) • Tools to check the variance model : • scatterplot residuals vs fitted values • Probability plot (Pplot)

  23. Checking the mean model scatterplot residuals vs fitted values 0 0 structure in the residuals change the mean model No structure in the residuals OK

  24. Checking the mean model : tests Two cases No replication Try a polynomial model (quadratic first) Replications Test of lack of fit

  25. Without replication try another mean model and test the improvement Example : If the test on c is significant (c  0) then keep this model Dep Var: HPLC N: 18 Multiple R: 0.996 Squared multiple R: 0.991 Adjusted squared multiple R: 0.991 Standard error of estimate: 8.539 Effect Coefficient Std Error t P(2 Tail) CONSTANT 21.284 6.649 3.201 0.006 CONCENT 2.842 0.335 8.486 0.000 CONCENT *CONCENT 0.001 0.003 0.227 0.824

  26. Departure from linearity With replications Perform a test of lack of fit Pure error Principle : compare to if - > then change the model

  27. Test of lack of fit : how to do it ? Three steps 1) Linear regression 2) One way ANOVA 3) if then change the model

  28. Test of lack of fit : example Three steps 1) Linear regression 2) One way ANOVA Dep Var: HPLC N: 18 Analysis of Variance Source Sum-of-Squares df Mean-Square F-ratio P CONCENT 121251.776 5 24250.355 289.434 0.000 Error 1005.427 12 83.786 3) if We keep the straight line

  29. Checking the variance model : homoscedasticity scatterplot residuals vs fitted values 0 0 No structure in the residuals but heteroscedasticity change the model (criterion) homoscedasticity OK

  30. What to do with heteroscedasticity ? scatterplot residuals vs fitted values : modelize the dispersion. 0 The standard deviation of the residuals increases with : it increases with x

  31. What to do with heteroscedasticity ? Estimate again the slope and the intercept but with weights proportionnal to the variance. with and check that the weight residuals (as defined above) are homoscedastic

  32. Checking the variance model : normality 0 Expected value for normal distribution Expected value for normal distribution 0 No curvature : Normality Curvature : non normality is it so important ?

  33. What to do with non normality ? Try to modelize the distribution of residuals In general, it is difficult with few observations If enough observations are available, the non normality does not affect too much the result.

  34. An interesting indice R² R² = square correlation coefficient = % of dispersion of the Yi's explained by the straight line (the model) 0  R²  1 If R² = 1, all theei = 0, the straight line explain all the variation of the Yi's If R² = 0, the slope is = 0, the straight line does not explain any variation of the Yi's

  35. An interesting indice R² R² and R (correlation coefficient) are not designed to measure linearity ! Example : Multiple R: 0.990 Squared multiple R: 0.980 Adjusted squared multiple R: 0.980

  36. Questions • How to obtain the best straight line ? • Is this straight line the best curve to use ? • How to use this straight line ?

  37. How to use this straight line ? • Direct use : for a given x • predict the mean Y • construct a confidence interval of the mean Y • construct a prediction interval of Y • Reverse use calibration (approximate results): for a given Y • predict the mean x • construct a confidence interval of the mean x • construct a prediction interval of X

  38. For a given x predict the mean Y Example :

  39. Confidence interval of the mean Y There is a probability 1-a that a+bx belongs to this interval

  40. Confidence interval of the mean Y U L 30

  41. Example

  42. Prediction interval of Y 100(1-a)% of the measurements carried-out for this x belongs to this interval

  43. Prediction interval of Y U L 30

  44. Example

  45. Reverse use : for a given Y=y0 predict the mean X Example :

  46. For a given Y=y0 a confidence interval of the mean X Y0 X U L

  47. Confidence interval of the mean X There is a probability 1-a that the mean X belongs to [ L , U ] L and U are so that

  48. Example

  49. What you should no longer believe One can fit the straight line by inverting x and Y If the correlation coefficient is high, the straight line is the best model Normality of the xi's is required to perform a regression Normality of the ei's is essential to perform a good regression

More Related