1 / 33

Week 11

Week 11. Regression Models and Inference. Generalising from data. Data from 15 lakes in central Ontario. Zinc concentrations in aquatic plant Eriocaulon septangulare (mg per g dry weight) & zinc concentrations in the lake sediment (mg per g dry weight). Generalising from data.

saeran
Download Presentation

Week 11

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Week 11 Regression Models and Inference

  2. Generalising from data • Data from 15 lakes in central Ontario. • Zinc concentrations in aquatic plant Eriocaulon septangulare (mg per g dry weight) & zinc concentrations in the lake sediment (mg per g dry weight).

  3. Generalising from data • No interest in specific lakes • How are plant & sediment Zn related in general. • How accurately can you predict plant Zn from sediment Zn?

  4. Model for regression data • Sample ‘represents’ a larger ‘population’ • Distinguish between regn lines for sample and population • Sample regn line (least squares) is an estimate of popn regn line. • How do you model randomness in sample?

  5. Sample regression line (revision) Least squares line b0intercept — predicted y when x = 0. b1 slope — increase (or decrease) expected for y when x increases by one unit. predictedy or estimatedy. fitted value for ith individual

  6. Height and handspan • Heights (inches) and Handspans (cm) of 167 college students. Handspan = -3 + 0.35 Height Handspan increases by 0.35 cm, on average, for each increase of 1 inch in height.

  7. Residuals (revision) • Vertical distance from data point to LS line • Person 70 in tall with handspan 23 cm • resisual = yi –yi = 23 – 21.5 = 1.5 cm

  8. Model: population regn line E(Y) mean or expected value of y for individuals in the population who all have the same x. b0intercept of line in the population. b1slope of line in the population. b1 = 0 means no linear relationship. b0 and b1 estimated by sample LS values b0 and b1.

  9. Model: distribution of ‘errors’ • Error = vertical distance of value from population regn line • Assume errors all have normal(0, ) distns • Constant standard deviation

  10. Linear regression model y = Mean + Error • Error is population equivalent of residual • Error is called “Deviation” in textbook Y = b0 + b1x +  • Error distribution ~ normal(0, )

  11. Understanding parameters

  12. Model assumptions • Linear relationship • No curvature • No outliers • Constant error standard deviation • Normal errors

  13. Checking assumptions • Data should be in a symmetric band of constant width round a straight line • Prices of Mazda cars in Melbourne paper

  14. Transformations • Transformation of Y (or X) may help • Model regression line:

  15. Parameter estimates • Least squares estimates, b0and b1 are estimates of 0and1 • Best estimate of error s.d.,  is • ‘Typical’ size of residuals

  16. Minitab estimates Data: x = heights (in inches)y = weight (pounds) of n = 43 male students. Standard deviation s = 24.00 (pounds): Roughly measures, for any given height, the general size of the deviations of individual weights from the mean weight for the height.

  17. Interpreting s • About 95% of crosses in band ± 2s on each side of least squares line. • s = 24, band ± 48 48

  18. Inference about regn slope • Regn slope, 1, is usually most important parameter • Expected increase in Y for unit increase in x • Point estimate is LS slope, b1 • How variable? What is std error of estimate?

  19. Inference: 95% C.I. for slope • Same pattern as earlier C.I.s estimate ± t* x std. error • Value of t* : • approx 2 for large n • bigger for small n • use t-tables (n – 2) degrees of freedom

  20. Example • Driver age and maximum legibility distance of new highway sign Average Distance = 577 – 3.01 × Age

  21. 95% C.I. from Minitab • Point estimate: reading distance decreases by 3.01 ft per year of age • 95% Confidence interval: n = 30 points t28 d.f. = 2.05

  22. Interpretation With 95% confidence, we estimate that … in the population of drivers represented by this sample, … the meansign-reading distance decreases between 3.88 and 2.14 ft … per 1-year increase in age.

  23. Importance of zero slope • If slope is 1 = 0, Y is normal with mean 0 and st devn  Response distribution does not depend on x • It is therefore important to test whether 1 = 0

  24. Test for zero slope • Hypotheses: H0: 1 = 0 HA: 1 ≠ 0 • Test statistic: • p-value: tail area of t-distn (n – 2 d.f.)

  25. Minitab: Age vs reading distance • Probability is virtually 0 that observed slope could be as far from 0 or farther if there was no linear relationship in population • Extremely strong evidence that distance and age are related and p-value  0.000

  26. Testing zero correlation H0:  = 0 (x and y are not correlated.) HA: ≠ 0 (x and y are correlated.) where  = population correlation • Same test as for zero regression slope. • Can be performed even when a regression relationship makes no sense. • e.g.leaf length & width

  27. Significance and Importance • With very large n, weak relationships (low correlation) can be statistically significant. Moral: With a large sample size, saying two variables are significantly related may only mean the correlation is not precisely 0. Look at a scatterplot of the data and examine the correlation coefficient, r.

  28. Prediction of new Y at x If you knew values of 0, 1 and  • Prediction error • New value has s.d.  • 95% prediction interval

  29. Prediction of new Y at x In practice, you must use estimates • Prediction error has two components • New value still has s.d.  estimated by s2 • Also, prediction itself is random • Combining these,

  30. Prediction of new Y at x • Prediction interval t* is from t tables (n – 2) d.f. • Narrowest when x is near x

  31. Reading distance and age • Minitab output • 95% confident that a 21-year-old will read sign between 407 and 620 ft

  32. Estimating mean Y at x • Different from estimating a new individual’s Y • Only takes into account variability in y • 95% CI for mean Y at x t* is from t tables (n – 2) d.f.

  33. Height and weight • 95% CI • For average of all college men of ht x • 95% PI • For one new college man of ht x

More Related