Week 11

Week 11 Regression Models and Inference

Generalising from data • Data from 15 lakes in central Ontario. • Zinc concentrations in aquatic plant Eriocaulon septangulare (mg per g dry weight) & zinc concentrations in the lake sediment (mg per g dry weight).

Generalising from data • No interest in specific lakes • How are plant & sediment Zn related in general. • How accurately can you predict plant Zn from sediment Zn?

Model for regression data • Sample ‘represents’ a larger ‘population’ • Distinguish between regn lines for sample and population • Sample regn line (least squares) is an estimate of popn regn line. • How do you model randomness in sample?

Sample regression line (revision) Least squares line b0intercept — predicted y when x = 0. b1 slope — increase (or decrease) expected for y when x increases by one unit. predictedy or estimatedy. fitted value for ith individual

Height and handspan • Heights (inches) and Handspans (cm) of 167 college students. Handspan = -3 + 0.35 Height Handspan increases by 0.35 cm, on average, for each increase of 1 inch in height.

Residuals (revision) • Vertical distance from data point to LS line • Person 70 in tall with handspan 23 cm • resisual = yi –yi = 23 – 21.5 = 1.5 cm

Model: population regn line E(Y) mean or expected value of y for individuals in the population who all have the same x. b0intercept of line in the population. b1slope of line in the population. b1 = 0 means no linear relationship. b0 and b1 estimated by sample LS values b0 and b1.

Model: distribution of ‘errors’ • Error = vertical distance of value from population regn line • Assume errors all have normal(0, ) distns • Constant standard deviation

Linear regression model y = Mean + Error • Error is population equivalent of residual • Error is called “Deviation” in textbook Y = b0 + b1x +  • Error distribution ~ normal(0, )

Understanding parameters

Model assumptions • Linear relationship • No curvature • No outliers • Constant error standard deviation • Normal errors

Checking assumptions • Data should be in a symmetric band of constant width round a straight line • Prices of Mazda cars in Melbourne paper

Transformations • Transformation of Y (or X) may help • Model regression line:

Parameter estimates • Least squares estimates, b0and b1 are estimates of 0and1 • Best estimate of error s.d.,  is • ‘Typical’ size of residuals

Minitab estimates Data: x = heights (in inches)y = weight (pounds) of n = 43 male students. Standard deviation s = 24.00 (pounds): Roughly measures, for any given height, the general size of the deviations of individual weights from the mean weight for the height.

Interpreting s • About 95% of crosses in band ± 2s on each side of least squares line. • s = 24, band ± 48 48

Inference about regn slope • Regn slope, 1, is usually most important parameter • Expected increase in Y for unit increase in x • Point estimate is LS slope, b1 • How variable? What is std error of estimate?

Inference: 95% C.I. for slope • Same pattern as earlier C.I.s estimate ± t* x std. error • Value of t* : • approx 2 for large n • bigger for small n • use t-tables (n – 2) degrees of freedom

Example • Driver age and maximum legibility distance of new highway sign Average Distance = 577 – 3.01 × Age

95% C.I. from Minitab • Point estimate: reading distance decreases by 3.01 ft per year of age • 95% Confidence interval: n = 30 points t28 d.f. = 2.05

Interpretation With 95% confidence, we estimate that … in the population of drivers represented by this sample, … the meansign-reading distance decreases between 3.88 and 2.14 ft … per 1-year increase in age.

Importance of zero slope • If slope is 1 = 0, Y is normal with mean 0 and st devn  Response distribution does not depend on x • It is therefore important to test whether 1 = 0

Test for zero slope • Hypotheses: H0: 1 = 0 HA: 1 ≠ 0 • Test statistic: • p-value: tail area of t-distn (n – 2 d.f.)

Minitab: Age vs reading distance • Probability is virtually 0 that observed slope could be as far from 0 or farther if there was no linear relationship in population • Extremely strong evidence that distance and age are related and p-value  0.000

Testing zero correlation H0:  = 0 (x and y are not correlated.) HA: ≠ 0 (x and y are correlated.) where  = population correlation • Same test as for zero regression slope. • Can be performed even when a regression relationship makes no sense. • e.g.leaf length & width

Significance and Importance • With very large n, weak relationships (low correlation) can be statistically significant. Moral: With a large sample size, saying two variables are significantly related may only mean the correlation is not precisely 0. Look at a scatterplot of the data and examine the correlation coefficient, r.

Prediction of new Y at x If you knew values of 0, 1 and  • Prediction error • New value has s.d.  • 95% prediction interval

Prediction of new Y at x In practice, you must use estimates • Prediction error has two components • New value still has s.d.  estimated by s2 • Also, prediction itself is random • Combining these,

Prediction of new Y at x • Prediction interval t* is from t tables (n – 2) d.f. • Narrowest when x is near x

Reading distance and age • Minitab output • 95% confident that a 21-year-old will read sign between 407 and 620 ft

Estimating mean Y at x • Different from estimating a new individual’s Y • Only takes into account variability in y • 95% CI for mean Y at x t* is from t tables (n – 2) d.f.

Height and weight • 95% CI • For average of all college men of ht x • 95% PI • For one new college man of ht x

Week 11

Week 11

Presentation Transcript

Week 11

Week 11

Week 11

Week 11

Week 11

Week 11

Week 11

Week 11

Week 11

Week 11

Week 11

Week 11

WEEK 11

Week 11

Week 11

Week 11

Week 11

Week 11

Week 11

Week 11

Week 11

WEEK 11