330 likes | 448 Views
Week 11. Regression Models and Inference. Generalising from data. Data from 15 lakes in central Ontario. Zinc concentrations in aquatic plant Eriocaulon septangulare (mg per g dry weight) & zinc concentrations in the lake sediment (mg per g dry weight). Generalising from data.
E N D
Week 11 Regression Models and Inference
Generalising from data • Data from 15 lakes in central Ontario. • Zinc concentrations in aquatic plant Eriocaulon septangulare (mg per g dry weight) & zinc concentrations in the lake sediment (mg per g dry weight).
Generalising from data • No interest in specific lakes • How are plant & sediment Zn related in general. • How accurately can you predict plant Zn from sediment Zn?
Model for regression data • Sample ‘represents’ a larger ‘population’ • Distinguish between regn lines for sample and population • Sample regn line (least squares) is an estimate of popn regn line. • How do you model randomness in sample?
Sample regression line (revision) Least squares line b0intercept — predicted y when x = 0. b1 slope — increase (or decrease) expected for y when x increases by one unit. predictedy or estimatedy. fitted value for ith individual
Height and handspan • Heights (inches) and Handspans (cm) of 167 college students. Handspan = -3 + 0.35 Height Handspan increases by 0.35 cm, on average, for each increase of 1 inch in height.
Residuals (revision) • Vertical distance from data point to LS line • Person 70 in tall with handspan 23 cm • resisual = yi –yi = 23 – 21.5 = 1.5 cm
Model: population regn line E(Y) mean or expected value of y for individuals in the population who all have the same x. b0intercept of line in the population. b1slope of line in the population. b1 = 0 means no linear relationship. b0 and b1 estimated by sample LS values b0 and b1.
Model: distribution of ‘errors’ • Error = vertical distance of value from population regn line • Assume errors all have normal(0, ) distns • Constant standard deviation
Linear regression model y = Mean + Error • Error is population equivalent of residual • Error is called “Deviation” in textbook Y = b0 + b1x + • Error distribution ~ normal(0, )
Model assumptions • Linear relationship • No curvature • No outliers • Constant error standard deviation • Normal errors
Checking assumptions • Data should be in a symmetric band of constant width round a straight line • Prices of Mazda cars in Melbourne paper
Transformations • Transformation of Y (or X) may help • Model regression line:
Parameter estimates • Least squares estimates, b0and b1 are estimates of 0and1 • Best estimate of error s.d., is • ‘Typical’ size of residuals
Minitab estimates Data: x = heights (in inches)y = weight (pounds) of n = 43 male students. Standard deviation s = 24.00 (pounds): Roughly measures, for any given height, the general size of the deviations of individual weights from the mean weight for the height.
Interpreting s • About 95% of crosses in band ± 2s on each side of least squares line. • s = 24, band ± 48 48
Inference about regn slope • Regn slope, 1, is usually most important parameter • Expected increase in Y for unit increase in x • Point estimate is LS slope, b1 • How variable? What is std error of estimate?
Inference: 95% C.I. for slope • Same pattern as earlier C.I.s estimate ± t* x std. error • Value of t* : • approx 2 for large n • bigger for small n • use t-tables (n – 2) degrees of freedom
Example • Driver age and maximum legibility distance of new highway sign Average Distance = 577 – 3.01 × Age
95% C.I. from Minitab • Point estimate: reading distance decreases by 3.01 ft per year of age • 95% Confidence interval: n = 30 points t28 d.f. = 2.05
Interpretation With 95% confidence, we estimate that … in the population of drivers represented by this sample, … the meansign-reading distance decreases between 3.88 and 2.14 ft … per 1-year increase in age.
Importance of zero slope • If slope is 1 = 0, Y is normal with mean 0 and st devn Response distribution does not depend on x • It is therefore important to test whether 1 = 0
Test for zero slope • Hypotheses: H0: 1 = 0 HA: 1 ≠ 0 • Test statistic: • p-value: tail area of t-distn (n – 2 d.f.)
Minitab: Age vs reading distance • Probability is virtually 0 that observed slope could be as far from 0 or farther if there was no linear relationship in population • Extremely strong evidence that distance and age are related and p-value 0.000
Testing zero correlation H0: = 0 (x and y are not correlated.) HA: ≠ 0 (x and y are correlated.) where = population correlation • Same test as for zero regression slope. • Can be performed even when a regression relationship makes no sense. • e.g.leaf length & width
Significance and Importance • With very large n, weak relationships (low correlation) can be statistically significant. Moral: With a large sample size, saying two variables are significantly related may only mean the correlation is not precisely 0. Look at a scatterplot of the data and examine the correlation coefficient, r.
Prediction of new Y at x If you knew values of 0, 1 and • Prediction error • New value has s.d. • 95% prediction interval
Prediction of new Y at x In practice, you must use estimates • Prediction error has two components • New value still has s.d. estimated by s2 • Also, prediction itself is random • Combining these,
Prediction of new Y at x • Prediction interval t* is from t tables (n – 2) d.f. • Narrowest when x is near x
Reading distance and age • Minitab output • 95% confident that a 21-year-old will read sign between 407 and 620 ft
Estimating mean Y at x • Different from estimating a new individual’s Y • Only takes into account variability in y • 95% CI for mean Y at x t* is from t tables (n – 2) d.f.
Height and weight • 95% CI • For average of all college men of ht x • 95% PI • For one new college man of ht x