Confidence Intervals

Confidence Intervals • Underlying model: Unknown parameter • We know how to calculate point estimates • E.g. regression analysis • But different data would change our estimates. • So, we treat our estimates as random variables • Want a measure of how confident we are in our estimate. • Calculate “Confidence Interval”

What is it? • If know how data sampled • We can construct a Confidence Interval for an unknown parameter, q. • A 95% C.I. gives a range such that true q is in interval 95% of the time. • A 100(1-a) C.I. captures true q (1-a) of the time. • Smaller a, more sure true q falls in interval, but wider interval.

Example 1: Lead in Water • Lead in drinking water causes serious health problems. • To test contamination, require a control site. • Problems: • Lead concentration in control site? • Estimate 95% confidence interval

Example 2: Gas Market • Recall U.S. gas market question: • By how much does gas consumption decrease when price increases? • Our linear model: • Estimate of b1: -.04237. • How confident are we in this estimate? • Construct 90% C.I. for this estimate

If Data ~N(m,s2) • Since we don’t know s, use t-distribution. • 95% C.I. for m: • s is standard error of mean. • t97.5 is critical value of t distribution • Draw on board (Prob = 2.5%)

t-distribution • Similar to Normal Distribution • Requires “degrees of freedom”. • df = (# data points) – (# variables). • E.g. mean of lead concentration, 8 samples, one variable: d.f.=7. • Higher d.f., closer t is to Normal distribution.

If Distribution Unknown • Can use “Bootstrapping”. • Draw large sample with replacement • Calculate mean • Repeat many times • Draw histogram of sample means • Calculate empirical 95% C.I. • Requires no previous knowledge of underlying process

Lead Concentration • 8 lead measurements: • Mean=51.39, s=5.75, t97.5=2.365 • Lower=51.39-(5.75)(2.365) • Upper= 51.39+(5.75)(2.365) • C.I. = [37.8,65.0] • Using bootstrapped samples: • C.I. = [40.8,62.08]

Gas Regression: S-Plus Coefficients: Value Std. Error t value Pr(>|t|) (Intercept) -0.0898134 0.0507787 -1.7687217 0.0867802 PG -0.0423712 0.0098406 -4.3057672 0.0001551 Y 0.0001587 0.0000068 23.4188561 0.0000000 PNC -0.1013809 0.0617077 -1.6429209 0.1105058 PUC -0.0432496 0.0241442 -1.7913093 0.0830122 Residual standard error: 0.02680668 on 31 degrees of freedom Multiple R-Squared: 0.9678838 F-statistic: 233.5615 on 4 and 31 degrees of freedom, the p-value is 0

Gas Price Response • b2=-.04237, s=.00984 • 90% C.I.: t95=1.695 (d.f.=37-5=32) • C.I. = [-.0591,-.0256] • Using bootstrapped samples: • C.I. = [-.063,-.026] • Response is probably between 2.5 gallons and 6 gallons.

Interpretation & Other Facts • There is a 95% chance that the true average lead concentration lies in this range. • There is a 90% chance that the true value of b1 lies in this range. • Also can calculate “confidence region” for 2 or more variables.

Confidence Intervals