Class 9: Thurs., Oct. 7

Class 9: Thurs., Oct. 7 • Inference in regression (Ch 10.1-10.2) • Confidence intervals for slope • Hypothesis test for slope • Confidence intervals for mean response • Prediction intervals • Confidence intervals and the polls • I will e-mail HW 5 to you by tomorrow. It will be due Tuesday, Oct. 19th.

CPS Wage-Education Data for March 1988

Inference Based on Sample • The whole Current Population Survey (25,631 men ages 18-70) is a random sample from the U.S. population (roughly 75 million men ages 18-70). • In most regression analyses, the data we have is a sample from some larger (hypothetical) population. We are interested in the true regression line for the larger population. • Inference Questions: • How accurate is the least squares estimate of the slope for the true slope in the larger population? • What is a plausible range of values for the true slope in the larger population based on the sample? • Is it plausible that the slope equals a particular value (e.g., 0) based on the sample? • Regression Applet: Link on web site under Fun Links. Link entitled Simple Linear Regression.

Full Data Set Random Sample of Size 25

Confidence Intervals • Confidence interval: A range of values that are plausible for a parameter given the data. • 95% confidence interval: An interval that 95% of the time will contain the true parameter. • Approximate 95% confidence interval: Estimate of parameter 2*SE(Estimate of parameter). • Approximate 95% confidence interval for slope: • For wage-education data, , approximate 95% CI = • Interpretation of 95% confidence interval: It is most plausible that the true slope is in the 95% confidence interval. It is possible that the true slope is outside the 95% confidence interval but unlikely; the confidence interval will fail to contain the true slope only 5% of the time in repeated samples.

Conf. Intervals for Slope in JMP • After Fit Line, right click in the parameter estimates table, go to Columns and click on Lower 95% and Upper 95%. • The exact 95% confidence interval is close to but not equal to

Hypothesis Testing • Simple Linear Regression Model: • Is the slope equal to 0? • Null hypothesis • Alternative (research) hypothesis • Test statistic: • Rough rule: Reject if |t|>=2. Accept if |t|<2. • P-values : Find the p-value for the test. The p-value is a measure of the credibility of the null hypothesis. Small p-values give you evidence against the null hypothesis. Large p-values suggest there is no evidence in the data to reject the null hypothesis. • The generally followed rule is to reject if the p-value is less than 0.05 and accept if the p-value is greater than 0.05.

Hypothesis Testing in JMP: • The test statistic is a standard error counter. It is the relationship between and that matters, not the size of itself. • Testing vs. . Use test statistic . Reject null hypothesis if |t|>=2.

Logic of Hypothesis Testing: Hypoth. Testing in the Courtroom • Null hypothesis: The defendant is innocent • Alternative hypothesis: The defendant is guilty • The goal of the procedure is to determine whether there is enough evidence to conclude that the alternative hypothesis is true. The burden of proof is on the alternative hypothesis. • A small p-value indicates that there is strong evidence against the null hypothesis. A p-value > 0.05 does not show that the null hypothesis is true, only that there is not strong evidence against the null hypothesis.

Car Price Example • A used-car dealer wants to understand how odometer reading affects the selling price of used cars. • The dealer randomly selects 100 three-year old Ford Tauruses that were sold at auction during the past month. Each car was in top condition and equipped with automatic transmission, AM/FM cassette tape player and air conditioning. • carprices.JMP contains the price and number of miles on the odometer of each car.

The used-car dealer has an opportunity to bid on a lot of cars offered by a rental company. The rental company has 250 Ford Tauruses, all equipped with automatic transmission, air conditioning and AM/FM cassette tape players. All of the cars in this lot have about 40,000 miles on the odometer. The dealer would like an estimate of the average selling price of all cars of this type with 40,000 miles on the odometer, i.e., E(Y|X=40,000). • The least squares estimate is

Confidence Interval for Mean Response • Confidence interval for E(Y|X=40,000): A range of plausible values for E(Y|X=40,000) based on the sample. • Approximate 95% Confidence interval: • Notes about formula for SE: Standard error becomes smaller as sample size n increases, standard error is smaller the closer is to • In JMP, after Fit Line, click red triangle next to Linear Fit and click Confid Curves Fit. Use the crosshair tool by clicking Tools, Crosshair to find the exact values of the confidence interval endpoints for a given X0.

A Prediction Problem • The used-car dealer is offered a particular 3-year old Ford Taurus equipped with automatic transmission, air conditioner and AM/FM cassette tape player and with 40,000 miles on the odometer. The dealer would like to predict the selling price of this particular car. • Best prediction based on least squares estimate:

Range of Selling Prices for Particular Car • The dealer is interested in the range of selling prices that this particular car with 40,000 miles on it is likely to have. • Under simple linear regression model, Y|X follows a normal distribution with mean and standard deviation . A car with 40,000 miles on it will be in interval about 95% of the time. • Class 5: We substituted the least squares estimates for for and said car with 40,000 miles on it will be in interval about 95% of the time. This is a good approximation but it ignores potential error in least square estimates.

Prediction Interval • 95% Prediction Interval: An interval that has approximately a 95% chance of containing the value of Y for a particular unit with X=X0 ,where the particular unit is not in the original sample. • Approximate 95% prediction interval: • In JMP, after Fit Line, click red triangle next to Linear Fit and click Confid Curves Indiv. Use the crosshair tool by clicking Tools, Crosshair to find the exact values of the prediction interval endpoints for a given X0.

Comparison of Confidence Intervals for Mean Response and Prediction Intervals • Confidence Interval for Mean Response: • Prediction Interval: • Prediction interval is wider than confidence interval for mean response because it is trying to predict the Y for a particular unit with X=X0 rather than the mean for all units with X=X0 • As sample size (n) becomes large, width of confidence interval for mean response goes to zero whereas width of prediction interval goes to 2*RMSE.

Confidence Intervals and the Polls • Margin of Error = 2*SE(Estimate). • 95% CI for Bush-Kerry difference: • 95% CI for difference between Bush and Kerry’s proportions:

Why Do the Polls Sometimes Disagree So Much?

Validity of Confidence Interval • Polls are conducted by attempting to randomly sample U.S. citizens of voting age. • Mean Estimated Difference in Vote Proportion : Average Estimated Difference in Vote Proportion from repeated random samples. • SE(Estimated Difference in Vote Proportion) is the “typical” amount by which the Estimated Difference in Vote Proprtion for one random sample differs from the Mean Estimated Difference in Vote Proportion • CI for True Difference in Vote Proportion = Estimated Difference in Vote Proportion 2*SE(Estimated Difference in Vote Proportion) • Confidence’s interval “95% guarantee” that 95% of the time it will contain true difference in vote proportion is only true if mean estimated difference in vote proportion = true difference in vote proportion. • When mean estimated difference in vote proportion does not equal true difference in vote proportion, there is bias.

Sources of Bias • See Ch 3.3: pages 252-254 • Undercoverage: some groups in the population are left out of the process of choosing the sample (for an opinion poll conducted by telephone, people without a residential phone are not covered). • Nonresponse: An individual chosen for the sample can’t be contacted or does not cooperate. Major problem in telephone surveys. • Response bias: Respondent’s or interviewer’s behavior may cause bias. Respondents may lie, especially if asked about illegal or unpopular behavior. Race or sex of interviewer can affect responses. • Wording of questions: Has very important influence on survey results. UN Experiment.

Voting Polls • The polls try to predict what will happen in the election. Thus, they must address the question, who is likely to vote? • In exit polling from the 2000 election, 39% of respondents identified themselves as Democrats, 34% as Republicans and 33% as Independents. In 1996, the composition was 39% Democrat, 34% Republican and 27% Independent. • Should the polls adjust their results so that they reflect a voter composition of more Democrats than Republicans? • Gallup doesn’t do much adjustment. • LA Times poll, Zogby’s poll make adjustments.

Class 9: Thurs., Oct. 7