Comparing Two Proportions

Statistics Comparing Two Proportions

Be able to state the null and alternative hypotheses for testing the difference between two population proportions. Know how to examine your data for violations of conditions that would make inference about the difference between the two population proportions unwise or invalid. Understand that the formula for the standard error of the difference between two independent sample proportions is based on the principle that when finding the sum or difference of two independent random variable, their variances add. What you will learn

Variances of independent random variables added— • The variance of a sum or difference of independent random variables is the sum of the variances of those variables. Terms

Sampling distribution— • The sampling distribution of is, under appropriate assumptions, modeled by a Normal model with mean and standard deviation Terms

Two-proportion z-interval— • A two-proportion z-interval gives a confidence interval of the true difference in proportions, p1 – p2 , in two independent groups. • The confidence interval is where z* is a critical value from the standard Normal model corresponding to a specified confidence level. terms

Pooling— • When we have data from different sources that we believe are homogeneous, we can get a better estimate of the common proportion and its standard deviation. We can combine, or pool, the data into a single group for the purpose of estimating the common proportion. The resulting pooled standard error is based on more data and is thus more reliable (in the null hypothesis is true and the groups are truly homogenous). Terms

Two-proportion z-test— • Test the null hypothesis H0: p1 – p2 = 0 by referring the statistic to a standard Normal model. Terms

Who do you think is more intelligent, men or women? • Gallup poll of 520 women and 506 men. • 28% of the men thought men were more intelligent. • 14% of the women thought men were more intelligent. • Comparing two percentages are much more common than questions with isolated percentages. • Example– Treatment is better than placebo control • Example– This year’s results are better than last year’s. Example

We know the difference between the two proportions of the random sample is 14%, but what is the true difference? We would like to find the true difference and the margin of error. For this we need to determine the standard deviation of the sampling distribution model for the difference in the proportions. Comparing two proportions

Remember– The variance of the sum or difference of two independent random variables is the sum of their variances. (Chapter 16). Why will this work? Comparing two proportions

How does this work? Consider grabbing a box of cereal. It claims there are 16 ounces in the box. We know that this is not exact because there is some variance from box to box. When you pour 2 ounces of cereal in a bowl, there will be further variance from bowl to bowl. How much cereal is left in the box? Comparing two proportions

According to our rule, the amount of cereal left in the box would now be the sum of the two variances. We need the standard deviation, not the variance which is finding the square root of the variance. Comparing two proportions

Here are the formulas. This formula applies only when X and Y are independent. Comparing two proportions

The samples can have different sizes and different proportion values. We use subscripts to keep the different values straight. In comparing males and females, we could use the subscripts of M and F or 1 and 2. Comparing two proportions

The standard deviations of the sample proportions are: Comparing two proportions

The variance of the difference in the proportions is: The standard deviation is: Comparing two proportions

Since we usually don’t know the true values of p1 and p2, we use the sample proportions from the data we are given. We use them to estimate the variances and find the standard error. Comparing two proportions

Within each group the data should be based on results for independent individuals. • Randomization Condition– • The data in each group should be drawn independently and at random from a homogeneous population or generated by a randomized comparative experiment. • The 10% Condition— • If the data are sampled without replacement, the sample should not exceed 10% of the population. Independence Assumptions

Since we are comparing two groups, we need to add the Independent Assumption. • This is the most important assumption. • Independent Groups Assumption— • The two groups we are comparing must also be independent of each other. Usually, the independence of the groups from each other is evident in the way data were collected. Independence Assumptions

Each of the groups must be big enough. • Success/Failure Condition— • Both groups are big enough that at least 10 successes and at least 10 failures have been observed in each. Sample Size condition

The sampling distribution model for a difference between two independent proportions. • Provided that the sampled values are independent, the samples are independent, and the sample sizes are large enough, the sampling distribution of is modeled by a Normal model with and standard deviation Sampling Distribution

If we have the sampling distribution model and the standard deviation, we have what we need to find the margin of error for the differences in proportions. Sampling Distribution

Two-proportion z-interval— • When the conditions are met, we are ready to find the confidence interval for the difference of two proportions, . The confidence interval is where we find the standard error of the difference, from the observed proportions. The critical value z* depends on the particular confidence level, C, that you specify. Sampling Distribution

Consider this example— The National Sleep Foundation asked a random sample of 1010 U.S. adults questions about their sleep habits. The study ensured that there was an equal number of men and women. On the question about snoring had 995 respondents, 37% of adults reported that they snored at least a few nights a week during the past year. 26% of the 184 people under 30 snored with 39% of the 811 in the older group. Can the difference really be 13% or is it due to the natural fluctuations in the sample that was chosen? Pooling

This type of question uses a hypothesis test. What would be the null hypothesis? H0: p1 – p2 = 0 or H0: p1 = p2 What would be the alternative hypothesis? HA: Pooling

The hypothesis is about a new parameter– the difference in proportions. We need to find the standard error for that. But we can actually do better than the standard error. Pooling

The proportions and the standard deviations are linked. There are two proportions in the standard error formula, but look at the null hypothesis. It claims the proportions are equal. To test the hypothesis, we assume that the null hypothesis is true. This means that there is a single value for in the SE formula. Pooling

How can we do this? If the null hypothesis is true, then among all adults the two groups have the same proportion. We will see 48 + 318 = 366 snorers out of a total of 184 + 811 = 995 adults who responded to the question. The overall proportion of snorers was 366/995 = 0.3678. Pooling

Pooling– Combining the counts to get an overall proportion. Whenever we we have data from different sources or different groups but we believe that they really came from the same underlying population, we can pool them to get better estimates. Pooling

When we have only proportions and not the counts, as in the snoring example, we have to reconstruct the number of successes by multiplying the sample sizes by the proportions. If these calculations don’t come out to whole numbers, round first. There must have been a whole number of successes to begin with. (This is the only time you round in the middle of a calculation.) Pooling

We can then put the pooled value into the formula, substituting it for both sample proportions in the standard error formula. Pooling

Snoring-- Pooling

A presidential candidate fears he has a problem with women voters. His campaign staff plans to run a poll to assess the situation. They’ll randomly sample 300 men and 300 women, asking if they have a favorable impression of the candidate. Obviously, the staff can’t know this, but suppose the candidate has a positive image with 59% of males but with only 53% of females. Example-- #1 Page 507

What kind of sampling design is his staff planning to use? This is a stratified random sample, stratified by gender. Example-- #1 Page 507

What difference would you expect the poll to show? We would expect the difference in proportions in the sample to be the same as the difference in proportions in the population, with the percentage of the respondents with a favorable impression of the candidate 6% higher among males. Example-- #1 Page 507

Of course, sampling error means the poll won’t reflect the difference perfectly. What’s the standard error for the difference in the proportions? The standard deviation of the difference proportions is: Example-- #1 Page 507

Sketch a sampling model for the size difference in proportions of men and women with favorable impressions of this candidate that might appear in a poll like this. Example-- #1 Page 507 Difference in proportion with favorable impression (Male – Female) 68% 95% -6% -2% 2% 6% 10% 14% 18% 99.7%

Could the campaign be misled by the poll, concluding that there really is no gender gap? Explain. The campaign could certainly be misled by the poll. According to the model, a poll showing little difference could occur relatively frequently. That result is only 1.5 standard deviations below the expected difference in proportions. Example-- #1 Page 507

In October 2000 the U.S. Department of Commerce reported the results of a large-scale survey on high school graduation. Researchers contacted more than 25,000 Americans aged 24 years to see if they had finished high school; 84% of the 12,460 males and 88.1% of the 12,678 females indicated that they had high school diplomas. Example-- #4 Page 508

Are the assumptions and conditions necessary for inference satisfied? Explain. • Randomization condition— • Assume that the samples are representative of all recent graduates. • 10% condition— • Although large, the samples are less than 10% of all graduates. • Independent samples condition— • The sample of men and the sample of women were drawn independently of each other. • Success/Failure condition— • The samples are very large, certainly large enough for the methods of inference to be used. Example-- #4 Page 508

Create a 95% confidence interval for the difference in graduation rates between males and females. Example-- #4 Page 508

Interpret your confidence interval. We are 95% confident that the proportion of 24-year old American women who have graduated from high school is between 2.4% and 4.0% higher than the proportion of American men the same age who have graduated from high school. Example-- #4 Page 508

Does this provide strong evidence that girls are more likely than boys to complete high school? Explain. Since the interval for the difference in proportions of high school graduates does not contain 0, there is strong evidence that women are more likely than men to complete high school. Example-- #4 Page 508

The painful wrist condition called carpal tunnel syndrome can be treated with surgery or less invasive wrist splints. In September 2002, Time magazine reported on a study of 176 patients. Among the half that had surgery, 80% showed improvement after three months, but only 54% of those who used the wrist splints improved. Example– #6 Page 508

What’s the standard error of the difference in the two proportions? Example– #6 Page 508

Construct a 95% confidence interval for this difference. • Randomization condition– • It’s not clear whether or not this study was an experiment. If so, assume that the subjects were randomly allocated to treatment groups. If not, assume that the subjects are representative of all carpal tunnel sufferers. • 10% condition— • 88 subjects in each group are less than 10% of all carpal tunnel sufferers. • Independent samples condition— • The improvement rates of the two groups are not related. • Success/Failure condition-- • All are greater than 10, so the samples are large enough. Example– #6 Page 508

Success/Failure condition— • All are greater than 10, so the samples are large enough. • Since the conditions have been satisfied, we will find a two-proportion z-interval. Example– #6 Page 508

Success/Failure condition— • Since the conditions have been satisfied, we will find a two-proportion z-interval. Example– #6 Page 508

State an appropriate conclusion. • We are 95% confident that the proportion of patients who show improvement in carpal tunnel syndrome with surgery is between 12.6% and 39.4% higher than the proportion who show improvement with wrist splints. Example– #6 Page 508

Comparing Two Proportions