240 likes | 255 Views
Chapter 18 Sampling distribution models. math2200. Sample proportion. Kerry v.s. Bush in 2004 A Gallup Poll 49% for Kerry A Rasmussen Poll 45.9% for Kerry Why the answers are different? Sample proportion estimates population proportion There is randomness due to sampling.
E N D
Sample proportion • Kerry v.s. Bush in 2004 • A Gallup Poll • 49% for Kerry • A Rasmussen Poll • 45.9% for Kerry • Why the answers are different? • Sample proportion estimates population proportion • There is randomness due to sampling
Modeling the Distribution of Sample Proportions • Imagine what would happen to the sample proportions if we were to actually draw many samples. • What would the histogram of all the sample proportions look like? • The histogram of the sample proportions to center at the true proportion, p, in the population • The histogram is unimodal, symmetric, and centered at p. • A normal model?
Model • Let X be the number of people voting for Bush in a sample of size n • Then X has a binomial model, Binomial(n,p) • p: the proportion of people for Bush in the entire population • When n is large, we can use normal approximation • Normal model with mean np and variance npq
Modeling sample proportion • Sample proportion is X/n • Normal model with mean p and variance pq/n
Example • Back to Kerry v.s. Bush • Assume that the population proportion voting for Kerry is 49% • X/n has a normal model with mean 0.49 and standard deviation 0.0158 (n=1000) • Then we know that both 49% and 45.9 % are reasonable to appear
Conditions • Normal model is an approximation to the exact model • Use it only when n is large • For example, if n=2, then X/n=0,0.5 or 1 • Randomization Condition: The sample should be a simple random sample of the population. • 10% Condition: If sampling has not been made with replacement, then the sample size, n, must be no larger than 10% of the population. • Success/Failure Condition: The sample size has to be big enough so that both and are greater than 10.
A Sampling Distribution Model for a Proportion • Before we observe the value of the sample proportion, it is a random variable that has a distribution due to sampling variations. • This distribution is called the sampling distribution model for sample proportions. • We never actually take repeated samples from the same population and make a histogram. We only imagine or simulate them. • Still, sampling distribution models are important because • they act as a bridge from the real world of data to the imaginary model of the statistic and • enable us to say something about the population when all we have is data from the real world.
An example • 13% of the population is left-handed. • A 200-seat school auditorium was built with 15 “leftie seats” • In a class of n=90 students, what’s the probability that there will NOT be enough seats for the left-handed students? • Let X be the number of left-handed students in the class • We want to find P(X>15) = P(X/n>0.167)
Check the conditions • n is large enough • randomization • 10% condition • The population should have more than 900 students • Success/failure condition • np=11.7>10, nq=78.3>10 • Normal model for X/n • Mean = 0.13 • Sd = sqrt(pq/n) = 0.035 • P(X/n>0.167) = 0.1446
Sample Mean • Sample means tend to normal when n is large
Central limit theorem (CLT) • If the observations are drawn • independently • from the same population (distribution) the sampling distribution of the sample mean becomes normal as the sample size increases. • We do not need to know the population distribution.
CLT • Suppose the population distribution has mean μand standard deviation σ • The sample mean has mean μand standard deviation σ/sqrt(n) • Let X1, …, Xn be n independently and identically distributed random variables • E(X1) = μ • Var(X1)= σ2 • Then as n increases, the distribution of (X1+…+Xn)/n tends to a normal model with mean μand standard deviation σ/sqrt(n)
The Fundamental Theorem of Statistics The Central Limit Theorem (CLT) The mean of a random sample has a sampling distribution whose shape can be approximated by a Normal model. The larger the sample, the better the approximation will be.
Example • Suppose the population distribution of adult weights has mean 175 pounds and sd 25 pounds • the shape is unknown • An elevator has a weight limit of 10 persons or 2000 pounds • What’s the probability that the 10 people who get on the elevator overload its weight limit?
Let Xi,i=1,2,…,10 be the weight of the ith person in the elevator • Then we want to know P(X1+…+X10>2000) = • From the CLT (check the requirement first), we know the distribution of is normal with mean 175 pounds and standard deviation • Then
Standard error • Using the CLT, we know the distribution of sample proportion is • However, we do not know p in practice. • Using the CLT, we know the distribution of sample mean is • However, we do not know and
Standard Error • When we don’t know p or σ, we’re stuck, right? • Nope. We will use sample statistics to estimate these population parameters. • Whenever we estimate the standard deviation of a sampling distribution, we call it a standard error.
Standard Error (cont.) • For a sample proportion, the standard error is • For the sample mean, the standard error is
What Can Go Wrong? • Don’t confuse the sampling distribution with the distribution of the sample. • When you take a sample, you look at the distribution of the values, usually with a histogram, and you may calculate summary statistics. • The sampling distribution is an imaginary collection of the values that a statistic might have taken for all random samples—the one you got and the ones you didn’t get.
What Can Go Wrong? (cont.) • Beware of observations that are not independent. • The CLT depends crucially on the assumption of independence. • You can’t check this with your data—you have to think about how the data were gathered. • Watch out for small samples from skewed populations. • The more skewed the distribution, the larger the sample size we need for the CLT to work.
Summary • Sample proportions or sample means are statistics • They are random because samples vary • Their distribution can be approximated by normal using the CLT • Be aware of when the CLT can be used • n is large • If the population distribution is not symmetric, a much larger n is needed • The CLT is about the distribution of the sample mean, not the sample itself