Introduction to Inference

Introduction to Inference

Did you ever cook a big pot of soup? Growing up, my mother used to make a rather large pot of soup or chili (eight kids). How did she know the soup was going to be good? She used to take a sample to infer about what the whole pot would taste like. In statistics we might like to know about a population. Just like my mom only took a sample to infer about the population, we may only have a sample of data and from the sample we will make an inference about the population. Recall from a previous section that if the distribution of the variable in the population is, say, normal then we can make probability statements about certain events happening (like what is the probability that a brand of tire will go 42,000 miles or more before it fails.) But we also need to know about the population mean and population standard deviation. With only a sample, we will use techniques to infer what the population values might be.

Thought experiment: When you and I think about a coin we usually say the coin has a 50% chance of coming up heads when we flip it, right? But, if you flip a coin 10 times will it come up heads 5 times? Maybe, but probably not. When we say the coin has a 50% chance of coming up heads what we really mean is if we flip it a really large number if times (like say a million times) that it will then have a relative frequency of about 0.50. Summary here: on the coin if we flip it a lot it will come up heads 50% of the time. But, we hardly ever do this. For a while we will think about some ideas that we hardly ever do in practice, but the ideas underlie all the actual inference we conduct.

Another though experiment: In the family I was born into there are a total of ten people – mom and dad and eight kids. This family could be considered a population (we usually have bigger populations we are interested in, but we use this as an example). Here is a data set with age and dominant hand of each person in the family. Person age dominant hand Mom 76 right (this is a made up Dad 77 right family – the ages Greg 53 right have changed over Kevin 52 right time) Bill 50 right Bob 49 right Steve 47 right Chuck 46 right Patty 45 left George 43 right

The population mean age in the family is 53.8 and the population proportion of right handed folks is 0.9. We could also find the population standard deviation of the age, but I didn’t here. Let’s focus on samples of size three. Say on a random sample we pick Greg, Bill, and Patty. The sample mean age here would be 48.33 and the sample proportion of right handed people would be 0.67. You will notice that each of these sample point estimates of the population parameters are wrong in terms of actually matching the population parameters. But, let’s take a different sample of three, say Mom, Kevin and Steve. We would have a sample mean age of 57.33 and sample proportion of right handers of 1.0. Again we have no match in terms of actually meeting the population parameters.

Before I talked about flipping a coin a lot. By analogy (and the analogy is not perfect), if we take a lot of samples of size 3 we would get many different sample means and sample proportions. If we looked at the distribution of sample means (and sample proportions separately) we would begin to see a pattern in the sample means. The pattern is a distribution called the sampling distribution of sample means and the sampling distribution comes from a repeated sampling context. In practice we only take 1 sample, but there are theoretical distributions of the statistics that a great deal is known about (and we will look at). This is similar to knowing a coin will come up heads 50% of the time.

Samples have statistics we might use to learn about population parameters. In a repeated sampling context, the statistics have patterns that we call sampling distributions. Next we turn to thinking about the population mean of a variable when we take a sample and calculate the sample mean.

Confidence Interval for the Population Mean

What a way to start a section of notes – but anyway. Imagine you are at the ground level in front of my house at the curb. The picture below is the view of a sprinkler turned on full blast. The one thing bad about the picture is the sprinkler does not shoot in both directions at once. It shoots left and then right. But I put both for illustrative purposes.

When I put my sprinkler in the center of my yard I can cover the middle 95% of the front yard. Let’s think about an experiment we could undertake. Say you are outside my house late at night when all the lights are out and you are blindfolded. Then we spin you around a lot. Your job then is to put the sprinkler down in the yard. What is the probability that the center of the yard will get wet? Did you say .95? Sure you did and here is why. If, when in the center, 95% of the yard can be hit, then putting the sprinkler at different places in the yard would mean that 95% of the time the center of the yard should be hit. Hope this helps you understand confidence intervals. If not, well, sorry.

Overview • In this section we study one of the two basic inference methods - • confidence intervals (hypothesis testing is the other) • Confidence intervals are used when our interest is estimating an unknown population parameter.

Another story Say there are five people in a room and the ages of the people are 18, 19, 20, 21, 22. If this is a population, the population mean is 20. Now let’s think about samples of size 2. Say I got the first two people – 18, and 19. The sample mean would be 18.5 – this is not the population mean. Some samples of size two will have sample mean = population mean, some won’t. In the real world we do not know the population mean, but the properties of the distribution of sample means help us learn about the population.

From my example you see that sometimes the sample mean will not be the population mean. So, a confidence interval builds in a margin of error around our point estimate in the hopes that the interval will include the population mean. The way we calculate the interval is 1) Take the sample mean 2) Calculate another value I will explain about more later 3) Get two numbers by taking the sample mean and subtracting the other value and taking the sample mean and adding the other value. This interval, from a low value to a high value, is hoped to contain the true unknown population mean. Summary: confidence interval for population mean Sample mean minus margin of error and sample mean plus margin of error.

From the last slide I now reiterate some ideas. The line represents sample means. In our sample we get the one represented by the vertical maker. Then we calculate another value, the margin of error – I show you later. Take this number and subtract it from the sample mean to get the lower limit of the interval and also take this number and add it to the sample mean to get the upper limit of the interval. Lower limit sample mean upper limit X

Overview • An example of when we do confidence intervals is when we want to estimate the unknown population mean. • The inference is based on the sampling distribution of the statistic of interest – here the sample mean.

Overview • In a previous section we saw that the sampling distribution of sample means has properties based on the parameters of the population. Namely, the sampling distribution of sample means • 1) has a normal distribution • 2) has the same mean as the mean of the population from which the sample is drawn, and • 3) has a standard deviation equal to the standard deviation of the population from which the sample was drawn divided by the square root of the sample size. • These properties will be exploited in this section.

Note sample means distribution is “thinner” because of property 3 on the previous screen. Overview quantitative variable in population This value is one standard error on the low side of the mean. This value is one standard error on the high side of the mean. sample means This is the mean value of the variable in the population as well as the mean of the sampling distribution.

Estimating with confidence - overview • When we do not know the value of a population parameter, we may want to estimate it. • The population mean is estimated by the sample mean – in fact we say the sample mean is a point estimate of the population mean.

Estimating with confidence - overview • Now, when we look just at the sampling distribution of the sample mean, we know this is the long run pattern of the sample mean. • This is similar to the idea that we don’t know what will come up on the next flip of a coin, but we know heads will come up 50% of the time.

Estimating with confidence - confidence interval • A property we learned earlier, combined with our more precise notion of the 68 - 95 - 99.7 rule, is that 95% of sample means lie within 1.96 standard errors of the mean. • Imagine 1.96 standard deviationsis the length of my sprinkler in one direction. If in the center 95% of the yard can get wet, then by putting then sprinkler at other parts of the yard the center will get wet 95% of the time.

Estimating with confidence - confidence interval This is a visual of where the middle 95% of sample means will fall. Sample means If we start at the pop. Mean and add 1.96 times the standard deviation we get here. The mean of the distribution of sample means is the population mean Start at the pop. Mean and subtract 1.96 times the standard deviation.

Estimating with confidence - confidence interval Even if we do not know the value of the population mean, the center of the distribution of sample means will still be located at the population mean. 95% of the sample means will be within 1.96 standard deviations from the mean. 1.96 standard deviations can be thought of as a distance. We will use this distance to help us make up an interval where we think the unknown population mean will be located. To get a confidence interval for the unknown population mean we 1. Calculate the sample mean. 2. Calculate 1.96 standard errors. 3. Take the sample mean and subtract 1.96 standard errors. Take the sample mean and add 1.96 standard errors.

Estimating with confidence - confidence interval This is the same slide as on slide 21, but with new information A sample mean example Sample means 95% of the sample means are within 1.96 standard deviation of the true mean. Using the same 1.96 standard deviations as length, then when we get a sample mean and place the same interval around the sample mean, then the interval should contain the unknown population mean 95% of the time.

Example Say a company is interested in customer satisfaction. It has created a survey such that from the consumer the company gets a score that measures satisfaction. The company would like to know the population mean score. Say the population standard deviation is 20 (this is a heroic assumption, but let’s use it) and a sample of size 100 has been taken. The standard deviation of the sampling distribution of sample means is then 20/square root 100 = 20/10 = 2. 1.96 standard deviations is then 1.96(2) = 3.92

Estimating with confidence Say the sample mean (x bar) = 82 (I pulled this number out of thin air – in problems given to you the context will dictate what to use). The 95% confidence interval is found by two calculations 1) the low value of the interval is 82 – 3.92 = 78.08 and 2) the high value of the interval is 82 + 3.92 = 85.92 The interval is typically reported by writing (78.08, 85.92) The way we report what this interval means is to say: “We can be 95% confident that the true unknown population mean is in the interval (78.08, 85.92).” But, what we really mean is, “we got these numbers by a method that gives correct results 95% of the time.”

Estimating with confidence - critical z • The z of 1.96 was the z to get a 95% confidence interval. 1.96 is called the critical z, or z*. .025 .475 .475 .025 x .025 is the area to the right of the critical z = 1.96. Let’s call this area to the right of the critical z the upper p critical value.

Estimating with confidence - critical z • What if we want a 90% confidence interval? .05 .45 .45 .05 x The Z we should use is 1.645 Notice in the Z table the area .05 in the upper tail is equivalent to an area to the left of .9500. This does not show up in the table exactly. Tradition say go between the Zs of 1.64 and 1.65.

Estimating with confidence Let’s redo the example we did before, but do a 90% confidence interval. Sample mean (x bar) = 82 Standard deviation = 2 and 1.645 standard errors is 3.29. The 90% confidence interval is found by two calculations 1) the low value of the interval is 82 – 3.29 = 78.71 and 2) the high value of the interval is 82 + 3.29 = 85.29 Or (78.71, 85.29)

Estimating with confidence Note that when we went from a 95% to a 90% interval the interval shrank. The 90% interval leaves us less confident and we get a smaller interval. This also means a 95% interval leaves us more confident and gives us a bigger interval. If you want to be 100% sure the interval includes the unknown mean, guess the interval is between a minus infinity and infinity. You can be sure the number is in that rather large interval – but this is not very practical.

99% confidence interval The z to use if you want a 99% confidence interval is 2.576.

Estimating with confidence - summary • A C% confidence interval means we can be C% confident the unknown parameter lies within Z* standard deviations of the sample mean. • This really means we arrived at these numbers by a method that gives correct results C% of the time. • Here C is the Confidence Coefficient.

Level of significance – alpha The book uses the Greek letter alpha to stand for what is called the level of significance. Alpha = 1 – Confidence Coefficient. Well, if we are , for example, 95% confident the interval includes the unknown mean then there is a 5% probability the interval will not include the unknown mean.

Introduction to Inference