870 likes | 885 Views
Explore how sampling methods can lead to biased results when trying to generalize data from samples to populations. Learn about simple random sampling and its importance in statistical analysis.
E N D
Statistic and Parameter • So far in our study of tests of significance we’ve focused on process probabilities. • Is the probability that Buzz will push the correct button more than 0.50? • Is the probability that scissors will be thrown less than 1/3? • With Buzz, the statistic was the proportion of times Buzz pushed the correct button in the trials we looked at (15/16) and the parameter was Buzz’s probability (long-term proportion) of pushing the correct button. • We used this sample statistic to tell us something about the parameter.
Generalization • Now we want to focus more on finite populations instead of an infinite process. • Typically the entire population is not measure for what we are interested in. • We, therefore, take samples (some subgroup of the population) to give us the information we want about a population. • We want to generalize the information from the sample to the population.
Sampling from a Finite Population Section 2.1
Sampling Students Example 2.1A
Sampling Students • We will look at data collected from the registrar’s office from the College of the Midwest for ALL students for Spring 2011 by looking at the two variables in the spreadsheet below that shows the first 8 students.
Sampling Students • What type of variable is “On campus”? • What type is Cumulative GPA?
Sampling Students • Here are graphs (a histogram and a bar graph) representing all of the 2919 students at the College of the Midwest for our two variables of interest.
Sampling Students • We usually don’t have information on an entire population (a census) like we do here. • We usually need to make inferences about a population based on a sample. • Suppose a researcher asks the first 30 students he finds on campus one morning (like 8:00 am outside of Phelps) whether they live on campus. This would be a quick an convenient way to get a sample.
Sampling Students For this scenario: • What is the population? • What is the sample? • What is the parameter? • What is the statistic? • Do you think this quick and convenient sampling method will result in a similar sample proportion to the population proportion?
Sampling Students • The researcher’s sampling method might overestimate the proportion of students that live on campus because if it is taken early in the morning and most of those that live off campus might not have arrived yet. • We call this sampling method biased. • A sampling method is biased if statistics from samples consistently over or under-estimate the population parameter.
Sampling Students • Bias is a property of a sampling method, not the sample • A method must consistently produce non-representative results to be considered biased • Sampling bias also depends on what is measured • Would the morning sampling method be biased in estimating the average GPA of students at the college? • What about estimating the proportion of students with black hair?
Biased Sampling Let’s take a look at a biased sampling method before we get back to our example.
ESPN Website: What is college basketball's fiercest rivalry? Connecticut vs. Tennessee (Women) Duke vs. North Carolina Hope vs. Calvin Illinois vs. Missouri Indiana vs. Purdue Louisville vs. Kentucky Penn vs. Princeton Philadelphia's Big 5 Oklahoma vs. Oklahoma State Xavier vs. Cincinnati http://proxy.espn.go.com/chat/sportsnation/polling?event_id=1194
ESPN Website: What is college basketball's fiercest rivalry? 75.1% Hope vs. Calvin 9.3% Duke vs. North Carolina 5.4% Indiana vs. Purdue5.2% Philadelphia's Big 5 1.7% Penn vs. Princeton1.5% Oklahoma vs. Oklahoma State 0.7% Louisville vs. Kentucky 0.6% Connecticut vs. Tennessee (Women)0.3% Illinois vs. Missouri 0.3% Xavier vs. Cincinnati Total Votes: 46,084
Random Sample • To get a sample that represents its population: • You can’t have people self-select themselves into the sample. (Basketball poll) • You can’t choose a convenient sample that is clearly not representative of the population. (this class and you are interested in proportion of college students that major in the social sciences)
Random Sample • A simple random sample is the easiest way to insure that your sample method is unbiased. • Remember that a sampling method is biased if statistics from samples consistently over or under-estimate the population parameter. • Hence, an unbiased method of sampling does not have a tendency to over or under-estimate the population parameter.
Simple Random Sample • A simple random sample is like drawing names out of a hat. • Technically, a simple random sample is a way of randomly selecting members of a population so that every sample of a certain size from a population has the same chance of being chosen.
Sampling Students • Suppose we have a computer randomly choose 30 students from the list of all students in the college. With the following results: • The proportion of these 30 students that live on campus is = 0.80. • The average cumulative GPA for these 30 students is = 3.24. ( is the symbol we use to represent the sample average or mean)
Sampling Students • Proportion: We have seen that the symbol that corresponds to but describes the population not the sample is p. • Mean or average: Likewise, the symbol that corresponds to but describes the population not the sample is . • How do we know if and are close to the population values? • A different sample of 30 students would probably give different values.
Sampling Students • We took 5 different SRSs of 30 students • Each sample gives different statistics • This is sampling variability • The values don’t change much: • Average GPAs from 3.22 to 3.40 • Sample proportions from 0.63 to 0.83
Sampling Students • Population parameters: • = 3.288 (the mean GPA of ALL students) • ≈ 0.776 (2265/2919). • The statistics tend to be close to the parameters with some below and some above.
Sampling Students • We took 1000 SRSs and have graphs of the 1000 sample means (for the GPAs) and 1000 sample proportions (for living on campus). • The mean of each distribution falls near the population parameter
Sampling Students • If we took all possible random samples of 30 students from this population the averages of the statistics would match the parameters exactly. • This distribution of statistics is called a sampling distribution and it is what we are approximating with our null distributions and our theory-based distributions. • Statistics computed from SRSs cluster around the parameter so this is an unbiased sampling method because there is no tendency to over or underestimate the parameter
Sampling Students • We can generalize when we use simple random sampling because it creates: • A sample that is representative of the population • A sample statistic that is close to the parameter
Sampling Students • If the researcher at the College of the Midwest uses 75 students instead of 30 with the same early morning sampling method will it be less biased? • No, selecting more students in the same manner doesn’t fix the tendency to oversample students who live on campus • A smaller sample that is random is actually more accurate.
Sampling Students • What is an advantage of a larger sample size? • Less sample to sample variability • Statistics from different samples cluster more closely around the center of the distribution
Notation Check Statistics • (x-bar) Sample Average or Mean • (p-hat) Sample Proportion Parameters • (mu) Population Average or Mean • (pi) Population Proportion Remember that statistics summarize a sample and parameters summarize a population
Learning Objectives for Section 2.1 • Identify the (finite) population and the sample in a statistical study. • Identify parameters and statistics in a statistical study. • Be able to fill in a data table where rows are the observational units and columns are the variables. • Identify when a sampling method might be biased and understand what happens when a sampling method is biased. • Recognize that the types of statistics and graphs used for categorical and quantitative variables differ, and be able to identify which statistics (proportions, means, SDs) and graphs (bar graph, dotplot, histogram) are appropriate for each type of variable.
Learning Objectives for Section 2.1 • State that collecting a representative sample from a population allows for generalizing results of inference procedures from the sample statistic(s) to the population parameter(s). • Recognize that small random samples can be representative of the population; you do not have to have a large proportion of the population in your sample to be representative.
Exploration 2.1A: Sampling Words • We need to sample from a population of interest if it is very large or is difficult to measure every single member of the population. • If we were interested in High School GPA for Hope students we would not need to sample. The registrar’s office has all that information. If we were interested in something that has not already been collected, we might want to sample.
Exploration 2.1A: Sampling Words • That being said, in this activity we will be using the words in the Gettysburg Address as our population. • There are fewer than 300 in this speech and we could easily look at the entire speech to find out average word length, proportion of words that contain an e, etc. • We will be sampling from this speech not to get information from the population, but to help us learn some things about sampling.
Only picture of Lincoln at Gettysburg(There is another picture in which there is some dispute as to whether or not two blurry images are that of Lincoln.)(Edward Everett spoke for over two hours. Lincoln followed with his two-minute speech.)
Exploration 2.1A • Select what you think is a representative sample of 10 words from the Gettysburg Address (pg 109). Record your words in the table in question 2. • Make dotplots of both average length and proportion containing e on the board.
Exploration 2.1A • Select a random sample of 10 words from the Gettysburg Address (pg 112). • Again we will make dotplots of both average length and proportion containing e on the board. • Which sample is more representative of the population?
Exploration 2.1A • We should have seen that our simple random sample gave us an unbiased estimate of the population mean and proportion while the self-selected sample was biased.
Exploration 2.1A Are these sampling methods biased? • Close our eyes and blindly point a pencil at 10 words. • Cut all the words out of the book, put them in a hat and draw out 10. • Put all the words on the same size pieces of paper, put them in a hat and draw out 10.
Exploration 2.1A • Now let’s go to the Sampling Words applet and see how: • The sample size changes the variability in the sampling distribution. • The population size doesn’t change the sampling distribution.
Central Limit Theorem • This idea that distributions of sample means forms an approximately normal distribution (with predictable mean and standard deviation) when the sample size is large enough is known as the Central Limit Theorem. • In the Gettysburg Address example, we saw that when we had a population distribution that was skewed, even with a fairly small sample size, the distribution of sample means was fairly symmetric.
Predicting Mean and SD for a Sampling Distribution • Let’s also look at the Sampling Words Applet to take samples of different distributions so we can see the Central Limit Theorem Working
Review of Section 2.1 • A sampling method is biased if statistics from samples consistently over or under-estimate the population parameter.
Review of Section 2.1 • A simple random sample is the easiest way to insure that your sample is unbiased. • Therefore, if we have a simple random sample, we can infer our results to the population from which is was drawn. • Even small samples can be representative of a very large population. If we have a simple random sample, we can generalize our results to a large population.
Review of Section 2.1 • We saw biased and unbiased sampling in the Gettysburg Address exploration. We also saw that: • When we increase sample size, the variability of our sampling distribution decreases. • This variability can be predicted. • Changing the population size has no effect on variability.
Population distribution of word lengths Distribution of average word length from samples of size 20 When we sample from a population and calculate a sample mean and then repeat this process over and over again, the distribution will look bell shaped under certain conditions.
Section 2.2: Inference for a Single Quantitative Variable Using methods similar to what we did in the last section, we will see how a null distribution for a single quantitative variable can be obtained and even predicted.
Example 2.2: Estimating Elapsed Time • Students in a stats class (for their final project) collected data on students’ perception of time • Subjects were told that they’d listen to music and asked questions when it was over. • Played 10 seconds of the Jackson 5’s “ABC” and asked how long they thought it lasted • Can students accurately estimate the length?