280 likes | 294 Views
Learn to estimate sampling error in population studies for precise results. Understand sampling distribution, standard error, and confidence intervals.
E N D
Ch 4 Estimating with uncertainty
Recall: • In chapter 3, we learned about • mean, median, mode • (different measures of population or sample location/central tendency) • and standard deviation / variance / interquartile distance • (different measures of population or sample spread)
The problem: • when we wish to learn about a population: • we take samples • but we can’t tell how much sampling error is involved • (recall: sampling error is that error due to chance in sampling a variable population) • we must estimate how much sampling error is involved • this will give us an idea of the precision of our estimate
But how do we estimate the sampling error of our estimate? • (we are going to ignore other sources of error – for the moment) • we can study the magnitude of sampling error by pretending it is occurring in a known population • In fact, many researchers have done this • so we have a large body of work that helps us to understand – and PREDICT – the magnitude of sampling error
Our motivating example: • the human genome • DNA sequences of all 23 human chromosomes • published in 2005 • (www.ensembl.org for genome information)
Our motivating example: • the human genome • DNA sequences of all 23 human chromosomes • published in 2005 • (www.ensembl.org for genome information) • 20290 genes (build 35, they’re on 37 now)
Frequency dist of number of nucleotides relative frequency
Frequency dist of number of nucleotides Note: this is the population. µ=2622; σ=2036.9
To study the magnitude of sampling error, • we sample from the population • Let’s choose 100 genes (with associated lengths) at random • (recall how we would do this if we had the spreadsheet of genes in front of us)
100 genes chosen at random: a sample This sample has Ybar = 2411.8 (compared to µ=2622) and s = 1463.5 (compared to σ=2036.9) Sample estimates underestimate population parameters!
What if we were to repeat this sampling process, over and over? • Every time, we would choose a number of genes at random (could be 100, could be any number), and for that sample, we would calculate a mean (Ybar) and a standard deviation (s).
What if we were to repeat this sampling process, over and over? • Every time, we would choose a number of genes at random (could be 100, could be any number), and for that sample, we would calculate a mean (Ybar) and a standard deviation (s). Sampling distribution (of the sample mean) Frequency
Sampling distribution of the mean – can be interpreted as a probability distribution where does the population mean fall?
Effect of samples of different sizes Larger samples yield more precise estimates with lower spread (and lower sampling error)
Standard error • a measure of sampling error • easy to calculate: • standard error decreases as n increases
Standard error of Ybar • also easy to calculate:
Standard error of Ybar • also easy to calculate: • Recall our sample of 100 genes from the human genome:
Standard error of Ybar • also easy to calculate: • Recall our sample of 100 genes from the human genome: This sample has Ybar = 2411.8 and s = 1463.5 and SE = 1463.5/10 = 146.3 Report: Ybar = 2411.8 ± 146.3
Standard error of Ybar • also easy to calculate: • Recall our sample of 100 genes from the human genome: This sample has Ybar = 2411.8 and s = 1463.5 and SE = 1463.5/10 = 146.3 Report: Ybar = 2411.8 ± 146.3 allows reporting estimate of error (in the Ybar estimate)
Another way to estimate error/precision: • confidence intervals • Note that the frequency dist of the sampling distribution was normal, even though the freq dist of the human genome data was right-skewed
Another way to estimate error/precision: • confidence intervals • Note that the frequency dist of the sampling distribution was normal, even though the freq dist of the human genome data was right-skewed • Remember also, from chapter 3, the rule of thumb about normally-distributed data: 95% of data will fall within 2 standard deviations of the mean -> we extend that here to incorporate standard error rather than plain old s: • 95% of data will fall within 2 standard errors of the mean
Rule of thumb in practice • 95% of data will fall within 2 SE of the mean • Recall our sample of 100 genes from the human genome project: • In practice: for any sample you take, the confidence interval is (Ybar-2SE, Ybar+2SE) • For 95% of confidence intervals calculated this way, the population mean falls inside the confidence interval CI: (2411.8 – 2*146.3, 2411.8+2*146.3) (2119.2, 2704.4)
Pseudoreplication • from “pseudo” and “replicate” • A replicate is an additional measurement • Example: We are interested in average blood pressure in men over age 65 • We find 10 men over age 65, and take their blood pressures, once in the morning, and once in the evening (total of 20 measurements). • Are these 20 measurements independent of one another?
Pseudoreplication • from “pseudo” and “replicate” • A replicate is an additional measurement • Example: We are interested in average blood pressure in men over age 65 • We find 10 men over age 65, and take their blood pressures, once in the morning, and once in the evening (total of 20 measurements). • Are these 20 measurements independent of one another? Sampling design is still really important!