280 likes | 425 Views
summary. Z-distribution Central limit theorem. Sweet demonstration of the sampling distribution of the mean. Sweet data. R-code – sampling distribution exact. data.set <- c(6,4,5,3,10,3,5,3,6,5,4,8,7,2,8,5,8,5,4,0) mean(data.set ) sd(data.set )*sqrt(19/20) #standard deviation
E N D
Z-distribution • Central limit theorem
Sweet demonstration of the sampling distribution of the mean
R-code – sampling distribution exact data.set <- c(6,4,5,3,10,3,5,3,6,5,4,8,7,2,8,5,8,5,4,0) mean(data.set) sd(data.set)*sqrt(19/20) #standard deviation (sd(data.set)*sqrt(19/20))/sqrt(20) sample_size<-5 samps <- combn(data.set, sample_size) xbars <- colMeans(samps) barplot(table(xbars))
R-code (sampling distribution simulated) data.set <- c(6,4,5,3,10,3,5,3,6,5,4,8,7,2,8,5,8,5,4,0) sample_size<-3 number_of_samples<-20 samples <- replicate(number_of_samples,sample(data.set, sample_size, replace=T)); out<-colMeans(samples); mean(out); sd(out) barplot(table(out))
Statistical inference If we can’t conduct a census, we collect data from the sample of a population. Goal: make conclusions about that population
Demonstration problem • You sample 36 apples from your farm’s harvest of over 200 000 apples. The mean weight of the sample is 112 grams (with a 40 gram sample standard deviation). • What is the probability that the mean weight of all 200 000 apples is within 100 and 124 grams?
What is the question? • We would like to know the probability that the population mean is within 12 of the sample mean. • But this is the same thing as • But this is the same thing as • So, if I am able to say how many standard deviations away from I am, I can use the Z-table to figure out the probability.
Slight complication • There is one caveat, can you see it? • We don’t know a standard deviation of a sampling distribution (standard error). We only know it equals to , but is uknown. • What we’re going to do is to estimate . Best thing we can use is a sample standard deviation , that equals to 40. • . This is our best estimate of a standard error. • Now you finish the example. What is the probability that population mean lies within 12 of the sample if the SE equals to 6.67? • 92.82%
This is neat! • You sample 36 apples from your farm’s harvest of over 200 000 apples. The mean weight of the sample is 112 grams (with a 40 gram sample standard deviation). What is the probability that the population mean weight of all 200 000 apples is within 100 and 124 grams? • We started with very little information (we know just the sample statistics), but we can infere that with the probability of 92.82% a population mean lies within 12 of our sample mean!
Point vs. interval estimate • You sample 36 apples from your farm’s harvest of over 200 000 apples. The mean weight of the sample is 112 grams (with a 40 gram sample standard deviation). • Goal: estimate a population mean • A population mean is estimated as a sample mean. i.e. we say a population mean equals to 112 g. This is called a point estimate (bodový odhad). • However, we can do better. We can estimate, that our true population mean will lie with the 95% confidence within an interval of (interval estimate).
Confidence interval • This type of result is called a confidence interval (interval spolehlivosti, konfidenční interval). • The number of stadandard errors you want to add/subtract depends on the confidence level (e.g. 95%) (hladina spolehlivosti). critical value kritická hodnota margin of error možná odchylka
Confidence level • The desired level of confidence is set by the researcher (not determined by data). • If you want to be 95% confident with your results, you add/subtract 1.96 standard errors (empirical rule says about 2 standard errors). • 95% interval spolehlivosti
80% 90% 1.28 1.64 95% 99% 1.96 2.58
Small sample size confidence intervals • 7 patient’s blood pressure have been measured after having been given a new drug for 3 months. They had blood pressure increases of 1.5, 2.9, 0.9, 3.9, 3.2, 2.1 and 1.9. Construct a 95% confidence interval for the true expected blood pressure increase for all patients in a population.
CLT consequence • Change in a blood pressure is a biological process. It’s going to be a sum of thousands or millions of microscopic processes. • Generally, if we think about biological/physical process, they can be viewed as being affected by a large number of random subprocesses with individually small effects. • The sum of all these random components creates a random variable that converges to a normal distribution regardless of the underlying distribution of processes causing the small effects. • Thus, the Central Limit Theorem explains the ubiquity of the famous "Normal distribution" in the measurements domain.
We will assume that our population distribution is normal, with and . • We don’t know anything about this distribution but we have a sample. Let’s figure out everything you can figure out about this sample: • , • We’ve been estimating the true population standard deviation with our sample standard deviation • However, we are estimating our standard deviation with of only ! This is probably goint to be not so good estimate. • In general, if this is considered a bad estimate.
William Sealy Gosset aka Student • 1876-1937 • an employee of Guinness brewery • 1908 papers addressed the brewer's concern with small samples • "The probable error of a mean". Biometrika 6 (1): 1–25. March 1908. • Probable error of a correlation coefficient". Biometrika 6 (2/3): 302–310. September 1908.
Student t-distribution • Instead of assuming a sampling distribution is normal we will use a Student t-distribution. • It gives a better estimate of your confidence interval if you have a small sample size. • It looks very similar to a normal distribution, but it has fatter tails to indicate the higher frequency of outliers which come with a small data set.
Student t-distribution df – degree of freedom (stupeň volnosti)
Back to our case • Because a sample size is small, sampling distribution of the mean won’t be normal. Instead, it will have a Student t-distribution with . • Construct a 95% confidence interval, please
Just to summarize, the margin of error depends on • the confidence level (common is 95%) • the sample size • as the sample size increases, the margin of error decreases • For the bigger sample we have a smaller interval for which we’re pretty sure the true population lies. • the variability of the data (i.e. on σ) • more variability increases the margin of error • Margin of error does not measure anything else than chance variation. • It doesn’t measure any bias or errors that happen during the proces. • It does not tell anything about the correctness of your data!!!