740 likes | 775 Views
This lecture explains the concepts of sampling distribution, standard error, and t-distributions in basic quantitative methods in social sciences. It also discusses the use of t-distributions when population standard deviation is unknown.
E N D
Basic Quantitative Methods in the Social Sciences(AKA Intro Stats) 02-250-01 Lecture 7
In Review… • Sampling distribution = The distribution of a statistic over repeated sampling from a specified population. • Standard error = The standard deviation of a sampling distribution (tells us how much variability we will get over repeated sampling) • If we know the shape and parameters (e.g., mean and standard deviation) of the sampling distribution of a statistic, we can derive the position of a particular statistic in the overall distribution.
More Review… • John got a 76% on the last midterm in statistics. The class mean was 65%, and the standard deviation was 6. • Can we determine the position of John’s score in the class distribution? Yes! We can calculate a Z-score. • SO: We know the position of John’s score (i.e., z-score), the probability of this score occurring in the class (which we get from the z-score), and the amount of sampling error.
And More Review… • What if we know that our class has a mean on the midterm of 68% with a Standard Deviation of 7, and we want to know if the first 3 rows did better than the rest of class… • Can we consider the first 3 rows as a sample of the class (population) and do a Z-test? YES! Why? We know the POPULATION PARAMETERS (mean and standard deviation – so the modified Z formula will work).
More Review…. • Crucial to understand: We can do this because we KNOW the standard deviation of the population. • What if we want to know how our class mean (that is, 65%) compares with introductory statistics courses across the country. • Can we calculate the Z-score of our class mean to find out it’s position?
More Review… • NO! Why not? Because we do not know the standard deviation of the sampling distribution of the mean (i.e., our class is no longer the population. Now our class is a sample in a larger population of statistics classes)
Last Review Slide… • Central Limit Theorem: Given a population with mean and standard deviation , the sampling distribution of the mean (the distribution of sample means) will have a mean equal to and a standard deviation equal to = /N. The distribution will approach the normal distribution as N (sample size) increases.
Introduction to t-Distributions • The reality is, we rarely know the population standard deviation ( ). • The t-distributions are a “family of theoretical distributions” that can be used when: • We are dealing with interval or ratio data • Our data is normally distributed • The population standard deviation ( ) is unknown.
t-Distributions continued.. • Review (ok, I lied, this is the last review slide): • A normal distribution is a population of z-scores where z is defined as: • Note: Here we know the population’s S.D ()
t-Distributions continued.. • A t-distribution is a population of t-scores where t is defined as: • X-bar is mean of random sample • Do you see the standard error of the mean in the formula?
t-Distributions continued…. • t-distributions are similar to the normal distribution in that they are unimodal and symmetrical. They have a mean of 0, negative values below the mean, and positive values above the mean.
t-distributions and degrees of freedom • Because the definition of t involves a term obtained from a sample (that is the estimated standard error of the mean), which in turn involves the degrees of freedom associated with the sample, there is a different t distribution for every degrees of freedom (sample size). • The t-distribution can be found in Table E.6 (p.444 in Howell). You will notice an extra column not in the normal curve table (df).
t-distributions and degrees of freedom (continued) • Here we see a representation of a set of t distributions. • Note that if the df is large t and z are the same, and they depart as the df gets smaller.
Basic Properties of t-Curves • Property 1: The total area under a t-curve is equal to 1. • Property 2: A t-curve extends indefinitely in both directions, approaching, but never touching the horizontal axis as it does so. • Property 3: A t-curve is symmetrical about 0. • Property 4: As the number of degrees of freedom becomes larger, t-curves look increasingly like the standard normal curve. • Property 5: Every t-score has a certain probability of occurrence in a specific t-distribution. As such, the values of t which enclose or cut-off given proportions of the appropriate t-distribution can be calculated.
z table: Gives the area above and below each specified value of z. t table: A different t distribution is defined for each possible number of degrees of freedom. Gives values of t that cut off particular critical areas, for example, the .05 and .01 levels of significance. z table vs. t table
Intro to Confidence Intervals • Recall: Although random sampling has no inherent bias, we cannot expect any given sample to perfectly represent its population. Why? Sampling error! • SO: The sample mean will almost always be a different value than the population mean.
A Confidence Interval Is: • A score interval calculated by a procedure with a specified probability of producing an interval containing the parameter (i.e., from the population). • Can we make a statement about how confident we are that a sample mean is close to the (unknown) population mean?
Example: • 25 people around Windsor are approached at random and asked to rate how good a job Jean Chretien is doing as Prime Minister, on a scale of 1 (he stinks) to 20 (he’s great). The mean rating was 8, and the standard deviation was 7.558. How confident are we that this mean of 8 is close to the mean of the overall population of Ontario? SO:
= 8, s = 7.558, n = 25 • The sample mean (8) is deemed to be the mean of a distribution which conforms to the t distribution at df = n-1. • By choosing t values which enclose a specified proportion of that t distribution, we can construct an interval of plausible values of .
Don’t worry, it’s not as complicated as that last sentence seemed • If we choose critical values for t at 0.05 confidence level, there is a 0.95 probability that the score interval we generate will contain . • This score interval is termed the 95% confidence interval (95% C.I.)
Confidence Limits on Mean • Sample mean (8) is a point estimate • We want an interval estimate • Probability that interval computed this way includes = 0.95
How do we get the t value? • n = 25, so df = n-1 = 24 • Look at the t distribution table for df = 24 at 0.05 level of confidence (for two tails). • The critical value for t = 2.064
= 7.558 / 5 = 1.5116 = t.05 = 2.064 (1.5116) = 3.12 t.05 = 8 3.12, so 4.88 to 11.12 SO: We can be 95% confident that the interval 4.88 to 11.12 inclusive contains the population mean rating on Jean Chretien.
That is to say…. • If we took 100 samples (25 people in each sample) from the same population, 95% of the samples would produce a mean between 4.88 and 11.12.
What if we wanted to be 99% confident? • t.01 = 2.797 (1.5116) = 4.23 SO: 8 4.23 = 3.77 to 12.23 SO: If we took 100 samples of 25 people, 99 of the samples would produce a mean between 3.77 and 12.23.
Things to Remember… • Other things being equal, increasing the confidence level (say from 95% to 99%) increases the size of the confidence interval. Why? • Because less certainty (confidence) is associated with greater precision (smaller interval).
Things to Remember (continued) • Other things being equal, an increase in the size of the sample standard deviation increases the size of the confidence interval. Why? • More variable data indicate more sampling error which in turn means less certainty can be attached to the accuracy of a particular estimate.
Things to Remember (continued) • Other things being equal, an increase in the size of the sample decreases the size of the confidence interval. Why? • Because larger samples provide more stable (less variable) estimates which in turn means that on average, sampling error is less and greater certainty can be attached to the accuracy of an estimate.
One more example: • 16 University of Windsor students were polled regarding how much they pay for rent each month, producing a mean of $500.00 a month, with a standard deviation of $60.00. Compute 95% confidence limits for the population mean of University of Windsor students.
Here we go… • t.05 (with df =15) = 2.131 • t.05 = 2.131 (60/4) = • t.05 = 2.131 (15) = 31.97 SO: 500 31.97 = 468.03 to 531.97 SO: If we took 100 samples of 16 people, 95 of the samples would produce a mean of $468.03 to $531.97 per month.
One sample t-tests: Rationale • Sometimes we know the population mean () of a variable, and we wish to determine whether the mean of a sample differs significantly from the population mean. Assumptions: • Normal population or large sample. • The population’s standard deviation is not known.
Why t and not z?: Review • Gosset noted that when we use the sample’s standard deviation instead of the population’s (which we do not know), the distribution changes as a function of sample size. If n is large, it is very close to the normal distribution. But smaller sample sizes lead to skewed distributions, which would give us too many “significant” results. • To compensate, we compare our t-value with it’s own distribution.
Say we know that the average cell phone user uses 3000 minutes of cellular air time each year () . • Dr. Z hypothesizes that business executives spend more time on their cell phone each year than does the average cell phone user. She interviews a sample of 20 business executives, and finds that they use on average 3500 minutes of cellular air time each year, with a standard deviation 300 minutes. Did this sample of business executives use significantly more cellular air time than the average cell phone user? Test at the .01 level of significance.
Hypothesis testing with the one sample t-test • We can test the null hypothesis: • H0: The mean number of cell phone minutes used per year by business executives does not differ from the mean of the average cell phone user.
Let’s try it! We know , we know what else do we need? = 300 / 20 = 300 / 4.4721 = 67.0826 Luckily, we know that s = 300. If we didn’t, we would need to calculate it from the raw data.
Calculating t… = 3500 – 3000 67.0826 = 7.453 = tobt.
One tailed or two tailed? • The null hypothesis is and the alternative hypothesis is one of the following: • Ha: ≠ : 2-tailed test. • Ha: < : right tailed test • Ha: > : left tailed test
Do we want to use a two-tailed, a left tailed, or a right-tailed test in this example? Since our hypothesis is that business executives use more cellular air time…… We’ll use a one-tailed test (in this case, a right-tailed test)
Is it significant? P values revisited p-value for a t-test if the test is (a) two-tailed, (b) left-tailed, (c) right-tailed:
Refer to the t-table… • Remember, df = n –1 = 19. • As mentioned, we’ll use a one-tailed test. • If we set our level of significance at .01, the critical t-value is 2.539 (this is called tcrit). • tobt (7.454) > than tcrit (2.539). Therefore: We reject the H0. • Can we state our conclusion in words?
The size of t and the Decision about H0 are affected by….. • The actual obtained difference • The magnitude of sample variance • The sample size • The significance level (.05? .01?) • Whether the test is 1 or a 2 tailed
Underlying Assumptions • Most statistical tests make certain presumptions regarding the data and the distributions of whatever parameters are at issue – These are known as the underlying assumptions of the test. • If the underlying assumptions are violated, the validity of the test may be compromised, such that for instance, the probability of a Type I error might be higher than the alpha level of the test.
The one-sample t-test assumes that the raw data were a random sample – that is, the raw scores must be independent of each other. This assumption must not be violated, or the t-test is worthless. The one-sample t-test assumes that the dependent variable is normally distributed in the population. This assumption can be (and usually is) violated to some degree – the t-test is a “robust” test, it tolerates some violation of the normality assumption. Underlying Assumptions
A second Approach: Confidence Intervals Re-visited • An alternative to the one-sample t-test is to calculate confidence intervals for the sample mean. If the population’s mean falls outside the sample mean’s confidence interval, H0 is rejected. • It’s logical: A 95% Confidence Interval suggests that 95% of samples will produce means within the interval. If the population mean falls outside the interval, it is significantly different than the sample mean.
Let’s look at our example: 3500 2.861 (67.0826) = 3500 191.9233 = 3308.08 to 3691.92 So: Since = 3000 is outside the interval, we reject H0, as 99% of samples of business executives will use an average of 3308.08 to 3691.92 minutes of cellular air time per year.
Another Example: A random sample of 65 freshman college students was selected to participate in a new look-say teaching program designed to increase reading speed in French. The final exam consisted of a French passage that the students translated. The time required for each student to complete the translation was recorded. The sample statistics were x = 302 and s = 56 sec.According to department records, the mean for students in conventional classes was 320 sec. Let alpha (a) = .05. • Hypothesis: the new program will increase reading speed.