720 likes | 993 Views
Hypothesis Testing in a Regression Context. LIR 832. Review: What Do These Regression Terms Mean?. Regression Analysis: weekearn versus Education, age, female, hours The regression equation is weekearn = - 1053 + 65.1 Education + 7.07 age - 230 female + 18.3 hours
E N D
Review: What Do These Regression Terms Mean? Regression Analysis: weekearn versus Education, age, female, hours The regression equation is weekearn = - 1053 + 65.1 Education + 7.07 age - 230 female + 18.3 hours 44839 cases used 10319 cases contain missing values Predictor Coef SE Coef T P Constant -1053.01 19.43 -54.20 0.000 Educatio 65.089 1.029 63.27 0.000 age 7.0741 0.1929 36.68 0.000 female -229.786 4.489 -51.19 0.000 hours 18.3369 0.2180 84.11 0.000 S = 459.0 R-Sq = 31.9% R-Sq(adj) = 31.9%
Topics of the Day… • 1. Populations and samples in the context of regression. • 2. The distribution of the error term in a regression model. • 3. Hypothesis testing. • One-tailed. • Two-tailed. • 4. Tests of group variables.
Review:Populations and Samples • In earlier lectures, we learned that there is a true parameter in a population (m) that we try to estimate by a sample (x-bar). In regression, there are similarities. • In other words, there is a “true” relationship between the variables in the population that we are trying to estimate: yi = b0 + b1*X1i + ei. • Since we often cannot see the entire population, we estimate this relationship through finding the equation within a sample: or
Review:Populations and Samples • As with all sample results, there are lots of different samples which might be drawn from a population. • These samples will typically provide somewhat different estimates of the coefficients. • This is, once more, a byproduct of sampling variation.
Review: Populations and Samples • Estimate a simple regression model of weekly earnings for all of the data on managers and professionals (what we’ll consider as the “population”), then take random 10% sub-samples of the data and compare the estimates. • Weekly Earnings = b0 + b1*education + e • Upon generating five random sub-samples of the data, we find the following:
Populations and Samples • Important point: • Sampling variability is responsible for the fact that the sample regression coefficients do not exactly reproduce the population regression coefficients, which is what we are after. • Q: Why does this happen? • A: We pull samples out of populations. We typically take a single sample. But, in fact, there are many many samples we could pull from a given population.
Populations and Samples: Example • Q: How many unique samples (no two samples have the exact same individuals) in a population of 15 with sample size 5? • A:
Populations and Samples: Example #2 • Q: Now suppose we are taking samples of 5,000 from our 50,000-person data set on managers and professionals. How many unique samples can we draw? • A: • As a result, we have many possible samples upon which to estimate a regression model. This will produce many different sample regression coefficients (b)… which may be different than the population coefficients (b).
The Distribution of Sample Regression Coefficients (b) • By the Gauss-Markov Theorem, so long that we have 31 degrees of freedom in our regression: • A.) The average of b’s is b. • B.) The variance of b declines as sample size becomes large. • C.) The b’s are normally distributed. • The Gauss-Markov Theorem can be thought of as the Central Limit Theorem for regression; this allows us to engage in inference from samples to populations.
Normality of Sample Regression Coefficients (b) • Q: What does normality of the sample regression coefficients (b) buy us? • A: This allows us to do inference using a normal distribution, as we have already done. • Thus, we can use Z transformations and hypothesis testing just as before.
Recap to This Point… • We wish to know about population regression (i.e., the “true” model). • We do not observe the population; instead, we use samples to make inferences about the population and the population regression. • Samples are characterized by sampling variability. a.) Samples do not exactly reproduce the population regression coefficients. b.) Sample regression coefficients do not exactly reproduce one another. c.) The Gauss-Markov Theorem tells us that the sample b’s are normally distributed around the population b.
Recap to This Point… 4. Because of sampling variability we cannot take the sample results as being exactly equal to the population results. a.) We can ask, in a probabilistic sense, are the sample results consistent with a set of beliefs about the unobserved population? b.) Is the b we get from our sample sufficiently close to our belief about β in the population (in the sense of being within 1.28 or 1.64 or 2.35 standard errors of our guess about β) that we can conclude that the results are consistent with our beliefs? 5. Because our b ~ N(b, var(b)), we can use the hypothesis test apparatus we have previously used with means to test our beliefs about coefficients.
Reviewing the Steps of Hypothesis Testing 1.) We lay out what we believe as our alternative hypothesis, βalternative. 2.) We lay out what we believe as our null hypotheiss, βnull. 3.) We draw our sample, and estimate the regression. 4.) We form our t-statistic as:
Reviewing the Steps of Hypothesis Testing 5.) We compare our estimated sample t* to our t-critical points. If we have at least 31 degrees of freedom and we are doing one-tailed tests these are: 10% 1.282 5% 1.645 1% 2.326 6.) Using our critical points, we reject or do not reject the null.
Example: Education and Earnings • An example from our managerial education data. Suppose that we believe that typically manager earn more than $80 per week per year of education. We have the following set up: • 1. alternative hypothesis: βalternative:> $80 per week per year of education. • Note: This happens to be true in this instance, but we don’t typically know the true population value. • 2. null hypothesis: βnull ≤$80 per week per year of education
Example: Education and Earnings • 3. Before taking our sample, we can set this up and think about what we are going to be doing. We have sufficient observations so that we can use the “infinite degrees of freedom cut points” once we have converted this to a t-distribution ( or z, if df>31): • Cut Points: 10% 1.282 5% 1.645 1% 2.326
Example: Education and Earnings Regression Analysis: weekearn versus Education The regression equation is weekearn = - 489 + 88.2 Education 4792 cases used 741 cases contain missing values Predictor Coef SE Coef T P Constant -488.51 56.85 -8.59 0.000 Educatio 88.162 3.585 24.59 0.000 S = 531.7 R-Sq = 11.2% R-Sq(adj) = 11.2% Analysis of Variance Source DF SS MS F P Regression 1 170910932 170910932 604.62 0.000 Residual Error 4790 1354011361 282675 Total 4791 1524922293
Example:Education and Earnings • Computing t: • Note: The standard error of the coefficient fills the place of the standard deviation in our hypothesis tests of means.
Example:Education and Earnings • Can reject in a 10% and 5% test, but not a 1% test. Alternatively, probability from a z-table is 1.16%.
Another Sample Regression… Regression Analysis: weekearn versus Education The regression equation is weekearn = - 460 + 85.9 Education 4652 cases used 773 cases contain missing values Predictor Coef SE Coef T P Constant -460.15 56.45 -8.15 0.000 Educatio 85.933 3.565 24.10 0.000 S = 525.2 R-Sq = 11.1% R-Sq(adj) = 11.1% Analysis of Variance Source DF SS MS F P Regression 1 160290124 160290124 581.01 0.000 Residual Error 4650 1282842668 275880 Total 4651 1443132792
A Graphical Overview • Note that we convert each possible value of b by applying our z-transform and carry along the associated probability. So each possible value of b is converted by: • and we work with a standard normal. The cut points are simply the points which are 1.28, 1.645 and 2.33 standard errors above the mean.
Extreme Sample Example Regression Analysis: weekearn versus Education The regression equation is weekearn = - 333 + 79.2 Education 4719 cases used 726 cases contain missing values Predictor Coef SE Coef T P Constant -333.24 58.12 -5.73 0.000 Educatio 79.208 3.665 21.61 0.000 S = 539.5 R-Sq = 9.0% R-Sq(adj) = 9.0% Analysis of Variance Source DF SS MS F P Regression 1 135988912 135988912 467.19 0.000 Residual Error 4717 1373012666 291078 Total 4718 1509001578
Research Example • An example from the article “The Gender Earnings Gap Among College Educated Workers” (ILRR, July, 1997) • One hypothesis is that getting a college degree (completing the 4th year of college) has a positive return. • This hypothesis reflects human capital & sheepskin theory.
Research Example:Hypothesis Testing • Consider the first hypothesis (Call our coefficient on obtaining a college degree βdegree). • Alternative: What we believe: βdegree > 0 • Null: What we do not believe: βdegree ≤ 0
Research Example:Hypothesis Testing Take a sample and run the regression (we have this from the article for 1979). • The coefficient for βdegree is .1214 for men (with SE of 0.036) • The coefficient for βdegree is .3175 for women (with SE of 0.0485)
Research Example: Hypothesis Testing We cannot use this as is, we need to convert it to a standard normal distribution. We do this with our old friend the z-transform: • Men first… • The t is signed correctly to support the alternative of βdegree > 0 • The t*, 3.39 is well above the 10% critical point (1.28), the 5% critical point of 1.645 and the 1% critical point (2.326)
Research Example: Hypothesis Testing • Q: Would you reject the null using the same tests for women from 1979 data? • Q: Would you reject the null using the same tests for men from 1986 data? • Q: Would you reject the null using the same tests for women from 1986 data?
Research Example: Hypothesis Testing • How about testing the effect of College GPA in 1979 on ln Weekly Earnings? Do for both men and women. • 1. set up the alternative and the null • 2. calculate the t* from the sample data • 3. compare the t* to the critical values. • 4. interpret the test
Research Example: Hypothesis Testing • You don’t always have to test against a β of zero, as you can test against any β. • Example from the article: • Hypothesis: The return to college graduation is more than 10%. • Alternative is that the return is > 10% • Null is that the return ≤10%.
Research Example: Hypothesis Testing • If we tested this hypothesis for men in 1986, we use b=0.1764, se(b)=0.0654. • Thus, to calculate the test statistic: • t*=(0.1764-0.1000)/0.0654 = 1.168. • Therefore, we cannot reject the null hypothesis in any of our one-tailed tests.
Research Example: Hypothesis Testing • Q: Test the alternative hypothesis that the return to a college education is more than 10 percent for women in 1986. • Q: Test that the return to job tenure is more than 2 percent for men and women in 1986.
Two-Tailed Hypothesis Tests • As with tests on means, sometimes we do not have a clear idea of the effect of a factor. • Example: The effect of marriage on men’s earnings. • One view: Suddenly, men become responsible (wife, child, house, the full catastrophe). They have to earn money to support their family. • Second View: Now that men have a second earner in the family, don’t have to do anything but drink expresso in the corner café. • Thus, we have two-tailed hypothesis testing for when we do not know the direction of the effect.
Example: Marriage on Earnings • Let’s take the case of the effect of marriage on male earnings: • Step 1: Lay out what we believe as our alternative hypothesis. • We believe that marriage has an effect on male earnings, but we don’t know the direction of the effect, Ha: βmarriage ≠0. • Step 2: Lay out what we believe as our null hypothesis. • Marriage has no effect on male earnings, H0: βmarriage = 0.
Example: Marriage on Earnings • Step 3: Using data from College Earnings, we form our t-statistic as the following (note that, in this case, the table already gives us the value of t-statistic):
Example:Marriage on Earnings • We compare our estimated sample t* to our t-critical points. Previously, we learned that our critical points were (so long as we had 31 observations) • 10% 1.282 • 5% 1.645 • 1% 2.326 • It’s a little different when we are doing two tailed tests. Now, if we want to put 10% in our rejections region, it can be for a t which is “too large” or for a t which is “too small”. So, to get a 10% total area, we need to place 5% in the upper tail and 5% in the lower tail.
SELECTED VALUES OF THE t DISTRIBUTION (That value t* such that Pr(t> t*) = α , where t~t(df)) One Tailed .25 .10 .05 .025 .01 .005 .001 Two Tailed .50 .20 .10 .05 .02 .01 .002 df_______________________________________________________________________ 1 1.00 3.08 6.31 12.71 31.82 63.66 318.31 2 .82 1.89 2.92 4.30 6.96 9.92 22.33 3 .76 1.64 2.35 3.18 4.54 5.84 10.21 4 .74 1.53 2.13 2.78 3.75 4.60 7.17 5 .73 1.48 2.02 2.57 3.36 4.03 5.89 6 .72 1.44 1.94 2.45 3.14 3.71 5.21 7 .71 1.42 1.90 2.36 3.00 3.50 4.78 8 .71 1.40 1.86 2.31 2.90 3.36 4.50 9 .70 1.38 1.83 2.26 2.82 3.25 4.30 10 .70 1.37 1.81 2.23 2.76 3.17 4.14 11 .70 1.36 1.80 2.20 2.72 3.11 4.02 12 .70 1.36 1.78 2.18 2.68 3.06 3.93 13 .69 1.35 1.77 2.16 2.65 3.01 3.85 14 .69 1.34 1.76 2.14 2.62 2.98 3.79 15 .69 1.34 1.75 2.13 2.60 2.95 3.73 16 .69 1.34 1.75 2.12 2.58 2.92 3.69 17 .69 1.33 1.74 2.11 2.57 2.90 3.65 18 .69 1.33 1.73 2.10 2.55 2.88 3.61 19 .69 1.33 1.73 2.09 2.54 2.86 3.58 20 .69 1.32 1.72 2.09 2.53 2.84 3.55 21 .69 1.32 1.72 2.08 2.52 2.83 3.53 22 .69 1.32 1.72 2.07 2.51 2.82 3.50 23 .68 1.32 1.71 2.07 2.50 2.81 3.48 24 .68 1.32 1.71 2.06 2.49 2.80 3.47 25 .68 1.32 1.71 2.06 2.48 2.79 3.45 26 .68 1.32 1.71 2.06 2.48 2.78 3.44 27 .68 1.31 1.70 2.05 2.47 2.77 3.42 28 .68 1.31 1.70 2.05 2.47 2.76 3.41 29 .68 1.31 1.70 2.04 2.46 2.76 3.40 30 .68 1.31 1.70 2.04 2.46 2.75 3.38 ∞ .67 1.28 1.64 1.96 2.33 2.58 3.09