730 likes | 872 Views
Session 3. Outline for Session 3. Multiple Regression Analysis of Variance One-way ANOVA ( not regression) Middle Section of the Output Hypothesis Testing Significance of the Whole Model (the F test) Full/Reduced Model Trick . Multiple Regression. 3.1. 3.2. Variable. Description.
E N D
Outline for Session 3 • Multiple Regression • Analysis of Variance • One-way ANOVA (not regression) • Middle Section of the Output • Hypothesis Testing • Significance of the Whole Model (the F test) • Full/Reduced Model Trick Applied Regression -- Prof. Juran
Multiple Regression 3.1 3.2 Applied Regression -- Prof. Juran
Variable Description Y Overall rating of job being done by supervisor Handles employee complaints X 1 Does not allow special privileges X 2 Opportunity to learn new things X 3 Raises based on performance X 4 Too critical of poor performance X 5 Rate of advancing to better jobs X 6 Table 3.2, RABE www.ilr.cornell.edu/~hadi/RABE/Data/P054.txt Example: Supervisor Ratings Applied Regression -- Prof. Juran
What’s Different? • More than one independent variable • Less intuitive interpretation of some of the statistics (β1, R, etc.) • More difficult to study relationships between the variables graphically • A “significant” model may contain insignificant independent variables Applied Regression -- Prof. Juran
Bottom Section Revisited Applied Regression -- Prof. Juran
Middle Section: Analysis of Variance Applied Regression -- Prof. Juran
Analysis of Variance (ANOVA) • Not just a regression procedure • Not covered in B6014 (except in the context of regression) • A collection of statistical models in which variance is partitioned into components due to different explanatory variables. Applied Regression -- Prof. Juran
“To consult the statistician after an experiment is finished is often merely to ask him to conduct a post mortem examination. He can perhaps say what the experiment died of.” “In relation to any experiment we may speak of this hypothesis as the “null hypothesis,” and it should be noted that the null hypothesis is never proved or established, but is possibly disproved, in the course of experimentation. Every experiment may be said to exist only in order to give the facts a chance of disproving the null hypothesis.” Applied Regression -- Prof. Juran
One-way ANOVA Sample mean = 5.883%; sample standard deviation = 1.898%. Applied Regression -- Prof. Juran
One-way ANOVA The difference we are studying (in this case the different neighborhoods) is called a treatment. Is a significant portion of the overall variability in mortgage rates attributable to the treatment, as opposed to random sampling error? Measure the total variability in X. Divide that variability up into two parts: the part that is just random sampling error, and the part that can be “explained” by the qualitative treatments. Perform a hypothesis test to see whether the “explained” part is significantly greater than zero. Two assumptions about the data: (a) the sample data come from populations that are normally distributed, and (b) the populations have the same variance. Applied Regression -- Prof. Juran
From B6014: the last step in the sample variance calculation is to divide the sum of the squared errors (the total squared differences between individual data and the sample mean) by n – 1. If we work back one step from the sample variance here, we get: This number — the total squared deviations of individual data from the mean — is called the total sum of squares (or total sum of squared errors, which we’ll symbolize SST). Applied Regression -- Prof. Juran
Step 2 is to divide the overall variability into “explained” and “unexplained”. We begin by calculate the sample means for interest rates in each of these neighborhoods. Applied Regression -- Prof. Juran
For each of these groups (neighborhoods), we square the difference between its mean interest rate and the grand mean, and then multiply that squared error by the number of observations r. Between columns sum of squares (symbolized SSC) represents a measure of the “explained” variability in X. Applied Regression -- Prof. Juran
Now we need a measure of the “unexplained error”, the amount of variability that is not explained by differences in neighborhoods. For this measure, we find the total squared differences between individual data and their respective column means. In English: the total squared differences between individual loan rates and the average rates within the neighborhoods. For example, in the Devon neighborhood, the sample mean was 0.07140. The squared differences in the Devon neighborhood are therefore: Applied Regression -- Prof. Juran
We add all of these squared differences within the different neighborhoods to get an overall measure of unexplained variability called the error sum of squares, which we will symbolize with SSE. Applied Regression -- Prof. Juran
We’ve done quite a bit of number crunching here, but the results can be summarized fairly concisely: • Step 3 is to perform a hypothesis test to see whether these three neighborhoods have the same population mean interest rate. • Determine degrees of freedom for each of the sources of variation. • Divide the Sum of Squares for each source by degrees of freedom • Create a ratio of “explained” to “unexplained” variability, called the F statistic Applied Regression -- Prof. Juran
SSC: Here we have four neighborhoods, so the degrees of freedom associated with the explained variability is 4 – 1 = 3. SSE:We have 12 data and 4 groups, so the degrees of freedom associated with the unexplained variability is 12 – 4 = 8. SST: Either by subtract 1 from the number of data, or add together the degrees of freedom from the explained and unexplained variability. Either way, we get 11 in this case. Applied Regression -- Prof. Juran
Now we’ll divide the sum of squares for both the between groups (explained) sum of squares and the within groups (unexplained) sum of squares by their respective number of degrees of freedom, to get a statistic we call Mean Squares. We symbolize them with MSC and MSE, respectively. Applied Regression -- Prof. Juran
Finally, we divide the mean squares between groups by the mean squares within groups to get a new statistic, called F in honor of Ronald Fisher. Applied Regression -- Prof. Juran
Here is a generic picture of the F distribution (the shape changes with different degrees of freedom): Applied Regression -- Prof. Juran
F is a continuous random variable, like z and t. Unlike z and t, F can never be less than zero. z and t have two tails and are symmetrical, but F is one-tailed. Like t, the shape of F varies with degrees of freedom. F has degrees of freedom both in its numerator (c – 1) and its denominator (r – 1)c. Applied Regression -- Prof. Juran
Step 2: Test Statistic The test statistic is F, specified with the degrees of freedom as discussed above. In our case there are 3 degrees of freedom in the numerator and 8 degrees of freedom in the denominator. Here’s what this particular F probability distribution looks like: Applied Regression -- Prof. Juran
Step 3: Decision Rule It should make sense that a relatively small value for F should be viewed as evidence consistent with the null hypothesis; if the ratio of explained variability to unexplained is small, then it suggests that there is not much difference between neighborhoods when it comes to interest rates. Conversely, if the F ratio is relatively large, it suggests that a significant proportion of the variation in interest rates is in fact attributable to differences in neighborhoods. But what exactly constitutes “relatively small” or “relatively large”? We can we look up a critical F value in an F table, just as we do with a z table or t table. It’s a little trickier because of the two different types of degrees of freedom, and F tables are usually set up for a particular upper tail probability. Applied Regression -- Prof. Juran
An F table for 5% upper tails: We look up 3 numerator degrees of freedom and 8 denominator degrees of freedom and get a critical value of 4.066. If the null hypothesis were true, we would collect a sample of data with an F larger than 4.066 only 5% of the time. Applied Regression -- Prof. Juran
Our decision rule is: Reject the null hypothesis if F is greater than 4.066. Here’s a picture of the decision rule: Applied Regression -- Prof. Juran
Step 4: As we’ve already seen, the F ratio for our data is 0.6933, which is well within the “non-reject region. We conclude that there is no significant difference between the average interest rates in the four different neighborhoods. Applied Regression -- Prof. Juran
Here’s a graph indicating that the p-value here is 0.5815, meaning that if the null hypothesis were true we would see an F ratio greater than 0.6933 more than 58% of the time. We couldn’t reject this null hypothesis without taking a 58% risk of committing a Type I error. Applied Regression -- Prof. Juran
One-way ANOVA is built into Excel, along with a few other basic versions of ANOVA. Applied Regression -- Prof. Juran
Back to RegressionMiddle Section: Supervisor model Applied Regression -- Prof. Juran
Middle Section: Analysis of Variance The major purpose of this section is to conduct a “regression” version of the F test, a hypothesis test of the overall significance of the regression model. Some preliminary calculations: The variability in Y is partitioned into explained error and unexplained error. Each of these sets of errors has a number of degrees of freedom associated with it: Applied Regression -- Prof. Juran
Middle Section: Analysis of Variance In the ANOVA (Analysis of Variance) section of the regression output, these numbers are reported in the first column. Applied Regression -- Prof. Juran
The Variability Partition Applied Regression -- Prof. Juran