770 likes | 899 Views
To extend the comparison of population means beyond the two groups tested by the two-sample t-test, we use a one-way analysis of variance (ANOVA).
E N D
To extend the comparison of population means beyond the two groups tested by the two-sample t-test, we use a one-way analysis of variance (ANOVA). The question we want to answer in analysis of variance is whether or not some categorical independent variable has a relationship to some quantitative dependent variable. If we used the mean of the dependent variable to estimate the score for every individual in the sample across all of the groups, we know that we would end up with a cumulative measure of error called the total sums of squares. Analysis of variance tests whether or not the cumulative error would decrease if we used the group mean to estimate the scores for the members in each group instead of the mean of the dependent variable. If the reduction in error using group means is large relative to the amount of error remaining after using group means, we will have a large statistical value for the F-ratio that has a low probability of occurring only by chance (using the F-distribution to compute the probability). If the probability of the F-ratio or F-test is less than the alpha we have set, we reject the ANOVA null hypothesis that the means for all groups are the same.
Rejecting the ANOVA null hypothesis that the group means are all the same does not tell us which specific group means were different from each other. To answer that question, the one-way analysis of variance requires a second step: using post hoc tests compare each pair of group means to identify specific differences. We might think we could use multiple t-tests to identify differences in pairs of group means, but this inflates our desired alpha error rate. If we set alpha to .05, the probability that we would make a correct decision for an individual test is 1 - .05 = 0.95. If we did three related t-tests, each with alpha set at 0.05, the probability of making correct decisions on all three of the tests is: .95 x .95 x .95 = 0.857375. The probability that we are making an error is 1 - 0.857375 = 0.142625. Thus, he probability that we are would erroneously reject the null hypothesis has increased from our desired 0.05 to an actual error rate of 0.142625. To hold our error rate to 0.05, we would have to divide the error rate by the number of tests, i.e. 0.05 / 3 = 0.0167. If we set alpha to 0.0167, the probability that we would make a correct decision for an individual test is 1 - .0167 = 0.983. The probability that we would make three correct decisions would be 0.983 x 0.983 x 0.983 = 0.950829. The probability of making an error is 1 – 0.950829 = 0.049171. (The difference from 0.05 is due to rounding.) This is what the Bonferroni post hoc test does, and that is the post hoc test we will use in our assignments. There are numerous post hoc tests that differ in their sensitivity to differences between groups and their strategy to avoid inflating the error rates.
The assumptions or conditions required for ANOVA are equality of variance and normality across the groups. We will use the Levene test of homogeneity of variance to test for equality of variance and the Shapiro-Wilk test of normality, applied to the standardized residuals. If these test results do not support conformity to required assumptions, we will try a log transformation of the dependent variable if it is skewed to the right, and a square transformation if it is skewed to the left. Only the quantitative dependent variable can be transformed. We would not expect a categorical independent variable to be different if it was transformed (all we would do is change the number codes for the groups). If we satisfy the assumptions, we interpret the F-test for group differences if its probability is less than alpha. If we reject the null hypothesis for the F-test we examine the post hoc tests. It is possible that none, one, two, or all of the paired differences will be significant on the post hoc tests. We do not interpret post hoc tests if we fail to reject the null hypothesis, even if there is are statistically significant differences between pairs of groups. If we reject the null, we examine the data for outliers (large standardized residuals) to see if they had a role in our statistical results. If our results change after omitting outliers, we would have to make a decision about reporting the results including outliers or the results excluding the outliers.
These problems use a revised version of world2007.sav named world2007R.sav. The new file included three variables that I recoded as categorical variables: • Popchange divides the data for pgrowth into three groups: • countries with a declining population • countries with a population growth less than 2% • countries with a population growth of 2% or more • EconGrowth divides the data for gdpgrow into three groups: • countries where the rate of growth in GDP was less than 4% • countries where the rate of growth in GDP was between 4% and 7% • countries where the rate of growth in GDP was greater than 7% • Urbaniz divided the data for urbanpop into three categories: • countries where the urban percent of the population was in the lowest third • countries where the urban percent of the population was in the middle third • countries where the urban percent of the population was in the highest third • We will treat these three ordinal variables as categorical and use them as the independent variables in our one-way analysis of variance problems.
The introductory statement in the question indicates: • The data set to use (world2007R.sav) • The task to accomplish (a one-way analysis of variance) • The variables to use in the analysis: the independent variable degree of urbanization [urbaniz]. and the dependent variable share of unemployed youth to total unemployed (per cent for males) [ythunemm]. • The alpha level of significance for the hypothesis test: 0.05.
These problem also contain a second paragraph of instructions that provide the formulas to use if the analysis requires us to re-express or transform the variable to satisfy the conditions for analysis of variance.
One-way analysis of variance requires a quantitative dependent variable and a categorical independent variable.
"Share of unemployed youth to total unemployed (per cent for males)" [ythunemm] is quantitative, satisfying the level of measurement requirement for the dependent variable. "Degree of urbanization" [urbaniz] is categorical, satisfying the level of measurement requirement for the independent variable. Mark the check box for a correct answer.
The next statement asks about the size of the sample. To answer this question, we run the one-way ANOVA in SPSS.
While SPSS has a procedure for one-way analysis of variance, we will use the General Linear Model procedure, since that is what we will have to use when we next move to two-factor analysis of variance. To compute a one-way analysis of variance, select General Linear Model > Univariatefrom the Analyze menu. “Univariate” indicates that we have a single (uni=one) dependent variable.
First, move the dependent variable, ythunemm, to the Dependent Variable text box. Second, move the independent variable, urbaniz, to the Fixed Factor(s) list box. Third, click on the Options button to request basic statistics. A categorical variable is a “fixed factor” when all of the possible values are included in the data set. A categorical variable is a “random factor” when only a random subset of possible values is included in the data set.
Second, move the dependent variable urbaniz to the Display Means for list box. • First, mark the checkboxes for: • Descriptive statistics, • Homogeneity tests, and • Residual plot. Third, click on the Continue button to close the dialog box.
Next, click on the Save button to instruct SPSS to compute the standardized residuals.
First, click on the Standardized check box to create standardized residuals in the SPSS Data Editor. Second, click on the Continue button to close the dialog box.
Next, click on the Post Hoc button to instruct SPSS to compute the post hoc tests.
First, move the independent variable to I to the Post Hoc Tests for list box. Third, click on the Continue button to close the dialog box. Second, click on the check box for theBonferroni post hoc test.
Our selections are complete. Click on the OK button to produce the output.
Our initial impetus for running the analysis of variance was to identify the available sample. The number of cases with valid data to analyze the relationship between "degree of urbanization" and "share of unemployed youth to total unemployed (per cent for males)" was 92.
The number of cases with valid data to analyze the relationship between "degree of urbanization" and "share of unemployed youth to total unemployed (per cent for males)" was 92. Mark check box for a correct answer.
The next question concerns the conformity of the data to the conditions or assumptions required for a one-way analysis of variance. Making inferences about population means based on a one-way analysis of variance requires equal variance of the dependent variable across groups defined by the independent variable and a normal distribution for the residuals. If we do not satisfy the assumption of equal variance and normal distribution of the residuals, we can re-express the dependent variable if it is skewed to see we can satisfy the condition of equal variance and normality using a transformed variable.
The uniformity of the variance of the dependent variable across groups defined by the independent variable is evaluated with the Levene Test of Equality of Error Variances. The Levene statistic tests the null hypothesis that the variances for all of the groups are equal. When the probability of Levene statistic is less than or equal to alpha, we reject the null hypothesis, supporting a finding that the variances of one or more groups is different and we do not satisfy the assumption of equal variances. In this problem, the interpretation of equal variance is supported by the Levene statistic of 0.629 with a probability of p = .535, greater than the alpha of p = .050. The null hypothesis is not rejected. The assumption of equal variance is supported.
The residual plot indicates that the spreads (heights) of the points for each group are similar, reinforcing the interpretation that the equal variance condition is satisfied.
The one-way analysis of variance expects the residuals (which we created with the Univariate command) to have a normal distribution. The distribution of the residuals is evaluated the Shapiro-Wilk test. To compute the Shapiro-Wilk test, select Descriptive Statistics> Explore from the Analyze menu.
While we are interested in the normality test only for the standardized residuals, we include the dependent variable ythunemm so that we have its skewness value in case we have to transform the variable. First, move the dependent variable and the variable for standardized residuals to the Dependent List. Second, click on the Plots button to request the normality test.
Second, clear the Stem-and-leaf check box and mark the Histogram check box. Third, click on the Continue button to close the dialog box. First, mark the check box for Normality plots with tests. This produces the Shapiro-Wilk test.
Next , click on the Plots button to request the normality test.
First, mark the option button to Exclude cases pairwise. Second, click on the Continue button to close the dialog box. The default option is to exclude cases listwise, which means a case will be excluded if it is missing data on either of the variables. Since standardized residuals will only be computed for cases that had valid data for both the independent variable urbaniz and the dependent variable ythunemm, it is possible that some cases with legitimate data for ythunemmwould be excluded by the Explore procedure because they did not have a valid value for urbaniz. Changing the option to pairwise exclusion includes all of the valid values for each variable independent of missing data on the other variables.
Our selections are complete. Click on the OK button to produce the output.
The Shapiro-Wilk statistic tests the null hypothesis that the distribution of the residuals is normal. When the probability of Shapiro-Wilk statistic is less than or equal to alpha, we reject the null hypothesis, supporting a finding that the residuals are not normally distributed and we do not satisfy the assumption of normality. In this problem, the normality of the distribution of residuals is not supported by the Shapiro-Wilk statistic of 0.970 with a probability of p = .033, less than or equal to the alpha of p = .050. The null hypothesis is rejected, and the assumption of normal residuals is not supported.
The histogram does not show a major departure from normality. There is one value that appears to be an outlier. The Shapiro-Wilk test is very sensitive to departures from normality.
The normality plot also does not show a serious departure from normality. The outlier is more evident.
Since we violated the normality condition, we do not mark the check box for the statement. We will re-express the dependent variable if it is skewed to see if the relationship using transformed variables satisfies the conditions required for analysis of variance.
The next question asks us to identify the correct transformation to use for our effort to satisfy the ANOVA conditions.
When the raw data does not satisfy the conditions of equal variance and normality, we examine the skewness of the variable to identify skewing that might be corrected with re-expression. The skewness for "share of unemployed youth to total unemployed (per cent for males)" [ythunemm] was 0.608. Since the skew for the dependent variable "share of unemployed youth to total unemployed (per cent for males)" [ythunemm] (0.608) was equal to or greater than 0, we attempt to correct violation of assumptions by re-expressing "share of unemployed youth to total unemployed (per cent for males)" on a logarithmic scale. NOTE: in these problems we use skewness to determine which transformation to use. We are not using the +1/-1 criteria to determine normality.
Since the logarithm was the correct transformation, we mark the check box.
The next question asks us to evaluate the analysis of variance for the re-expressed variable. We will first re-express the variable and then run the one-way analysis of variance again.
To re-express the variable, select the Compute Variable command from the Transform menu.
First, type the name for the transformed variable in the Target Variable text box. Second, type the formula for the transformation in the Numeric Expression text box. The formula is provided in the second paragraph of the problem statement. Third, click on the OK button to close the dialog box.
To repeat the one-way analysis of variance, click on Dialog Recall tool button.
Replace the raw dependent variable ythunemm with the log transformed variable LG_ythunemm. Since that is the only required change, click on the OK button to produce the output.
In this problem, the interpretation of equal variance is supported by the Levene statistic of 1.835 with a probability of p = .166, greater than the alpha of p = .050. The null hypothesis is not rejected, and the assumption of equal variance is supported.
The transformed variable shows slight narrowing in the category at the right., but the Levene statistic indicates that the data meets the criteria.
To run the normality test again, select the Explore command from the drop down menu for the Dialog Recall tool button.
First, remove the variables from the Dependent List and add variable for the standardized residual from the second run of the Univariate command, ZRE_2. Second, click on the OK button to generate the output. Each time we run the Univariate command with the option to save standardized residuals, SPSS creates a new variable with a higher number at the end, e.g. ZRE_1, ZRE_2, etc. It is up to the analyst to select the correct variable for the test.
In this problem, the normality of the distribution of residuals is supported by the Shapiro-Wilk statistic of 0.977 with a probability of p = .104, greater than the alpha of p = .050. The null hypothesis is not rejected, and the assumption of normal residuals is supported.
Since we satisfied both the equal variance condition and the normality condition, we mark the check box as correct.
Since we satisfied the conditions for ANOVA, we make the statistical decision about the null hypothesis.
When the p-value for the F-test is less than or equal to alpha, we reject the null hypothesis that the means of the populations represented by the groups in the sample were all equal, and we interpret the results of the test. If the p-value is greater than alpha, we fail to reject the null hypothesis and do not interpret the result. The p-value for the ANOVA test (p < .001) was less than or equal to the alpha level of significance (p = .050) supporting the conclusion to reject the null hypothesis. At least one of the means of the populations represented by the groups in the sample was different from the other means. There are multiple F statistics and sig. values in the output. We use the one on the row naming the independent variable.
The p-value for the ANOVA test (p < .001) was less than or equal to the alpha level of significance (p = .050) supporting the conclusion to reject the null hypothesis. At least one of the means of the populations represented by the groups in the sample was different from the other means. We mark the check box for a correct answer. Since we rejected the null hypothesis, we can interpret the post hoc tests. Had we failed to reject the null hypothesis, we would have halted the analysis.