400 likes | 598 Views
Regression & Correlation. Interval & Ratio Level Association. An Example. Explaining variation in % of state’s 2000 population receiving food stamps. Dependent Variable. State-to-state variation in % of state’s 2000 population receiving food stamps. Independent Variables.
E N D
Regression & Correlation Interval & Ratio Level Association
An Example • Explaining variation in % of state’s 2000 population receiving food stamps
Dependent Variable • State-to-state variation in % of state’s 2000 population receiving food stamps
Independent Variables • Such population characteristics as • Income • Education • Measures of need, e.g. • Unemployment rate • % living below poverty line
Independent Variables, cont. • Other • Teen pregnancies • % covered by health insurance
Interval/Ratio Data • Carries a lot of information • It may be multiplied & divided • It may (in theory) assume an infinite number of values (by going out to the right of the decimal) • It is also called “continuous” data
Interval/Ratio Data, Continuous Data • Can be found in surveys (income, years of education, etc.) • Is more commonly found in data sets containing aggregate or ecological data (data which summarizes large numbers of individual cases)
Interval/Ratio Data: Analyzing It • Can collapse (recode) it into categories • Can use regression and correlation to analyze it directly
Regression: Explaining & Predicting • Case scores on independent variable (X) and dependent variable (Y) can be plotted onto a graph, creating a scattergram, or a scatterplot • A line (the regression line) can then be drawn through the points on the scattergram, in order to summarize them
Regression: Explaining & Predicting • A regression equation describes a regressionline • Simple regression equations have the form: • Y’ = a + bXi
Y’ = a + bXi • Y’ • A predicted value of the dependent variable • Xi • A given value of the independent variable • b • The slope of the regression line • The angle at which the regression line crosses the Y axis • a.k.a. the regression coefficient
Y’ = a + bXi • a • The “Y intercept” • The point at which the regression line crosses the Y axis • The value of Y when X is zero
Regression: Explaining & Predicting, cont. • The line which produces the least amount of error in predicting the dependent variable is the best line (the least squares criterion) • The computing formulas used to obtain slopes and intercepts are designed to satisfy this criterion
Regression: Explaining & Predicting, cont. • They allow us to predict values of the dependent variable from given values of the independent variable • They show how the two variables are related (i.e. they explain the dependent variable’s behavior in terms of the independent variable)
Example: Food Stamps & Teenage Mothers • Dependent variable (Y): • % of state’s 2000 population receiving food stamps • Independent variable (X): • % of births to mothers under 20 in 1997 • Equation: • Food stamps % = 1.238 + .396(% of births to mothers under 20)
Food Stamps & Teenage Mothers, cont. • If 15% of a state’s births are to mothers under the age of 20, what percentage of that state’s population would you predict would be receiving Food Stamps? • Food Stamps % = 1.238 + . 396(15%) • Food Stamps % = 1.238 + 5.94 • Food Stamps % = 7.178%
Food Stamps & Teenage Mothers, cont. • If the number of births to mothers under the age of 20 in that state were to decline by 3%, what effect might that have on percentage of population receiving Food Stamps? • Food Stamps % = 1.238 - . 396(3%) • Food Stamps % = 1.238 – 1.888 • Food Stamps % = Decrease by 0.05%
Food Stamps & Teenage Mothers, cont. • Food stamps % = 1.238 + . 396(% of births to mothers under 20) • Food Stamps % and births to mothers under 20 are positively associated. As % of births to mothers under 20 decreases, percent of population receiving food stamps also decreases (the positive slope tells us that)
Explaining Food Stamps & Teenage Mothers • How much percent of population receiving food stamps decreases is indicated by the slope’s size (magnitude) A one percent change in births to mothers under 20 results in a change of (roughly) . 396% in percent of population receiving aid.
Slopes • Are Key, But • Their magnitude is affected by both the strength of association between the two variables, and by the magnitude of the independent variable • They are not standardized • Two slopes may not be easily compared
Slopes, but • Sometimes we are interested in measuring strength of association, not in explaining &/or predicting • To deal with this, we use the correlation coefficient
Correlation • Is a summary association measure for interval/ratio data (used like Cramer’s V, Somer’s D, etc.) • Is a standardized slope • Is easily calculated • Is routinely reported with regression equations
Correlation • Lots Of Names, One Statistic • Pearson’s r • Correlation coefficient • Pearson’s Product Moment Correlation Coefficient
Correlation, Cont. • Is often reported by itself, without bothering to first calculate slopes & intercepts • Ranges from -1.0 to 0.0 to 1.0 • When squared (the coefficient of determination), shows the amount (%) of variation explained
Correlation r2 shows the amount (%) of explained variation: r r2 .30 .09 .50 .25 .608 .37
Getting Correlations Without Scattergrams • There is a correlation function in many statistical software packages, and some spreadsheets • They will produce a correlation matrix, which shows the correlation of each selected variable with all other selected variables
Standard Error of Estimate • A “goodness of fit” measure • Analogous to standard deviation • a range above & below regression line within which 68.2% of all actual cases fall
Multiple Regression & Partial Correlation • Multivariate analysis for interval & ratio level data • Involves the introduction of additional independent variables (controls) into a bivariate association • Yields summary statistics that are comparable to those found in simple regression
Multiple Regression: Results • Multiple regression equation • Y’ = a + b1X1 + b2X2 + + bnXn • Each slope indicates the relationship between its corresponding independent variable and the dependent variable independent of the effect of all other independent variables in the equation
Multiple Regression Equation • Size of slopes is affected by • Strength of association • Scale of independent variable(s) • Number of independent variables in the equation
Multiple Regression: Results, cont. • Multiple correlation coefficient: R2 • Shows the % of variation in dependent variable explained by all independent variables acting together • Significance
Example: Food Stamps • Criteria For Assessing Obtained Equation(s) • Do a good job of explaining variation in dependent variable (i.e. maximize R2) • Keep number of independent variables down to a reasonable minimum, a.k.a. • Parsimony • Elegance • Efficiency
Example: Food Stamps • Selecting Independent Variables • Start with a set of interesting variables, then winnow down • Considerations: • Variables that are (large correlation coefficients) or should be (in theory) strongly associated with the dependent variable are good starting points
Example: Food Stamps • Selecting Independent Variables, cont. • Avoid using several independent variables which measure the same concept (strongly correlated with each other, have important theoretic similarities) • Try to use independent variables which make significant contributions to the final equation • “t” of 2.0 or greater indicates significance • Remember, this will change as you add or delete variables
Selecting Independent Variables, cont. • A beta (a standardized slope) • Shows the influence of its associated independent variable on the dependent variable, independent of the effects of all other independent variables in the equation • Is expressed in standard deviation units • Can drop independent variables with small betas (or add ones with large betas), then recompute. This is a form of stepwise regression
Resulting Equation % Food Stamps = 16.7 + .343(Teen Moms) - .157(% HS) - .103(Health Insurance) R2 = .443 Prob. = .000
Multiple Correlation Coefficient • R2 • Shows the % of variation in dependent variable explained by all independent variables acting together
Partial Correlation Coefficient • rxy.z • Shows correlation between dependent variable & a single independent variable, controlling for the effect of a third (fourth, etc.) variable
Interpreting Partial Correlation rxy.z 2 shows the amount (%) of variation explained by independent variable, independent of the controls: rxy.z rxy.z 2 .30 .09 .50 .25 .185 .43