Regression & Correlation:Extended Treatment

Regression & Correlation:Extended Treatment • Overview • The Scatter Diagram • Bivariate Linear Regression • Prediction Error • Coefficient of Determination • Correlation Coefficient • Anova and the F statistic • Multiple Regression

Independent Variables Nominal Interval Considers the distribution of one variable across the categories of another variable Considers how a change in a variable affects a discrete outcome Dependent Variable Interval Nominal Considers the difference between the mean of one group on a variable with another group Considers the degree to which a change in one or two variables results in a change in another Overview

This cell is not covered in this course TODAY! TODAY! Overview You already know how to deal with two nominal variables Independent Variables Nominal Interval Logistic Regression Lambda Dependent Variable Interval Nominal Regression Correlation Anova and F-Test

General Examples Does a change in one variable significantly affect another variable? Do two scores co-vary positively (high on one score high on the other, low on one, low on the other)? Do two scores co-vary negatively (high on one score low on the other; low on one, hi on the other)? Does a change in two or more variables significantly affect another variable?

Specific Examples Does getting older significantly influence a person’s political views? Does marital satisfaction increase with length of marriage? How does an additional year of education affect one’s earnings? How do education and seniority affect one’s earnings?

Scatter Diagrams • Scatter Diagram (scatterplot)—a visual method used to display a relationship between two interval-ratio variables. • Typically, the independent variable is placed on the X-axis (horizontal axis), while the dependent variable is placed on the Y-axis (vertical axis.)

Scatter Diagram Example • The data…

Scatter Diagram Example

A Scatter Diagram Example of a Negative Relationship

Linear Relationships • Linear relationship – A relationship between two interval-ratio variables in which the observations displayed in a scatter diagram can be approximated with a straight line. • Deterministic (perfect) linear relationship – A relationship between two interval-ratio variables in which all the observations (the dots) fall along a straight line. The line provides a predicted value of Y (the vertical axis) for any value of X (the horizontal axis.

Graph the data below and examine the relationship:

The Seniority-Salary Relationship

Example: Education & Prestige Does education predict occupational prestige?If so, thenthe higher the respondent’s level of education, as measured by number of years of schooling, the greater the prestige of the respondent’s occupation. Take a careful look at the scatter diagram on the next slide and see if you think that there exists a relationship between these two variables…

Scatterplot of Prestige by Education

Example: Education & Prestige • The scatter diagram data can be represented by a straight line, therefore there does exist a relationship between these two variables. • In addition, since occupational prestige becomes higher, as years of education increases, we can say also that the relationship is a positive one.

Take your best guess? If you know nothing else about a person, except that he or she lives in United States and I asked you to his or her age, what would you guess? The mean age for U.S. residents. Now if I tell you that this person owns a skateboard, would you change your guess? (Of course!) With quantitative analyses we are generally trying to predict or take our best guess at value of the dependent variable. One way to assess the relationship between two variables is to consider the degree to which the extra information of the second variable makes your guess better. If someone owns a skateboard, that is likely to indicate to us that s/he is younger and we may be able to guess closer to the actual value.

Take your best guess? • Similar to the example of age and the skateboard, we can take a much better guess at someone’s occupational prestige, if we have information about her/his years or level of education.

run Y rise = b rise run a X Equation for a Straight Line Y= a + bX where a = intercept b = slope Y = dependent variable X = independent variable

The estimates of a and b will have the property that the sum of the squared differences between the observed and predicted (Y-Y)2 is minimized using ordinary least squares (OLS). Thus the regression line represents the Best Linear and Unbiased Estimators (BLUE) of the intercept and slope. ˆ Bivariate Linear Regression Equation ^ Y = a + bX • Y-intercept (a)—The point where the regression line crosses the Y-axis, or the value of Y when X=0. • Slope (b)—The change in variable Y (the dependent variable) with a unit change in X (the independent variable.)

SPSS Regression Output: 1996 GSSEducation & Prestige Now let’s interpret the SPSS output...

Prediction Equation: Y = 6.120 + 2.762(X) This line represents the predicted values for Y when X is zero. ˆ The Regression Equation

Prediction Equation: Y = 6.120 + 2.762(X) This line represents the predicted values for Y for each additional year of education ˆ The Regression Equation

Y = 6.120 + 2.762(X) ˆ Interpreting the regression equation • If a respondent had zero years of schooling, this model predicts that his occupational prestige score would be 6.120 points. • For each additional year of education, our model predicts a 2.762 point increase in occupational prestige.

Ordinary Least Squares • Least-squares line (best fitting line) – A line where the errors sum of squares, or e2, is at a minimum. • Least-squares method – The technique that produces the least squares line.

Estimating the slope: b • The bivariate regression coefficient or the slope of the regression line can be obtained from the observed X and Y scores.

Covariance and Variance Covariance = Variance of X = Covariance of X and Y—a measure of how X and Y vary together. Covariance will be close to zero when X and Y are unrelated. It will be greater than zero when the relationship is positive and less than zero when the relationship is negative. Variance of X—we have talked a lot about variance in the dependent variable. This is simply the variance for the independent variable

Estimating the Intercept The regression line always goes through the point corresponding to the mean of both X and Y, by definition. So we utilize this information to solve for a:

Back to the original scatterplot:

A Representative Line

Other Representative Lines

Calculating the Regression Equation

The Least Squares Line!

Summary: Properties of the Regression Line • Represents the predicted values for Y for any and all values of X. • Always goes through the point corresponding to the mean of both X and Y. • It is the best fitting line in that it minimizes the sum of the squared deviations. • Has a slope that can be positive or negative;

Prediction Errors Back to our original data… Consider the prediction of Y for one country: Norway Norway’s predicted Y=73

Take your best guess? If you didn’t know the percentage of citizens in Norway who agreed to pay higher prices for environmental protection (Y) what would you guess? The mean for Y or = 56.45(The horizontal line in Figure 8) With this prediction the error for Norway is:

IMPROVING THE PREDICTION • Let’s see if we can reduce the error of prediction for Norway by using the linear regression equation: • The new error of prediction is: • Have we improved the prediction? • Yes!By…5.72 (16.55-10.83=5.72)

SUM OF SQUARED DEVIATION • We have looked only at Norway..To calculate deviations from the mean for all the cases we square the deviations and sum them;we call it the total sum of squares or SST: • The sum of squared deviations from the regression line is called the error sum of squares or SSE

MEASURING THE IMPROVEMENT IN PREDICTION • The improvement in the prediction error resulting from our use of the linear prediction equation is called the regression sum of squares or SSR. It is calculated by subtracting SSE from SST or: • SSR=SST-SSE

EXAMPLE:GNP AND WILLINGNESS TO PAY MORE Calculating the error sum of squares(SSE)

Example:GNP and Willingness to Pay More • We already have the total sum of squares from Table 4:(SST) • The regression sum of squares or SSR is thus: • SSR=SST-SSE=3,032.7-2,625.92=406.78

Coefficient of Determination • Coefficient of Determination (r2) – A PRE measure reflecting the proportional reduction of error that results from using the linear regression model. • The total sum of squares(SST) measures the prediction error when the independent variable is ignored(E1): E1= SST • The error sum of squares(SSE) measures theprediction errors when using the independent variable and the linear regression equation(E2): E2=SSE

Coefficient of Determination Thus... r2=0.13means: by using GNPand the linear prediction rule to predict Y-the percentage willing to pay higher prices-the error of prediction is reduced by 13percent(0.13x100). r2 also reflects the proportion of the total variation in the dependent variable, Y, explained by the independent variable, X.

Coefficient of Determination r2 can also be calculated using this equation……..

The Correlation Coefficient • Pearson’s Correlation Coefficient (r) — The square root of r2. It is a measure of association between two interval-ratio variables. • Symmetrical measure—No specification of independent or dependent variables. • Ranges from –1.0 to +1.0. The sign () indicates direction. The closer the number is to 1.0 the stronger the association between X and Y.

The Correlation Coefficient r = 0 means that there is no association between the two variables. r = 0 Y X

The Correlation Coefficient r = 0 means that there is no association between the two variables. r = +1 means a perfect positive correlation. r = +1 Y X

The Correlation Coefficient r = 0 means that there is no association between the two variables. r = +1 means a perfect positive correlation. r = –1 means a perfect negative correlation. Y r = –1 X

Testing the Significance of r2 using Anova • r2 is an estimate based on sample data. • We test it for statistical significance to assess the probability that the linear relationship it expresses is zero in the population. • This technique, analysis of variance (Anova) is based on the regression sum of squares(SSR) and the error sum of squares(SSE).

Determining df • There are df associated with both the regression sum of squares(SSR) and errors sum of squares (SSE). • For SSR df=k. K is equal to the number of independent variables in the regression equation. In the bivariate case df=1 • For SSE df=N-(K+1). In the bivariate case df=N-2[N-(1+1)]

Regression & Correlation:Extended Treatment