Correlation and Simple Linear Regression - Review, SPSS Tutorial, and Testing Significance

Psych 706: stats II Class #6

Agenda • Assignment #3 due 4/5 • Correlation (Review) • Simple Linear Regression • Review Exam #1 tests • SPSS Tutorial: Simple Linear Regression

correlation • Pearson’s correlation: Standardized measure of covariance • Bivariate • Partial • Assumptions: Linearity and Normality (outliers are a big deal here) • When assumptions not met for Pearson’s use other bivariate: • Spearman’s rho – rank-orders data • Kendall’s tau – use for small sample sizes, lots of tied ranks • Testing significance of correlations • Is one correlation different from zero? • Are correlations significantly different between two samples? http://www.quantpsy.org/corrtest/corrtest.htm • Are correlations significantly different within one sample? http://quantpsy.org/corrtest/corrtest2.htm

Correlation example • Class #4 on Blackboard: Album Sales.spv • Do the following predictors share variance with the following outcome? • X1 = Advertising budget • X2 = Number of plays on the radio • X3 = Rated attractiveness of band members (0 = hideous potato heads, to 10 = gorgeous sex objects) • Y = Number of albums sold • Right now we are not going to worry about assumptions (linearity, etc.)

Spss bivariate correlations AnalyzeCorrelateBivariate • Move variables you want to correlate into Variables box • Click two-tailed and flag significant correlations • Click Pearson and/or Spearman’s and/or Kendall’s tau

Spss bivariate correlations

Spss Partial correlations AnalyzeCorrelatePartial • Move variables you want to correlate (Album Sales and Radio Plays) into Variables box • Put Band Attractiveness in the Controlling For box • Click two-tailed and display actual significance level

Spss Partial correlations Correlation between Album Sales and Radio Plays decreased from .599 (bivariate correlation) to .580 when removing shared variance from Band Attractiveness, and the correlation is still significant Conclusion: Radio Plays shares significant unique variance with Album Sales not shared with Band Attractiveness

Questions about correlation?

HOW IS regression related to correlation? • Correlation indicates strength of two variables, X and Y. • In regression analyses, you can easily compare the degree to which multiple X variables predict Y within the same statistical model In this graph, since there is only one X variable, data in the scatterplot can be quantified either way: as a correlation (standardized) and as a regression equation (unstandardized)

Simple regression • Correlation is standardized, but regression is not • As a result, we include an intercept in the model • Equation for a straight line (“linear model”) • Outcome = Intercept + Predictor Variable(s) + Error • Y = b0 + bX + E Regression coefficients Slope

Equation for a straight line • b0 • Intercept (expected mean value of Y when X = 0) • Point at which the regression line crosses the Y-axis (ordinate) • b1 • Regression coefficient for the predictor • Gradient (slope) of the regression line • Direction/Strength of Relationship

Intercepts and Slopes (AKA Gradients)

Assumptions of the linear model • Linearity and Additivity • Errors (also called residuals) should be independent of each other AND normally distributed • Homoscedasticity • Predictors should be uncorrelated with “external variables” • All predictor variables must be quantitative/continuous or categorical • Outcome variable must be quantitative/continuous • No multicollinearity(no perfect correlation between predictor variables if there’s more than one) • BIGGEST CONCERN: Outliers!!!!

Method of least squares

How good is our regression model? • The regression line is only a model based on the data. • This model might not reflect reality. • We need some way of testing how well the model fits the observed data. • Enter SUMS OF SQUARES!

Sums of squares SS total = differences between each data point and mean of Y

Sums of squares SS total = differences between each data point and mean of Y SS model = differences between mean value of Y and slope

Sums of squares SS total = differences between each data point and mean of Y SS model = differences between mean value of Y and slope SS residual = differences between each data point and slope

Sums of squares SS total = differences between each data point and mean of Y SS model = differences between mean value of Y and slope SS residual = differences between each data point and slope R² = SS model / SS residual F = MS model / MS residual

THIS looks a lot like our one-way anovacalculations!

SS Total Difference between each score and the grand mean One-Way ANOVA Review SS Model Difference between each group mean and the grand mean Three Group Means = pink, green, and blue lines Grand Mean = black line overall mean of all scores regardless of Group Individual scores = pink, green, and blue points SS Residual Difference between each score and its group mean

regression ANOVA (F test) used to test the OVERALL regression model: • If all predictor variables together (X1, X2, X3) share significant variance with the outcome variable (Y) T-tests used to test SIMPLE effects: • Whether individual predictors (slope of X1, X2, or X3) significantly different from zero This is similar to ANOVA testing whether there is an OVERALL difference between groups and post-hoc comparisons testing SIMPLE effects between specific groups

What is the difference between One-Way anova and SIMPLE regression? • They are exactly the same calculations but presented in a different way • In both you have one dependent variable, Y • In ANOVA, your independent variable, X, is required to be categorical • In simple regression, your independent variable, X, can be categorical or continuous • Would it be helpful to see an example of how they are the same next week at the start of class?

Regression Example • Class #4 on Blackboard: Album Sales.spv • How do the following predictors separately and together influence the following outcome? • X1 = Advertising budget • X2 = Number of plays on the radio • X3 = Rated attractiveness of band members (0 = hideous potato heads, to 10 = gorgeous sex objects) • Y = Number of albums sold

Regression assumptions, Part 1 • Linearity and Normality, Outliers • Skewness/Kurtosis z-score calculations • Histograms • Boxplots • Transformations if needed • Scatterplots between all variables • Multicollinearity • Bivariate correlations between predictors should be less than perfect (r < .9) • Non-Zero Variance • Predictors should all have some variance in them (not all the same score) • Type of Variables Allowed • Predictors must be scale/continuous or categorical • Outcome must be scale/continuous • Homoscedasticity • Variance around the regression line should be about the same for all values of the predictor variable (look at scatterplots)

Regression assumptions, part 2 • Errors (also called residuals) should be independent of each other AND normally distributed • Predictors should be uncorrelated with “external variables” = DIFFICULT TO CHECK!!!

Checking assumptions • You could try to figure out assumptions while you’re running the regression • I like to check assumptions as much as possible BEFORE running the regression so that I can more easily focus on what the actual results are telling me • You can also select extra options in the regression analysis to get a lot of info on assumptions

THIS IS THE PLAN • We are going to check assumptions for all variables in our Album Sales SPSS file as if we were going to run a multiple regression with three predictors • However, we’re going to save that multiple regression for next week • Today we’ll run a simple linear regression first and interpret the output to get you used to looking at the results

Create histograms Y X1

Create histograms X3 X2

Divide skewness & kurtosis by their standard errorsCutoff: Anything beyond z=+/-1.96 (p<.05) is problematic X1 Y X3 X2

Next steps • X2 (No. of Plays on Radio) and Y (Album Sales) look normally distributed • Problems with Normality for X1 (Adverts) and X3 (Band Attractiveness) • Lets look at boxplots to view outliers/extreme scores • Lets transform the data and see if that fixes the skewness/outlier problem

Box plots X1 X3

Transformed ADVERTS so no longer skewed X1 X1

By transforming adverts, the outliers are no longer outliers! X1 X1 X1

Transformed BAND attractiveness IS still skewed w/ outliers X3 X3

Let’s transform attractiveness scores into z-scores • AnalyzeDescriptiveStatisticsDescriptives • Put original Attractiveness variable in box • Check Save Standardized Values as Variables • New variable created: Zscore: Attractiveness of Band • Plot Histogram of z-scores • 4 Outliers > 3SD!!!

Outliers: a couple of options • You have 200 data points which is a lot – you could calculate power with the 4 outliers removed and see how much it might affect your ability to find an effect… • You could remove them from analysis entirely • Documenting subject #, etc. and reason for removal • Save data file with new name (AlbumSales_Minus4outliersOnAttract.sav) • You could replace the 4 outliers with the next highest score on Attract, which is a ‘3’ or you could replace with the mean score (both reduce variability though) • Document this change • Saving file with new name (AlbumSales_4outliersOnAttractmodified.sav)

Outliers: another option • We could leave outliers in the data set and run a bunch of extra tests in our regression to see if any of these data points cause undue influence on our overall model • We’ll get to those tests during next class • Essentially you could run the regression with and without the outliers included in the model and see what happens • DataSelect CasesIf condition is satisfied: 3 > ZAttract > -3 • This means include all data points if the z-score value of Attractiveness is within 3 SD

Next steps • Let’s say we went with deleting the 4 outliers • Now let’s look at other potential outliers using scatterplots • This will also show us the relationships between the variables (positive versus negative) • This will also let us check the homoscedasticity assumption: The variance around the regression line should be about the same for all values of the predictor variable

Scatterplots: HOMOSCEDASTICITY check Y Y X2 X1

Scatterplots: homoscedasticity check Y Y X3 X2

Multicollinearity check: bivariate correlations Pearson’s (parametric) Y X3 X1 X2 X1 X2 X3

JUST Checking out Relationship of predictors to outcome variable Pearson’s (parametric) Y X3 X1 X2 X1 X2 X3

Multicollinearity check: bivariate correlations Kendall’s tau (non-parametric) Y X2 X3 X1 X1 X2 X3

Just checking out Relationship of predictors to outcome variable Kendall’s tau (non-parametric) Y X2 X3 X1 X1 X2 X3

Non-zero variance assumption AnalyzeDescriptive StatisticsFrequencies Move variables in box, click Statistics, select Variance and Range Y X3 X1 X2

predictor variables must be quantitative/continuous or categorical Look at Label in SPSS Variable View: Are X1, X2, and X3 considered Scale or Nominal variables?

Correlation and Simple Linear Regression - Review, SPSS Tutorial, and Testing Significance