1 / 76

Correlation and Regression Analysis Workshop

Learn to analyze bivariate data through scatterplots, correlation coefficients, and linear regression models in SPSS. Interpret results and understand relationships between continuous variables. Explore real-world examples and datasets.

cobosa
Download Presentation

Correlation and Regression Analysis Workshop

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Correlation and RegressionMathematics & Statistics HelpUniversity of Sheffield

  2. Learning outcomes By the end of this session you should know about: • Approaches to analysis for simple continuous bivariate data By the end of this session you should be able to: • Construct and interpret scatterplots in SPSS • Identify when it is appropriate to use correlation • Calculate a correlation coefficient in SPSS • Interpret a correlation coefficient • Identify when it appropriate to use linear regression • Run a simple regression model in SPSS • Interpret the results of a linear regression model

  3. Download the slides from the MASH website MASH > Resources > Statistics Resources > Workshop materials

  4. Association between two continuous variables: correlation or regression? Two basic questions: • Is there a relationship? • No causation is implied, simply association • Use CORRELATION • How can we use the value of one variable to predict the value of the other variable? • May be causal, may not be • Use REGRESSION

  5. Correlation: are two continuous variables associated? When examining the relationship between two continuous variables ALWAYS look at the scatterplot, to see visually the pattern of the relationship between them

  6. Scatterplot: Relationship between two continuous variables: • Explores the way the two co-vary (correlate): • Positive / negative • Linear / non-linear • Strong / weak • Presence of outliers Outlier Linear

  7. Scatterplots

  8. Correlation Coefficient r Measures strength of a linear relationship between 2 continuous variables, can take values between -1 to +1 r = 0.9 r = 0.01 r = -0.9

  9. Correlation: Interpretation An interpretation of the size of the coefficient has been described by Cohen (1992) as: Cohen, L. (1992).Power Primer. Psychological Bulletin, 112(1) 155-159

  10. Relationship is not assumed to be a causal one – it may be caused by other factors Does chocolate make you clever or crazy? • A paper in the New England Journal of Medicine claimed there was a relationship between chocolate and Nobel Prize winners http://www.nejm.org/doi/full/10.1056/NEJMon1211064

  11. Chocolate and serial killers What else is related to chocolate consumption? http://www.replicatedtypo.com/chocolate-consumption-traffic-accidents-and-serial-killers/5718.html

  12. Dataset for today: Birthweight_reduced_data Factors affecting birth weight of babies Mother smokes = 1 Standard gestation = 40 weeks

  13. Exercise 1: Gestational age and birthweight Draw a line of best fit through the data (with roughly half the points above and half below). Describe the relationship Is the relationship: • strong/ weak? • positive/ negative? • linear?

  14. Exercise 2: Interpretation Interpret the following correlation coefficients using Cohen’s classification and explain what they mean. Which correlations seem meaningful?

  15. Scatterplot in SPSS Graphs  Legacy Dialogs  Scatter/Dot

  16. Scatterplot in SPSS Graphs  Legacy Dialogs  Scatter/Dot

  17. Correlation in SPSS Analyze Correlate  Bivariate  Pearson Use Spearman’s correlation for ordinal variables or skewed scale data

  18. Scatterplot and correlation SPSS output using reduced baby weight data set Pearson correlation r = 0.708 Strong relationship

  19. Hypothesis test for the correlation coefficient • Can be done, the null hypothesis is that the population correlation r = 0 • Not recommended as it is influenced by the number of observations • Better to use Cohen’s interpretation

  20. Hypothesis test: Influence of sample size

  21. And so what do correlations of 0.63 (n=10) and 0.16 (n=150) look like? Correlation=0.63, p=0.048 (n=10) Correlation=0.16, p=0.04 (n=150)

  22. Points to note • Do not assume causality • Be careful comparing the correlation coefficient, r, from different studies with different n • Do not assume the scatterplot looks the same outside the range of the axes • Use Cohen’s scale to interpret, rather than the p-value • Always examine the scatterplot!

  23. Exercise 3a: Scatterplot Use Recode > Transform into Different Variables to construct a variable for maternal smoking status (non-smoker / smoker) Construct a scatterplot for birthweight and gestational age? Use Set Markers by to distinguish between smokers and non-smokers • Is there evidence of a linear relationship • Interpret the correlation coefficient. What does it mean? Note: • Think about which variable should be on the x axis (horizontal) and which should be on the y axis( vertical) • If you double-click on the graph you can open the Graph dialog window and edit the chart, for example change the colours used for smokers and non-smokers

  24. Exercise 3b: Scatterplot & Correlation Construct a scatterplot and calculate Pearson’s correlation coefficient for birthweight and maternal pre-pregnancy weight? • Is there evidence of a linear relationship • Interpret the correlation coefficient. What does it mean? Note: think about which variable should be on the x axis (horizontal) and which should be on the y axis( vertical)

  25. Association between two continuous variables: correlation or regression? Two basic questions: • Is there a relationship? • No causation is implied, simply association • Use CORRELATION • How can we use the value of one variable to predict the value of the other variable? • May be causal, may not be • Use REGRESSION

  26. Simple linear regression • Regression quantifies the relationship between two continuous variables • It involves estimating the best straight line with which to summarise the association • The relationship is represented by an equation, the regression equation • It is useful when we want to • look for significant relationshipsbetween two variables • predictthe value of one variable for a given value of the other

  27. Independent / dependent variables Does attendance have an association with exam score? Does temperature have an impact on the growth rate of a cell culture? INDEPENDENT (explanatory/ predictor) variable (x) DEPENDENT (outcome) variable (y) affects

  28. Does gestational agehave an association with birth weight?

  29. Regression y = a + b x Simple linear regression looks at the relationship between two continuous variables by producing an equation for a straight line of the form You can use this to predict the value of the dependent (outcome) variable for any value of the independent (explanatory) variable Independent variable Dependent variable Intercept Slope

  30. Birth weight example equation Example: Birth weight (y) = -3.03 + 0.16 * gestational age (x) here, a = -3.03 (intercept) b = 0.16 (slope) i.e. for every extra week of gestation, birth weight increases by 0.16 kgs

  31. Birth weight example - Slope Slope b is the average change in the Y variable for a change of one unit in the X variable b = 0.16 so extra 0.16 kgs for every extra week of gestation

  32. Birth weight example - Intercept Y Response variable (dependent variable) X Predictor / explanatory variable (independent variable)

  33. Estimating the best fitting line • We try to fit the “best” straight line • The standard way to do this is using a method called least squares using a computer • Residuals = differences between observed and predicted values for each observation • Least squares method chooses a line so that the sum of the squared residuals (averaged over all points) is minimised

  34. Line of best fit Residuals = observed - predicted

  35. Hypothesis testing in regression • Regression finds the best straight line with which to summarise an association • It is useful when we want to look for significant relationshipsbetween variables • The slope is tested for significance. If there is no relationship, the gradient of the line (b) would be 0; i.e. the regression line would be a horizontal line crossing the y axis at the average value for the y variable

  36. Regression in SPSS Analyse  Regression  Linear

  37. Output from SPSS: key regression table • As p < 0.05, gestational age is a significant predictor of birth weight • Weight increases by 0.16 kgs for each week of gestation P – value < 0.001 Y = -3.029 + 0.162X

  38. Output from SPSS: ANOVA table ANOVA compares the null model (mean birth weight for all babies) with the regression model Null model: y = 3.31 Regression model: y = -3.02 + 0.16x

  39. Output from SPSS: ANOVA table Does a model containing gestational age predict significantly more accurately than just using the mean birth weight for all babies? Yes as p < 0.001 Total: number of subjects included in the analysis – 1

  40. How reliable are predictions? Using R2 How much of the variation in birth weight is explained by the model including Gestational age? Proportion of the variation in birth weight explained by the model R2 = 0.502 = 50% Predictions using the model are fairly reliable. Which other variables may help improve the fit of the model? Compare models using Adjusted R2, as this adjusts for the number of variables in the model

  41. Exercise 4 Investigate whether mother’s pre-pregnancy weight and birth weight are associated using a simple linear regression

  42. Exercise 4: regression Adjusted R2 = Does the model result in reliable predictions? ANOVA p-value = Is the model an improvement on the null model (where every baby is predicted to be the mean weight)?

  43. Exercise 4: Regression Pre-pregnancy weight coefficient and p-value: Regression equation: Interpretation:

  44. Assumptions for regression

  45. Checking assumptions: normality of residuals • Residuals: • observed value minus value predicted by the model (fitted value) • Yobs- Yfit • i.e. the vertical lines on the plot below It is the residuals that need to be normally distributed, not the data

  46. Checking assumptions: normality of residuals • Use standardised residuals to check the assumptions. Outliers are those values < -3 or > 3 • Select histogram of residuals Scatterplot of predicted vs residuals

  47. Checking assumptions: normality Histogram looks approximately normally distributed When writing up, just say ‘normality checks were carried out on the residuals and the assumption of normality was met’

  48. Predicted values against residuals Are there any patterns as the predicted values increases? There is a problem with Homoscedasticityif the scatter is not random. These shapes are bad:

  49. Exercise 5 Re-run the regression model, but this time, produce the residual plots. Do you think that the assumptions of normality of residuals and homogeneity of variance are met?

More Related