Bivariate Relationships

SHARON LAWNER WEINBERG SARAH KNAPP ABRAMOWITZ Statistics SPSS An Integrative Approach SECOND EDITION Bivariate Relationships Using Chapter 5

Summarizing the Relationship Between Two Variables: An Overview

The Relationship Between Two Scale VariablesWhat the Scatterplot Tells Us • Whether the relationship appears linear • If it does appear linear, it also tells us: • The direction and nature of the linear relationship • The relative strength of the linear relationship

Overview: Examples Using the Scatterplot to Describe the Relationship Between Two Scale Variables • Hamburgdata set: FAT and the CALORIES. • States data set: PERTAK (percentage of eligible students taking the SAT) and SATV (average verbal SAT score for the state). • Currency data set: BILLVALUE(bill denomination) and the number of bills in circulation. • Marijuana data set: YEAR and the percentage of students reporting that they ever smoked marijuana from 1987-1999.

Creating the Scatterplot Example: Using the Hamburg data set, describe the relationship between the FAT and the CALORIES of a burger. Solution: To obtain the scatterplot between FAT and CALORIES for the Hamburg data set, using SPSS, go to Graphs on the main menu bar, Legacy Dialogs, and then Scatter. Click Define.Put CALORIES into the box labeled y-axis and FAT into the box labeled x-axis and click OK.

Scatterplot: FAT vs. CALORIES

Interpreting the Scatterplot of FAT vs. CALORIES • A line appears to fit the data well; i.e., there is not a simple curve that would provide a better fit, so a linear model is appropriate. • The direction of the linear relationship is positive because the slope of the line representing the data is positive. The nature of the linear relationship is that burgers that are relatively high in fat tend also to be relatively high in calories. • The strength of the linear relationship appears to be strong because the points cluster tightly around the line.

Editing the Scatterplot to Label Points Go to Graphs on the main menu bar, Legacy Dialogs, and then Scatter. Click Define.Put CALORIES into the box labeled y-axis, FAT into the box labeled x-axis, and NAME in the box labeled label cases by and click OK. Double click on the graph to put it in the Chart Editor. Click on Elements, Show Data Labels. Move Name to the Displayed box and eliminate count. Click Apply, Close.

Labeled Scatterplot: FAT vs CALORIES

Scatterplot: PERTAK vs. SATV

Interpreting the Scatterplot of SATV vs. PERTAK • Although the points have a curvilinear shape, a line would appear to represent these points reasonably well, and so we will use it in this case. • The direction the linear relationship is negative because the slope of the line representing the data is negative. The nature is that states with a relatively low percentage of students taking the SAT tend to have higher SAT Verbal scores, on average. • The strength of the linear relationship is more moderate than for the hamburger example because the points in this case do not cluster as tightly around the line.

Scatterplot: Denomination (BILLVALUE) vs. number of bills in circulation.Note:Use Transform, Compute to combine variables to create a variable for the number of bills in circulation.

Interpreting the Scatterplot of BILLVALUE vs. NUMBER • Because the points have a “cloud like” formation, neither a simple curve nor a line is a good fit for these data. • We conclude that there is little or no relationship between the bill value and the number in circulation.

Scatterplot: Year vs. percentage of high school seniors reporting that they smoked marijuana at least once: 1987-1999.Note: Use Select Cases to restrict to the appropriate years.

Interpreting the Scatterplot of YEAR vs. MARIJUANA • A simple curve (or two lines) provides a better fit for the data than a single line and is therefore more appropriate than a line for modeling the data. • The relationship between marijuana use and year is non-linear.

Quantifying the Linear Relationship between Two Scale Variables: Pearson Product Moment Correlation Coefficient Often called, simply, correlation, and symbolized by the letter r. Before calculating, use a scatterplot to verify that the relationship between the variables appears to be linear. Calculated as the average of the product of the z-scores. This summary statistic measures the direction, nature, and strength of the linear relationship. Direction: Look at sign of r (positive or negative) Nature: Look at sign of r (positive means that high scores on one variable correspond to high scores on the other and low with low, negative means that low scores on one variable correspond to high on the other and vice versa) Strength: Look at magnitude (absolute value) of r. In the social sciences, a good rule of thumb comes from Cohen’s scale: r < .1 little or no, .1 <= r < .3, weak, .5 <= r < .5 moderate, r >= .5 strong

Obtaining the Pearson Correlation Using SPSS To use SPSS to obtain the correlation coefficient between CALORIES and FAT, click Analyze on the Main Menu Bar, Correlate, and Bivariate. Move the two variables, CALORIES and FAT, into the Variables box and click OK.

Interpreting the Pearson Correlation Coefficients • The correlation between FAT and CALORIES is .997 indicating a very strong positive linear relationship: burgers that are relatively high in fat tend also to be relatively high in calories and burgers that are relatively low in fat tend also to be relatively low in calories. • The correlation between SAT Verbal and the percentage of students taking the SAT is -.86 indicating a strong negative linear relationship: states that have a relatively high verbal SAT average tend to have a relatively low percentage of students taking the SAT and states that have a relatively low verbal SAT average tend to have a relatively high percentage of students taking the SAT.

Other Properties of Correlation The strength of the correlation is measured on an ordinal scale Correlation does not imply causation, i.e. when two variables are correlated it is not necessarily true that changing one will result in a predictable change in the other A linear transformation applied to one variable does not change the magnitude of the correlation. The sign of the correlation will change, however, if the transformation involves multiplication by a negative number Restricting the range of one of the variables can increase or decrease the magnitude of the correlation

Relationships between Two Ordinal or One Ordinal and One Scale: Scatterplot and Spearman Rank Correlation Coefficient The Spearman correlation, called Spearman’s rho, is a special case of the Pearson correlation computed on ranked data. Example: Describe the relationship, or indicate that there is not one, between the amount of time spent in school on homework (HWKIN12) and the amount of time spent out of school on homework (HWKOUT12) in twelfth grade for students in the NELS data set.

Scatterplot: HWKIN12 and HWKOUT12

Obtaining the Spearman Rank Correlation Coefficient Using SPSS Click Analyze, Correlate, Bivariate. Move the variables HWKIN12 and HWKOUT12 into the Variables box. Click Spearman and click off Pearson in the Correlation Coefficients box. Click OK. Note that when using SPSS, we do not need to transform the data to rankings to obtain the Spearman correlation coefficient. SPSS does this transformation for us.

Interpreting the Spearman Rank Correlation Coefficient • The Spearman correlation is interpreted in the same way as the Pearson correlation. • In this case, Spearman’s rho = .40, indicating a moderate positive relationship. • Twelfth grade students in the NELS data set who spend a relatively large amount of time doing homework in school also spend a relatively large amount of time doing homework outside of school and students who spend a relatively small amount of time doing homework in school tend also to spend a relatively small amount of time doing homework outside of school.

Relationships between One Scale and One Dichotomous Variable Example using the Hamburg data set: Describe the relationship between calories and cheese.

Interpreting the Correlation When One Variable is Scale and One is Dichotomous • The correlation between CALORIES and CHEESE is r = .51. • The correlation is positive indicating that high scores on one variable are associated with high scores on the other. • CHEESE is coded with 0 (a relatively low score) representing the absence of cheese and 1 (a relatively high score) representing the presence of cheese. • Burgers with cheese tend to be higher in calories than those without cheese. • This special case of Pearson correlation is sometimes called the point biserial correlation.

Description of the Impeach Data Set • On February 12, 1999, for only the second time in the nation’s history, the U.S. Senate voted on whether to remove a President, based on impeachment articles passed by the U.S. House. • Dozens of political talk shows featured analyses of why senators may have voted the way they did, but such discourse was rarely (if ever) informed by systematic statistical analysis of the votes. • Professor Alan Reifman of Texas Tech University created this data set about the senators to be used as part of such an analysis. The relevant variable descriptions appear in the following table.

Variables in the Impeach Data Set

Scatterplot Example: Describe the relationship between conservatism score and the vote on perjury

Interpreting the Correlation between Senators’ Conservatism and Their Vote on Perjury • The correlation between VOTE1 and conservatism is r = .87, indicating a strong relationship between the two variables. • The sign of the correlation is positive, so high scores on one variable are associated with high scores on the other. • VOTE1 is coded with 0 (a relatively low score) representing not guilty and 1 (a relatively high score) representing guilty. • Senators who are more conservative tended to vote guilty on perjury.

ScatterplotExample: Describe the relationship between conservatism score and the vote on obstruction of justice

Interpreting the Correlation between Senators’ Conservatism and Their Vote on Obstruction of Justice • The correlation between VOTE2 and conservatism is r = .94, indicating a strong relationship between the two variables and a stronger relationship than that between VOTE1 and conservatism. • The sign of the correlation is positive, so high scores on one variable are associated with high scores on the other. • VOTE2 is coded with 0 (a relatively low score) representing not guilty and 1 (a relatively high score) representing guilty. • Senators who are more conservative tended to vote guilty on obstruction of justice.

Relationships between Two Dichotomous Variables Example: Is there a relationship between whether or not the senator is first-term and his or her vote on perjury? Solutions via: • Clustered bar graph • Pearson • Crosstabulation

Using SPSS to Obtain a Clustered Bar Graph Click Graphs on the main menu bar, Legacy Dialogs, and Bar. Change from Simple to Clustered and click Define. Put VOTE1 in the CategoryAxis box and NEWBIE in the Define ClustersBy box. Click OK.

Clustered Bar Graph

Using SPSS to Obtain the Contingency Table To obtain the frequencies of each of the four cells (a contingency table or cross-tabulation), click Analyze on the main menu bar, Descriptive Statistics, Crosstabs. Put VOTE1 in the Row(s) box and NEWBIE in the Column(s) box. Click OK.

Contingency Table

Contingency Table Analysis • First term senators tended to vote guilty and more established senators tended to vote not guilty. • Any of the following alternatives may be used to provide statistical support: • Approximately 62.9 percent (39/62*100) of the non-first term senators voted not guilty whereas 42.1 percent (16/38*100) of the first term senators voted not guilty. • Approximately 37.1 percent (23/62*100) of the non-first term senators voted guilty whereas 57.9 percent (22/38*100) of the first term senators voted guilty. • Approximately 70.9 percent (39/55*100) of the not guilty votes came from non-first term senators whereas 51.1 percent (23/45*100) of the guilty votes came from non-first term senators. • Approximately 29.1 percent (16/55*100) of the not guilty votes came from first term senators whereas 48.9 percent (22/45*100) of the guilty votes came from first term senators.

Correlation Analysis • The correlation between VOTE1 and NEWBIE is r = .20. • The sign of the correlation is positive, so high scores on one variable are associated with high scores on the other. • VOTE2 is coded with 0 (a relatively low score) representing not guilty and 1 (a relatively high score) representing guilty. • NEWBIE is coded with 0 representing non-first term and 1 representing first term. • First term senators tended to vote guilty on perjury and more established senators tended to vote not guilty. • This special case of Pearson correlation is sometimes called the phi coefficient.

Relationships between Other Variable Types • Nominal non-dichotomous or ordinal with fewer than about five categories by dichotomous. • Example: Are there regional differences in how the senators tended to vote on obstruction of justice? • Nominal non-dichotomous or ordinal with fewer than about five categories by scale. • Example: Are there regional differences in the typical conservatism score of the senators?

Clustered Bar Graph: Graphically Representing Vote on Obstruction vs Region

Contingency Table: Tabulating Vote on Obstruction of Justice by Region

Contingency Table Analysis • Senators from the northeast tended to vote not guilty, while those from the south and west tended to vote guilty and those from the midwest were equally likely to vote guilty or not guilty. • In particular, approximately 83.3 percent (15/18*100) of the senators from the northeast voted not guilty whereas 50.0 percent (12/24*200) from the midwest, 40.6 percent (13/32*200) from the south, and 38.5 percent (10/26*200) from the west voted not guilty. • Alternatively, in terms of voting guilty, approximately 16.7 percent (3/18*100) of the senators from the northeast voted guilty whereas 50.0 percent (12/24*200) from the midwest, 59.4 percent (19/32*200) from the south, and 61.5 percent (16/26*200) from the west voted guilty.

Boxplots: Graphically Representing Conservatism Score by Region

Compare Means or Medians: Comparing Conservatism Scores by Region

Analysis Based on Medians • Because the data are noticeably skewed for the northeast region, a more appropriate comparison of conservatism across regions is via the median, although results based on the means in this example yield the same result. • According to the values of the median, the most conservative senators come from the south (72), followed by the west (64), the midwest (50), and the northeast (19.5).

Selection • The table on the following slide provides guidelines for choosing the appropriate statistic(s) and graphs for describing the relationship between two variables. • Other combinations may be correct.

Bivariate Relationships