Bivariate Relationships

SHARON LAWNER WEINBERG SARAH KNAPP ABRAMOWITZ Statistics SPSS An Integrative Approach SECOND EDITION Bivariate Relationships Using Chapter 5

Summarizing the Relationship Between Two Variables: An Overview

The Relationship Between Two Scale Variables:What the Scatterplot Tells Us • Whether the relationship appears linear • If linear, it also tells us: • The direction (or nature) of the linear relationship • The relative strength of the linear relationship

A Scatterplot Example: HamburgerData Set

Interpreting the Fat vs. Calories Scatterplot • A line, as opposed to a simple curve, appears to fit the data well; i.e., a linear model appears appropriate for these data. • The direction of the linear relationship is positive because the line has a positive slope; i.e., burgers that are relatively high in fat tend also to be relatively high in calories. • The strength of the linear relationship appears to be strong because the points cluster tightly around the line.

Creating the Scatterplot (for FAT and CALORIES) in SPSS Go to Graphs on the main menu bar, Legacy Dialogs, and then Scatter. Click Define.Put CALORIES into the box labeled y-axis and FAT into the box labeled x-axis and click OK.

Labeling the Points of the FAT vs. CALORIES Scatterplot Go to Graphs on main menu bar, Legacy Dialogs, then Scatter. Click Define. Put CALORIES into box labeled y-axis, FAT into box labeled x-axis, and NAME in box labeled label cases by. Click OK. Double click on the graph to put it into Chart Editor. Click Elements, Show Data Labels. Move Name to Displayed box, Eliminate count. Click Apply, Close.

Labeled Scatterplot: Hamburger Data Set

Scatterplot Example: StatesData Set (Percentage of eligible students in the State taking the SAT (PERTAK) vs. the average State verbal SAT score (SATV))

Interpreting the SATV vs. PERTAK Scatterplot • Although the points have a curvilinear shape, a line would appear to represent these points reasonably well, and so we will use it in this case. • The direction of the linear relationship is negative because the slope of the line is negative; i.e., states with a relatively low percentage of students taking the SAT tend to have higher average SAT Verbal scores. • The strength of the linear relationship is more moderate than in the hamburger example because the points in this case do not cluster as tightly around the line.

Scatterplot Example: Currency Data Set(Denomination (BILLVALUE) vs. Number of bills in circulation (NUMBER)).Note: Need to combine variables to create a variable for the number of bills in circulation.

Interpreting the Denomination vs. Number Scatterplot • The points have a “cloud like” formation, and so neither a simple curve nor a line provides a good fit to these data. • It appears, therefore, that there is little or no relationship between the value of a bill and its number in circulation.

Scatterplot Example: Marijuana Data Set(Year vs. Percentage of students reporting ever having used marijuana from 1987-1999)Note: Use Select Cases to restrict cases to the appropriate years.

Interpreting the Year vs. Marijuana Scatterplot • A simple curve (or two lines) provides a better fit to the data than a single line and is therefore more appropriate than a line for modeling the data. • We conclude that the relationship between marijuana use and year is non-linear.

Quantifying a Linear Relationship between Two Scale Variables: Pearson Product Moment Correlation Coefficient • Called simply, correlation, and symbolized by the letter r. • Before calculating, use a scatterplot to verify that the relationship between the variables appears to be linear. • Measures the direction, nature (the interpretation of the direction), and strength of the linear relationship. • Direction (and Nature): Measured by the sign of r (positive or negative) • Strength: Measured by the absolute magnitude of r. • Rule of thumb (Cohen’s scale): r < .1 little or no rel, .1 <= r < .3, weak rel, .5 <= r < .5 moderate rel, r >= .5 strong rel

Obtaining the Pearson Correlation Using SPSS Click Analyze on the Main Menu Bar, Correlate, and Bivariate. Move the two relevant variables into the Variables Box. Click OK.

Interpreting Pearson Correlation Coefficients • The correlation between FAT and CALORIES is .997, represents a very strong positive linear relationship between these two variables. • => Burgers that are relatively high in fat tend also to be relatively high in calories and burgers that are relatively low in fat tend also to be relatively low in calories. • The correlation between the Average SATVs of states and the percentage of students in these states who take the SAT is -.86, representing a strong negative linear relationship between these two variables. • => States with a relatively high average SATV tend to have a relatively low percentage of students taking the SAT; states with a relatively low average SATV tend to have a relatively high percentage of students taking the SAT.

Other Properties of the Correlation Correlation values are on an ordinal scale Correlation does not imply causation, i.e. when two variables are correlated it is not necessarily true that changing one will result in a predictable change in the other A linear transformation applied to one variable does not change the magnitude of the correlation. The sign of the correlation will change, however, if the transformation involves multiplication by a negative number Restricting the range of one of the variables can either increase or decrease the magnitude of the correlation

Quantifying the Relationship between Two Ordinal Variables or Between One Ordinal and One Scale Variable: The Spearman Rank Correlation Coefficient The Spearman correlation, called Spearman’s rho, is a special case of the Pearson correlation computed on ranked data. Example: Quantify the relationship between the amount of time spent in school on homework (HWKIN12) and the amount of time spent out of school on homework (HWKOUT12) in twelfth grade for students in the NELS data set.

Beginning with a Scatterplot: HWKIN12 and HWKOUT12

Obtaining the Spearman Rank Correlation Coefficient using SPSS Click Analyze, Correlate, Bivariate. Move the variables HWKIN12 and HWKOUT12 into the Variables box. Click Spearman and remove the check from Pearson in the Correlation Coefficients box. Click OK. Note: When using SPSS, there is no need to transform the data to rankings to obtain the Spearman correlation because SPSS automatically will do this transformation for us.

Interpreting the Spearman Rank Correlation Coefficient • The Spearman correlation is interpreted in the same way as the Pearson correlation. • In this example, Spearman’s rho = .40. • => Twelfth grade students in the NELS data set who spend a relatively large amount of time doing homework in school also spend a relatively large amount of time doing homework outside of school and students who spend a relatively small amount of time doing homework in school tend also to spend a relatively small amount of time doing homework outside of school.

Quantifying the relationship between One Scale and One Dichotomous Variable – The Point Biserial Correlation Coefficient Example using the Hamburg data set: Calories vs. Cheese.

Interpreting the relationship between Calories and Cheese r = .51 between CALORIES and CHEESE. In this case r is called a point biserial correlation, a special case of the Pearson correlation coefficient Note: The sign of the correlation is positive CHEESE is coded 0 (a relatively low score on the cheese metric) to represent the absence of cheese and 1 (a relatively high score on the cheese metric) to represent the presence of cheese. => Burgers with cheese tend to be higher in calories than those without cheese.

Another Example of the Point Biserial Correlation: Impeach Data Set On February 12, 1999, for only the second time in the nation’s history, the U.S. Senate voted on whether to remove a President, based on impeachment articles passed by the U.S. House. Dozens of political talk shows featured analyses of why senators may have voted the way they did, but such discourse was rarely (if ever) informed by systematic statistical analysis of the votes. Professor Alan Reifman of Texas Tech University created a relevant data set about the senators’ voting to be used for such an analysis.

The Impeach Data Set

Conservatism Score vs. Vote on Perjury Scatterplot

Interpreting the relationship between Senators’ Conservatism and Their Vote on Perjury r = .87 between Conservatism and VOTE1. Note: The sign of the correlation is positive VOTE1 is coded 0 (a relatively low score on the voting metric) to represent not guilty of perjury and 1 (a relatively high score on the voting metric) to represent guilty of perjury. => Senators who are more conservative tended to vote guilty on perjury.

Conservatism Score vs. Vote on Obstruction of Justice Scatterplot

Interpreting the Relationship between Senators’ Conservatism and their Vote on Obstruction of Justice r = .94 between Conservatism and VOTE2. Note: The sign of the correlation is positive VOTE2 is coded 0 (a relatively low score on the voting metric) to represent not guilty of obstruction of justice and 1 (a relatively high score on the voting metric) to represent guilty of obstruction of justice. => Senators who are more conservative tended to vote guilty on obstruction of justice.

Quantifying a Relationship between Two Dichotomous Variables – The Phi Coefficient Example: Is there a relationship between the political party of a senator (Democrat or Republican) and his/her vote on obstruction of justice? Note: the Phi Coefficient is also a special case of the Pearson. Rather than use a scatterplot to represent the data graphically, we use a clustered bar graph.

Using SPSS to Obtain a Clustered Bar Graph • Click Graphs on the main menu bar, Legacy Dialogs, and Bar. Change from Simple to Clustered and click Define. Put VOTE1 in the CategoryAxis box and NEWBIE in the Define ClustersBy box. Click OK.

Graphically Representing First-Term and Vote on Perjury: The Clustered Bar Graph

Tabulating the Relationship between First-Term and Vote on Perjury: The Contingency Table To obtain the frequencies of each of the four cells (a contingency table or cross-tabulation), click Analyze on the main menu bar, Descriptive Statistics, Crosstabs. Put VOTE1 in the Row(s) box and NEWBIE in the Column(s) box. Click OK.

Analyzing and Interpreting the Contingency Table • First term senators tended to vote guilty and more established senators tended to vote not guilty. • Any of the following alternatives may be used to provide statistical support: • Approximately 62.9 percent (39/62*100) of the non-first term senators voted not guilty whereas 42.1 percent (16/38*100) of the first term senators voted not guilty. • Approximately 37.1 percent (23/62*100) of the non-first term senators voted guilty whereas 57.9 percent (22/38*100) of the first term senators voted guilty. • Approximately 70.9 percent (39/55*100) of the not guilty votes came from non-first term senators whereas 51.1 percent (23/45*100) of the guilty votes came from non-first term senators. • Approximately 29.1 percent (16/55*100) of the not guilty votes came from first term senators whereas 48.9 percent (22/45*100) of the guilty votes came from first term senators.

Quantifying and Interpreting the Relationship between First-Term and Vote on Perjury: The Phi Coefficient • The correlation between VOTE1 and NEWBIE is r = .20. • VOTE2 is coded with 0 (a relatively low score) representing not guilty and 1 (a relatively high score) representing guilty. • NEWBIE is coded with 0 representing non-first term and 1 representing first term. • The sign of the correlation is positive, so high scores on one variable are associated with high scores on the other. • First term senators tended to vote guilty on perjury and more established senators tended to vote not guilty.

Relationships between Other Variable Types • Dichotomous vs. Non-dichotomous Nominal or Ordinal (with limited categories): Vote on obstruction of justice vs. census region • Scale vs. Non-dichotomous Nominal or Ordinal (with limited categories): Conservatism vs. census region.

Graphically Representing Vote on Obstruction of Justice vs. Census Region: The Clustered Bar Graph

Tabulating Vote on Obstruction of Justice vs. Census Region: The Contingency Table

Analyzing and Interpreting the Contingency Table • Senators from the Northeast tended to vote Not Guilty; those from the South and West tended to vote Guilty; and those from the Midwest were equally likely to vote Guilty or Not Guilty. • In particular, approximately 83.3 percent (15/18*100) of the senators from the Northeast voted Not Guilty whereas 50.0 percent (12/24*200) from the Midwest, 40.6 percent (13/32*200) from the South, and 38.5 percent (10/26*200) from the West voted not guilty. • Alternatively, in terms of voting Guilty, approximately 16.7 percent (3/18*100) of the senators from the Northeast voted Guilty whereas 50.0 percent (12/24*200) from the Midwest, 59.4 percent (19/32*200) from the South, and 61.5 percent (16/26*200) from the West voted Guilty.

Graphically Representing Conservatism Score vs. Census Region: The Boxplot

Summary Tabulation of Conservatism Scores by Census Region: A Comparison by Means & Medians

Analyzing and Interpreting these Summary Data: A Preference for Medians • Because the data are noticeably skewed for the Northeast region, a more appropriate comparison of conservatism across regions is via the median, although results based on the means in this example, yield the same result. • According to the median values, the most conservative senators come from the South (Median=72), followed by the West (Median=64), the Midwest (Median=50), and finally, the Northeast (Median=19.5).

Summarizing The Possibilities

Bivariate Relationships