Scatterplots & Correlations

Scatterplots & Correlations Chapter 4

What we are going to cover • Explanatory (Independent) and Response (Dependent) variables • Displaying relationships with scatterplots • Interpreting scatterplots • Adding categorical variables to scatterplots • Measuring linear associations with correlation • Important facts and issues with correlations

Starting with some terminology • Response variables (Dependent) = Ys • Explanatory variables (Independent) = Xs • When stating relationships we generally state the dependent first. • When graphically depicting relationship we generally place the dependent on the y axis. • In most stats software the dialogue boxes follow this convention and ask you to enter the dependent or response variable first

Here is the Scatterplot dialogue box for EXCEL with the publisher’s plugin

A scatterplot displays the relationship between two quantitative variables measured on the same individual or event, etc.

Just as we began our discussion of the distribution of individual variables by graphically depicting them, so we do when we are interested in relationships between variables • Scatterplots are a great way to do this depiction.

Adjusting your graph (art and science) This is the original excel scatterplot This is my adjusted excel scatterplot

Once again let’s look for patterns of regularity and outliers • Using the four step method • State the problem (in this case does the percent of students taking the SAT influence math scores • Plan we can try to observe this with a scatterplot • Solve (interpret the plot), notice there is something of a downward sloping left to right line and some clustering • Conclude, there does appear to be a negative association between the variables, as the percent of students taking the SAT in a state increases, the average math score of the state declines

We can also group data in a scatterplot • As can be seen, the data in the previous chart has been group by region (a nominal variable) in this example • In the last class I did the same thing when I divided my data on income into two separate sets for men and women and made side by side box plots

Measuring Linear Correlations • Just as in the past lesson, where we moved from depicting data in graphs to summarizing them with numbers, so we can do the same with associations. • A statistic which is commonly used to measure the strength of an association when data is measured at the interval and ratio level is “r” (Pearson’s r).

Pearson’s r really just builds on what we did with descriptive statistics. Now we are finding the distance of each point from the mean of x over the x variance multiplied by the mean of y over the y variance. In other words, it is based on standardized values

Some important points about “r” • Correlations are symmetrical statistics, they will produce the same result whichever variable you tag as explanatory and respondent • Because “r” uses standardized values it does not change if you rescale variables • A negative signed “r” indicates a negative association, a positive sign indicates a positive relationship. • r varies between -1 and 1. • Values approaching 0 indicate no association. • Values approaching -1 indicate a near perfect negative linear relationship • Values approach 1 indicate a near perfect positive linear relationship.

Some warnings • As noted, Pearson’s r only works if both variables are measured at least at the interval level • Do a scatterplot first. • r only works with linear (or nearly linear) relationships. As curvature enters the picture, r’s use declines • outliers (extreme high and low values) will distort r • Correlations do not provide a total summary of relationships, you should usually also provide the means of x and y and their standard deviations so people can evaluate the usefulness of the correlation

Spearman’s rho (a correlation for ordinal data) • Spearman’s rho (or rank order correlation) is a correlation you can use with ordinal data. As with “r” it varies between -1 and 1 and a value approaching 0 indicates no meaningful relationship between the variables. • It is very handy and is used in a number of situations. For example, in sports very elaborate computer programs are used to rank players and/or teams. We could use rho to analyze whether the rankings reliably predict who wins (for example in tennis). • Another common use is when you are looking for associations among opinion data which is collected at the ordinal level. • We won’t calculate this. Enough to say that most programs that do “r” will have a nearby function for rho.

The following table is from Cohn, CJPS 38:2 (2005), 415-434.

Some things you will note, • In the previous table beside “rho” there was a number titled “significance”. • As with most statistics, “r” and “rho” have known distributions with given data set sizes (degrees of freedom [N-1]). • Significance answers the question, given the degrees of freedom, how likely are we to see this score for the statistic?

A score of 0.05 or less would mean there is a 5% or less chance that these results could occur if we randomly drew results. In other words, there is a 95% chance that these results represent a genuine association of the strength reported between the variables. • The score in the table was 0.000. This means there is almost no chance a Rho of this strength could occur with this many cases by simple random chance. • Therefore, there is a very high likelihood that the strength of association reported between the variables is a genuine association. • You will hear more about significance as the course proceeds.

Scatterplots & Correlations