Introduction to Data Analysis.

Introduction to Data Analysis. Bivariate Linear Regression Correlations

This week’s lecture (1) • Previously we looked at differences between means (descriptive inferences), but often we want to look at how one variable is related to another. • We are also often interested in causal relationships between variables. • We normally have a variable we are interested in predicting (say attitudes to abortion), and variables that that we think cause them (say age or religiosity). • This week we examine how to go about testing relationships between variables and causal inferences.

This week’s lecture (2) • This week is therefore an introduction to a technique called linear regression. • This forms the basis of a large proportion of statistical analysis. • Reading. • Agresti and Finley Ch. 9.

Our example for the day • We’re going to start with an experiment as an example. • Let’s go back to the 1950s and imagine a psychology experiment unencumbered by ethics committees. • We’re interested in whether high levels of pain affect memory tasks. • We ask 8 people to remember 30 objects, and then ask them to recall those objects whilst being given electric shocks. • Each person gets a different shock and we vary the level of shock from 20v up to 160v, in increments of 20v.

Two types of variable • For our experiment, we have two variables. • Number of items recalled. This is the dependent variable. This is what we are interested in predicting. • Level of shock. This is the independent variable. This is what we think predicts the dependent variable. • Here, we have a clear causal ‘story’. • We change the shock and people’s memory changes. • Generally in social science, causality can be more difficult to ascertain.

Back to the shocks • Here’s our data as a table, with each person’s score and level of shock. • It appears that more painful shocks impede memory. • More usefully given that these are both interval level variables we can produce a scatter-plot (or scatter graph).

A scatter-plot for our data Could fit a line “by eye”, the line allows us to read off a predicted value of memory given a level of shock. We’ve tried to minimize the differences between each point and the line when drawing it. Tessa

Describing the line • We can describe a line like this with a simple equation. • There’s two components to this: • The slope of the line (e.g. how many fewer objects can a person remember as the shock is increased by 1v). • The intercept of the line (e.g. when the shock is 0v how many objects can a person remember).

Equation of the line (1) The intercept is 27. 10-17 = -7 120-50 = 70

Equation of the line (2) • So in our case the slope = -.10 and the intercept = 27. • For every 1v increase in the shock, the score on the memory test decreases by 0.10. • If someone receives a 10v shock we predict that they will score 26 (i.e. 27 – 10*0.10 = 27 – 1 = 26). • If someone receives a 90v shock we predict that they will score 18 (i.e. 27 – 90*0.10 = 27 – 9 = 18).

Equation of the line (3) • We can generalise this for all lines of this sort. • Y is the dependent variable, X is the independent variable. • α is the intercept, and β is the slope of the line. • Positive βs mean that as X increases, Y increases. • Negative βs mean that as X increases, Y decreases.

Equation of the line (4) • There’s some specific values of β that we might be interested in. • If β = 0 then there is no relationship between the two variables. As X increases, Y does not change. • This is an important value for the slope, and one we’ll return to later on. • If β =1 and α = 0 then Y = X. If we increase X by one unit, Y increases by one unit, and when X is zero so is Y.

A statistical model? • Kind of. • We have used a sample to predict something about the population. • e.g. we think that if we give someone a 100v shock then they will remember 17 objects. • But… • Fitting a line by eye to the data seems a bit ad hoc. • If we’ve got a sample, there’s going to be sampling error, so we want to know a range (maybe like a confidence interval) for where our line could fall given different samples.

Fitting a line (and not by eye) • Given different people (due to wonky eyes and so forth) will fit different lines, we need a general way to fit a line. • When we were fitting the line by eye, we were trying to minimize the deviations from the data points to the line, so why not do that in a more rigorous manner. • Want to try and make the Y-values of the line (e.g. our prediction of the memory score) as close to the actual Y-values of all of the observations as possible. • Some of the points are higher and some lower though…

Highs and lows For our line fitted “by eye”, some observations are higher and some lower. James does better than we predict, but Tessa does worse. James (higher than our line) Tessa (lower than our line)

Least-squares • For standard deviations, we squared the differences between the mean and the individual observations. • So we want to do the same here, and minimize the sum of the squared deviations from the line. • The line that best manages this is called the Ordinary Least Squares (OLS) line.

OLS line • We can calculate it using the differences between the means of X and Y and the individual scores for each observation. • For our shock experiment the mean of X (level of shock) is 90v and the mean of Y (memory score) is 13 ¾ objects remembered.

Calculating β and α • So, we want to make the squared deviations as small as possible. Some fancy math (which I won’t bother with) allows us to calculate both the β and α. • For β, we need to know the mean X and mean Y.

I could calculate this… • …but I can’t be bothered, and in reality we have computers to do it for us. • So assuming we pressed the appropriate button in SPSS/STATA/Excel the β would be equal to -0.12. • Our guess “by eye” wasn’t bad, but it wasn’t the best slope because it did not minimize the squared deviations. • What about α? There is another simple formula.

I can calculate this • Even a lazy man like me can manage this one.

Our OLS line • So since we now know β and α, we have a line that predicts the score on the memory test when we give people any level of shock. • e.g if we gave someone a 100v shock, then we predict that they will score about 12 ½ on the test. • Because the slope is constant (i.e. it’s a straight line) for every 1v we increase the shock, we expect the memory score to decrease by 0.12. This is therefore a linear relationship.

Causation • I’ve been talking about an experiment, in which we know that the only thing that has changed is the level of electric shock, but what about the real world? • We don’t always know what causes what. • Do political attitudes cause party identification or vice versa? • What about other stuff? • Are older people more anti-abortion because of their age, or because of their religiosity? • More on both of these next week.

Straight lines Actual non-linear relationship • Is a straight line the best way of representing all relationships between variables? • Some relationships may be non-linear. • e.g. sex and happiness. Increases from zero times a week up to 20 times may increase happiness, but the increase from 20 to 50 times a week might be somewhat exhausting. • More on this next week. OLS linear relationship

What about the samples? • The line we are generating is based on a sample, so is this a real relationship? • Are people’s memories really affected by how much pain they are in? They appear to be in our sample, but what about the population we are interested in (in this case presumably all people in the world). • It could be that, by chance, the people that we randomly picked to receive the highest shocks were also the least good at remembering things. • More on this in 10 minutes.

Hypothesis testing (again) • What we want to do is test the null hypothesis that β is zero in the population using our sample. • Remember if β is zero then the line is completely flat, as X increases Y does not change. • e.g. as pain increases memory stays the same. • We go about testing this in a similar way to last week. • Again the crucial insight is that if we sampled lots of people, then we would have a distribution of different Ys for each level of X.

Some terminology • I’ve been talking Greek, but thinking about it we have a sample so: • Let’s call the population line coefficients (i.e. the slope and intercept) β and α. • Let’s call our particular coefficients from the sample that we have b and a. • We want to estimate the true regression coefficients β and α, but all we have is a sample of the population giving us a line with coefficients b and a.

A bit to add to the equation • In the population, relationships between social science variables will never be deterministic. • Our measures aren’t perfect (our test score clearly isn’t a brilliant measure of memory). • There’s inherent variability (some people will have naturally better memories than others, some people will be hungover, etc.). • Therefore we need to add an ‘error’ term to the equation:

‘Errors’ • Best way of thinking about this: • There’s a distribution of Y values for each value of X. • If I went out and gave lots of people 100v shocks some would score more on the memory test than others, and some would score less. • This is bound to be the case because the relationship is probabilistic, there is variability in the values of Y at each value of X. • The true regression line will go though the mean of each of these distributions for all of the X values (what we call the conditional mean).

Y, given X=40 Y, given X=80 Y, given X=120 Regression line for the population (the line goes through the mean of Y for each value of X). Mean Y, given X=40

Assumptions • There are some important assumptions lying behind all this. • The distribution of values of Y at each value of X is normally distributed. • The spread of the distribution of values of Y at each value of X is the same. • The true relationship between the variables in the population is linear. • The sample is random. • For large samples we don’t need to bother too much about the 1st one; the latter three are important though.

e2 - Difference between observed and predicted (for estimated regression). Y, given X=40 Y, given X=80 Y, given X=120 ε3 - Difference between observed and predicted (for true regression). Estimated regression True regression

Estimates and the truth • Unless we’re having an especially lucky day, our estimated regression line is unlikely to be the same as the true regression line. • So we want to know how ‘close’ we are. If we took lots of separate samples and then calculated lots of separate regression lines, we would get a distribution of slope coefficients (the b). • Fortunately for us, the sampling distribution of b is normal, as long as the sample size is large(ish), and the mean of all the possible bs is β.

Standard error of b • There is a formula for the standard error of b as well. • The SE depends on the variability of the Y observations around our estimated line (standard deviation or s). • It also depends on how much spread there is to our X values, when the X-values are spread out then the SE is smaller. This makes sense, the more the observations are bunched together, the more the ‘errors’ will obscure the small bit of the line we are investigating.

CIs for b (1) • For largeish samples (say 40+), we can thus calculate CIs around the slope with reference to the normal distribution, and work out a range in which we think the slope will fall. • For a 95% CI the appropriate number of standard errors each side is 1.96 (obviously). So with 95% confidence:

CIs for b (2) • Let’s take a new example (with a larger sample). Let’s say that we sample 100 people aged over 18, ask them their age in years, and the number of toffees eaten last month. • Our sophisticated piece of theory suggests that as people age they develop a liking for toffees. • We can calculate a b for these data, a SE for b and a confidence interval around b.

95% CIs for b (3) • In our case, b = 0.5 (i.e. for every year older someone is they consume ½ a toffee extra per month) and the SE of b = 0.2. • So, with 95% confidence, the slope of our line will lie between the two values -0.11 and 0.89. • We might want to make specific hypothesis tests about the line as well.

Hypothesis testing • The particular null hypothesis that we normally want to test is whether β is zero in the population (i.e. could the true line be flat?). • We simply work out the p-value of this null hypothesis by calculating the z-statistic (i.e. how many SEs from zero is the b?). • In our case the b is 2.5 SEs more than zero, and the probability of this is 0.012 (or 1.2%). • We can reject the null hypothesis, and conclude that toffee consumption does increase with age.

Y, given X=45 for H0 Y, given X=25 for H0 Our regression line Null hypothesis

Some warnings… • ‘Dangerous’ to make some predictions without examining the data, and/or applying some common sense. • We’ll look at some of these over the following weeks, but one in particular related to prediction is worth mentioning now.

Extrapolation • We can use our regression equation to predict values for any level of X, but do we want to do this? • Generally speaking we don’t want to extrapolate too far from the observed values. • i.e. if we pick values of X that we don’t have any data on, we don’t know that our linear model (i.e. our straight line) still holds. • Need to apply common-sense (unlike the Daily Telegraph).

By this logic, in 2640 women will finish the race before they start…

Correlation (1) • We know how our variables are related. • Memory is worse if you’re subjected to electric shocks, toffee consumption increases with age. • But to what degree are they related? • The value of b depends on the measures; i.e. people eat ½ a toffee extra as they age one year. If we want to compare the degree to which variables are related we need to do something else. • The ‘something else’ is called correlation which is usefully unit free. • Unit free means that it doesn’t depend on the units of measurement for X and Y.

Correlation (2) • This is closely related to regression and the formula to calculate it relies on the same numbers (sums of the differences between observations and means) as the equation for b. • Unlike b, the correlation (called r) doesn’t make a distinction between the dependent and independent variables. • Let’s go back to the memory example.

Correlation (3) +ve scores for the top line of the correlation formula Mean shock = 90 Mean memory score =13 ¾

Correlation (4) • If X is greater than its mean and Y is greater than its mean we get a positive number • If X decreases when Y increases, we get a negative. • If we add all these numbers up (and standardize it so it runs from -1 to +1) then we have our correlation. • A correlation of 1 tells us that the variables would lie precisely on an upward sloping straight line on a scatter plot. • A correlation of -1 tells us that the variables would lie precisely on an downward sloping straight line on a scatter plot. • A correlation of zero tells us there is no (linear) relationship between the variables. • For our example the correlation is -0.95 which shows the two variables are very closely negatively correlated.

Isn’t this all a bit simplistic? • YES. • It is all very well analysing the relationship between two variables, but in the real world relationships are more complicated. • Need to ‘control’ for other factors, how do we include other independent variables? • What about categorical independent variables, how do we include these? • Non-linear relationships. • General approach to testing hypotheses using statistical data by ‘building’ models. • All to be dealt over the next two weeks…

Introduction to Data Analysis.