370 likes | 602 Views
Statistical Association. Rachad Antonius Notes to accompany Interpreting Quantitative Data with IBM-SPSS, 2nd Edition SAGE 2012. Two exciting aspects of social statistics. The study of the links between observed phenomena
E N D
Statistical Association Rachad Antonius Notes to accompany Interpreting Quantitative Data with IBM-SPSS, 2nd Edition SAGE 2012
Two exciting aspects of social statistics • The study of the links between observed phenomena • The generalization of observed links from a sample to a whole population.
Bivariate statistics Multivariate statistics (crosstabulation, correlation, analysis of variance (ANOVA) • The study of the links between observed phenomena • The generalization of observed links from a sample to a whole population. Statistical inference applied to associations, i.e. the study of statistically significant associations
But what is the meaning of ‘link’ ? It could mean several things, and it refers to: • The notion of ‘co-variation’ • The notion of tendency(as opposed to the notion of ‘rule’) • The notion of strength of the statistical relationship • The notion of prediction • The notion of explanatory factor • The notion of causality
When you want to analyze a statistical relationshipsix questions come to mind • Is there a relationship between the variables ? • What is its strength? • What is its direction? What is its shape? • If the relationship has been observed on a sample, is it valid for the whole population ? • Does it indicate a causal relationship? Is it an explanation? • Are there intervening variables that may explain the relationship?
In order to assert that there is a statistical association, • You must describe it and measure it, on the basis of the observations made on a sample • You must determine whether it could be due to chance, or whether it reflects something about the population(this involves the notion of level of significance)
The description of a statistical association … … depends on the level of measurment of the variable : Nominal, ordinal, or numerical scale (each level will require different methods for analyzing the relationship)
If the variables are measured by a nominal scale (these could be qualitative variables, or quantitative variables having a very limited number of categories) You must produce a crosstabulation, and see whether the percentages of cases belonging to a category of the independent variable vary across the categories of the dependent variables. This can be illustrated directly with IBM-SPSS
Numerical scales: correlations • Each dot on the diagram represents a case, and the data related to this case on the X variable and the Y variable are given on the X- and Y- axes. • You get a scatter diagram, which may exhibit a trend. In that example, the trend is clear: As X increases, Y decreases. • We consider the relationship to be linear if a straight line going through the scatter diagram captures this trend better than any other type of curve. • Not all correlations are linear, of course ! Sometimes, the trend is better represented by a second degree equation, or by a logarithmic transformation (which means that the scatter diagram of (x, logy) is linear). • Several data pointsdo not follow the pattern: theyare far from the line representing the trend. • When the points of the scatter diagram are close to the line that represent the trend, we say that the correlation is strong. The closer the points, the stronger the correlation. In this case, we will be able to predict reasonably well the value of Y when X is given.
How are all these measures done ? • Given a set of data points on two quantitative variables, we (I mean statisticians) come up with an equation that represents the regression line, that is, the line that fits best the points in scatter diagram. • The deviations from the line are studied carefully. The correlation coefficient is based on these deviations. • We will then examine the conditions that must be verified on the equation and on the deviations if we are to make a correct inference based on this data.
The regression equation • Data points are labelled 1, 2, 3, etc, and by the letter i to refer to an unspecified data point. • Therefore, the values taken by this case on the X and Y variables are represented by (xi , yi) (i takes successively the values: 1, 2, 3,…., n). • The y value of a given point i can thus be expressed as: yi = a + b xi + ei • ei is the deviation of the observed y value from the regression line, which is like an abstract model of the statistical relationship. • The regression line itself is given by: yi = a + b xi
Individual data point with values (xi, yi) Exact value of y ei : error when estimating y Regression line Predicted value of y Exact value of x
Example of a regression equation Suppose you have a situation where the regression equation is given by: • Infant Mortality Rate = 127.20 - 1.13 * (female litteracy rate) (the cases here are individual countries) • What this means, is that: When you increase the litteracy rate of women by 1 % in a country, you expect that infant mortality decreasesby 1.13 deaths for every 1000 births. • General equation: Predicted value ofy = a + b x • The coefficient b gives us the magnitude of the effect of x on y : an increase of 1 unit in X is accompanied by an increase of b units in Y. • Warning: this is not a measure of the strength of the relationship (i.e. the precision of the prediction) ! It is just a measure of the trend.
Standardization: Neutralizing the effect of the unit of measurement • Warning: the units in which X and Y are measured may give a false impression about that trend. Solution : standardize X and Y. xs = (x-m)/s ys = b xs The Greek alphabet is used to indicate that these are standardized values. There is no a because the standardized line goes through the orignin of the axes. • Standardizing the variables neutralizes the effect of the unit of measure used. This is very useful when you want to compare the effect of several independent variables on one dependent variable.
An example of the output of IBM-SPSSFile used : World95.savVariables: Infant mortality rate (per 1000 live births) and Female literacy
Output (continued)File used : World95.sav Note that the B coefficients were the ones used in the regression equation in slide number 14.
Meaning of the output • Pearson correlation coefficient r = -0,843 • Correlation coefficient squared: R2 or r2 = 0,71 • Sig. (2-tailed)0,000 • r varies from – 1 to + 1 . It tells us how well the cloud of points of the scatter diagram fits the regression line, which represents the general trend.Therefore, it tells how how good a prediction based on this correlation is. • r2 is the percentage of variation in Y that can be attributed to X. In our example: about 70% of the variation of fertility rates found in different countries can be attributed to the variation in the literacy rate.
Meaning of the output (continued) • The level of significance (the Sig. column in the output) gives the probability of getting the observed results in the sample in a situation where there is no association for the whole population. In other words, it tells us the probability of being wrong if we claimed that there is a statistical association at the level of the population. If that probability is small, we may be willing to take the risk ! • In our example, this probability is less than 1 in a thousand. (Note : 0.000 does not mean 0. It means that there may be some non-zero decimals, but they are further down. The first three decimals are zero). • You may have noted that in the example given above, we do not have a sample: we have the set of ALL countries ! Does the notion of Level of Significance apply, here ? Theoretically, no. But practically, it is used as a measure of the certainty of the statistical relationship.
In summary: • The effect of X on Y …. ……. Is given by the coefficient b. • The strength of the relationship … …… is given by the correlation coefficient r. • The percentage of variation in Y that is explained by X … …… is given by R2. • The probability that the observed relationship is due to chance … …. is given by Sig.
The general case(when you have more than one independent variable) • Predicted y = b0 + b1x1 + b2x2 + …. + bkxk • If the x’s are standardized : predicted y = b0 + b1x1 + b2x2 + …. + bkxk • The b’s measure the effect size of every variable x when the others are held constant. • The b’s measure the effect of each variable when all the variables are transformed into standardised values (mean = 0 and st.dev. =1) • The standardization procedure allows us to compare the effects of the various independent variables x1, x2,…, xk, on the dependent variable y.
Interpreting outputsof an ANOVA This is similar to the correlation just examined, except more variables were added: in addition to the Female literacy rate, the Gross domestic product of the country, the percentage living in cities, the average daily calorie intake, and the average female life expectancy were all added.
We see that the model as a whole is statistically significant, but within it, only two effects are statistically significant: the effect of the Average female life expectancy, and the effect of the percentage of females who read. Average female life expectancy has a stronger explanatory power than female literacy (as seen from the beta coefficients) but this does not mean that there is a causal relationship !
Qualitative vs Quantitative data • How do we determine whether there is a statistical association between these two sorts of variables ? • Example : Income and Sex, • or Age and Sex • or Age at first marriage and Sex. • Let us first compare the mean of the two groups, and their distribution.
Mean Income for men and women We see that there is a big difference. What if the difference was smaller ?
Mean age for men and women There is a difference – very small – between the men and women of THIS sample. Does this reflect a real difference at the level of the whole population ? Or is it simply due to chance ? (samples are not all an exact replica of the population !)
Perform a t-test If you say that there is a REAL difference in the population, you are running a 9 % chance of being wrong. This is too risky. It is safer to assume that there is no age difference between the men and women in the population where this sample comes from.
How about Age at first marriage ? The difference is more visible. What should we expect about the significance level ? WARNING : this depends on the sample size. Do not jump too fast to conclusions.
T-test on the difference of Age at first marriage between men and women If you say that the age at first marriage differs in the whole population, you run almost no risk of being wrong More precisely : your risk is less than 1 in a 1000.
Conclusions • The fact that there is a difference between the means of two groups tells you that the individuals of this sample differ, on the average. • You ask : does this difference reflect a similar difference in the bigger population ? (assuming of course that this is a random sample). • The t-test tells you how risky it is to state that there is a real difference in the population. The significance level is the risk of making the false claim that they differ. • A significance level of 5 % (or 0.05) or less, is usually an acceptable level of risk.
What if there are two explanatory variables ?Effect of Sex on Income :
Main and interaction effects • Sex and Education have each a main effect on income. • Could it be that they interact ? This would mean that : • For men, education affects income differently than it does for women. We call this an interaction effect. • T- tests help us determine whether Variables A and B have a main effect on variable C. • There are other tests that allow us to determine whether there is also an interaction effect. • The principle of such tests is the same as for t-test. Each test produces a significance level which indicates how risky it is to assert that there is an effect at the level of the whole population.
Determining the significance levelsof the main and interaction effects We see that each variable has a significantmain effect and that the interaction effect is also significant. This means that getting a higher degree does not have the same effect on income for men and for women.