270 likes | 511 Views
UK FHS Historical sociology (2014). Quantitative Data Analysis I. Contingency tables : multivariate analysis and elaboration – introduction to third level of data sorting Jiří Šafr jiri.safr( AT )seznam.cz. updated 5 / 6 /2014.
E N D
UK FHS Historical sociology (2014) Quantitative Data Analysis I. Contingency tables: multivariate analysis and elaboration– introduction to third level of data sorting Jiří Šafr jiri.safr(AT)seznam.cz updated5/6/2014
Multivariate analysis: 3-rd level data sorting in contingency table More detailed description and elaboration (introduction 1.)
Third level of data sorting in contingency table • A contingency table analysis is used to examine the relationship between two categorical variables (bivariate) • but it can be organized within levels of a third variable. If our goal is elaboration (rather than detailed description), we call it test variable or factor.We aim at to control for its effects. • If a third variable is introduced, it will form separate layers or strata in the table.
3rd level of sorting datain contingency table • We analyse simultaneouslyrelationships among several variables (mostly more independent – explanatory variables). • The principle is identical as in bivariate analysis. • The goal of 3rd level of sorting datais in principle: • More detailed description (in sub/sub-groups) • Elaboration of relationships → searching for causal relations, deeper understanding of context, distinguishing between substantive and false relations, controlling for effect of the 3rd variable (X↔Y / Z) • This is true also for any 3rd level of sorting data in general, i.e. also for means in subgroups and linear association (scatter-plots, correlation, regression). We will explain it on contingency tables first.
Principle of multivariate analysis: 3rd level of data sorting (2×2×2 table) Church Attendance by gender and age, USA 1990 Difference 9 % points Difference 16 % points Source: General Social Survey, NORC 100 % 100 % Source: [Babbie 1997: 391] Dependent variable: Attendance to religious service simultaneously by 2 independent vars: Age, Gender Both older men and women go to church more frequently than young (i.e. religiosity rises up with age). In each age category women attend church more often than men. It seems that gender has slightly larger effect on church attendance than age. Age as well as gender have independent effect on church attendance. Within each category of independent variable different attributes of the other one still influence people‘s behaviour. Similarly both independent variables have cumulative effect on behaviour:Older women visit church the most, whereas young men the least. [Babbie 1997: 391-392]
Simplification of the 2×2×2 table: 100 % → 70 % Lessoften Source: General Social Survey, NORC [Babbie 1997: 391] We show only „positive“ categories of the variable („attend weekly“). However we are not losing any information. Frequencies in brackets report the base for percent, from which we can complete a sum for omitted category. [Babbie 1997: 391]
3rd level of data sorting (2×2×2 table) → description/exploration Propadají více studenti „kolejáci“ – muži nebo „kolejáci“ – ženy? 15 percent difference only 1 percent difference V porovnání s mužskými protějšky studentky bydlící na koleji propadají častěji. Ale je jich stejný podíl jako u těch studentek, co bydlí jinde (tzn. vliv koleje na prospěch se u žen zřejmě neuplatňuje; u mužů je pozitivní: „kolejáci“– muži jsou u zkoušky úspěšnější a zároveň nejúspěšnější ze všech). Zdroj: upraveno podle [Kapr, Šafář 1969: 152]
Introduction into elaboration 3rd level of data sorting → Controlling for the factor
Testing / controlling effect of 3rd variable - factor → Elaboration • Constructing separated tables split by categories of the third variable makes the tested factor holding constant. → relationship between two variables is net – cleaned of distorting effect of this factor variable.
Třídění 3 st.: kontrola vlivu 3 proměnné:interpretace a uspořádání (2x3x3)tabulky Souvisí účast ve volbách s věkem, i při kontrole vlivu vzdělání? U ordinálních nezávislých proměnných porovnáváme procentní rozdíly krajních kategorií odděleně mezi kategoriemi kontrolního faktoru. Rozdíly mezi krajními kategoriemi věku v procentních bodech: 14 % 13 % 30 % Ptáme se: 1. Nacházíme rozdíly v X (věk) a Y (volil) uvnitř kategorií kontrolní proměnné Z (vzdělání)? Porovnáme s tabulkou třídění 2. st. Pro X a Y. 2. Jsou rozdíly mezi krajními kategoriemi X (věk) v rámci kategorií kontrolní proměnné Z (vzdělání) stejné? Zatímco v případě ZŠ a SŠ jsou rozdíly mezi nejmladšími a nejstaršími stejné, tak u VŠ je rozdíl větší. → Vzdělání tedy do vztahu mezi volební účastí a věkem částečně intervenuje.
Interaction and additive effect Interaction effect – effect of one variable on another is contingent on the value of third variable Dopočet do 100 % je % Nevolil Different effect of age in categories of education on voting: for juniors no difference, for seniors % difference in voting is rising with higher education. The highest voting is among older university graduates. Additive effect – efects of both variables add together to produce the final result Still the samepercentage point difference between categories of age in categories of education Similar effect of age in categories of education, only on „different level“ [Treiman 2009: 26-28]
Testování vlivu dalšího faktoru • Porovnáme intenzitu souvislosti v původní tabulce se souvislosti zjištěnou v nových tabulkách s kontrolou 3 faktoru . • Když v nových tabulkách souvislost mezi původními daty zmizí nebo je podstatně oslabena→ souvislost v původní tabulce je funkcí třetího faktoru • Dále je uvedeno, jak odhalit skrytý vztah rychle pomocí asociačních koeficientů v podskupinách 3-kontrolního faktoru (pro nominální znaky Lambda, Phi, CramV a ordinální korelace) • V AKD II. si pak také ukážeme jak tabulku standardizovat (převážit) podle faktoru Z, tj. jako kdyby všichni v kategoriích X měli stejné podíly v kategoriích Z (např. stejné vzdělání).
Why we conductelaboration? To detect and describe interaction (additive) effectsand when doing this we can reveal2. Spurious association(false association/correlation)3. Suppressed – hidden associationFollowing two examples will explain it. The aim is net relationship between two variables when controlled for effect of 3rd variable. Coefficients of association (e.g. Lambda used here) are explained in later or in 3. Contingency tables and analysis of categorical data .
Example I.: Spurious association(false association/correlation)1. bivariate relationship Preference for meal Total Religiosity HAMBURGER CAVIAR High Low Total Source: [Disman 1993: 219-223] Seemingly strong association, but …
2. After controlling for effect of Education(3rd level of data sorting) People with low education No association for people with low education; 0 % point difference (also Lambda=0). Preference for meal Total Religiosity HAMBURGER CAVIAR High Low Total Source: [Disman 1993: 219-223]
2. After controlling for effect of Education(3rd level of data sorting) People with high education Preference for meal Total Religiosity HAMBURGER CAVIAR High Low Total Association disappears when we control effect of education → factor behind which influences both religiosity and preference for food. Source: [Disman 1993: 219-223]
Example II.: Suppressed – hidden association1. bivariate relationship Package A Package B Total Would buy Would not buy Total Zdroj: [Disman 1993: 219-223] Na první pohled žádná souvislost, ale …
2. when gender controlled for(3rd levelof data sorting) menwomen Package A Total Package B Package A Package B Total Wouldbuy Wouldbuy Would not buy Would not buy Total Total Controlling for 3rd variable – factor revealed suppressed association(false independency) between the two variables. Reason for this bias → the relationship between the variables exists only in a part of the population (within women).
When examining relationships in elaboration coefficients of association/ordinal correlation can help us find interaction or suppressed effects
Ordinal correlation for ordinalvariables – bivariate „zero order“ table/correlation (4o×4o table) When our data is from random sample (i.e. not whole population) we have to in addition first test statistical hypothesis, that the coefficient is not zero (i.e. it is not zero in the whole population and not only in our sample). Approx. Significance (also p) is here < 5% → we reject the null hypothesis that Gamma/TauBis zero in whole population). More on this in QDA II. CROSSTABSincome4 BY edu4 /STATISTICSGAMMA BTAU. Source: data [ISSP 2007, ČR]
Is the strength of relationship (ordinal correlation) identical for men and women? → we can compute conditional association/correlation coefficients separately in categories of control variable – factor(gender) Here 4o×4o×2 table.
Ordinal correlation for ordinal variables in 3rd level of data sorting(separately for men and women) → gender [s30] is controlling factor First order conditional table/ correlation CROSSTABS prijem4 BY vzd4 BYs30 /STATISTICS GAMMA BTAU. Among women education has a a little stronger effect, but on the whole women earn less than men regardless of education level (see also the graph with means of income). In QDA2 we will further compute partial ordinal correlation (GAMMA). Source: data [ISSP 2007, ČR]
Types of contingency tables with 3 variables and coefficients of association/correlation Generally you can always use association (no direction just strength of mutual dependence)→ coefficients of association. • 2×2×2 (similarly 2×2×3n) – all dichotomic → coefficients association and also special point biserial correlation or tetrachoric correlation • 2×3o×3n or 2×3o×2 – dependent variable dichotomic, independent ordinal, control nominal → ordinal correlation in groups of control factor (without eventuality of considering linear trends in strength of association/correlation) • 2×3n×3o – dependent variable dichotomic, independent nominal, control factor ordinal → only coefficients of association (but we can consider linear trend in strength of association between categories of control factor) • 3o×3o×3o (similarly 2×2×3o) – all ordinal → ordinal correlation (we can consider linear trend in strength of correlation between categories of control factor) + coefficients of partial correlation (i.e. net correlation of X↔Y when effect of Z is controlled, more on this in QDA II.) It stands also for more than 3 categories (e.g. 4o or 4n).
Coefficients of association in (bivariate) multivariate analysis in SPSS within CROSSTABS • Within CROSSTABS we can compute several measures of association and correlation for variables Y x X (bivariate) as well as separately in categories of controlling factorZ → this can help us quickly assess interaction and reveal „false“ relationship. • For nominal variables (Y, X, Z-controlling factor)coefficients of association(they range 0-1 → no direction): CROSSTABSvar1 BYvar2 BYvar3-controlling/CELLSCOL /STATISTICSCC PHI. Coefficients of association: CC = Contingency coefficient, PHI = Cramer V(+ equivalent for dichotomised variables is Phi); there are also other coefficients of association and correlation (e.g. Lambda). • for ordinal variables(Y, X) and nominal/ordinal controlling factor (Z) in addition of association coeff. ordinal correlation(they range -1–0–1 → determinedirection): CROSSTABSvar1 BYvar2 /CELLSCOL/STATISTICSCC PHIGAMMA CORR BTAU. Correlation coefficients: GAMMA = Goodman&Kruskal Gamma, BTAU = Kendaull Tau B, CORR = Spearman Rho(+ Pearson correl. coef. R for ratio variables) • Notice, if we don‘t find correlation, it doesn't mean that, there is no (strong) relationship–association.Moreover with ordinal variables comparison of correlations and coefficients of association can help us indicate what is the relationship (nonlinearity). • Notice: in case of means in subgroups (MEANS) we van compute coefficient Eta2 (for ratio x nominal variable): MEANSvar1-dependet-numericBYvar2-independent-categ.BYvar3-controlling-categorial /CELLS MEAN STDDEV COUNT /STATISTICSANOVA. More on coeficients of association and correlation can be found in 2. Korelace a asociace: vztahy mezi kardinálními/ ordinálními znaky na http://metodykv.wz.cz/AKD2_korelace.ppt
Notice: check counts (absolute frequency)when sorting data in higher level • When doing 3rd level of data sorting always check counts in v individual cells of the table with caution, notably in small samples. CROSSTABSvar1 BY var2 BY var3/CELLSCOLCOUNT. • If frequencies are too small, then interpretation of the table makes no sense from the statistical as well as substantive point of view. → You can collapse sparse cell entries.