1.33k likes | 1.43k Views
INTRODUCTION TO CATEGORICAL DATA ANALYSIS. ODDS RATIO, MEASURE OF ASSOCIATION, TEST OF INDEPENDENCE, LOGISTIC REGRESSION AND POLYTOMIOUS LOGISTIC REGRESSION. DEFINITION. Categorical data are such that measurement scale consists of a set of categories.
E N D
INTRODUCTION TO CATEGORICAL DATA ANALYSIS ODDS RATIO, MEASURE OF ASSOCIATION, TEST OF INDEPENDENCE, LOGISTIC REGRESSION AND POLYTOMIOUS LOGISTIC REGRESSION
DEFINITION • Categorical data are such that measurement scale consists of a set of categories. • e.g. marital status: never married, married, divorced, widowed nominal • e.g. attitude toward some policy: strongly disapprove, disapprove, approve, strongly approve ordinal • SOME VISUALIZATION TECHNIQUES: Jittering, mosaic plots, bar plots etc. • Correlation between ordinal or nominal measurements are usually referred to as association.
Creating and manipulating frequency tables(https://cran.r-project.org/web/packages/vcdExtra/vignettes/vcd-tutorial.pdf ) • Case form a data frame containing individual observations, with one or more factors, used as the classifying variables. • The Arthritis data is available in case form in the vcd package. There are two explanatory factors: Treatment and Sex. Age is a numeric covariate, and Improved is the response— an ordered factor, with levels None < Some < Marked. Excluding Age, we would have a 2 × 2 × 3 contingency table for Treatment, Sex and Improved. > library(vcd) > names(Arthritis) # show the variables [1] "ID" "Treatment" "Sex" "Age" "Improved"
> str(Arthritis) # show the structure 'data.frame': 84 obs. of 5 variables: $ ID : int 57 46 77 17 36 23 75 39 33 55 ... $ Treatment: Factor w/ 2 levels "Placebo","Treated": 2 2 2 2 2 2 2 2 2 2 ... $ Sex : Factor w/ 2 levels "Female","Male": 2 2 2 2 2 2 2 2 2 2 ... $ Age : int 27 29 30 32 46 58 59 59 63 63 ... $ Improved : Ord.factor w/ 3 levels "None"<"Some"<..: 2 1 1 3 3 3 1 3 1 1 ... > head(Arthritis,5) # first 5 observations, same as Arthritis[1:5,] ID Treatment Sex Age Improved 1 57 Treated Male 27 Some 2 46 Treated Male 29 None 3 77 Treated Male 30 None 4 17 Treated Male 32 Marked 5 36 Treated Male 46 Marked
Frequency form a data frame containing one or more factors, and a frequency variable, often called Freq or count. > # Agresti (2002), table 3.11, p. 106 > GSS <- data.frame( expand.grid(sex=c("female", "male"), party=c("dem", "indep", "rep")),count=c(279,165,73,47,225,191)) > GSS sex party count 1 female dem 279 2 male dem 165 3 female indep 73 4 male indep 47 5 female rep 225 6 male rep 191 > names(GSS) [1] "sex" "party" "count" > str(GSS) 'data.frame': 6 obs. of 3 variables: $ sex : Factor w/ 2 levels "female","male": 1 2 1 2 1 2 $ party: Factor w/ 3 levels "dem","indep",..: 1 1 2 2 3 3 $ count: num 279 165 73 47 225 191 > sum(GSS$count) [1] 980
Table form a matrix, array or table object, whose elements are the frequencies in an n-way table. • The HairEyeColor is stored in table form in vcd. > str(HairEyeColor) table [1:4, 1:4, 1:2] 32 53 10 3 11 50 10 30 10 25 ... - attr(*, "dimnames")=List of 3 ..$ Hair: chr [1:4] "Black" "Brown" "Red" "Blond" ..$ Eye : chr [1:4] "Brown" "Blue" "Hazel" "Green" ..$ Sex : chr [1:2] "Male" "Female" > sum(HairEyeColor) # number of cases [1] 592 > sapply(dimnames(HairEyeColor), length) # table dimension sizes Hair Eye Sex 4 4 2
Enter frequencies in a matrix, and assign dimnames, giving the variable names and category labels. Note that, by default, matrix() uses the elements supplied by columns in the result, unless you specify byrow=TRUE. > ## A 4 x 4 table Agresti (2002, Table 2.8, p. 57) Job Satisfaction > JobSat <- matrix(c(1,2,1,0, 3,3,6,1, 10,10,14,9, 6,7,12,11), 4, 4) > dimnames(JobSat) = list(income=c("< 15k", "15-25k", "25-40k", "> 40k"), + satisfaction=c("VeryD", "LittleD", "ModerateS", "VeryS")) > JobSat satisfaction income VeryDLittleDModerateSVeryS < 15k 1 3 10 6 15-25k 2 3 10 7 25-40k 1 6 14 12 > 40k 0 1 9 11
JobSat is a matrix, not an object of class("table"), and some functions are happier with tables than matrices. You can coerce it to a table with as.table(). > JobSat <- as.table(JobSat) > str(JobSat) table [1:4, 1:4] 1 2 1 0 3 3 6 1 10 10 ... - attr(*, "dimnames")=List of 2 ..$ income : chr [1:4] "< 15k" "15-25k" "25-40k" "> 40k" ..$ satisfaction: chr [1:4] "VeryD" "LittleD" "ModerateS" "VeryS"
Ordered factors and reordered tables • In table form, the values of the table factors are ordered by their position in the table. Thus in the JobSat data, both income and satisfaction represent ordered factors, and the positions of the values in the rows and columns reflects their ordered nature. • Imagine that the Arthritis data was read from a text file. By default the Improved will be ordered alphabetically: Marked, None, Some— not what we want. In this case, the function ordered() (and others) can be useful.
> Arthritis <- read.csv("arthritis.txt",header=TRUE) > Arthritis$Improved <- ordered(Arthritis$Improved, levels=c("None", "Some", "Marked")) • With this order of Improved, the response in this data, a mosaic display of Treatment and Improved shows a clearly interpretable pattern.
Finally, there are situations where, particularly for display purposes, you want to re-order the dimensions of an n-way table, or change the labels for the variables or levels. This is easy when the data are in table form: aperm() permutes the dimensions, and assigning to names and dimnames changes variable names and level labels respectively. > UCB <- aperm(UCBAdmissions, c(2, 1, 3)) > dimnames(UCB)[[2]] <- c("Yes", "No") > names(dimnames(UCB)) <- c("Sex", "Admit?", "Department") > ftable(UCB) Department A B C D E F Sex Admit? Male Yes 512 353 120 138 53 22 No 313 207 205 279 138 351 Female Yes 89 17 202 131 94 24 No 19 8 391 244 299 317 , , Dept = A Admit Gender Admitted Rejected Male 512 313 Female 89 19 , , Dept = B Admit Gender Admitted Rejected Male 353 207 Female 17 8 , , Dept = C Admit Gender Admitted Rejected Male 120 205 Female 202 391 ……………………………………………….
structable() • For 3-way and larger tables the structable() function in vcd provides a convenient and flexible tabular display. The variables assigned to the rows and columns of a two-way display can be specified by a model formula. > structable(HairEyeColor) # show the table: default Eye Brown Blue Hazel Green Hair Sex Black Male 32 11 10 3 Female 36 9 5 2 Brown Male 53 50 25 15 Female 66 34 29 14 Red Male 10 10 7 7 Female 16 7 7 7 Blond Male 3 30 5 8 Female 4 64 5 8 > HairEyeColor , , Sex = Male Eye Hair Brown Blue Hazel Green Black 32 11 10 3 Brown 53 50 25 15 Red 10 10 7 7 Blond 3 30 5 8 , , Sex = Female Eye Hair Brown Blue Hazel Green Black 36 9 5 2 Brown 66 34 29 14 Red 16 7 7 7 Blond 4 64 5 8
structable(Hair+Sex ~ Eye, HairEyeColor) # specify col ~ row variables Hair Black Brown Red Blond Sex Male Female Male Female Male Female Male Female Eye Brown 32 36 53 66 10 16 3 4 Blue 11 9 50 34 10 7 30 64 Hazel 10 5 25 29 7 7 5 5 Green 3 2 15 14 7 7 8 8 > HSE = structable(Hair+Sex ~ Eye, HairEyeColor) # save structable object > mosaic(HSE) # plot it
table() • You can generate frequency tables from factor variables using the table() function, tables of proportions using the prop.table() function, and marginal frequencies using margin.table(). > n=500 > A <- factor(sample(c("a1","a2"), n, rep=TRUE)) > B <- factor(sample(c("b1","b2"), n, rep=TRUE)) > C <- factor(sample(c("c1","c2"), n, rep=TRUE)) > mydata <- data.frame(A,B,C) > # 2-Way Frequency Table > attach(mydata) The following objects are masked _by_ .GlobalEnv: A, B, C > mytable <- table(A,B) # A will be rows, B will be columns > mytable # print table B A b1 b2 a1 137 116 a2 119 128
> margin.table(mytable, 1) # A frequencies (summed over B) A a1 a2 253 247 > margin.table(mytable, 2) # B frequencies (summed over A) B b1 b2 256 244 > prop.table(mytable) # cell percentages B A b1 b2 a1 0.274 0.232 a2 0.238 0.256 > prop.table(mytable, 1) # row percentages B A b1 b2 a1 0.5415020 0.4584980 a2 0.4817814 0.5182186 > prop.table(mytable, 2) # column percentages B A b1 b2 a1 0.5351562 0.4754098 a2 0.4648438 0.5245902 > # 3-Way Frequency Table > mytable <- table(A, B, C) > ftable(mytable) C c1 c2 A B a1 b1 73 64 b2 63 53 a2 b1 68 51 b2 65 63
ODDS RATIO - EXAMPLE • Chinook Salmon fish captured in 1999 • VARIABLES: -SEX: M or F Nominal - MODE OF CAPTURE: Hook&line or Net Nominal - RUN: Early run (before July 1) or Late run (After July 1) Ordinal - AGE: Interval (Cont. var.) - LENGTH (Eye to fork of tail in mm): Interval (Cont. Var.) • What is the odds that a captured fish is a female? Consider Success = Female (Because they are heavier )
CHINOOK SALMON EXAMPLE For Hook&Line: For Net: The odds that a captured fish is female are 77% ((1.77-1)=0.77) higher with hook&line compared to net.
ODDS RATIO • In general
INTERPRETATION OF OR • What does OR=1 mean? Odds of success are equal numbers under both conditions. e.g. no matter which mode of capturing is used.
INTERPRETATION OF OR • OR>1 • OR<1 Odds of success is higher with condition 1 Odds of success is lower with condition 1
SHAPE OF OR • The range of OR is: 0OR • ln(OR) has a more symmetric distribution than OR (i.e., more close to normal distribution) • OR=1 ln(OR)=0 • (1)100% Confidence Interval for ln(OR): (1)100% Confidence Interval for OR: Non-symmetric
CHINOOK SALMON EXAMPLE (Contd.) • The odds that a captured fish is female are about 30 to 140% greater with hook&line than with using a net using 95% confidence level.
OTHER MEASURES OF ASSOCIATION FOR 2X2 TABLES • Relative Risk=
MEASURE OF ASSOCIATION FOR IxJTABLES • Pearson 2 in contingency tables: • EXAMPLE= Instrument Failure
PEARSON 2 IN CONTINGENCY TABLES • Question: Are the type of failure and location of failure independent? H0: Type and location are independent H1: They are not independent • We will compare sample frequencies (i.e. observed values) with the expected frequencies under H0. • Remember that if events A and B are independent, then P(AB)=P(A)P(B). • If type and location are independent, then • P(T1 and L1)=P(T1)P(L1)=(97/200)(111/200)
PEARSON 2 IN CONTINGENCY TABLES • Cells~Multinomial(n,p1,…,p6)E(Xi)=npi • Expected Frequency=E11=n.Prob.=200(97/200)(111/200)=53.84 • E12=200(97/200)(42/200)=20.37 • E13=(97*47/200)=22.8 • E21=(103*111/200)=57.17 • E22=(103*42/200)=21.63 • E23=(103*47/200)=24.2
CRAMER’S V • It adjusts the Pearson 2 with n, I and J. In the previous example,
EXAMPLE > (HairEye <- margin.table(HairEyeColor, c(1, 2))) Eye Hair Brown Blue Hazel Green Black 68 20 15 5 Brown 119 84 54 29 Red 26 17 14 14 Blond 7 94 10 16 > chisq.test(HairEye) Pearson's Chi-squared test data: HairEye X-squared = 138.29, df = 9, p-value < 2.2e-16
Correlation between a continuous and categorical variable • Point biserial Correlation: The point biserial correlation coefficient, rpbi, is a special case of Pearson’s correlation coefficient. It measures the relationship between two variables: • One continuous variable (must be ratio scale or interval scale). • One naturally binary variable. • Many different situations call for analyzing a link between a binary variable and a continuous variable. For example: • Does Drug A or Drug B improve depression? • Are women or men likely to earn more as nurses?
Point biserial Correlation • The formula for the point biserial correlation coefficient is: • M1 = mean (for the entire test) of the group that received the positive binary variable (i.e. the “1”). • M0 = mean (for the entire test) of the group that received the negative binary variable (i.e. the “0”). • Sn = standard deviation for the entire test. • p = Proportion of cases in the “0” group. • q = Proportion of cases in the “1” group. • The point biserial calculation assumes that the continuous variable is normally distributed and homoscedastic.
Example • In the example below, two variables are created. The first yos refers to years of smoking. The second is disease and refers to the presence of a certain disease, with 0 indicating the absence and 1 indicating the presence of the disease. yos <- sample(1:30, 100, replace = TRUE) disease <- sample(0:1, 100, replace = TRUE) cor.test(yos, disease) Pearson's product-moment correlation data: yos and disease t = 0.62294, df = 98, p-value = 0.5348 alternative hypothesis: true correlation is not equal to 0 95 percent confidence interval: -0.1352848 0.2560616 sample estimates: cor 0.06280213 > library(ltm) > biserial.cor(yos,disease) [1] -0.06280213 This is the same calculation as used for Pearson correlation! The direction (sign) of the correlation coefficient is based purely on the coding of the categorical variable, i.e. which is coded as 0 and which is coded as 1.
Biserial Correlation • Biserial correlation is almost the same as point biserial correlation, but one of the variables is dichotomous ordinal data and has an underlying continuity. For example, depression level can be measured on a continuous scale, but can be classified dichotomously as high/low. • The formula is: rb = [(Y1 – Y0) * (pq/Y) ] /σy, • Y0 = mean score for data pairs for x=0, • Y1 = mean score for data pairs for x=1, • q = proportion of data pairs for x=0, • p = proportion of data pairs for x=1, • σy = population standard deviation.
Example • An example might be test performance vs anxiety, where anxiety is designated as either high or low. Presumably, anxiety can take on any value in between, perhaps beyond, but it may be difficult to measure. We further assume that anxiety is normally distributed. prop.table(table(disease)) disease 0 1 0.43 0.57 library(polycor) polyserial(yos, disease) [1] 0.07915729
CORRELATION BETWEEN ORDINAL VARIABLES • Correlation coefficients are used to quantitatively describe the strength and direction of a relationship between two variables. • When both variables are at least interval measurements, may report Pearson product moment coefficient of correlation that is also known as the correlation coefficient, and is denoted by ‘r’. • Pearson correlation coefficient is only appropriate to describe linear correlation. The appropriateness of using this coefficient could be examined through scatter plots. • A statistic that measures the correlation between two ‘rank’ measurements is Spearman’s ρ , a nonparametric analog of Pearson’s r. • Spearman’s ρ is appropriate for skewed continuous or ordinal measurements. It can also be used to determine the relationship between one continuous and one ordinal variable. • Statistical tests are available to test hypotheses on ρ. Ho: There is no correlation between the two variables (H0: ρ= 0).
Rank-Biserial Correlation Coefficient • The rank-biserial correlation coefficient, rrb, is used for dichotomous nominal data vs rankings (ordinal). The formula is usually expressed as rrb = 2 •(Y1 - Y0)/n, where n is the number of data pairs, and Y0 and Y1, again, are the Y score means for data pairs with an x score of 0 and 1, respectively. These Y scores are ranks. This formula assumes no tied ranks are present. This may be the same as a Somer's D statistic for which an online calculator is available. Coefficient of Nonlinear Relationship (eta) • It is often useful to measure a relationship irrespective of if it is linear or not. The eta correlation ratio or eta coefficient gives us that ability. This statistic is interpreted similar to the Pearson, but can never be negative. It utilizes equal width intervals and always exceeds |r|. However, even though r is the same whether we regress y on x or x on y, two possible values for eta can be obtained.
MEASURES OF ASSOCIATION FOR IxJ TABLES FOR TWO ORDINAL VARIABLES
Either of these might be considered a perfect relationship, depending on one’s reasoning about what relationships between variables look like. • Why are there multiple measures of association? • Statisticians over the years have thought of varying ways of characterizing what a perfect relationship is: tau-b = 1, gamma = 1 tau-b <1, gamma = 1
Rule of Thumb • Gamma tends to overestimate strength but gives an idea of upper boundary. • If table is square use tau-b; if rectangular, use tau-c. • Pollock (and we agree): τ <.1 is weak; .1<τ<.2 is moderate; .2<τ<.3 moderately strong; .3< τ<1 strong.
MEASUREMENT OF AGREEMENT FOR IxI TABLES Prob. of agreement
EXAMPLE (COHEN’S KAPPA or Index of Inter-rater Reliability) • Two pathologists examined 118 samples and categorize them into 4 groups. Below is the 2x2 table for their decisions.
EXAMPLE (Contd.) The difference between observed agreement that expected under independence is about 50% of the maximum possible difference.
EVALUATION OF KAPPA • If the obtained K is less than .70 -- conclude that the inter-rater reliability is not satisfactory. • If the obtained K is greater than .70 -- conclude that the inter-rater reliability is satisfactory. • Interpretation of kappa, after Landis and Koch (1977)
Mantel-Haenszel test and conditional association • Use the mantelhaen.test(X) function to perform a Cochran-Mantel-Haenszel chisquare test of the null hypothesis that two nominal variables are conditionally independent, A B C, in each stratum, assuming that there is no three-way interaction. X is a 3 dimensional contingency table, where the last dimension refers to the strata.
The UCBAdmissions serves as an example of a 2 × 2 × 6 table, with Dept as the stratifying variable. > ## UC Berkeley Student Admissions > mantelhaen.test(UCBAdmissions) Mantel-Haenszel chi-squared test with continuity correction data: UCBAdmissions Mantel-Haenszel X-squared = 1.4269, df = 1, p-value = 0.2323 alternative hypothesis: true common odds ratio is not equal to 1 95 percent confidence interval: 0.7719074 1.0603298 sample estimates: common odds ratio 0.9046968 • The results show no evidence for association between admission and gender when adjusted for department. However, we can easily see that the assumption of equal association across the strata (no 3-way association) is probably violated. For 2 × 2 × k tables, this can be examined from the odds ratios for each 2 × 2 table (oddsratio()), and tested by using woolf_test() in vcd.
library(vcd) oddsratio(UCBAdmissions, log=FALSE) odds ratios for Admit and Gender by Dept A B C D E F 0.3492120 0.8025007 1.1330596 0.9212838 1.2216312 0.8278727 lor <- oddsratio(UCBAdmissions) # capture log odds ratios summary(lor) z test of coefficients: Estimate Std. Error z value Pr(>|z|) A -1.052076 0.262708 -4.0047 6.209e-05 *** B -0.220023 0.437593 -0.5028 0.6151 C 0.124922 0.143942 0.8679 0.3855 D -0.081987 0.150208 -0.5458 0.5852 E 0.200187 0.200243 0.9997 0.3174 F -0.188896 0.305164 -0.6190 0.5359 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 woolf_test(UCBAdmissions) Woolf-test on Homogeneity of Odds Ratios (no 3-Way assoc.) data: UCBAdmissions X-squared = 17.902, df = 5, p-value = 0.003072 HO: ORStratum1 = ORStratum2 = … ORStratum(K-1) = ORStratum K “common association” H1: At least one differs from the others “there is effect modification”
We can visualize the odds ratios of Admission for each department with fourfold displays using fourfold(). The cell frequencies nij of each 2×2 table are shown as a quarter circle whose radius is proportional to nij , so that its area is proportional to the cell frequency. Confidence rings for the odds ratio allow a visual test of the null of no association; the rings for adjacent quadrants overlap iff the observed counts are consistent with the null hypothesis. col <- c("#99CCFF", "#6699CC", "#F9AFAF", "#6666A0", "#FF0000", "#000080") fourfold(UCB,mfrow=c(2,3), color=col)