Lecture 5 – Categorical Data and Survival Analyses

Lecture 5 – Categorical Data and Survival Analyses

OUTLINE • Definition • Common CDA • Descriptive summaries • Tests of Association • Modeling • Extensions • Other examples in CDA

What is Categorical Data Analysis? • Statistical analysis of data that are categorical (cannot be summarized with mean +/- SD) • Includes dichotomous, ordinal, nominal outcomes • Examples: Disease prevalence, Discharge location, Treatment adherence (yes/no)

Examples of Studies with CDA • MI after CABG • Diagnostic studies looking Sensitivity, Specificity of a new test/procedure • Discharge location after new surgical intervention.

How to analyze words? • Order vs. no order • Breakdown mean +/- SD for two groups • Do the same: Breakdown Outcome %’s for two groups

How to analyze words? • Comparing length of stay after CABG: • New Trt = 19.2 +/- 2.7 • SOC = 21.3 +/- 3.3 • Comparing prevalence of MI: • New Trt = 16% • SOC = 24% • Are these differences statistically significant? clinically significant?

Choice of End Point • Some designs have a binary response variable • MI after 3 years • Overall Survival • Time to CVD • Time to recurrent MI • Can Dichotomize as 1 year rate (Yes/No)

What is Categorical Data Analysis? • paper example

Common CDA • Descriptive summaries • Tests for association • Modeling

Descriptive summaries

Let’s Talk Data…

Descriptive Summaries in CDA Nominal – Categorical Data Measured in unordered categories Ordinal – Categorical Data Measured in orderedcategories Continuous – Quantitative Data Measured on a continuum (summarize with %’s) (summarize with %’s) summarize with many measures

Types of Data Nominal – Categorical data measured in unorderedcategories Race Blood Type Ordinal – Categorical data measured in ordered categories Cancer Stages Socio-economic Status (low, medium, high) Continuous – Quantitative data measured on a continuum Serum Creatinine Height/Weight/BMI • Gender • Likert (unlikely, neutral, likely) • Diastolic Blood Pressure • Tumor measurements

What the data might look like…

Compare Categorical Outcomes between groups • How to assess if a predictor is associated with a categorical outcome? • Intuitive?: Get the %’s of the outcome prevalence within each predictor group. • Example: New Trt and MI. • New Trt response rate = 16% • SOC response rate = 24%

Contingency Tables Group Group

CDA Summary with Contingency Table • Research question Is there a relationship between Group and Attacked Heart? • Better to convert the table into percentages (easier to see)

What the data might look like…

Step 1. Breakdown the frequencies

Step 2. Get the different %’s

Row vs. Column %’s: It’s your choice • Row %’s: • 40% of New trt patients had MI vs. 80% of Old trt patients had MI • Col %’s: • 75% of No MI were in the New trt group vs. 33% of MI were in New trt group • P-value for test of association is the same!

Tests for Association

CDA tests for Association • Is there a significant association between Group and MI? • What is a good way to test for an association between the two?

Test for significant differences • The most common tests are the Chi-square test and Fisher’s Exact test. • Research question: Is there an association between treatment group and MI? • To answer this: Compare what you would expect if there was noassociation to what you observed

Expect if no relationship?

Same % with MI by Group

Test for significant differences • Have exact same response % would favor “no association” • There is another general way to calculate what you “expect” • Use Row totals, Column totals, Grand total to calculate “Expected” frequencies

Observed vs. Expected Frequencies • Observed frequencies = actual counts • “Expected” frequencies: = Row total x Column total / Grand total (why?)

What you actually observed in Study

“Expected” frequencies

Chi-square test • Quantify if the actual frequencies are far enough away from the Expected (assuming no association) • We can quantify using the Chi-square test statistic • We can get the p-value to determine if there is a significant association.

Chi-square test for association in RxC table • H0: There is no association between row and columns • The classic Pearson’s chi-squared test of independence • For a 2x2 table, df = (2-1) x (2-1) = 1 • Conservatively, we require expected ≥ 5 for all i, j

Chi-square Test • Associated P-value for this Chi-square value is p=0.0098. • Thus, we conclude group and MI are significantly associated (given α = 0.05).

“Expected” frequencies

Fisher’s Exact Test • Fisher’s Exact test will test similar hypotheses as the Chi-square test. • Use Fisher’s Exact test when assumptions of Chi-square test are not satisfied. • That is, when you have Expected < 5 (basically implying when cell sample size is small).

Confidence Intervals for %’s

Confidence Interval for %’s You conduct your follow-up after CABG study and accrue 40 patients. After 3 years 20 out of all 40 patients have had a MI. Q1. What is your best guess at the true (population) MI rate at 3 years? A. Based on your sample, 20/40 = 50%

Sampling Variability MI at 3 yrs = ? Inference MI = 50% Sample Population

Sampling Variability MI at 3 yrs = ? Inference MI = 44% Sample Population

Confidence Interval for %’s A good way to make inference about what the range of plausible values of the population % is to calculate a Confidence Interval (CI). Q2. How much precision do you have in terms of estimating the MI rate at 3 yrs. in the population based on your sample?

95% Confidence Intervals • 95% Confidence Interval for Mean: • 95% Confidence Interval for Proportion (Standard “Wald” CI):

Confidence Interval for %’s Q2. How much precision do you have in terms of estimating the MI rate in the population based on your sample? (Remember, 20 of 40 total had MI) A. A 95% Wilson CI for population MI rate is (35.2%, 64.8%). Thus, if we have repeated our study over and over again, each time drawing a sample of 40 patients, then the true population MI rate at 3 yrs. would be between 35.2% and 64.8% approximately 95% of the time.

Confidence Interval for %’s What’s interesting is that there are “lucky” and “unlucky” combinations of p (response rate) and N (sample size) That is, for a given sample size: * for some p you may higher ability to make inference * for some p you may have less ability! Not to scald the Wald, but not all CI’s are created equal Paper

Modeling in CDA

Modeling in CDA • Modeling is done with variations of Logistic Regression: • Dichotomous • Ordinal (Proportional odds) • Nominal (Generalized logit) • Conditional (Matched-pairs) • Exact (small sample size/rare outcome) • Longitudinal (GEE, GLMM) • Simple (1 predictor) vs. Multivariable (>1 predictor/adjusted)

Why use adjusted analysis? • Do you think patient demographics or clinical characteristics at baseline would affect MI? • What if half of the patients are all <30 yrs. old and half are all >80 yrs. old? • What are some possible confounders of response? Effect modifiers? • These are testable in adjusted analyses.

You may not need adjusted. • Typically have well-defined specific patient populations of interest. • Thus, inclusion/exclusion criteria might have removed variability from potential confounders • A well designed, well executed trial usually does not require intensive and complex analysis.

What is Logistic Regression? • In a nutshell: A statistical method used to model dichotomous or binary outcomes (but not limited to) using predictor variables. Used when the research method is focused on whether or not an event occurred, rather than when it occurred (time course information is not used).

Lecture 5 – Categorical Data and Survival Analyses