480 likes | 575 Views
Biostats Methods I. or: A Biostats Refresher. or: A t-test isn’t all you need. Leann Myers Dept. of Biostatistics & Bioinformatics Tidewater 2031 myersl@tulane.edu. Two Assumptions and a Truth. Everyone here has a basic understanding of stats.
E N D
Biostats Methods I or: A Biostats Refresher or: A t-test isn’t all you need Leann Myers Dept. of Biostatistics & Bioinformatics Tidewater 2031 myersl@tulane.edu
Two Assumptions and a Truth • Everyone here has a basic understanding of stats 2. Most everyone here doesn’t want another lecture on stats I can’t teach you statistics in an hour (or even 2).
Outline Basic descriptive statistics Choosing a method of analysis Simple comparisons Realistic comparisons Data files—Excel doesn’t always General linear models: interpreting regression equations Power and sample size “Case” studies
Bare Bones Statistics Data are generally either continuous or categorical. Different techniques are used to analyze different types of outcomes. Summarizing data is the most basic use of statistics.
group 1 27 27 27 27 27 group 2 7 17 27 37 47 Summary Statistics Continuous Data: Mean the mean is 27 Standard deviation SD1 = 0 SD2 = 5.44 Standard Error weight that SD by the sample size SE1 = 0 SE2 = 2.43
Summary Statistics Continuous Data: Mean the mean is 150 Standard deviation SD = 20 Subject’s score is 250. I’m 99% certain that subject didn’t come from this sample. If the data are basically normally distributed, 99% of the scores fall between 90 and 210 (+ 3 SDs). This goes back to the standard normal distribution:
Standard Normal Distribution Those distributions you learned in intro ARE good for something.
Two Groups We’re usually not concerned about a single score, but instead about comparing different groups. For the mean, the “SD” = Standard Error. If you assume both groups have the same SE, then looking at overlap in the curves cues you in about significance.
Basic Statistics Tests Student’s t-test is a variation of that concept, allowing smaller sample sizes ANOVA is an extension of the concept Most statistics can be turned into z-scores and assessed:
It’s all normal If you compare two proportions, you are comparing two means. You can generate a z (standard normal deviate) or two; back to the graphs. If you compare a regression coefficient to its null value (0), which is the typical null hypothesis with regression, you’re computing a z or a t—back to the graphs. Primary limitations are necessary assumption of normality, or at least something close, and equality of variances (SEs).
Or is it all abnormal? Micceri T. The unicorn, the normal curve, and other improbable creatures. Psychological Bulletin. 1989;105:156–166 Do you know how to tell if your data are normally distributed? Possible solutions—transform, cutpoint, use a different test
Summary Statistics Continuous Data: Median (and other percentiles) med1 = 27 med2 = 27 med3 = 3 group 1 27 27 27 27 27 group 2 7 17 27 37 47 group 3 0 2 3 60 70 mean = 27 Quick test of symmetry: mean ≈ median
Summary Statistics Continuous Data: Range: maximum value - minimum value group 1 27 27 27 27 27 group 2 7 17 27 37 47 group 3 0 2 3 60 70 You can use the range to estimate the SD: range/6 (assuming symmetry)
Summary Statistics If you decide to report the median instead of the mean: (and why would you do this?) then the range or interquartile range (IQR) is a more appropriate measure of dispersion than the Standard Deviation
Summary Statistics Categorical Data Frequencies and proportions To compare two proportions, turn the difference into a z H0: p1 = p2 H0: p1-p2 = 0 (p1 – p2) is computed from the data E(p1 – p2) is given by the H0 SE(p1 – p2) is computed from binomial distribution
Summary Statistics Categorical Data To compare two proportions, turn the difference into a z but most people opt for a c2 test instead
Summary Statistics Categorical Data Non- Gender Disease disease Male 40 60 Female 25 75 Pearson’s c2 statistic can be used to determine if the proportions of those with the disease is the same for both genders. If you used z, you’ll get the same answer. For simple tables with 1 df, z2 = c2
Summary Statistics Categorical Data Odds ratios (y) if the probability of an event (disease) = p then the probability of not having the disease = 1-p The odds of having the disease are
Summary Statistics: Odds ratios If you have two groups of subjects (males and females), compute the odds for each group: Non- Gender Disease disease Male 40 60 Female 25 75 p(non-dis) 60/100 75/100 p(disease) 40/100 25/100 Put one over the other and you have an odds ratio.
Summary Statistics: Odds ratios How do you interpret the OR? CI
Summary Statistics Time to event data: These are continuous data but a special case because of censoring If you do a study and follow people for 2 years, some will have the event and some won’t. For people who have the event, you have a score (such as days to event) For people who don’t, you have censored data—number of days event-free, but not number of days to event
Summary Statistics Time to event data Typical summary statistic is median time to event, adjusted for censoring Typical graphical display is a Survival (Failure time) curve
Choosing a method of analysis Choosing the right test is key. The choice of the statistical test depends on the type of outcome, not the independent variables or predictors. It also depends on how many responses the subject gives: if you measure them once, you have one outcome. Multiple times, and you have repeated measures and need to use methods that accommodate correlation between scores.
Basic Statistical Tests Type of data Measured once Measured > once Continuous t-tests, ANOVA, RM ANOVA, GEE (normal) linear regression mixed models Continuous Wilcoxon, Friedman’s test, (non-normal) Kruskal-Wallis SOL NL regression
Basic Statistical Tests Type of data Measured once Measured > once Categorical c2, FET GEE logistic regression Poisson regression Time to event Kaplan-Meier, ----------- (survival) Cox PH models
Basic Hypothesis Testing Research questions are usually tested in the null, so setting up a null hypothesis is the first formal step. 1. H0: m1 = m2 2. H0: p1 =p2 3. H0: b =0 4. H0: r =0 Based on the null hypothesis and the data, you pick the right test.
Basic Hypothesis Testing Compute the test statistic. Assess the test statistic on appropriate degrees of freedom (this is when you get the magic p-value) Conclude What does that magic p-value mean?
Basic Hypothesis Testing Suppose we’re testing the difference in two proportions, H0: p1 = p2 we compute z, z = 2.01 p = 0.044 What does that mean?
Basic Hypothesis Testing There are two possible conclusions: • Assume the null hypothesis is true and that there really • is no difference between the proportions. The probability of • getting | z | as big or bigger than 2.01 in that case is 0.044. • Our sample is just strange (random chance rears its ugly head). • The null hypothesis is not true. There is a difference • between the two proportions. Which do you choose?
Basic Hypothesis Testing Suppose we’re testing the difference in two proportions, H0: p1 =p2 we compute z, z = 1.01 p = 0.3125 What does that mean?
Basic Hypothesis Testing There are two possible conclusions: • Assume the null hypothesis is true and that there really • is no difference between the proportions. The probability of • getting | z | as big or bigger than 1.01 in that case is 0.3125. • The null hypothesis is not true. There is a difference • between the two proportions. Our sample is strange and • we didn’t find it (random chance rears its ugly head). Which do you choose?
Simple comparisons Variable TRT Control p Age (m + sd) 27 + 5 24 + 6 ns BMI (m + sd) 30.8 + 7.1 30.7 + 5.4 ns % male 53 47 ns Ethnicity (%) white 25 34 0.05 AA 60 48 other 15 18 ...
Simple comparisons Drink in Past 30 days p value Yes n (%) No n (%) Total n (%) 2278 (48.1) 2460 (51.9) Gender 0.7653 Male 1116 (48.3) 1193 (51.7) Female 1162 (47.8) 1267 (52.2) Ethnicity <0.0001 AA 586 (37.8) 966 (62.2) White 1544 (53.3) 1350 (46.7) Other 148 (50.9) 144 (49.1) Age 0.18 Mean 15.4 + 0.8 15.3 + 0.8
Simple comparisons These are very simple comparisons, usually assessed by a t-test or one-way ANOVA for continuous outcomes and c2 for categorical outcomes. EXCEL will do some of these, if you ask nicely. Life isn’t usually that simple.
Realistic comparisons Fairly simple orthopedic study: compared good vs. bad outcomes (based on IKDS scores) following graft Sex male vs female Age (mean + SD) Site medial vs. lateral Graft source MTF vs Cryolite Fixation Technique Suture only vs All others Antalgic Gait Yes No Effusion Yes No Jointline TTP Yes No Articular Grade 1 or 2 3 or 4 Additional Surgery Yes No Workman’s Comp. Yes No SF-12 scores* PCS (mean + SD) MCS (mean + SD)
Realistic comparisons This is a more typical study: we’re interested in the simultaneous impact of a number of predictors on outcome, not simply each individual one. You can’t analyze these with a t-test or a c2 You can’t analyze these with introductory level biostats. Most of the applicable techniques are some form of the general linear model (GLM) which are generally regression based methods. More on these next time.
What can you do? Think Planning and articulating the study are key to appropriate analysis What is the point of the study? What are the research questions? Think in terms of hypotheses that can be tested Define the outcomes operationally Write it down: if you can think it through enough for coherent writing, you’re on the right track.
Think Define the outcomes Can they be transformed if necessary? SF 36 vs. serum creatinine Can categories be combined? RESIST Define the test groups Can categories be combined? Define the covariates
Think Think about the design/data collection establish protocols and forms Missing data why are they missing? how much missing? Data entry and QC
Data entry and QC Most people who enter their own data use EXCEL That’s good and that’s bad It’s good because excel is an easy format It’s bad because excel lets you get carried away
Good Data Files Do not include identifying information in the data file. ID info should be kept in a separate file, with links to ID number. Keep variable names simple and descriptive Avoid special characters in variable names Don’t start variable names with numbers or special characters
Good Data Files--Text Don’t include the long comments. Whenever possible, use numeric codes rather than text. don’t mix the two. check your spelling and capitalization. when imported, M F m f reflect four genders Black and Balck are two different ethnicities
Good Data Files--simplify Don’t put summaries, graphs, etc. in the actual data file, or at least not on the same worksheet. Document all coding schemes Think about separate files for different types on info
Good Data Files Perfect data entry Double entry is best random sample double entry is better than nothing Services available for entry
Detouring just a little Next time: regression power/sample size