Statistics and More

Statistics and More Shahrzad Bazargan-Hejazi 1-2-2015

Descriptive Statistics • For most papers in the health sciences, the goal of analysis should be to use the simplest statistics possible to make the results of the study clear. • Most research studies do not require the use of complex statistics like regression (and using advanced statistical tests incorrectly is never helpful).

FIGURE 1. Analytic Plan

Types of Variables A variable is a characteristic that can be assigned more than one value. The value of a variable for an individual does not have to vary (change) over time, but the response among individuals within a population should be something that might differ.

Types of Variables There are several ways to classify variables: • Ratio variables • Interval variables • Continuous variables • Discrete variables • Ordinal variables (ranked variables) • Nominal variables (categorical variables) • Binomial variables

FIGURE 2. Types of Variables

Measures of Central Tendency There are several ways to report the average response to a variable in a population: For ratio and interval variables, the central tendency can be described using means, medians, and modes. For ordinal variables, a median or mode can be reported. A mode can be reported for categorical variables.

FIGURE 3. Example of a Mean, Median, and Mode

Measures of Spread Measures of spread, also called “dispersion,” are used to describe the variability and range of responses. range median quartiles interquartile range (IQR)

Measures of Spread A normal distribution of responses has a bell-shaped curve with one peak in the middle Not all numeric variables have a normal distribution. The distribution may instead be left-skewed, right-skewed, bimodal, or uniform.

FIGURE 4. Sample Histogram

Standard Deviation For variables with a relatively normal distribution the standard deviation describes the narrowness or wideness of the range of responses. • 68% of responses fall within one standard deviation above or below the mean. • 95% of responses are within two standard deviations above or below the mean. • More than 99% of responses are within three standard deviations above or below the mean.

Z-scores A z-score indicates how many standard deviations away from the sample mean an individual’s response is. An individual whose age is exactly the mean age in the population will have a z-score of 0. A person whose age is one standard deviation above the mean in the population will have a z-score of 1. A person whose age is two standard deviations below the population mean will have a z-score of –2.

FIGURE 5. Example of the Distribution of Responses for a Normally Distributed Numeric Variable

Categorical Responses A histogram or boxplot cannot be used to display the responses to categorical variables. The distribution of responses must instead be displayed in a bar chart (or, less often, a pie chart).

FIGURE 6. Sample Bar Chart

FIGURE 7. Common Descriptive Statistics by Variable Type

Statistical Consultation If answering the study question adequately requires the use of elaborate analytic techniques, invite a statistical expert to serve as a collaborator and as a coauthor on the resulting paper.

Comparative Statistics • Comparative statistics compare groups of participants by sex or age, by exposure or disease status, or by other characteristics.

FIGURE 8. Analytic Plan for Comparing Groups

Hypotheses for Statistical Tests Comparative statistical tests usually are designed to test for difference rather than for sameness. Statistical test questions are usually phrased in terms of differences: Are the means different? Are the proportions different? Are the distributions different?

Hypotheses for Statistical Tests The null hypothesis (H0) describes the expected result of a statistical test if there is no difference between the two values being compared. The alternative hypothesis (Ha) describes the expected result if there is a difference.

FIGURE 9. Examples of Hypotheses for Statistical Tests

Interpreting P-values A p-value, or probability value, determines whether the null hypothesis (H0) will be rejected. The standard is to use a significance level of α = 0.05, or 5%. Any statistical test with a result that is in the 5% of most extreme responses expected by chance will result in the rejection of the null hypothesis.

FIGURE 10. Interpreting p -Values

Interpreting Confidence Intervals Confidence intervals (CIs) provide information about the expected value of a measure in a source population based on the value of that measure in a study population. The width of the interval is related to the sample size of the study. A larger sample size will yield a narrower confidence interval.

FIGURE 11- Interpreting Confidence Intervals (CIs)

Measures of Association Some of the most common types of comparative analysis are the odds ratio (OR) used for case-control studies and the rate ratio (RR) used for cohort studies. The reference group for an OR or RR should be well-defined. The 95% confidence interval provides information about the statistical significance of the tests.

FIGURE 12- Example of Odds Ratios for a Case-Control Study of Acute Myocardial Infarction

Selecting an Appropriate Test Statistical analysts must select a test that is appropriate to the goal of the analysis and the types of variables being analyzed.

Selecting an Appropriate Test • Parametric tests assume that the variables being examined have particular (usually normal) distributions and that the variances for the variables being examined be similar in the population groups being compared. • Nonparametric tests do not make assumptions about the distributions of responses.

Selecting an Appropriate Test • Parametric tests are typically used for ratio and interval variables with relatively normal distributions of responses. • Nonparametric tests are used for ranked variables, categorical variables, and when the distribution of a ratio or interval variable is non-normal.

Two-Sample Tests Independent populations: populations in which each individual can be a member of only one of the population groups being compared A variety of statistical tests can be used to compare independent populations. The appropriate test to use depends on the type of variable being examined.

FIGURE 14- Tests for Comparing Two or More Groups

FIGURE 15 Examples of Tests for Comparing Males and Females in a Study Population

FIGURE 16 Simplified Version of Figure 27-12

Paired Tests A different set of tests is used when the goal is to compare before-and-after results in the same individuals.

FIGURE 17- Tests for Comparing Matched Populations

FIGURE 18- Examples of Tests for Comparing Pretest and Post-Test Results for Participants in a 3-Month Exercise Program

User-friendly statistical software programs have made it possible for nearly everyone to run advanced statistical analyses, but these programs still require the user to select appropriate tests and correctly decipher what the output means. A Brief Guide to Advanced Health Statistics

Confounding • Multivariate statistical models can be used to examine the interactions that may occur among variables. • This can be especially helpful when a third variable may be concealing or distorting the true relationship between two other variables. • Several different types of third variable effects might occur, including confounding and effect modification.

Confounding • To be a confounder or effect modifier, the third variable must be independently associated with both an exposure (or predictor) variable and an outcome variable. • A crude odds ratio (or other measure of association) for the relationship between the exposure and the outcome should be calculated, along with a separate measure of association for each level of the third variable, such as separate odds ratios for males and females.

Confounding • If the crude and stratum-specific ORs are all similar, then report a crude OR. • If the stratum-specific ORs are equivalent to one another but different from the crude OR, the third variable is a confounder. Report an adjusted OR. • If the stratum-specific ORs are different from one another and different from the crude OR, the third variable is an effect modifier. Report stratum-specific ORs.

FIGURE 19-Confounding and Effect Modification

Regression • Regression is often the easiest way to adjust for confounding variables or interaction terms. • Regression models seek to understand the relationship between one or more predictor (independent) variables and one outcome (dependent) variable. • The models allow the effect of one predictor variable on the outcome to be examined while controlling for other predictor variables (keeping their values constant).

FIGURE 21- Steps in Fitting a Regression Model

Linear Regression • A linear regression model is used when the outcome variable is a ratio or interval variable. • Simple linear regression models examine whether there is a linear relationship between one predictor variable and the outcome variable. • The regression model finds the best-fit line for the data points, and the equation for that line can be used to predict the expected value of the outcome variable for various values of the predictor variable.

Linear Regression • The r2 for the model, which is the square of the correlation coefficient, provides information about how well the regression model predicts the variation in the values of the outcome variable. • The value of r2 ranges from 0 to 1, with larger values indicating a better model fit.

FIGURE 22-Example of a Simple Linear Regression Model

Linear Regression • Multiple linear regression models examine the effects of several predictor variables on the value of the outcome variable. • The resulting equation can be used to examine the effect of each predictor variable on the outcome variable while controlling for the other predictors by holding their values constant. • Multiple linear regression models can have both continuous and categorical predictor variables.

Statistics and More