Understanding Matching in Observational Studies - EPI 200B Winter 2010

Matching Beate Ritz, MD, Ph.D. EPI 200B Winter 2010 NOTE: many of the following slides are based on the lectures notes provided by Dr. Hal Morgenstern (Epi Methods I and II); some of the examples are taken from ME3 Chapter 11

Matching: partial restriction in the selection of study subjects Is usually seen as one part of a strategy to control for confounders in observational studies. It is partial restriction in the selection of subjects. In most observational studies matching restricts the eligibility of comparison subjects by choosing them to be similar to index subjects with respect to one or more matching variables.

Matching Example: Target population (ME3 p.172) Stratum RR= 10 10 Both exposure and male sex are risk factors for the disease: • Within sex, exposed have 10 times the risk of unexposed ((0.005/0.0005)=(0.001/0.0001)=10) • Within exposure level, males have 5 times the risk of females ((0.005/0.001)=(0.0005/0.0001)=5) There is also substantial confounding, since 90% of the exposed individuals are male and only 10% of the unexposed are male. The crude and sex specific risk ratios are different (33 versus 10)

Matched cohort (Example ME3 p.172) Suppose a cohort study drawing from the exposed target population and matching the unexposed to the exposed cohort by sex: • First, we select 10% exposed independent of sex • Then, we draw a comparison group of unexposed subjects from the target matching by sex; • this prevents association between sex and exposure in study cohort; i.e. the expected RR(=10) is the same in sex strata and overall

Matched case-control study:crude results (Example ME3 p.173) In a case-control study one would • First, identify (all) cases (N=4740) from the source population • Then, we sample 4740 controls from the source population, 1:1 matched to cases by sex NOTE: The crude Odds ratio is much less than the true risk ratio (5 rather than 10)!

Matched case-control study:stratified results (Example ME3 p.173) Stratifying by sex in the matched case-control study gives the correct results OR=10 Thus unlike the cohort matching, the case control matching has not eliminated confounding by sex in the point estimate of the risk ratio. The discrepancy between the crude results and the stratum specific results result from a bias that is introduced by selecting controls according to a factor related to exposure (sex, the matching factor). The bias behaves like confounding: the crude is biased and stratification removes it. However, note this is not a reflection of the confounding by sex in the source population; and it it differs in the direction from that bias. Case-control matching superimposed selection bias in place of the initial confounding in the source population.

Why case-control matching introduces bias(Example ME3 p.176) Controls are supposed to provide an estimate of the distribution of the exposure in the source population. Matching by a factor associated with exposure makes the control series more similar to the cases with respect to exposure; this biases the crude estimate towards the null no matter what the direction of the association between matching factor and exposure

Matching: logistically convenient and cost-efficient. Matching variables are usually known risk factors for the disease (and thus potential confounders) or "logistical" factors related to the selection of comparison subjects, e.g. • time of disease diagnosis, • place of residence, • the facility where a diagnosis is made or medical care is sought Although the latter type of matching variable may be a proxy confounder, matching on such variables is usually done to facilitate the process of obtaining eligible comparison subjects. Thus, matching these non-confounders is logistically convenient and cost-efficient. In addition, matching on time of the case's diagnosis is the most common method of density sampling controls in a case-control study.

Matching: done differently in case-control and cohort studies • In case-control studies, controls (the comparison group) are matched to cases (the index group). • In a cohort andcross-sectional studies, unexposed subjects (the comparison group) are usually matched to exposed subjects (the index group). • In randomized trials, matching is also known as blocked randomization: matching is done before randomization, then randomization is done within each matched set (block).

Matching: reasons Matching is more common in case-control studies and randomized trials than in cohort or cross-sectional studies The reasons are both statistical and practical: Matching can often enhance the cost efficiency of conducting a case-control study; although it is not guaranteed to do so. Also it increases the efficiency of typical randomized trials. The most common strategies for matching in an observational study are individual matching and frequency matching.

Types of Matching: Individual and frequency matching Individual matching One or more comparison subjects are selected separately for each index subject, selecting only those comparison subjects that are similar to the corresponding index subject on one or more matching variables. Frequency matching The number selected in each matching category of comparison group is made proportional to the number in that category of the index group

Types of Matching: Fixed and variable ratio matching Matching may also be classified by the ratio of comparison to index subjects in each matched set: • Fixed-ratio: The ratio of comparison to index subjects is the same for all matched sets–e.g., pairwise or 1:1 matching; 1:2 matching etc.. • Variable-ratio: The ratio of comparison to index subjects varies among matched sets (by design or, more commonly, because of nonparticipation of certain selected subjects).

Individual MatchingMethods There are two common methods for individual matching • Category matching: Each comparison subject is selected from the matching category (stratum) to which the index subject belongs (e.g., white males, ages 30-34). • Caliper matching: Each set of comparison subjects is selected to have values on the matching variable that are “close” to the corresponding value of the index subject There are two common methods of caliper matching: • Fixed caliper: The tolerance for eligibility is the same for all matched sets (e.g., age of index subject ± 2 years). • Variable caliper: the tolerance for eligibility varies among matched sets (e.g., nearest-neighbor matching–i.e., selecting an available comparison subject whose value on the matching variable is closest to the index subject). This is done to avoid not getting a match for certain index subjects, or to get the best possible match. It is possible to use both category and caliper matching on different variables in the same study.

Individual Matching • Direct matching: Matched sets are created from direct measurement of the matching variables–e.g., matching on age and sex. • Natural matching: Matched sets are created from family or social-network relationships–e.g., comparison subjects are twins, siblings, friends, or neighbors of the index subjects. With this method, it is often unclear on what factors the members of matched sets are made similar.

Individual Matching A special form of natural matching is matching subjects to themselves–i.e., index subjects serve as their own comparison subjects in one of these ways: • making within-person comparisons of the outcome over time–e.g., before vs. after first exposure in a cohort study involving recurrent outcomes (e.g. asthma attacks) within individuals • comparing similar anatomical parts in the same subjects–e.g., right eye vs. left eye, where one eye is diseased and the other is not (case-control sampling); or one eye is treated and the other is not (randomly chosen) • comparing exposure frequency between cases (acute events) shortly before they became cases and prior to or after the disease period–i.e., a case-crossover design: in which cases cases serve as their own controls.

Individual Matching In case control studies with caliper or natural matching there may be overlap among categories of the matching variables, so that certain potential controls are represented in more than one category This overlap can results in selection bias if exposure status is associated with inclusion in multiple categories. E.g., a problem with friend matching (selecting controls from the friends named by cases) is that certain individuals belong to multiple friendship networks and, therefore, have a higher probability of being selected as controls. Bias arises if people with many friends also have higher exposure probability (e.g., alcohol consumption or social-support level). The bias from overlapping calipers tends to be small in real examples, however, so caliper matching remains common. ADD an example here:

Frequency Matching • Frequency matching of comparison subjects cannot be completed until all index subjects (cases) are identified (or you know the expected distribution of the matching variables for the index subjects). • The end results of frequency matching is essentially equivalent to category matching. • The main difference is that frequency matching, unlike category matching, may involve any selection ratio of comparison to index subjects – even a ratio less than one (I.e. fewer total comparison subjects than index subjects)

Countermatching A method called ‘countermatching’ may have certain statistical advantages in population-based case-control studies in which we have a surrogate E* for exposure status for the entire source population. E.G., in occupational studies, we might know the location of employment for all workers in the source population (E*=1 if in a location with exposure, E*=0 otherwise), but we do not have the resource to determine the amount of or timing of exposure for each worker in the source population. With countermatching, for each case with E*=1 we select a control with E*=0, and for each case with E*=0 we select a control with E*=1. In other words, controls get matched to cases with the opposite value of E*. The researcher then collects more detailed information on exposure status (and possibly confounders) for all subjects in the case-control study. Countermatching may at first seem counterintuitive. How can we benefit by making comparison subjects different from index subjects? To understand how countermatching can improve statistical efficiency we must consider more closely how matching can improve statistical efficiency.

Why Match? The methodologic issues involved in matching are more complex than they first appear, which has led to widespread misunderstanding of matching. To appreciate the real statistical advantage of matching, we must consider how matched data are analyzed. As we see throughout this section, the analysis of matched data should usually take the matching variables into consideration through some form of stratification. Thus, matching of index and comparison subjects may not be sufficient to control for confounding due to the matching variables; stratified analysis or analogous methods may also be needed to complete the control of matching factors. Although matching is used as part of a strategy to control for confounding, confounding control is not the major reason for matching, because we can always control for measured confounders in the analysis without matching, via stratification or model fitting.

Major Statistical Advantage of Matching In both case-control and cohort studies we aim to reduce the variance of adjusted estimators, at a given sample size. This goal is especially important when there is a limited number of index subjects. This gain in precision occurs when the matching variable (M) is associated with both exposure status (E) and the disease occurrence (D) in the source population, so that we would need to control for M as the confounder even if matching were not done. Thus, the major statistical reason for matching is not to control for confounders, which can be done in the analysis, but to produce a more efficient study (one that yields an estimator with a smaller variance for a given sample size) than if we had not matched.

Major Statistical Advantage of Matching This gain in statistical efficiency (variance reduction), when it occurs, is obtained by equalizing (matching or ‘balancing’) the ratio of comparison to index subjects across strata of the matching variable. However, such equalization does not always result in greater efficiency (especially in case-control studies), so it is important to understand when it helps and when it does not. Natural matching also offers logistical advantages in that it is often easier to find controls from among sibs, friends and neighbors of cases.

Example: Source Population Stable source population: Relation between low dietary beta-carotene (the exposure) and the incidence of lung cancer (D), by smoking status (the covariate).

Example: Unmatched Case-Control Study Unmatched (population-based) density case-control study: Expected results taking all 52 cases and 52 randomly selected controls. Note that 12/52 = 23% of controls are smokers, reflecting the distribution of person-time in the source population (24,000/104,000). The proportions of smoking and nonsmoking controls that are exposed are also equal to the corresponding proportions of person-years in the source population.

Example: Unmatched Case-Control Study OR= mOR = 3.00 95% CL= (1.16, 7.73) Comment: Note the confounding by smoking in the source population and in the unmatched case-control study. This confounding is controlled using stratified analysis–e.g., the M-H method for estimating the common effect.

Example: Matched Case-Control Study Matched case-control study: All 52 cases and 52 controls matched on smoking (expected results of pairwise matching). Note the same 1:1 ratio of controls to cases for smokers and nonsmokers in the total sample.

Example: Matched Case-Control Study:Conclusion Although matching eliminated most confounding due to smoking, it introduced a selection bias (the crude estimator is still biased) that is controllable by controlling for smoking. It also produced a more precise estimate of the common odds ratio with the same sample size than did an unmatched design. Note that in the matched study, the ratio of cases to controls is the same for smokers and nonsmokers–i.e., the strata are balanced with respect to disease. In the unmatched study, however, the strata are quite unbalanced.

Example: Matched Case-Control StudyComments In observational studies, a gain in statistical efficiency due to matching tends to occur when • The matching factor is one that must be controlled n the unmatched design (a confounder) and • The matching factor is strongly associated with • The outcome in a case-control study • The exposure in a cohort study Nonetheless, matching can lead to a loss of efficiency, although when there are few matching categories, the gains and the losses are usually not dramatic.

Analysis of Matched Data When analyzing matched data, we must take the matching into consideration through some form of stratification. This objective is achieved by one of two methods, depending on the type of matching: • Ordinary stratified analysis: With category or frequency matching, the most efficient analytic strategy for estimating the effect is to conduct a general stratified analysis as described in ME2 Ch. 15. That is, we ignore matched sets (even if the comparison subjects are individually matched to index subjects) and re-stratify on all matching variables in the analysis (as in the previous example).

Analysis of Matched Data • Matched analysis: With caliper or natural matching or a mixture of caliper and category matching, we usually conduct a stratified analysis by treating each matched set as a separate stratum (“matched analysis”). This type of analysis preserves the matched sets and usually yields small numbers within strata, demanding use of sparse data methods such as Mantel-Haenszel methods or conditional maximum likelihood.

Analysis of Matched Data cont. We could use matched analysis with (individual) category matching, but ordinary stratified analysis yields more precise estimates of effect. Similarly, we could use ordinary stratified analysis with caliper matching (ignoring the matched sets), but this strategy might leave residual confounding from the matching variables (since mutually exclusive strata were not used to select matched comparison subjects) if the categories are too wide, and it might lead to a loss in precision. In any case, so-called “matched analysis methods” are nothing more than analysis methods for sparse data.

Selection of Matching Variables:Statistical Considerations • The selection of specific matching variables when designing a study depends on several statistical considerations as well as the study design. • As noted previously, the statistical purpose of matching is to improve statistical precision when controlling for confounders. • Although we can also control for confounders in the analysis without matching, the efficiency loss by stratification is particularly troublesome when the confounder is a nominal variable with many categories–e.g., occupations or neighborhoods.

Selection of Matching Variables:Statistical Considerations • Matching on such a confounder helps to prevent strata in which there are no cases or no controls – such strata get discarded by most analysis methods, so that any subject in those strata contribute nothing to the analysis (uninformative strata) • Without matching, therefore, it might not be possible to control adequately in the analysis for this type of variable. We would have to combine strata, possibly producing residual confounding.

Selection of Matching Variables:Statistical Considerations cont. In certain situations, however, it may be counterproductive to match on a given variable. That is: • Unlike in a cohort study, matching on a variable M in a case-control study usually precludes estimating its effect on disease occurrence because the M-D association is altered artificially by matching. • This is one reason for matching only on known risk factors for the disease in case-control studies. • In both case-control and cohort studies, matching on an intermediate variable or a factor that is affected by both the exposure and the disease will lead to a bias in both matched and crude analyses.

Selection of Matching Variables:Statistical Considerations • Sometimes matching can result in a loss of statistical efficiency, relative to not matching, instead of a gain. • This problem–called overmatching–occurs under different conditions in cohort versus case-control designs, and more readily in case-control designs. The selection of specific matching variables when designing a study depends on several statistical considerations as well as the study design.

Selection of Matching Variables:Statistical Considerations As noted previously, the statistical purpose of matching is to improve statistical precision when controlling for confounders. • Although we can also control for confounders in the analysis without matching, the efficiency loss by stratification is particularly troublesome when the confounder is a nominal variable with many categories–e.g., occupation or neighborhood. • Matching on such a confounder helps to prevent strata in which there are no cases or no controls – such strata get discarded by most analysis methods, so that any subject in those strata contribute nothing to the analysis (uninformative strata) • Without matching, therefore, it might not be possible to control adequately in the analysis for this type of variable. We would have to combine strata, possibly producing residual confounding.

Overmatching In case-control studies, statistical overmatching is most likely to occur when the matching variable is not a risk factor for the disease but is strongly associated with exposure status in the source population. Furthermore, if the matching variable is not taken into account in the analysis, the effect estimate will usually be biased; as illustrated in the next example. Often, however, logistical considerations (how to recruit controls) may override concerns about statistical overmatching – for example neighborhood matching is usually done for logistical reasons, without regard to its statistical impact. Unfortunately, if too many factors are chosen for matching, it may become difficult or impossible to find a match for each case, e.g. , there may often be no cooperative potential control of the same sex and race, and also close in age to the case in a neighborhood. Fortunately, it is possible to partially match controls (e.g. loosen age or race matching requirements for some controls), as long as once controls the matching factor closely in the analysis.

Example: Overmatching In A Case-Control Study Source population: Expected relation between neuroleptic (antipsychotic) drug exposure and the incidence (D) of tardive dyskinesia (TD), by psychiatric diagnosis (C, which is not a risk factor for TD among the unexposed).

Example: Overmatching In A Case-Control Study cont. Unmatched case-control study: All 150 cases and 150 randomly selected controls (expected results). Comment: Because diagnosis is not a risk factor for TD in the unexposed base population, it does not confound the exposure effect. Thus, the expected estimate of the cOR is unbiased in an unmatched case-control study.

Example: Overmatching In A Case-Control Study cont. Matched case-control study: All 150 cases and 150 controls matched on diagnosis (expected results).

Example: Overmatching In A Case-Control Study cont. Conclusion: By matching on diagnosis (a non-risk factor), diagnosis becomes associated with disease status in the unexposed study population (even though these two variables are not associated in the total study population). That is, matching on diagnosis introduces a selection bias. Since the crude estimate of effect (1.57) is biased, we must stratify in the analysis on diagnosis; but by forcing us to stratify, the matching results in a loss of statistical efficiency (statistical overmatching) because diagnosis is strongly associated with exposure status. Note: that the 95% confidence interval is wider in the matched study than in the unmatched study of the same size.

Overmatching In A Case-Control Study cont. Comment: Overmatching also results in efficiency loss when doing a matched analysis of case-control data, because the proportion of discordant matched sets is reduced. Implication: In a case-control study, we do notwant controls to be similar (matched) to cases on allfactors other than the exposure. Such a strategy could result in severe overmatching since certain matching variables may be strongly associated with exposure status but not risk factors. In summary, making controls similar to cases in a case-control study, does not eliminate the need to control a factor, and may force us to control a factor that we would not have had to otherwise. Making controls similar to cases is therefore a poor heuristic for controlling confounding. Matching needs to be done with some thought to the likely gains and losses that results from matching on each possible candidate, including both statistical and logistical considerations.

Statistical Overmatching and Study Design If a particular variable is not a risk factor for the disease (or is a weak risk factor) but is associated with exposure status in the base population, matching on this variable can result in a loss of statistical efficiency in a case-control study but not in a cohort study. This difference in overmatching between case-control and cohort designs can be depicted as a type of confounding by the matching variable. Case-control study:Matching on a factor related to exposure in a case-control study introduces a source of association between disease status and the matching variable conditional on exposure. Suppose this factor was unrelated to disease to begin with except through its association with exposure. After matching, it will be related to both exposure and disease, and thus become an “induced confounder”. Because we now have to control for this factor, and this control usually increases variance, we suffer a loss of statistical efficiency, compared with the expected results if matching had not been used. The stronger the M-E association, the greater the power loss from overmatching, and with rare exceptions the greater the efficiency loss (variance increase) as well.

Statistical Overmatching and Study Design Cohort study: In a cohort study fixed-ratio matching does not introduce an M-D association conditional on exposure. Instead, fixed-ratio cohort matching eliminates any association between exposure status and the matching variable in the total cohort (source population). Thus, the matching variable is not a confounder in this type of cohort study If the matching variable is not a risk factor the unmatched variance estimator will be unbiased as well, so we don’t have to control for it and there will be no loss of efficiency from matching on it.

Statistical Overmatching and Study Design As in a case control study, matching in a cohort study can result in a loss of statistical efficiency (i.e., statistical overmatching) when estimating the risk or rate ratio even if the matching variable is a confounder in the source population (and hence without matching a stratified analysis must be done to get an unbiased point estimate). Usually, however, this efficiency loss does not occur when estimating the risk or rate difference. Thus, the conditions for overmatching in a cohort study are very different from the conditions for overmatching in a case-control study. While it may be difficult in practice to predict when cohort matching on a factor will increase variance, the increase is most likely to occur when the confounding (in the common RR or IDR) by the factor in the source population is toward the null value–i.e., when the ratio measure is biased toward one in the source population if the factor is not controlled. For example, we might expect a loss in statistical efficiency by matching on a confounder when cIR = 1.57 and mIR = 2.00 or when cIR = 0.70 and mIR = 0.50.

Example: Overmatching in a Cohort Study Source population: Relation between neuroleptic drug exposure and the incidence (D) of TD, by psychiatric diagnosis (which is a negative confounder in this population because schizophrenics are at lower risk of TD, unlike the two previous examples).

Example: Overmatching in a Cohort Study Unmatched cohort study: Randomly sample 10% of all exposed persons and unexposed persons in the source population (expected results). Comment: The crude risk ratio is confounded toward the null by diagnosis because schizophrenics are less likely than other patients to get TD (among the unexposed) but more likely to be exposed to neuroleptics.

Example: Overmatching in a Cohort Study Matched cohort study: Randomly sample 10% of all exposed persons and select an equal number of unexposed subjects matched on diagnosis (expected results). Comments: Even though matching on diagnosis made diagnosis a nonconfounder, it resulted in some loss of statistical efficiency (a larger variance and hence a wider 95% confidence interval). This efficiency loss occurred because of the smaller number of cases in the matched design (i.e., 33 vs. 36).

Matching in Experiments Experiments: Matching in experiments (more often called blocking) is done before exposure, and affects exposure assignment but does not affect the marginal matching-factor distribution in the experimental cohort except insofar as the investigator discards potential subjects during this process (i.e. because they have no match, or he wants a constant number within blocks). Exposure plays no role in this selection process because no one is exposed until after selection. In contrast, in non experimental cohort studies, exposure precedes selection, and when unexposed are matched to exposed, exposure affects the matching factor distribution in the cohort. In randomized experiments, matching does NOT lead to a loss in large sample statistical efficiency, whereas as shown above it can lead to efficiency loss in a non experimental study.

Matching and Cost Efficiency:Trade-off between statistical considerations and study costs Typically, the process of matching in an observational study adds to the cost of data collection because of the need to identify eligible matches for index subjects by collecting information on the matching variables for persons who do not get into the study. Thus, the decision to match on a specific variable (M) depends on whether the gain in statistical efficiency by matching on a risk factor is worth the added cost. A method for estimating whether a parameter with a given sample size is more cost efficient than a second method is to assess whether the first method yields a more precise estimate for the same study cost or whether it costs less to obtain the same degree of precision. Matching, therefore, is more cost efficient than not matching if it costs less to achieve a given level of precision by matching than by simply increasing the sample size without matching, provided the latter strategy is feasible. For example, if a matching strategy increases efficiency by 10% but adds over 10% to the total cost of data collection, the matching strategy is not cost efficient.

Matching and Cost Efficiency:Trade-off between statistical considerations and study costs The extra cost of matching tends to be least when data on the matching variables are readily available on all persons in the source population before they are selected–e.g., on computer files. In fact it is possible that matching on such variables as place of residence or place of diagnosis, especially in case-control studies, may reduce data-collection costs: E.g., suppose we wish to conduct a population-based case-control study of lung cancer among adults in Los Angeles County, where the county tumor registry is used to identify new cases. It is not very practical to select controls randomly from all adult residents of Los Angeles County because there is no convenient list of eligible residents. One commonly used alternative is to individually match controls to cases on residential neighborhood and possibly other factors–i.e., controls are selected from the same neighborhoods as cases. Even if “neighborhood” is not a risk factor for lung cancer (conditional on exposure status and other covariates), the matching can facilitate the selection of controls in such a way as to reduce the likelihood of response bias (e.g., if the probability of cooperation depends on SES).

Understanding Matching in Observational Studies - EPI 200B Winter 2010

Understanding Matching in Observational Studies - EPI 200B Winter 2010

Presentation Transcript

Matching

matching

Matching

Matching

Matching

Matching

Matching

Matching

Matching

Matching

Matching

Matching

Matching

Graph Matching

Matching activity

Property Matching and Weighted Matching

Matching Methods

Matching Markets

Matching

Stereo Matching

Horoscope Matching | Online Kundali Matching

Matching