Bias and Confounding in Information Accuracy

Precision and Validity Information Bias Dr. Jørn Olsen Epi 200B January 21 and 26, 2010

Bias and confounding (Last, Dictionary) Bias: Deviation of results or inference from truth, or processes leading to such deviations. Any trend in the collection, analysis, interpretation, publication, or review of data that can lead to conclusions that are systematically different from the truth. 2

Bias and confounding (Last, Dictionary) Confounding: A situation in which the effect of two processes are not separated. Confounder, confounding factor, confounding variable-Poor term, confounding is study specific. No variables are always confounders. 3

Dictionary; IEA/Last: • Information bias (observational bias): A flaw in measuring exposure or outcome data that results in different quality (accuracy) of information between comparisons groups

Information Bias and Other Method Problems • Information: exposures, end points, confounders, modifiers • For discrete variables: classification error/misclassification • Differential/non-differential information bias

Data accuracy • Data are almost never 100% accurate • Coding errors, measurement errors • We ask questions that cannot be answered correctly-exposed to ETS last year

Non-differential – does not depend upon the value of other variables Example – diagnosing has the same sensitivity and specificity among exposed and non-exposed. Or, exposure is reported with the same sensitivity and specificity among cases and controls

Non-differential misclassification better than differential • Non-differential misclassification can often be achieved in follow-up studies • Exposures are recorded prior to disease occurrence • Diseases may be recorded by doctors who do not ask about exposures

Recall bias misclassification of the exposure A serious problem in case control studies or cross sectional studies based upon recall

Recall bias Hungarian case-control surveillance of congenital abnormalities (Epidemiology 2001; 12: 461-66.) Drug use = self-reported data (interview, memory aids) = log-book: medicine prescribed by ANC doctors Sensitivity a/(a+c) Specificity d/(b+d)

A low sensitivity is expected if mothers provide a complete recall since only ANC prescribed drugs are in the log book.

Short-term drugs

Long-term drugs

What to do to reduce differential information bias? • Use blinding if possible-”blind till it hurts” Cochrane. • Use of hospital controls may, in some cases, help to reduce information bias. • The disease used to identify the comparison group must NOT be associated with the exposure under study (must not be a cause or a preventive factor).

For case-control studies • First study is important • No disclosure of study hypothesis • Use biomarkers of exposure if possible • Use secondary data collected prior to the disease • Use neutral interviewers

Differential misclassification of the endpoint: sometimes a problem in follow-up studies

Is this follow-up study vulnerable to differential misclassification of DVT?

Follow-up studies are usually less vulnerable to differential recall bias because the exposure is recorded before the end point, but knowing the hypothesis may introduce bias if the exposure is a suspected cause of the disease under study. Blind the clinicians, if possible.

It is often stated that non-differential misclassification leads to bias towards no association (RR = IRR = OR = 1, RD = IRD = 0) First argument for that was provided by Bross in the 1950’s. Non differential misclassification is not the same as random misclassification (random is only non-differential in the long run). Random misclassification (blinding) can be very differential by chance in a small study.

P = proportion of smokers; Pl and Pr l = Lung cancer r = reference

TP = P x sens FN = P x (1-sens) FP = (1-P) (1-spec) TN = (1-P) spec

If we take interest in the difference between Pl and Pr, D = Pl – Pr (normally we would take an interest in exposure odds-for example)

We are only able to estimate Pl and Pr, and then Include D = Pl – Pr and in case of non-diff. miscl. FPL = FPr = FP FNL = FNr = FN

Then = D (1– (FN + FP)) (check it out) Meaning ≠ D if FN and FP ≠ 0 (sens + spec < 2) FN + FP < 1.0 D < D (but same sign) FP + FN = 1.0 D = 0 (like flipping a coin) FN + FP = 2 D = -D (coding!) Also true for ORs ^ ^ ^

Non differential misclassification of a dichotomous variable will, in most cases, bias values towards no association (but there are other sources of error in a study and the combined effect may be away from the null) Non differential misclassification of a variable with more than two categories can cause bias away from the null but mainly in rather unusual situations Misclassification of a confounder can cause bias in any direction.

When estimating relative effect measures a high specificity is wanted. True cohort data If sensitivity is 0.8 but specificity is 1

If sensitivity is 1 but specificity is 0.80

If sensitivity is 0.8 and specificity is 0.9

The corresponding case-cohort studies would produce the following (similar) results (if done right in this situation as a case-cohort study).

The corresponding case-cohort studies would produce the following (similar) results

If we get a reference pathologist to eliminate all FP cases, we would get (for the last table)

Adjusting for misclassification is possible if sens and spec are known

Example sens = 0.44 spec = 0.94; based upon comparison with “Golden Standard” – clinical diagnosing

Exp P (M) = (350/1777 + 0.94 – 1) / (0.44 + 0.94 – 1) = 0.360 (640 with the disease) Exp P (F) = (277/2064 + 0.94 – 1) / (0.44 + 0.94 – 1) = 0.195 (403 with the disease) In case of differential misclassification, use sex specific sens and spec = 1.85

Misclassification of a confounder may bias a result in any direction (Greenland & Robins. Am J Epidemiol 1985:122;495-506) Let this be the true data:

The confounder has an effect (OR=2) The exposure has no effect (OR=1)

Now assume exposure and disease status is recorded without error. Only the confounder is non-differential misclassified (sens=0.8 and spec=0.9), we then get:

When stratifying on the confounder True data

Miscl data

Misclassification is likely if we ask for sensitive data (alcohol intake), if we ask for data that can not be easily recalled like diet, if the relevant time window is short (teratology), if we give little attention to the data collection or perhaps if we give too much attention to the data collection.

Regression towards the mean. Misclassification for a group of people because we over sample large random errors. This selection leads to misclassification. IQ = IQ + ε Σε = 0 for all in the study but not for those selected from extreme parts of the distribution (Σε > 0). Their measured IQs may be unusual because their IQs are unusual or because their measurement errors were large, or both. In a new round of measuring IQ one would expect Σε to be zero (at least closer to 0). IQ ^

Regression towards the mean comes in many different forms. Assume you want to predict PTB and collect data on a number of potential risk factors. • You select those who have the highest RR and claim you can predict 60% of PTB using these markers. When you apply these ‘predictors’ in a new data source, you are in for a disappointment, why?

Misclassification has an impact on estimates of effect sizes and power A smaller study with better quality data may be preferable than a large study with poor quality data Use blinding to avoid differential misclassification Estimate misclassification/repeated measures

Capture – recapture to estimate completeness of recording (the degree of underreporting). If you have two different data sources (parental reporting of febrile seizures and hospitalizations for febrile seizures) you may be able to estimate these data sources actual coverage

The arguments come from biologists and go like this: You want to know the number of salmon in a given lake; you can empty the lake and count all salmons. Or 1. You catch some salmon (M1) in the lake and give them a mark and throw them back into the lake 2. You make another catch of salmon (M2) and note how many had the mark (were caught in the first catch) M3 3. Now you know M1, M2 and M3 and you are ready to estimate the total number of salmon in the lake, N.

M1 x M2 N M1 x M2 M3 P1 (first catch) M1/N P2 (second catch) M2/N M3 = N x P1 x P2 = N x M1/N x M2/N M3 = N =

Say, in our study, we had parental reports for 100 children with FS and 75 hospital reports. Our estimate of the total number of children with FS in the study would be (if 50 were registered with FS both places) (100 x 75)/50 = 150

Bias and Confounding in Information Accuracy