510 likes | 668 Views
Gene-Environment Case-Control Studies. Raymond J. Carroll Department of Statistics Faculty of Nutrition Texas A&M University http://stat.tamu.edu/~carroll. Outline. Problem : Can more efficient inference be done assuming gene (G) and environment (X) independence?
E N D
Gene-Environment Case-Control Studies Raymond J. Carroll Department of Statistics Faculty of Nutrition Texas A&M University http://stat.tamu.edu/~carroll
Outline • Problem: Can more efficient inference be done assuming gene (G) and environment (X) independence? • Gene-Environment independence: the case-only method • Profile likelihood approach • Efficiency gains • Example • Conclusions
Acknowledgment • This work is joint with Nilanjan Chatterjee, National Cancer Institute • Papers in: Biometrika, Genetic Epidemiology http://dceg.cancer.gov/people/ChatterjeeNilanjan.html
Outline • Theoretical Methods: • With real G and X independence, we used a profile likelihood method based on nonparametric maximum likelihood • (Key insight) Equivalent to a device of pretending the study is a regular random sample subject to missing data • (This allows) generalization to any parametric model for G given X.
A Little Terminology • Epidemiologists: Case control sample • Econometricians: Choice-based sample • These are exactly the same problems • Subjects have two choices (or disease states) • Subjects have their covariates sampled conditional on their choices, i.e., • Random sample from those with disease • Random sample from those without disease
Basic Problem Formalized • Case control sample: D = disease • Gene expression: G • Environment: X • Strata: S • We are interested in main effects for G and (X,S) along with their interaction
Prospective Models • Simplest logistic model • General logistic model • The function m(G,X,b1) is completely general
Case-Control Data • Case-control data are not a random sample • We observe (G,X) given D, i.e., we observe the covariates given the response, not vice-versa • If we had a random sample, linear logistic regression would be used to fit the model • Obvious idea: ignore the sampling plan and pretend you have a random sample
Case-Control Data • Known Fact: The intercept is not identified, rest of the model is identified • Retrospective odds is given as
Alternative Derivation: Ignore Sampling Plan • Consider a prospective study • Let D= 1 mean selection into the study • Pretend • Then compute
Case-Control Data • Fact: all parameters except the intercept can be estimated consistently while ignoring the sampling plan • Standard Errors:Those compute ignoring the sampling plan are asymptotically correct
Case-Control Data • The intercept is determined by pr(D=1) in the population, hence not identified from these data • Little Known Fact: Adding information about pr(D=1) adds no information about
Gene-Environment Independence • In many situations, it may be reasonable to assume G and X are independently distributed in the underlying population, possibly after conditioning on strata • This assumption is often used in gene-environment interaction studies
G-E Independence: Discussion • Does not always hold! • Example: polymorphisms in the smoking metabolism pathway may affect the degree of addiction • If False: Possible severe bias (Albert, et al., 2001, our own simulations)
G-E Independence: Discussion • It is reasonable in many problems • Example: Environment is a treatment in a randomized study under nested case-control sampling • Example: Reasonable when exposure is not directly controlled by individual behavior • Radiation exposure for A-bomb survivors • Carcinogenic exposure of employees • Pesticide exposure in a rural community
Generalizations • I have phrased this problem as one where G and X are independent given strata • This makes sense contextually in genetic epidemiology • All the results I will describe go through if you can write down a probability model for G given (X,S): I do this in the Israeli Study.
Generalizations • If G is binary, it is natural to apply our approach • Posit a parametric or semiparametric model for G given (X,S) • Consequences: • More efficient estimation of G effects • Much more efficient estimation of G and (X,S) interactions.
Gene-Environment Independence • Rare Disease Approximation: Rare disease for all values of (G,X) • May be unreasonable for important genes such as BRCA1/2 • Case-only estimate of multiplicative interaction (Piegorsch, et al.,1994)
Gene-Environment Independence: Case-Only Analysis • Positive Consequence: Often much more powerful than standard analysis • Power advantage of this method often has led researchers to discard information on controls • Negative Consequence: no ability to estimate other risk parameters, which are often of greater interest (see example later) • Restrictions: Can only handle multiplicative interaction, requires rare disease in all values of (G,X)
Gene-Environment Independence • Fact: gain in power for inference about a multiplicative interaction • Consequence: There is thus (Fisher) information in the assumption • Conjecture: Can handle general models and improve efficiency for all parameters • We do this via a semiparametric profile likelihood approach • We start though from a different likelihood
Prentice-Pyke Calculation • Methodology: Start with the retrospective likelihood • The distribution of (X,G) in the population is left unspecified • Semiparametric MLE is usual logistic regression
Environment and Gene Expression • Methodology: Start with the retrospective likelihood • Note how independence of G and X is used here, see the red expressions • We do not want to model the often multivariate distribution of X • Gene distribution model can be standard
Environment and Gene Expression • Methodology: Compute a profile estimate • Parametric/semiparametric distribution for G • Nonparametric distribution for X (possibly high dimensional) • Result: Explicit profile likelihood
Environment and Gene Expression • Methodology: Treat as distinct parameters • Let G have parametric structure: • Construct the profile likelihood, having estimated the as functions of data and other parameters • The result is a function of : this function can be calculated explicitly!
Profile Likelihood • Result:
Alternative Derivation • Consider a prospective study • Let D= 1 mean selection into the study • Pretend • Then compute • This is exactly our profile pseudo-likelihood!
Alternative Derivation • We compute: • Standard approach computes • It is this insight that allows us to greatly generalize the work past independence of G and X.
Computation • Intercept: The logistic intercept, and hence pr(D=1), is weakly identified by itself • Disease rate: If pr(D=1) is known, or a good bound for it is specified, can have significant gains in efficiency. • This does not happen for a regular case-control study
Interesting Technical Point • Profile pseudo-likelihood acts like a likelihood • Information Asymptotics are (almost) exact • Missing G data handled seamlessly (see next) • Missing genotype • Unphased haplotype data
Missing Data • We have a formal likelihood: • If gene is missing, suggests the formal likelihood • Result: Inference as if the data were a random sample with missing data
Measurement Error • The likelihood formulation also allows us to deal with measurement error in the environmental variables
First Simulation • MSE Efficiency of Profile method: 0.02 < pr(D=1) < 0.07
Israeli Ovarian Cancer Study • Population based case-control study • Study the interplay of BRCA1/2 mutations (G) and two known risk factors (E or X) of ovarian cancer: • oral contraceptive (OC) use • parity. • Missing Data: Approximately 50% of the controls were not genotyped, and 10% of the cases
Israeli Ovarian Cancer Study • Results reported in Modan et al., NEJM (2001). • Their analysis involves • Assumption of parity and OC use are independent of BRCA1/2 mutation status • Simple but approximate methods for exploiting G and E independence assumption (including case-only estimate of interaction) • Risk model adjusted for Age, Race, Family History, History of Gynecological Surgery
Israeli Ovarian Cancer Study • Disease risk model including same covariates as Modan et al (2001) • In addition, we explicitly adjusted for the possibility of both G and E being related to S • FH = family history (breast cancer = 1, ovarian or >= 2 breast cancer = 2)
Israeli Ovarian Cancer Study • Question: Can carriers be protected via OC-use? • The logarithm of the odds ratio is the sum of • The main effect for OC-use • The interaction term between OC-use and being a carrier, i.e., interaction between gene and environment • Note how this involves main effects and interactions
Israeli Ovarian Cancer Study • Question: Is there a carrier/OC interaction • The case-only method can only answer this question
Israeli Ovarian Cancer Study • Interaction of OC and BRCA1/2:
Israeli Ovarian Cancer Study • Main Effect of BRCA1/2:
Israeli Ovarian Cancer Study • Odds ratio for OC use among carriers = 1.04 (0.98, 1.09) • No evidence for protective effect • Not available from case-only analysis • Length of interval is ½ the length of the usual analysis
Features of the Method • Allows estimation of all parameters of logistic regression model and can be used to examine interaction in alternative scales • Can be used to estimate OR for non-rare diseases • Important for studying major genes such as BRCA1/2
Features of the Method • Allows incorporation of external information on Pr(D=1) • Unlike with logistic regression in case-control studies, this information improves efficiency of estimation
Colorectal Adenoma Study • PLCO Study: 772 cases, 772 controls • Three SNPs in the calcium-sensing receptor region • HWE assumed • Interest in the interaction of number of copies of one haplotype (GCG) and calcium intake from diet
Colorectal Adenoma Study • Method #1: Write down the prospective likelihood and apply missing data techniques • A standard analysis • If ignoring the case-control sampling scheme works for ordinary logistic regression, it should work for missing haplotype regression too, right? • Wrong! Biased estimates and standard errors • Method #2: Our method
Conclusions • Standard case-control (choice-based) studies • Specify a model for G given X, e.g., G-E independence in population after conditioning on strata • No assumptions made about X (high dimensional) • All parameters estimable, no rare-disease assumption • Handle missing G data • Large gains in efficiency versus usual method • Large gains in efficiency for effects of environment given the gene
Conclusions • Theoretical Methods: • With real G and X independence, we used a profile likelihood method based on nonparametric maximum likelihood • (Key insight) Equivalent to a device of pretending that study is a regular random sample subject to missing data • (This allows) generalization to any parametric model for G given X.
Acknowledgment • Two graduate students have worked on this project Iryna Lobach, Yale Christie Spinka, U of Missouri
Thanks! http://stat.tamu.edu/~carroll