1 / 50

Gene-Environment Case-Control Studies

Gene-Environment Case-Control Studies. Raymond J. Carroll Department of Statistics Faculty of Nutrition Texas A&M University http://stat.tamu.edu/~carroll. Outline. Problem : Can more efficient inference be done assuming gene (G) and environment (X) independence?

metta
Download Presentation

Gene-Environment Case-Control Studies

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Gene-Environment Case-Control Studies Raymond J. Carroll Department of Statistics Faculty of Nutrition Texas A&M University http://stat.tamu.edu/~carroll

  2. Outline • Problem: Can more efficient inference be done assuming gene (G) and environment (X) independence? • Gene-Environment independence: the case-only method • Profile likelihood approach • Efficiency gains • Example • Conclusions

  3. Acknowledgment • This work is joint with Nilanjan Chatterjee, National Cancer Institute • Papers in: Biometrika, Genetic Epidemiology http://dceg.cancer.gov/people/ChatterjeeNilanjan.html

  4. Outline • Theoretical Methods: • With real G and X independence, we used a profile likelihood method based on nonparametric maximum likelihood • (Key insight) Equivalent to a device of pretending the study is a regular random sample subject to missing data • (This allows) generalization to any parametric model for G given X.

  5. A Little Terminology • Epidemiologists: Case control sample • Econometricians: Choice-based sample • These are exactly the same problems • Subjects have two choices (or disease states) • Subjects have their covariates sampled conditional on their choices, i.e., • Random sample from those with disease • Random sample from those without disease

  6. Basic Problem Formalized • Case control sample: D = disease • Gene expression: G • Environment: X • Strata: S • We are interested in main effects for G and (X,S) along with their interaction

  7. Prospective Models • Simplest logistic model • General logistic model • The function m(G,X,b1) is completely general

  8. Case-Control Data • Case-control data are not a random sample • We observe (G,X) given D, i.e., we observe the covariates given the response, not vice-versa • If we had a random sample, linear logistic regression would be used to fit the model • Obvious idea: ignore the sampling plan and pretend you have a random sample

  9. Case-Control Data • Known Fact: The intercept is not identified, rest of the model is identified • Retrospective odds is given as

  10. Alternative Derivation: Ignore Sampling Plan • Consider a prospective study • Let D= 1 mean selection into the study • Pretend • Then compute

  11. Case-Control Data • Fact: all parameters except the intercept can be estimated consistently while ignoring the sampling plan • Standard Errors:Those compute ignoring the sampling plan are asymptotically correct

  12. Case-Control Data • The intercept is determined by pr(D=1) in the population, hence not identified from these data • Little Known Fact: Adding information about pr(D=1) adds no information about

  13. Gene-Environment Independence • In many situations, it may be reasonable to assume G and X are independently distributed in the underlying population, possibly after conditioning on strata • This assumption is often used in gene-environment interaction studies

  14. G-E Independence: Discussion • Does not always hold! • Example: polymorphisms in the smoking metabolism pathway may affect the degree of addiction • If False: Possible severe bias (Albert, et al., 2001, our own simulations)

  15. G-E Independence: Discussion • It is reasonable in many problems • Example: Environment is a treatment in a randomized study under nested case-control sampling • Example: Reasonable when exposure is not directly controlled by individual behavior • Radiation exposure for A-bomb survivors • Carcinogenic exposure of employees • Pesticide exposure in a rural community

  16. Generalizations • I have phrased this problem as one where G and X are independent given strata • This makes sense contextually in genetic epidemiology • All the results I will describe go through if you can write down a probability model for G given (X,S): I do this in the Israeli Study.

  17. Generalizations • If G is binary, it is natural to apply our approach • Posit a parametric or semiparametric model for G given (X,S) • Consequences: • More efficient estimation of G effects • Much more efficient estimation of G and (X,S) interactions.

  18. Gene-Environment Independence • Rare Disease Approximation: Rare disease for all values of (G,X) • May be unreasonable for important genes such as BRCA1/2 • Case-only estimate of multiplicative interaction (Piegorsch, et al.,1994)

  19. Gene-Environment Independence: Case-Only Analysis • Positive Consequence: Often much more powerful than standard analysis • Power advantage of this method often has led researchers to discard information on controls • Negative Consequence: no ability to estimate other risk parameters, which are often of greater interest (see example later) • Restrictions: Can only handle multiplicative interaction, requires rare disease in all values of (G,X)

  20. Gene-Environment Independence • Fact: gain in power for inference about a multiplicative interaction • Consequence: There is thus (Fisher) information in the assumption • Conjecture: Can handle general models and improve efficiency for all parameters • We do this via a semiparametric profile likelihood approach • We start though from a different likelihood

  21. Prentice-Pyke Calculation • Methodology: Start with the retrospective likelihood • The distribution of (X,G) in the population is left unspecified • Semiparametric MLE is usual logistic regression

  22. Environment and Gene Expression • Methodology: Start with the retrospective likelihood • Note how independence of G and X is used here, see the red expressions • We do not want to model the often multivariate distribution of X • Gene distribution model can be standard

  23. Environment and Gene Expression • Methodology: Compute a profile estimate • Parametric/semiparametric distribution for G • Nonparametric distribution for X (possibly high dimensional) • Result: Explicit profile likelihood

  24. Environment and Gene Expression • Methodology: Treat as distinct parameters • Let G have parametric structure: • Construct the profile likelihood, having estimated the as functions of data and other parameters • The result is a function of : this function can be calculated explicitly!

  25. Profile Likelihood • Result:

  26. Alternative Derivation • Consider a prospective study • Let D= 1 mean selection into the study • Pretend • Then compute • This is exactly our profile pseudo-likelihood!

  27. Alternative Derivation • We compute: • Standard approach computes • It is this insight that allows us to greatly generalize the work past independence of G and X.

  28. Computation • Intercept: The logistic intercept, and hence pr(D=1), is weakly identified by itself • Disease rate: If pr(D=1) is known, or a good bound for it is specified, can have significant gains in efficiency. • This does not happen for a regular case-control study

  29. Interesting Technical Point • Profile pseudo-likelihood acts like a likelihood • Information Asymptotics are (almost) exact • Missing G data handled seamlessly (see next) • Missing genotype • Unphased haplotype data

  30. Missing Data • We have a formal likelihood: • If gene is missing, suggests the formal likelihood • Result: Inference as if the data were a random sample with missing data

  31. Measurement Error • The likelihood formulation also allows us to deal with measurement error in the environmental variables

  32. Advertisement

  33. First Simulation • MSE Efficiency of Profile method: 0.02 < pr(D=1) < 0.07

  34. Israeli Ovarian Cancer Study • Population based case-control study • Study the interplay of BRCA1/2 mutations (G) and two known risk factors (E or X) of ovarian cancer: • oral contraceptive (OC) use • parity. • Missing Data: Approximately 50% of the controls were not genotyped, and 10% of the cases

  35. Israeli Ovarian Cancer Study • Results reported in Modan et al., NEJM (2001). • Their analysis involves • Assumption of parity and OC use are independent of BRCA1/2 mutation status • Simple but approximate methods for exploiting G and E independence assumption (including case-only estimate of interaction) • Risk model adjusted for Age, Race, Family History, History of Gynecological Surgery

  36. Israeli Ovarian Cancer Study • Disease risk model including same covariates as Modan et al (2001) • In addition, we explicitly adjusted for the possibility of both G and E being related to S • FH = family history (breast cancer = 1, ovarian or >= 2 breast cancer = 2)

  37. Israeli Ovarian Cancer Study • Question: Can carriers be protected via OC-use? • The logarithm of the odds ratio is the sum of • The main effect for OC-use • The interaction term between OC-use and being a carrier, i.e., interaction between gene and environment • Note how this involves main effects and interactions

  38. Israeli Ovarian Cancer Study • Question: Is there a carrier/OC interaction • The case-only method can only answer this question

  39. Israeli Ovarian Cancer Study • Interaction of OC and BRCA1/2:

  40. Israeli Ovarian Cancer Study • Main Effect of BRCA1/2:

  41. Israeli Ovarian Cancer Study • Odds ratio for OC use among carriers = 1.04 (0.98, 1.09) • No evidence for protective effect • Not available from case-only analysis • Length of interval is ½ the length of the usual analysis

  42. Features of the Method • Allows estimation of all parameters of logistic regression model and can be used to examine interaction in alternative scales • Can be used to estimate OR for non-rare diseases • Important for studying major genes such as BRCA1/2

  43. Features of the Method • Allows incorporation of external information on Pr(D=1) • Unlike with logistic regression in case-control studies, this information improves efficiency of estimation

  44. Colorectal Adenoma Study • PLCO Study: 772 cases, 772 controls • Three SNPs in the calcium-sensing receptor region • HWE assumed • Interest in the interaction of number of copies of one haplotype (GCG) and calcium intake from diet

  45. Colorectal Adenoma Study • Method #1: Write down the prospective likelihood and apply missing data techniques • A standard analysis • If ignoring the case-control sampling scheme works for ordinary logistic regression, it should work for missing haplotype regression too, right? • Wrong! Biased estimates and standard errors • Method #2: Our method

  46. Colorectal Adenoma Study

  47. Conclusions • Standard case-control (choice-based) studies • Specify a model for G given X, e.g., G-E independence in population after conditioning on strata • No assumptions made about X (high dimensional) • All parameters estimable, no rare-disease assumption • Handle missing G data • Large gains in efficiency versus usual method • Large gains in efficiency for effects of environment given the gene

  48. Conclusions • Theoretical Methods: • With real G and X independence, we used a profile likelihood method based on nonparametric maximum likelihood • (Key insight) Equivalent to a device of pretending that study is a regular random sample subject to missing data • (This allows) generalization to any parametric model for G given X.

  49. Acknowledgment • Two graduate students have worked on this project Iryna Lobach, Yale Christie Spinka, U of Missouri

  50. Thanks! http://stat.tamu.edu/~carroll

More Related