700 likes | 825 Views
Gene-Environment Case-Control Studies. Raymond J. Carroll Department of Statistics Faculties of Nutrition and Toxicology Texas A&M University http://stat.tamu.edu/~carroll. TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A A A A A A A A A A. Outline.
E N D
Gene-Environment Case-Control Studies Raymond J. Carroll Department of Statistics Faculties of Nutrition and Toxicology Texas A&M University http://stat.tamu.edu/~carroll TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAAAAAAAAA
Outline • Problem: Case-Control Studies with Gene-Environment relationships • Efficient formulation when genes are observed • Measurement errors in environmental variables • Haplotype modeling and Robustness
Acknowledgment • This work is joint with Nilanjan Chatterjee (NCI) and Yi-Hau Chen (Academia Sinica)
Acknowledgment • Further work is joint with Mitchell Gail (NCI), Iryna Lobach (Yale) and Bhramar Mukherjee (Michigan)
Software • SAS and Matlab Programs Available at my web site under the software button • Examples are given in the programs http://stat.tamu.edu/~carroll
Some Personal History • I was born in Japan • The coffee table is still in my house
Some Personal History • My father lived in Seoul for 2 months in 1948 and 1 year in 1968 • He took many photos of sights there, especially in 1948
Basic Problem Formalized • Case control sample: D = disease • Gene expression: G • Environment, can include strata: X • We are interested in main effects for G and X along with their interaction
Prospective Models • Simplest logistic model • General logistic model • The function m(G,X,b1) is completely general
Likelihood Function • The likelihood is • Note how the likelihood depends on two things: • The distribution of (X,G) in the population • The probability of disease in the population • Neither can be estimated from the case-control study
When G is observed • The usual choice is ordinary logistic regression • It is semiparametric efficient if nothing is known about the distribution of G, X in the population • Why semiparametric: what is unknown is the distribution of (G,X) in the population
When G is observed • Logistic regression is thus robust to any modeling assumptions about the covariates in the population • Unfortunately it is not very efficient for understanding interactions
Gene-Environment Independence • In many situations, it may be reasonable to assume G and X are independently distributed in the underlying population, possibly after conditioning on strata • This assumption is often used in gene-environment interaction studies
G-E Independence • Does not always hold! • Example: polymorphisms in the smoking metabolism pathway may affect the degree of addiction • Part of this talk is to model the distribution of G given X
Gene-Environment Independence • If you’re willing to make assumptions about the distributions of the covariates in the population, more efficiency can be obtained. • The reason is that you are putting a constraint on the retrospective likelihood
More Efficiency, G Observed • A constraint on the population is to posit a parametric or semiparametric model for G given X • Consequences: • More efficient estimation of G effects • Much more efficient estimation of G and (X,S) interactions.
The Formulation • In the most general semiparametric setting, we have • Question: What methods do we have to construct estimators?
Methodology • We have developed two new ways of thinking about this problem • In ordinary logistic regression case-control studies, they reduce to the Prentice-Pyke formulation
The Hard Way • Treat X as a discrete random variable whose mass points are the observed data points • Holding all parameters fixed, maximize the retrospective likelihood to estimate the probabilities of the X values.
The Hard Way • The maximization is not trivial to do correctly • Result: an explicit profile likelihood that does not involve the distribution of X
Pretend Missing Data Formulation • The following simple trick can be shown to be legitimate and semiparametric efficient • Equivalently, we compute a semiparametric profiled likelihood • Semiparametric because the distribution of X is not modeled
Pretend Missing Data Formulation • The idea is to create a “pretend” study, which is one of random sampling with missing data • We use an MAR regime. • The “pretend” study mimics the case-control study
Pretend Missing Data Formulation • Suppose you have a large but finite population of size N • Then, there are with the disease • There are without the disease
Pretend Missing Data Formulation • In a case-control sample, we randomly select n1 with the disease, and n0 without. • The fraction of people with disease status D=d that we observe is
Pretend Missing Data Formulation • Then let’s make up a “pretend” study, that has random sampling with missing data • I take a random sample • I get to observe (D,X,G) when D=d with probability • I will say that if I observe (D,X,G). Then
Pretend Missing Data Formulation • In this pretend missing data formulation, ordinary logistic regression is simply • We have a model for G given X, hence we compute • This has a simple explicit form, as follows
Result • Define • This is the intercept that ordinary logistic regression actually estimates • It only gets the slope right
Result • Define • Further define
Result • Then, the semiparametric efficient profiled likelihood function is • Trivial to compute.
Result • In the rare disease case, we have the further simplification that
Interesting Technical Point • Profile pseudo-likelihood acts like a likelihood • Information Asymptotics are (almost) exact
Typical Simulation Example • MSE Efficiency of Profile method compared to ordinary logistic regression
Consequence #1 • We have a formal likelihood: • This is also a legitimate semiparametric profile likelihood • Anything you can do with a likelihood you can do with a semiparametric profile likelihood
Consequences #2-#3 • Measurement Error in the Gene: • Handle misclassification of a covariate (the gene) as in any likelihood problem (see later) • Measurement Error in the Environment : • The structural approach, wherein you specify a flexible model for covariates measured with error, is applicable.
Advertisement Lobach, et al., Biometrics, in press
Consequences #4-#5 • Flexible Modeling of Covariate Effects: • Modeling some components by penalized regression splines • The LASSO and other likelihood-based methods apply • Model Averaging: • Can entertain/average various risk models • Bayesian methods are asymptotically correct
Consequence #6 • Model Robustness: • One can model average/select/LASSO various models for the distribution of G given X • Main Point: Our method results in a legitimate likelihood, hence can be treated as such
Modeling the Gene • Now turn to models for the gene • Given such models likelihood calculations can be used for model fitting • We will consider haplotypes
Haplotypes • Haplotypes consist of what we get from our mother and father at more than one site • Mother gives us the haplotype hm = (Am,Bm) • Father gives us the haplotype hf = (af,bf) • Our diplotype is Hdip = {(Am,Bm), (af,bf)}
Haplotypes • Unfortunately, we cannot presently observe the two haplotypes • We can only observe genotypes • Thus, if we were really Hdip = {(Am,Bm), (af,bf)}, then the data we would see would simply be the unordered set (A,a,B,b)
Missing Haplotypes • Thus, if we were really Hdip = {(Am,Bm), (af,bf)}, then the data we would see would simply be the unordered set (A,a,B,b) • However, this is also consistent with a different diplotype, namely Hdip = {(am,Bm), (Af,bf)} • Note that the number of copies of the (a,b) haplotype differs in these two cases • The true diploid = haplotype pair is missing
Missing Haplotypes • The likelihood in terms of the diploid is • We observe the genotypes G • The likelihood of the observed data is
Missing Haplotypes • The likelihood of the observed data is • Note how easy this was: it is really the profiled semiparametric likelihood of the observed data
Haplotypes • Danyu Lin has a nice EM-based program for estimating haplotype frequencies • It accepts data in text format with SAS missing data conventions • The program is flexible, and for example it can assume Hardy-Weinberg equilibrium (HWE) http://www.bios.unc.edu/~lin/hapstat/
Haplotype Fitting • Models that assume haplotype-environment independence are straightforward to fit via EM • Danyu Lin’s program can do this as well as our SAS program • The remaining issue is how to gain robustness against deviations from this assumed independence
Robustness • We build robustness by specifying models for diplotypes given the environmental variables • We first run a program to get a preliminary estimate of haplotype frequency • We use the most frequent haplotype as a reference haplotype