Gene-Environment Case-Control Studies

Gene-Environment Case-Control Studies Raymond J. Carroll Department of Statistics Faculties of Nutrition and Toxicology Texas A&M University http://stat.tamu.edu/~carroll TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAAAAAAAAA

Outline • Problem: Case-Control Studies with Gene-Environment relationships • Efficient formulation when genes are observed • Measurement errors in environmental variables • Haplotype modeling and Robustness

Acknowledgment • This work is joint with Nilanjan Chatterjee (NCI) and Yi-Hau Chen (Academia Sinica)

Acknowledgment • Further work is joint with Mitchell Gail (NCI), Iryna Lobach (Yale) and Bhramar Mukherjee (Michigan)

Software • SAS and Matlab Programs Available at my web site under the software button • Examples are given in the programs http://stat.tamu.edu/~carroll

Some Personal History • I was born in Japan • The coffee table is still in my house

Some Personal History • My father lived in Seoul for 2 months in 1948 and 1 year in 1968 • He took many photos of sights there, especially in 1948

Joonghwa moon at Deoksugung, 1948

Joonghwa moon at Deoksugung, today

The Prices of Drinks Were Pretty Low

Basic Problem Formalized • Case control sample: D = disease • Gene expression: G • Environment, can include strata: X • We are interested in main effects for G and X along with their interaction

Prospective Models • Simplest logistic model • General logistic model • The function m(G,X,b1) is completely general

Likelihood Function • The likelihood is • Note how the likelihood depends on two things: • The distribution of (X,G) in the population • The probability of disease in the population • Neither can be estimated from the case-control study

When G is observed • The usual choice is ordinary logistic regression • It is semiparametric efficient if nothing is known about the distribution of G, X in the population • Why semiparametric: what is unknown is the distribution of (G,X) in the population

When G is observed • Logistic regression is thus robust to any modeling assumptions about the covariates in the population • Unfortunately it is not very efficient for understanding interactions

Gene-Environment Independence • In many situations, it may be reasonable to assume G and X are independently distributed in the underlying population, possibly after conditioning on strata • This assumption is often used in gene-environment interaction studies

G-E Independence • Does not always hold! • Example: polymorphisms in the smoking metabolism pathway may affect the degree of addiction • Part of this talk is to model the distribution of G given X

Gene-Environment Independence • If you’re willing to make assumptions about the distributions of the covariates in the population, more efficiency can be obtained. • The reason is that you are putting a constraint on the retrospective likelihood

More Efficiency, G Observed • A constraint on the population is to posit a parametric or semiparametric model for G given X • Consequences: • More efficient estimation of G effects • Much more efficient estimation of G and (X,S) interactions.

The Formulation • In the most general semiparametric setting, we have • Question: What methods do we have to construct estimators?

Methodology • We have developed two new ways of thinking about this problem • In ordinary logistic regression case-control studies, they reduce to the Prentice-Pyke formulation

The Hard Way • Treat X as a discrete random variable whose mass points are the observed data points • Holding all parameters fixed, maximize the retrospective likelihood to estimate the probabilities of the X values.

The Hard Way • The maximization is not trivial to do correctly • Result: an explicit profile likelihood that does not involve the distribution of X

Pretend Missing Data Formulation • The following simple trick can be shown to be legitimate and semiparametric efficient • Equivalently, we compute a semiparametric profiled likelihood • Semiparametric because the distribution of X is not modeled

Pretend Missing Data Formulation • The idea is to create a “pretend” study, which is one of random sampling with missing data • We use an MAR regime. • The “pretend” study mimics the case-control study

Pretend Missing Data Formulation • Suppose you have a large but finite population of size N • Then, there are with the disease • There are without the disease

Pretend Missing Data Formulation • In a case-control sample, we randomly select n1 with the disease, and n0 without. • The fraction of people with disease status D=d that we observe is

Pretend Missing Data Formulation • Then let’s make up a “pretend” study, that has random sampling with missing data • I take a random sample • I get to observe (D,X,G) when D=d with probability • I will say that if I observe (D,X,G). Then

Pretend Missing Data Formulation • In this pretend missing data formulation, ordinary logistic regression is simply • We have a model for G given X, hence we compute • This has a simple explicit form, as follows

Result • Define • This is the intercept that ordinary logistic regression actually estimates • It only gets the slope right

Result • Define • Further define

Result • Then, the semiparametric efficient profiled likelihood function is • Trivial to compute.

Result • In the rare disease case, we have the further simplification that

Interesting Technical Point • Profile pseudo-likelihood acts like a likelihood • Information Asymptotics are (almost) exact

Typical Simulation Example • MSE Efficiency of Profile method compared to ordinary logistic regression

Typical Empirical Example

Consequence #1 • We have a formal likelihood: • This is also a legitimate semiparametric profile likelihood • Anything you can do with a likelihood you can do with a semiparametric profile likelihood

Consequences #2-#3 • Measurement Error in the Gene: • Handle misclassification of a covariate (the gene) as in any likelihood problem (see later) • Measurement Error in the Environment : • The structural approach, wherein you specify a flexible model for covariates measured with error, is applicable.

Advertisement Lobach, et al., Biometrics, in press

Consequences #4-#5 • Flexible Modeling of Covariate Effects: • Modeling some components by penalized regression splines • The LASSO and other likelihood-based methods apply • Model Averaging: • Can entertain/average various risk models • Bayesian methods are asymptotically correct

Consequence #6 • Model Robustness: • One can model average/select/LASSO various models for the distribution of G given X • Main Point: Our method results in a legitimate likelihood, hence can be treated as such

Modeling the Gene • Now turn to models for the gene • Given such models likelihood calculations can be used for model fitting • We will consider haplotypes

Haplotypes • Haplotypes consist of what we get from our mother and father at more than one site • Mother gives us the haplotype hm = (Am,Bm) • Father gives us the haplotype hf = (af,bf) • Our diplotype is Hdip = {(Am,Bm), (af,bf)}

Haplotypes • Unfortunately, we cannot presently observe the two haplotypes • We can only observe genotypes • Thus, if we were really Hdip = {(Am,Bm), (af,bf)}, then the data we would see would simply be the unordered set (A,a,B,b)

Missing Haplotypes • Thus, if we were really Hdip = {(Am,Bm), (af,bf)}, then the data we would see would simply be the unordered set (A,a,B,b) • However, this is also consistent with a different diplotype, namely Hdip = {(am,Bm), (Af,bf)} • Note that the number of copies of the (a,b) haplotype differs in these two cases • The true diploid = haplotype pair is missing

Missing Haplotypes • The likelihood in terms of the diploid is • We observe the genotypes G • The likelihood of the observed data is

Missing Haplotypes • The likelihood of the observed data is • Note how easy this was: it is really the profiled semiparametric likelihood of the observed data

Haplotypes • Danyu Lin has a nice EM-based program for estimating haplotype frequencies • It accepts data in text format with SAS missing data conventions • The program is flexible, and for example it can assume Hardy-Weinberg equilibrium (HWE) http://www.bios.unc.edu/~lin/hapstat/

Haplotype Fitting • Models that assume haplotype-environment independence are straightforward to fit via EM • Danyu Lin’s program can do this as well as our SAS program • The remaining issue is how to gain robustness against deviations from this assumed independence

Robustness • We build robustness by specifying models for diplotypes given the environmental variables • We first run a program to get a preliminary estimate of haplotype frequency • We use the most frequent haplotype as a reference haplotype

Gene-Environment Case-Control Studies

Gene-Environment Case-Control Studies

Presentation Transcript

Case-Control Studies

Case-Control Studies

Cohort and case-control studies

Case-Control Studies

Case-Control Studies (Retrospective Studies)

Case-Control Studies

CASE-CONTROL STUDIES

Gene Mapping and Identification: Case Studies

Issues in case-control studies

Chapter 15 Control Case Studies

Principles of case control studies

Gene-Environment Case-Control Studies

GenoMEL: gene/gene, genotype/phenotype and gene/ environment interaction studies for melanoma

Case-Control Studies

Chapter 8 Case-Control Studies

Case-Control Studies (retrospective studies)

Case-control studies

Case-Control Studies