Gene-Environment Case-Control Studies

Gene-Environment Case-Control Studies Raymond J. Carroll Department of Statistics Center for Statistical Bioinformatics Institute for Applied Mathematics and Computational Science Texas A&M University http://stat.tamu.edu/~carroll TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAAAAAAAAA

Advertising • Training: We are finishing Year 08 of an NCI-funded R25T training program • http://www.stat.tamu.edu/b3nc • We train statistically and computationally oriented post-docs in the biology of nutrition and cancer • Active seminar series

Outline • Problem: Case-Control Studies with Gene-Environment relationships • Efficient formulation when genes are observed • Haplotype modeling and Robustness • Applications

Acknowledgment • This work is joint with Nilanjan Chatterjee (NCI) and Yi-Hau Chen (Academia Sinica)

Software • SAS and Matlab Programs Available at my web site under the software button • Examples are given in the programs • Paper are in Biometrika (2005), Genetic Epidemiology (2006), Biostatistics (2007), Biometrics (2008) and JASA (2009) • R programs available from the NCI http://stat.tamu.edu/~carroll

Basic Problem Formalized • GeneandEnvironment • Question: For women who carry the BRCA1/2 mutation, does oral contraceptive use provide any protection against ovarian cancer?

Basic Problem Formalized • GeneandEnvironment • Question: For people carrying a particular haplotype in the VDR pathway, does higher levels of serum Vitamin D protect against prostate cancer?

Basic Problem Formalized • GeneandEnvironment • Question: If you are a current smoker, are you protected against colorectal adenoma if you carry a particular haplotype in the NAT2 smoking metabolism region?

Prospective and Retrospective Studies • D = disease status (binary) • X = environmental variables • Smoking status • Vitamin D • Oral contraceptive use • G = gene status • Mutation or not • Multiple or single SNP • Haplotypes

Prospective and Retrospective Studies • Prospective: Classic random sampling of a population • You measure gene and environment on a cohort • You then follow up people for disease occurrence

Prospective and Retrospective Studies • Prospective Studies: • Expensive: disease states are rare, so large sample sizes needed • Time-consuming: you have to wait for disease to develop • They Exist: Framingham Heart Study, NIH-AARP Diet and Health Study, Women’s Health Initiative, etc.

Prospective and Retrospective Studies • Prospective Studies: • Daunting Task: Only very large, very expensive prospective studies can find gene-environment interactions • Data Access: Access to the Framingham Heart Study requires a university commitment to security

Prospective and Retrospective Studies • Retrospective Studies: Usually called case-control studies • Find a population of cases, i.e., people with a disease, and sample from it. • Find a population of controls, i.e., people without the disease, and sample from it.

Prospective and Retrospective Studies • Retrospective Studies: Because the gene G and the environment X are sample after disease status is ascertained • Microarray studies on humans: most are case-control studies • Genome Wide Association Studies (GWAS): most are case-control studies

Prospective and Retrospective Studies • Case-control Studies: • Fast: no need to wait for disease to develop • Cheap: sample sizes are much smaller • Subtle: The controls need to be representative of the population of people without the disease.

Basic Problem Formalized • Case control sample: D = disease • Gene expression: G • Environment, can include strata: X • We are interested in main effects for G and X along with their interaction as they affect development of disease

Basic Problem Formalized • 99.9999% of analyses of case-control data use logistic regression • Closely related to Fisher’s Linear Discriminant Analysis (LDA) • Difference: we want to understand what targets affect disease, not just predict disease

Logistic Regression • Logistic Function: • The approximation works for rare diseases

Prospective Models • Simplest logistic model without an interaction • The effect of having a mutation (G=1) versus not (G=0) is

Prospective Models • Simplest logistic model with an interaction • The effect of having a mutation (G=1) versus not (G=0) is

Empirical Observations • Logistic regression is in every statistical package • Unfortunately, logistic regression is not efficient for understanding interactions • Much larger sample sizes are required for interactions that for just gene effects • Most gene-environment interaction case-control studies fail for this reason

Empirical Observations • Statistical Theory: There is a lovely statistical theory available • It says: ignore the fact that you have a case-control sample, and pretend you have a prospective study • It all works out: don’t worry, be happy!

Empirical Observations • Statistical Theory: Ordinary logistic regression applied to a case-control study makes no assumptions about the population distribution of (G,X) • Remember: we do not have a sample from a population, only a case-control sample • Logistic regression is robust: to assumptions about the population distribution of (G,X)

Likelihood Function • The likelihood is • Note how the likelihood depends on two things: • The distribution of (X,G) in the population • The probability of disease in the population • Neither can be estimated from the case-control study

When G is observed • Logistic regression is thus robust to any modeling assumptions about the covariates in the population • Unfortunately it is not very efficient for understanding interactions

Gene-Environment Independence • In many situations, it may be reasonable to assume G and X are independently distributed in the underlying population, possibly after conditioning on strata • This assumption is often used in gene-environment interaction studies

G-E Independence • Does not always hold! • Example: polymorphisms in the smoking metabolism pathway may affect the degree of addiction

Gene-Environment Independence • If you’re willing to make assumptions about the distributions of the covariates in the population, more efficiency can be obtained. • This is NOT TRUE for prospective studies, only true for retrospective studies.

Gene-Environment Independence • The reason is that you are putting a constraint on the retrospective likelihood

Gene-Environment Independence • Our Methodology: Is far more general than assuming that genetic status and environment are independent • We have developed capacity for modeling the distribution of genetic status given strata and environmental factors • I will skip this and just pretend G-E independence here

More Efficiency, G Observed • Our model: G-E independence and a genetic model, e.g., Hardy-Weinberg Equilibrium • Consequences: • More efficient estimation of G effects • Much more efficient estimation of G-E interactions.

The Formulation • Any logistic model works • Question: What methods do we have to construct estimators?

Methodology • I won’t give you the full methodology, but it works as follows. • Case-control studies are very close to a prospective (random sampling) study, with the exception that sometimes you do not observe people

Pretend Missing Data Formulation • Suppose you have a large but finite population of size N • Then, there are with the disease • There are without the disease

Pretend Missing Data Formulation • In a case-control sample, we randomly select n1 with the disease, and n0 without. • The fraction of people with disease status D=d that we observe is

Pretend Missing Data Formulation • Pretend you randomly sample a population • You observe a person who has D=d, and , with the probability • Statisticians know how to deal with missing data, e.g., compute probabilities for what you actually see

Pretend Missing Data Formulation • In this pretend missing data formulation, ordinary logistic regression is simply • We have a model for G given X, hence we compute

Methodology • Our method has an explicit form, i.e., no integrals or anything nasty • It is easy to program the method to estimate the logistic model • It is likelihood based. Technically, a semiparametric profile likelihood

Methodology • We can handle missing gene data • We can handle error in genotyping • We can handle measurement errors in environmental variables, e.g., diet

Methodology • Our method results in much more efficient statistical inference

More Data • What does More efficient statistical inference mean? • It means, effectively, that you have more data • In cases that G is a simple mutation, our method is typically equivalent to having 3 times more data

How much more data: Typical Simulation Example • The increase in effective sample size when using our methodology

Real Data Complexities • The Israeli Ovarian Cancer Study • G = BRCA1/2 mutation (very deadly) • X includes • age, • ethnic status (below), • parity, • oral contraceptive use • Family history • Smoking • Etc.

Real Data Complexities • In the Israeli Study, G is missing in 50% of the controls, and 10% of the cases • Also, among Jewish citizens, Israel has two dominant ethnic types • Ashkenazi (European) • Shephardic (North African)

Real Data Complexities • The gene mutation BRCA1/2 if frequent among the Ashkenazi, but rare among the Shephardic • Thus, if one component of X is ethnic status, then pr(G=1 | X) depends on X • Gene-Environment independence fails here • What can be done? Model pr(G=1 | X) as binary with different probabilities!

Israeli Ovarian Cancer Study • Question: Can carriers of the BRCA1/2 mutation be protected via OC-use?

Typical Empirical Example

Israeli Ovarian Cancer Study • Main Effect of BRCA1/2:

Israeli Ovarian Cancer Study • Odds ratio for OC use among carriers = 1.04 (0.98, 1.09) • No evidence for protective effect • Not available from case-only analysis • Length of interval is ½ the length of the usual analysis

Haplotypes • Haplotypes consist of what we get from our mother and father at more than one site • Mother gives us the haplotype hm = (Am,Bm) • Father gives us the haplotype hf = (af,bf) • Our diplotype is Hdip = {(Am,Bm), (af,bf)}

Gene-Environment Case-Control Studies