760 likes | 965 Views
Gene-Environment Case-Control Studies. Raymond J. Carroll Department of Statistics Center for Statistical Bioinformatics Institute for Applied Mathematics and Computational Science Texas A&M University http://stat.tamu.edu/~carroll. TexPoint fonts used in EMF.
E N D
Gene-Environment Case-Control Studies Raymond J. Carroll Department of Statistics Center for Statistical Bioinformatics Institute for Applied Mathematics and Computational Science Texas A&M University http://stat.tamu.edu/~carroll TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAAAAAAAAA
Advertising • Training: We are finishing Year 08 of an NCI-funded R25T training program • http://www.stat.tamu.edu/b3nc • We train statistically and computationally oriented post-docs in the biology of nutrition and cancer • Active seminar series
Outline • Problem: Case-Control Studies with Gene-Environment relationships • Efficient formulation when genes are observed • Haplotype modeling and Robustness • Applications
Acknowledgment • This work is joint with Nilanjan Chatterjee (NCI) and Yi-Hau Chen (Academia Sinica)
Software • SAS and Matlab Programs Available at my web site under the software button • Examples are given in the programs • Paper are in Biometrika (2005), Genetic Epidemiology (2006), Biostatistics (2007), Biometrics (2008) and JASA (2009) • R programs available from the NCI http://stat.tamu.edu/~carroll
Basic Problem Formalized • GeneandEnvironment • Question: For women who carry the BRCA1/2 mutation, does oral contraceptive use provide any protection against ovarian cancer?
Basic Problem Formalized • GeneandEnvironment • Question: For people carrying a particular haplotype in the VDR pathway, does higher levels of serum Vitamin D protect against prostate cancer?
Basic Problem Formalized • GeneandEnvironment • Question: If you are a current smoker, are you protected against colorectal adenoma if you carry a particular haplotype in the NAT2 smoking metabolism region?
Prospective and Retrospective Studies • D = disease status (binary) • X = environmental variables • Smoking status • Vitamin D • Oral contraceptive use • G = gene status • Mutation or not • Multiple or single SNP • Haplotypes
Prospective and Retrospective Studies • Prospective: Classic random sampling of a population • You measure gene and environment on a cohort • You then follow up people for disease occurrence
Prospective and Retrospective Studies • Prospective Studies: • Expensive: disease states are rare, so large sample sizes needed • Time-consuming: you have to wait for disease to develop • They Exist: Framingham Heart Study, NIH-AARP Diet and Health Study, Women’s Health Initiative, etc.
Prospective and Retrospective Studies • Prospective Studies: • Daunting Task: Only very large, very expensive prospective studies can find gene-environment interactions • Data Access: Access to the Framingham Heart Study requires a university commitment to security
Prospective and Retrospective Studies • Retrospective Studies: Usually called case-control studies • Find a population of cases, i.e., people with a disease, and sample from it. • Find a population of controls, i.e., people without the disease, and sample from it.
Prospective and Retrospective Studies • Retrospective Studies: Because the gene G and the environment X are sample after disease status is ascertained • Microarray studies on humans: most are case-control studies • Genome Wide Association Studies (GWAS): most are case-control studies
Prospective and Retrospective Studies • Case-control Studies: • Fast: no need to wait for disease to develop • Cheap: sample sizes are much smaller • Subtle: The controls need to be representative of the population of people without the disease.
Basic Problem Formalized • Case control sample: D = disease • Gene expression: G • Environment, can include strata: X • We are interested in main effects for G and X along with their interaction as they affect development of disease
Basic Problem Formalized • 99.9999% of analyses of case-control data use logistic regression • Closely related to Fisher’s Linear Discriminant Analysis (LDA) • Difference: we want to understand what targets affect disease, not just predict disease
Logistic Regression • Logistic Function: • The approximation works for rare diseases
Prospective Models • Simplest logistic model without an interaction • The effect of having a mutation (G=1) versus not (G=0) is
Prospective Models • Simplest logistic model with an interaction • The effect of having a mutation (G=1) versus not (G=0) is
Empirical Observations • Logistic regression is in every statistical package • Unfortunately, logistic regression is not efficient for understanding interactions • Much larger sample sizes are required for interactions that for just gene effects • Most gene-environment interaction case-control studies fail for this reason
Empirical Observations • Statistical Theory: There is a lovely statistical theory available • It says: ignore the fact that you have a case-control sample, and pretend you have a prospective study • It all works out: don’t worry, be happy!
Empirical Observations • Statistical Theory: Ordinary logistic regression applied to a case-control study makes no assumptions about the population distribution of (G,X) • Remember: we do not have a sample from a population, only a case-control sample • Logistic regression is robust: to assumptions about the population distribution of (G,X)
Likelihood Function • The likelihood is • Note how the likelihood depends on two things: • The distribution of (X,G) in the population • The probability of disease in the population • Neither can be estimated from the case-control study
When G is observed • Logistic regression is thus robust to any modeling assumptions about the covariates in the population • Unfortunately it is not very efficient for understanding interactions
Gene-Environment Independence • In many situations, it may be reasonable to assume G and X are independently distributed in the underlying population, possibly after conditioning on strata • This assumption is often used in gene-environment interaction studies
G-E Independence • Does not always hold! • Example: polymorphisms in the smoking metabolism pathway may affect the degree of addiction
Gene-Environment Independence • If you’re willing to make assumptions about the distributions of the covariates in the population, more efficiency can be obtained. • This is NOT TRUE for prospective studies, only true for retrospective studies.
Gene-Environment Independence • The reason is that you are putting a constraint on the retrospective likelihood
Gene-Environment Independence • Our Methodology: Is far more general than assuming that genetic status and environment are independent • We have developed capacity for modeling the distribution of genetic status given strata and environmental factors • I will skip this and just pretend G-E independence here
More Efficiency, G Observed • Our model: G-E independence and a genetic model, e.g., Hardy-Weinberg Equilibrium • Consequences: • More efficient estimation of G effects • Much more efficient estimation of G-E interactions.
The Formulation • Any logistic model works • Question: What methods do we have to construct estimators?
Methodology • I won’t give you the full methodology, but it works as follows. • Case-control studies are very close to a prospective (random sampling) study, with the exception that sometimes you do not observe people
Pretend Missing Data Formulation • Suppose you have a large but finite population of size N • Then, there are with the disease • There are without the disease
Pretend Missing Data Formulation • In a case-control sample, we randomly select n1 with the disease, and n0 without. • The fraction of people with disease status D=d that we observe is
Pretend Missing Data Formulation • Pretend you randomly sample a population • You observe a person who has D=d, and , with the probability • Statisticians know how to deal with missing data, e.g., compute probabilities for what you actually see
Pretend Missing Data Formulation • In this pretend missing data formulation, ordinary logistic regression is simply • We have a model for G given X, hence we compute
Methodology • Our method has an explicit form, i.e., no integrals or anything nasty • It is easy to program the method to estimate the logistic model • It is likelihood based. Technically, a semiparametric profile likelihood
Methodology • We can handle missing gene data • We can handle error in genotyping • We can handle measurement errors in environmental variables, e.g., diet
Methodology • Our method results in much more efficient statistical inference
More Data • What does More efficient statistical inference mean? • It means, effectively, that you have more data • In cases that G is a simple mutation, our method is typically equivalent to having 3 times more data
How much more data: Typical Simulation Example • The increase in effective sample size when using our methodology
Real Data Complexities • The Israeli Ovarian Cancer Study • G = BRCA1/2 mutation (very deadly) • X includes • age, • ethnic status (below), • parity, • oral contraceptive use • Family history • Smoking • Etc.
Real Data Complexities • In the Israeli Study, G is missing in 50% of the controls, and 10% of the cases • Also, among Jewish citizens, Israel has two dominant ethnic types • Ashkenazi (European) • Shephardic (North African)
Real Data Complexities • The gene mutation BRCA1/2 if frequent among the Ashkenazi, but rare among the Shephardic • Thus, if one component of X is ethnic status, then pr(G=1 | X) depends on X • Gene-Environment independence fails here • What can be done? Model pr(G=1 | X) as binary with different probabilities!
Israeli Ovarian Cancer Study • Question: Can carriers of the BRCA1/2 mutation be protected via OC-use?
Israeli Ovarian Cancer Study • Main Effect of BRCA1/2:
Israeli Ovarian Cancer Study • Odds ratio for OC use among carriers = 1.04 (0.98, 1.09) • No evidence for protective effect • Not available from case-only analysis • Length of interval is ½ the length of the usual analysis
Haplotypes • Haplotypes consist of what we get from our mother and father at more than one site • Mother gives us the haplotype hm = (Am,Bm) • Father gives us the haplotype hf = (af,bf) • Our diplotype is Hdip = {(Am,Bm), (af,bf)}