1 / 77

Gene-Environment Case-Control Studies

Gene-Environment Case-Control Studies. Raymond J. Carroll Department of Statistics Center for Statistical Bioinformatics Institute for Applied Mathematics and Computational Science Texas A&M University http://stat.tamu.edu/~carroll. TexPoint fonts used in EMF.

kuri
Download Presentation

Gene-Environment Case-Control Studies

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Gene-Environment Case-Control Studies Raymond J. Carroll Department of Statistics Center for Statistical Bioinformatics Institute for Applied Mathematics and Computational Science Texas A&M University http://stat.tamu.edu/~carroll TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAAAAAAAAA

  2. Advertising • Training: We are finishing Year 08 of an NCI-funded R25T training program • http://www.stat.tamu.edu/b3nc • We train statistically and computationally oriented post-docs in the biology of nutrition and cancer • Active seminar series

  3. Outline • Problem: Case-Control Studies with Gene-Environment relationships • Efficient formulation when genes are observed • Haplotype modeling and Robustness • Applications

  4. Acknowledgment • This work is joint with Nilanjan Chatterjee (NCI) and Yi-Hau Chen (Academia Sinica)

  5. Software • SAS and Matlab Programs Available at my web site under the software button • Examples are given in the programs • Paper are in Biometrika (2005), Genetic Epidemiology (2006), Biostatistics (2007), Biometrics (2008) and JASA (2009) • R programs available from the NCI http://stat.tamu.edu/~carroll

  6. Basic Problem Formalized • GeneandEnvironment • Question: For women who carry the BRCA1/2 mutation, does oral contraceptive use provide any protection against ovarian cancer?

  7. Basic Problem Formalized • GeneandEnvironment • Question: For people carrying a particular haplotype in the VDR pathway, does higher levels of serum Vitamin D protect against prostate cancer?

  8. Basic Problem Formalized • GeneandEnvironment • Question: If you are a current smoker, are you protected against colorectal adenoma if you carry a particular haplotype in the NAT2 smoking metabolism region?

  9. Prospective and Retrospective Studies • D = disease status (binary) • X = environmental variables • Smoking status • Vitamin D • Oral contraceptive use • G = gene status • Mutation or not • Multiple or single SNP • Haplotypes

  10. Prospective and Retrospective Studies • Prospective: Classic random sampling of a population • You measure gene and environment on a cohort • You then follow up people for disease occurrence

  11. Prospective and Retrospective Studies • Prospective Studies: • Expensive: disease states are rare, so large sample sizes needed • Time-consuming: you have to wait for disease to develop • They Exist: Framingham Heart Study, NIH-AARP Diet and Health Study, Women’s Health Initiative, etc.

  12. Prospective and Retrospective Studies • Prospective Studies: • Daunting Task: Only very large, very expensive prospective studies can find gene-environment interactions • Data Access: Access to the Framingham Heart Study requires a university commitment to security

  13. Prospective and Retrospective Studies • Retrospective Studies: Usually called case-control studies • Find a population of cases, i.e., people with a disease, and sample from it. • Find a population of controls, i.e., people without the disease, and sample from it.

  14. Prospective and Retrospective Studies • Retrospective Studies: Because the gene G and the environment X are sample after disease status is ascertained • Microarray studies on humans: most are case-control studies • Genome Wide Association Studies (GWAS): most are case-control studies

  15. Prospective and Retrospective Studies • Case-control Studies: • Fast: no need to wait for disease to develop • Cheap: sample sizes are much smaller • Subtle: The controls need to be representative of the population of people without the disease.

  16. Basic Problem Formalized • Case control sample: D = disease • Gene expression: G • Environment, can include strata: X • We are interested in main effects for G and X along with their interaction as they affect development of disease

  17. Basic Problem Formalized • 99.9999% of analyses of case-control data use logistic regression • Closely related to Fisher’s Linear Discriminant Analysis (LDA) • Difference: we want to understand what targets affect disease, not just predict disease

  18. Logistic Regression • Logistic Function: • The approximation works for rare diseases

  19. Prospective Models • Simplest logistic model without an interaction • The effect of having a mutation (G=1) versus not (G=0) is

  20. Prospective Models • Simplest logistic model with an interaction • The effect of having a mutation (G=1) versus not (G=0) is

  21. Empirical Observations • Logistic regression is in every statistical package • Unfortunately, logistic regression is not efficient for understanding interactions • Much larger sample sizes are required for interactions that for just gene effects • Most gene-environment interaction case-control studies fail for this reason

  22. Empirical Observations • Statistical Theory: There is a lovely statistical theory available • It says: ignore the fact that you have a case-control sample, and pretend you have a prospective study • It all works out: don’t worry, be happy!

  23. Empirical Observations • Statistical Theory: Ordinary logistic regression applied to a case-control study makes no assumptions about the population distribution of (G,X) • Remember: we do not have a sample from a population, only a case-control sample • Logistic regression is robust: to assumptions about the population distribution of (G,X)

  24. Likelihood Function • The likelihood is • Note how the likelihood depends on two things: • The distribution of (X,G) in the population • The probability of disease in the population • Neither can be estimated from the case-control study

  25. When G is observed • Logistic regression is thus robust to any modeling assumptions about the covariates in the population • Unfortunately it is not very efficient for understanding interactions

  26. Gene-Environment Independence • In many situations, it may be reasonable to assume G and X are independently distributed in the underlying population, possibly after conditioning on strata • This assumption is often used in gene-environment interaction studies

  27. G-E Independence • Does not always hold! • Example: polymorphisms in the smoking metabolism pathway may affect the degree of addiction

  28. Gene-Environment Independence • If you’re willing to make assumptions about the distributions of the covariates in the population, more efficiency can be obtained. • This is NOT TRUE for prospective studies, only true for retrospective studies.

  29. Gene-Environment Independence • The reason is that you are putting a constraint on the retrospective likelihood

  30. Gene-Environment Independence • Our Methodology: Is far more general than assuming that genetic status and environment are independent • We have developed capacity for modeling the distribution of genetic status given strata and environmental factors • I will skip this and just pretend G-E independence here

  31. More Efficiency, G Observed • Our model: G-E independence and a genetic model, e.g., Hardy-Weinberg Equilibrium • Consequences: • More efficient estimation of G effects • Much more efficient estimation of G-E interactions.

  32. The Formulation • Any logistic model works • Question: What methods do we have to construct estimators?

  33. Methodology • I won’t give you the full methodology, but it works as follows. • Case-control studies are very close to a prospective (random sampling) study, with the exception that sometimes you do not observe people

  34. Pretend Missing Data Formulation • Suppose you have a large but finite population of size N • Then, there are with the disease • There are without the disease

  35. Pretend Missing Data Formulation • In a case-control sample, we randomly select n1 with the disease, and n0 without. • The fraction of people with disease status D=d that we observe is

  36. Pretend Missing Data Formulation • Pretend you randomly sample a population • You observe a person who has D=d, and , with the probability • Statisticians know how to deal with missing data, e.g., compute probabilities for what you actually see

  37. Pretend Missing Data Formulation • In this pretend missing data formulation, ordinary logistic regression is simply • We have a model for G given X, hence we compute

  38. Methodology • Our method has an explicit form, i.e., no integrals or anything nasty • It is easy to program the method to estimate the logistic model • It is likelihood based. Technically, a semiparametric profile likelihood

  39. Methodology • We can handle missing gene data • We can handle error in genotyping • We can handle measurement errors in environmental variables, e.g., diet

  40. Methodology • Our method results in much more efficient statistical inference

  41. More Data • What does More efficient statistical inference mean? • It means, effectively, that you have more data • In cases that G is a simple mutation, our method is typically equivalent to having 3 times more data

  42. How much more data: Typical Simulation Example • The increase in effective sample size when using our methodology

  43. Real Data Complexities • The Israeli Ovarian Cancer Study • G = BRCA1/2 mutation (very deadly) • X includes • age, • ethnic status (below), • parity, • oral contraceptive use • Family history • Smoking • Etc.

  44. Real Data Complexities • In the Israeli Study, G is missing in 50% of the controls, and 10% of the cases • Also, among Jewish citizens, Israel has two dominant ethnic types • Ashkenazi (European) • Shephardic (North African)

  45. Real Data Complexities • The gene mutation BRCA1/2 if frequent among the Ashkenazi, but rare among the Shephardic • Thus, if one component of X is ethnic status, then pr(G=1 | X) depends on X • Gene-Environment independence fails here • What can be done? Model pr(G=1 | X) as binary with different probabilities!

  46. Israeli Ovarian Cancer Study • Question: Can carriers of the BRCA1/2 mutation be protected via OC-use?

  47. Typical Empirical Example

  48. Israeli Ovarian Cancer Study • Main Effect of BRCA1/2:

  49. Israeli Ovarian Cancer Study • Odds ratio for OC use among carriers = 1.04 (0.98, 1.09) • No evidence for protective effect • Not available from case-only analysis • Length of interval is ½ the length of the usual analysis

  50. Haplotypes • Haplotypes consist of what we get from our mother and father at more than one site • Mother gives us the haplotype hm = (Am,Bm) • Father gives us the haplotype hf = (af,bf) • Our diplotype is Hdip = {(Am,Bm), (af,bf)}

More Related