Peter Kraft pkraft@hsph.harvard Bldg 2 Rm 207 2-4271

EPI293Design and analysis of gene association studiesWinter Term 2008Lecture 5: Gene-environment, gene-gene interaction and “pathway” analyses Peter Kraft pkraft@hsph.harvard.eduBldg 2 Rm 2072-4271

Terwilliger & Weiss (2000) Nat Genet 26:151-157

DNA synthesis dTMP GCP2 Dietary Folate Folate (monoglutamate) dUMP (Polyglutamate) RFC1 TYMS DHFR 10-formyl-THF Dietary Choline FTHFD FTHFSCDC1 DHF MTHFD 5-formyl-THF Cysteine GART ATIC AMT B6 CBS MTHFS B6 5-formimino-THF FTCD DHFR 5,10-methylidyne Homocysteine FTCD FTCD Methyl-Cobalamin SAHH THF Ser GART MTHFD AMT Betaine SAH SHMT Gly DNMT1 BHMT 5’10’MTHF MTR SAM B2 DMG MTHFR MTRR MAT1A Slide courtesy of Stephanie Chiuve 5’MTHF (plasma folate) Cobalamin Methionine DNA methylation

Thomas (2005) CEBP 14:557

Guiding Principle: KISS • “Keeping it Simple is Stupid” For every complex problem there is a simple, easy to understand, incorrect answer. --Quoted in Ulrich (2006) CEBP 15:827 • “Keep it Simple, Stupid” Looking at genes, environmental factors marginally is informative, and also is often all we can do reliably. C.f. Haldane’s “Defense of Beanbag Genetics” See letter by Ambrosone and response by Pharaoh in JNCI (2007) 99:487-9

One of the important functions of beanbag genetics is to show what kinds of numerical data are needed [to test hypotheses about genetic effects]. Their collection will be expensive [Haldane cites an example where 25,000 subjects of more would be needed]. Insofar as Professor Mayr succeeds in convincing politicians and business executives who control research funding, we will not get the data. --Haldane, “Defense…” Or worse, we’ll get a lot of underpowered studies analyzed using poorly understood, ad-hoc methods, which will only contribute noise and confusion. Better to know we don’t know than to think we know when we don’t.

Outline • Gene-environment interaction background • Study designs • Analytic methods (case-control data) • “Basic Table” • Testing for “interaction” • Testing for association, incorporating possible G-E interaction • Gene-gene interaction background • Analytic methods • Simple pairwise-interactions • Empirical Bayes hierarchical models • Toxicokinetic models • Combining information across multiple SNPs • Machine learning methods

“The influences of diet and diseases” might “mask” some “inborn errors of metabolism.” Furthermore, “idiosyncrasies as regards drugs” may be due to “inborn errors of metabolism.” Garrod (1902) Lancet 2:1616

“There are exactly four possibilities, shown in Table 3. The enumeration is so simple that no one has ever troubled to make it.” X Y JBS Haldane (1938) Heredity and Politics

xerodermapigmentosa PKU G6PDfava beans alpha-1antitrypsin sickle cell Figure from Khoury et al. (1988) AJHG 42:89

110 cases / 110 controls Roberts-Thomson (1996) Lancet 347:1372

RR red meat servings 212 cases / 221 controls PHS Men 60 years or older Chen (1998) Cancer Res 58:3307

DEFINITION and NOTA BENE “By interaction or effect modification we mean a variation in some measure of the effect of an exposure on disease risks across the levels of [...] a modifer. [...] “The definition of interaction depends on the measure of association used.” From Thomas (2004) Oxford University Press. Emphasis added.

Absolute risks Relative risks • Supra- (sub-) multiplicative interaction • RR11 / (RR01  RR10)  1 • Supra- (sub-) additive interaction • I11 - (I01-I00) + (I10-I00)  0 • When the null model has no obvious biological interpretation, testing for interaction may not be helpful Thompson (1991) J Clin Epidemiol 44:221

Risk of disease pGE = b0 + bg G + be E + bge GE Log odds of disease pGE log = 0 + g G + e E + ge GE 1-pGE Simple example 1 if exposed 0 if unexposed 1 if carrier 0 if non-carrier E G

0 Log odds of disease -3 Unexposed Exposed ge0 0.5 Carrier Risk of disease Noncarrier 0 Unexposed Exposed bge=0

0.5 Risk of disease 0 Unexposed Exposed bge0 0 Log odds of disease -3 Unexposed Exposed ge=0

It can be useful to note that the relation between individual and joint [genetic and environmental] effects can take different forms, which can depend on the biologic mechanism underlying the interaction. However [...] predicting the biologic mechanism from such epidemiologic data is difficult and perhaps not productive. Botto and Khoury in: Khoury et al. (2004) Oxford University Press The standard “test for interaction” (H0: ge=0) is in fact a test for departure from a specific model of interaction (additive on the log odds ratio scale).

pkraft@hsph.harvard.edu 1 2 4 5 G X Y E Unless relationship between G, E & X or X & Y is well known, model is unidentifiable—a given relationship between G, E &Y could be due to a non-additive relationship between G, E & X or a non-linear relationship between X & Y Y G=1,E=1 G=0,E=1 G=1,E=0 G=0,E=0 X After Thompson (1991)

pkraft@hsph.harvard.edu “Crossover” effects are “non-removable,” i.e. monotonic transformations of scale will not eliminate the “interaction.” However, we do not have appropriate statistical methods for testing the specific null hypothesis of “no crossover interaction” 0.5 Risk of disease 0 Unexposed Exposed bge0

Study designs *Via matching on ethnicity, “genomic control”

Weinberg and Umbach (2000) Am J Epidemiol

Case-Only Analysis Based on genotype-exposure table in CASES Gentotypic odds ratios for exposure from this table are equal to interaction relative risks only if genotypes and exposure are not correlated in general population. (Also have to assume log-linear risk model: Pr(D|G,E)=aBGCEDG,E, where B C and D for reference genotypes or exposures are 1.) if P(G,E)=P(G)P(E)

Basic 6x2 Table Like 2x3 disease-genotype table, this presentation is “closest to the data” and makes no assumption about genetic model or how the gene and exposure jointly influence risk

Testing “interaction” (standard) • Compare “main effects only” model to “main effects plus interaction” model • Usually called the “test for interaction,” this is actually a test of departure from a specified model for interaction (additive on the log odds scale for logistic regression) Say E is dichotomous 0,1 and G is also 0,1 (e.g. dominant coding) Then in SAS speak, we want to compare model caco=g e; to model caco=g e g*e; Tests: OR11/(OR10OR01)=1

Testing “interaction” (a little fancier) • Often researchers are interested in departures from an additive (on the incidence scale) interaction • Somehow, this scale has become identified with “biologically independent effects,” although there are biologically realistic scenarios of “indpendent effects” that lead to a multiplicative interaction—for discussion, see Rothman & Greenland “Modern Epidemiology” and VanderWeele and Robins (2007) Epidemiology 18:329 • This scale has direct public health relevance • We can use a clever trick to test for non-additivity • I11 - (I01-I00) + (I10-I00) = 0  RR11=RR10 + RR01 - 1 • This is no longer a generalized linear model • Can’t fit using standard logistic regression software, e.g. • Have to use custom code (e.g. PROC NLMIXED)

Testing “interaction” (a little fancier) procnlmixed data=twosnp; if (g eq 0) and (e eq 0) then eta=a; if (g eq 0) and (e eq 1) then eta=a+b2; if (g eq 1) and (e eq 0) then eta=a+b1; if (g eq 1) and (e eq 1) then eta=a+log(exp(b1)+exp(b2)-1); ll = caco*eta – (1-caco)*log(1+exp(eta)); model caco ~ general(ll); parms a b1 b2=0; run; Null Model(interaction constrained to be additive on risk scale) procnlmixed data=twosnp; if (g eq 0) and (e eq 0) then eta=a; if (g eq 0) and (e eq 1) then eta=a+b2; if (g eq 1) and (e eq 0) then eta=a+b1; if (g eq 1) and (e eq 1) then eta=a+b3; ll = caco*eta – (1-caco)*log(1+exp(eta)); model caco ~ general(ll); parms a b1 b2=0; run; Alternative Model(interaction not constrained) Compare -2 log Lnull +2 log Lalt to chi-square 1 d.f.

Screening for stratum-specific effects • Is this gene associated with risk of disease in any exposure subgroup? • Can also ask: Is this exposure associated with risk of disease among individuals with any genotype? Compare two models pGE Null log = 0 + e E 1-pGE pGE Alternative log = 0 + e E + g G + ge GE 1-pGE

"True" alternative model Assumption of G-E independence not required for validity of G-GE test—as long as E is measured accurately! Pr(G,E)=Pr(G)Pr(E) Pr(G)=qg Pr(E)=qe

Power and sample size calculations The Test Statistic has an asymptotic χ2(δ) distribution, where Conditional on ascertainment scheme

G N=900 pg=0.35 pe=0.30 ORe=2

G-GE N=900 pg=0.35 pe=0.30 ORe=2

GE N=900 pg=0.35 pe=0.30 ORe=2

diff(G-GE,G) N=900 pg=0.35 pe=0.30 ORe=2

What about misclassified E? fb,q,s,d(D,G,E)= ∑XPb(D|G,X) Ps(E|X) Pd,qe(X|G) Pqg(G) b penetrance parameters s sensitivity, specificity q exposure prevalence, allele frequency d exposure odds ratio(s) by genotype See also: Garcia-Closas, Thompson et al. 1998; Garcia-Closas, Rothman et al. 1999

G-GE GE G n=1200,qg=10%,qe=25%,ORe=2

n=1200,qg=10%,qe=25%,ORe=2 G-GE GE G

E as an intermediate G G G D D D But if: or E E E So far we've discussed this ... Then conditioning on E can reduce or eliminate power to detect G

Take home message • Which test/analysis is most appropriate will depend on goals of analysis • Are you screening for genetic (environmental) factors , allowing for possible effect modification by environmental (genetic) factors? Is E on the causal pathway from G to D? • Are you trying to describe the risk pattern across G-E strata? What scale is most relevant? E.g. departures from additivity on absolute risk scale are relevant as they provide support for targeted interventions. • It is extremely difficult to argue from observational data back to biologic mechanism… we are epidemiologists, not cell biologists

Peter Kraft pkraft@hsph.harvard Bldg 2 Rm 207 2-4271