Bias Adjustment in Whole-Genome Scans

Bias Adjustment in Whole-Genome Scans Fei Zou fzou@bios.unc.edu Department of Biostatistics Carolina Center for Genome Sciences University of North Carolina at Chapel Hill

Bias correction for estimates of genetic risk • A number of recent papers have observed (e.g. Garner, 2007) that when genome-wide significance thresholds are applied in testing, estimates of risk are inflated (the “winner’s curse” or “Beavis effect”) • the magnitude of odds ratio estimates may be upwardly biased, posing a challenge for replication/confirmation or extension to additional populations- under powered.

Original versus Corrected Odds-Ratio Estimates for Three Published Genetic Association Studies

Some approaches have been proposed for improved estimation • maximizing the conditional likelihood for genotype outcomes, given declared significance (Zöllner and Pritchard, 2007). • bootstrapping of genotype-phenotype values to provide an empirical correction (Sun and Bull, 2005; Yu et al., 2007) • both require original data and may be computationally prohibitive.

An approximate conditional likelihood approach (Ghosh et al., 2008; Zhong and Prentice, 2008, similar results in clinical trials literature) Assume we have a parameter of interest Assume that this or similar Wald-like statistic is used to declare significance Declare multiple-test corrected significance if significance threshold Defining , we have

z=5.2 z=5.33 Example. Using c=5.0 (similar to a genome scan threshold, with nominal a=5.7X10-7), the desired shrinkage is apparent. If the observed z is well above the threshold, the unconditional and conditional likelihoods are similar. 6

approximate conditional Thus we have defined a new “m version” of the problem, for which we use as an approximate conditional likelihood Here the m.l.e. of the conditional likelihood may not be optimal in any sense. It can be (theoretically) shown that no unbiased estimator of m exists. Naïve estimator (equal to z) Conditional m.l.e. A low m.s.e. estimator A compromise estimator

and at any time we may convert back to b using Clearly this approach can be applied in a variety of settings. In genome-wide association studies, we might have For a one-parameter genetic model, i.e., recessive, additive, dominant action of the SNP genotype c will typically be in the range of 5-6

. . Alternatively, Zhong and Prentice (2008) propose the standard LR approach for CI:

Performance (expectation and m.s.e.) and confidence intervals, m version 95% confidence bounds, obtained by inverting test procedure using the conditional density for z (given significance) 10

Confidence Interval 95% confidence bounds, obtained by inverting test procedure using the conditional density for z (given significance) -version

All of the performance results in the idealized m-version of the problem carry over in the realistic version of the problem • Ghosh at el. 2008 provide simulations under a variety of genetic models, under a “worst-case” scenario with 500 cases, 500 controls. • Confidence interval coverage shown to be accurate 12

95% nominal coverage n=1000 n=5000 n=10000 Dominant model prevalence of disease=.01

A related problem arises often in genomics applications: • For example: given a SNP that is significant in a genome scan for primary phenotype 1, we may want to perform inference about its effect on secondary phenotype 2 (which is correlated with phenotype 1). E.g. Type II diabetes and obesity. • Another example: we run a genome scan for SNP effects (G), as well as environmental (E) and GXE effects. We only care about the E and GXE effects if the SNP is declared significant for G.

Bias for m2 is r times the bias in m1. Bias does not depend on m2. Two-m version of the problem Assume the corresponding z’s are bivariate normal with correlation r 18

More generally

Typically estimated from likelihood Obtain multivariate versions of earlier point estimates

Two Binary Traits Used to induce dependence between traits • Two dichotomous traits and following Palmgren (bivariate logistic) model, and a SNP having dominant effect on each trait • denotes dichotomous SNP genotype • We examined ranging from -0.7(OR 0.5) to 0.7(OR 2) and fixed at 0.3. • Disease prevalence=0.1 for each • c=5, MAF=0.25 • Correlation between estimators can be determined from data

Dichotomous primary and continuous secondary traits • : dichotomous : continuous • denotes SNP genotype • We examined ranging from -0.34(OR 0.5) to 0.34(OR 2) and fixed at 0.3.

Binary Y2

Quantitative Y2

The previous results were applied to the situation where the data were sampled prospectively. • For genome-wide association studies, a much more common situation is one in which the data are sampled retrospectively based on phenotype Y1, typically case control status. • Problem: under the retrospective sampling scheme, the relationship between (dichotomous) Y2 and X becomes complicated and no longer logistic (Scott and Wild 1995), even though the relationship between Y1 and X is still logistic. • Appropriate conditional likelihood with respect to the retrospective sampling scheme has to be considered when analyzing the secondary phenotype Y2 for retrospective data • Scott and Wild (1995), Lee and Scott (1997) • Lin and Zeng (2008) for GWAS secondary phenotype analysis 25

Biasness in Secondary Phenotype Analysis Simulation set up: Bivariate logistic; P(Y1=1)=0.05 and P(Y2=1)=0.2; MAF=0.25.

We directly maximize the retrospective log-likelihood: assuming that is known.

Binary Y2 (retrospective)

Binary Y2 (prospective)

Quantitative Y2 (retrospective)

Quantitative Y2 (prospective)

Gene by environment interaction A dichotomous trait , a single SNP having dominant effect on the trait, environment and their interaction E is dichotomous (0,1), G is dichotomous (0,1) We examined ranging from -0.7(OR 0.5) to 0.7(OR 2) and fixed and at 0.3 and 0.2 respectively. Disease prevalence=0.01 c=5, MAF=0.25, ncases=ncontrols=500 Correlation between coefficients can be determined from likelihood 32

Gene x Environment Environment Gene 33

The use of our approximate conditional likelihood has additional advantages. The focus on the Wald statistic means that we can: • use other parameterizations (not necessarily odds ratios) • use the approach even when covariates are fitted in the model. This is a key advantage, as correction for population stratification is often performed using covariates • Apply the approach to published summary tables. We need only c, , and . The last value, if not provided directly, can be inferred from z, the p-value, or a published odds ratio confidence interval. • What about the extra bias in reporting only the most significant SNP in a genome-wide association study?

References – Bias correction in risk estimates Allison DB, Fernandez JR, Heo M, Zhu S, Etzel C, Beasley TM, Amos CI (2002) Bias in estimates of quantitative-trait-locus effect in genome scans: demonstration of the phenomenon and a method-of-moments procedure for reducing bias. Am J Hum Genet 70:575-585 Garner C (2007) Upward bias in odds ratio estimates from genome-wide association studies. Genet Epidemiol 31:288-295 Ghosh A, Zou F, Wright FA. (2008) Estimating odds ratios in genome scans: an approximate conditional likelihood approach. Am J Hum Genet.82:1064-74 Goring HH, Terwilliger JD, Blangero J (2001) Large upward bias in estimation of locus-specific effects from genomewide scans. Am J Hum Genet 69:1357-1369 Rothman N, Skibola CF, Wang SS, Morgan G, Lan Q, Smith MT, Spinelli JJ, et al. (2006) Genetic variation in TNF and IL10 and risk of non-Hodgkin lymphoma: a report from the InterLymph Consortium. Lancet Oncol 7:27-38 Sun L, Bull SB (2005) Reduction of selection bias in genomewide studies by resampling. Genet Epidemiol 28:352-367 Wald A (1943) Tests of statistical hypotheses concerning several parameters when the number of observations is large. Transactions of the American Mathematical Society 54:426-482 Wang WY, Barratt BJ, Clayton DG, Todd JA (2005) Genome-wide association studies: theoretical and practical concerns. Nat Rev Genet 6:109-118 Yu K, Chatterjee N, Wheeler W, Li Q, Wang S, Rothman N, Wacholder S (2007) Flexible design for following up positive findings. Am J Hum Genet 81:540-551 Zhong H, Prentice RL (2008) Bias-reduced estimators and confidence intervals for odds ratios in genome-wide association studies. Biostatistics Feb 28 2008 (Epub). Zöllner S, Pritchard JK (2007) Overcoming the winner's curse: estimating penetrance parameters from case-control data. Am J Hum Genet 80:605-615

Collaborators • Arpita Ghosh • Fred A. Wright

Bias Adjustment in Whole-Genome Scans

Bias Adjustment in Whole-Genome Scans

Presentation Transcript

Whole Genome Sequencing

Whole Genome Phylogenetic Analysis

Detecting selection using genome scans

Whole Genome Duplications (Polyploidy)

GENOME SCANS

GENOME SCANS

Whole genome association studies

Bias Adjustment in Whole-Genome Scans

Whole-genome motif discovery

Whole Genome Alignment

Whole Genome Alignment

Whole genome scans to localise QTL

Whole Genome Alignment

Whole genome alignments

Control of Population Stratification in Whole-Genome Scans

Whole genome alignments

Whole genome alignments

Whole genome analysis

Whole Genome Assembly

Whole Genome Sequencing in a Nutshell

Whole-Genome Optical Mapping

Whole Genome Assembly