420 likes | 543 Views
Bayesian Variable Selection in Semiparametric Regression Modeling with Applications to Genetic Mappping. Fei Zou Department of Biostatistics Email: fzou@bios.unc.edu. Outline. Introduction Experimental crosses Existing QTL Mapping Methods Bayesian semi-parametric QTL Mapping Results
E N D
Bayesian Variable Selection in Semiparametric Regression Modeling with Applications to Genetic Mappping Fei Zou Department of Biostatistics Email: fzou@bios.unc.edu
Outline • Introduction • Experimental crosses • Existing QTL Mapping Methods • Bayesian semi-parametric QTL Mapping • Results • Remarks and Conclusions
http://www.cs.unc.edu/Courses/comp590-090-f06/Slides/CSclass_Threadgill.ppthttp://www.cs.unc.edu/Courses/comp590-090-f06/Slides/CSclass_Threadgill.ppt
Overview • One gene one trait: very unlikely • The vast majority of biological traits are caused by complex polygenes • Potentially interacting with each other • Most traits have significant environmental exposure components • Potentially interacting with polygenes
Experimental Crosses: F2 Parents P1 P2
Experimental Crosses • F2 Backcross(BC) P2 P1 P1 P2 AA AA BB BB P1 F1 F1 F1 AA AB AB AB BB AB AB AB AA AB F2: BC:
QTL Data Format 0: homozygous AA, 2: homozygous BB, 1: heterozygote AB. Marker positions:
Linkage Analysis • Data structure: • Marker data (genotypes plus positions) • Phenotypic trait(s) • Other nongenetic covariates, such as age, gender, environmental conditions etc • Quantitative trait loci (QTL): a particular region of the genome containing one or more genes that are associated with the trait being assayed or measured
QTL Mapping of Experimental Crosses • Single QTL Mapping • Single marker analysis (Sax, 1923 Genetics) • Interval mapping: Lander & Botstein (1989, Genetics) • Multiple QTL mapping • Composite interval mapping (Zeng 1993 PNAS, 1994 Genetics; Jansen & Stam, 1994 Genetics) • Multiple interval mapping (Kao et al., 1999 Genetics) • Bayesian analysis (Satagopan et al., 1997 Genetics)
Single QTL Interval Mapping • For backcross, the model assumes • QTL analysis: • If QTL genotypes are observed, the analysis is trivial: simple t-test! • However, QTL position is unknown and therefore QTL genotypes are unobserved
Interval Mapping • For QTL between markers • QTL genotypes missing: can use marker genotypes to infer the conditional probabilities of the QTL genotypes for a given QTL position • Profile likelihood (LOD score) calculated across the whole genome or candidate regions using EM algorithm • In any region where the profile exceeds a (genome-wide) significance threshold, a QTL declared at the position with the highest LOD score.
Multiple QTL Mapping • Most complicated traits are caused by multiple (potentially interacting) genes, which also interact with environment stimuli • Single QTL interval mapping • Ghost QTL (Lander & Botstein 1989) • Low power
Multiple QTL Mapping • Composite interval mapping (Zeng 1993, 1994; Jansen & Stam1993): searching for a putative QTL in a given region while simultaneously fitting partial regression coefficients for "background markers" to adjust the effects of other QTLs outside the region • which background markers to include; window size etc • Multiple interval mapping (Kao et al 1999): fitting multiple QTLs simultaneously • Computationally intensive; how many QTLs to include?
Multiple QTL Mapping • Bayesian methods (Stephens and Fisch 1998 Biometrics; Sillanpaa and Arjas 1998 Genetics; Yi and Xu 2002 Genetic Research, and Yi et al. 2003 Genetics): treat the number of QTLs as a parameter by using reversible jump Markov chain Monte Carlo (MCMC) of Green (1995 Biometrika) • change of dimensionality, the acceptance probability for such dimension change, which in practice, may not be handled correctly (Ven 2004 Genetics)
Multiple QTL Mapping • Alternative, multiple QTL mapping can be viewed as a variable selection problem • Forward and step-wise selection procedures (Broman and Speed 2002 JRSSB) • LASSO, etc • Bayesian QTL mapping • Xu (2003 Genetics), Wang et al (2005 Genetics) Huang et al (2007 Genetics): Bayesian shrinkage • Yi et al (2003 Genetics): stochastic search variable selection (SSVS) of George and McCulloch (1993 JASA) • Yi (2004 Genetics): composite model space of Godsill (2001 J. Comp. Graph. Stat) • Software: R/qtlbim by Yi’s group
Multiple QTL Mapping • Limitations of existing QTL mapping methods • do not model covariates at all or only model covariate effect linearly • do not model interactions at all or model only lower order interactions, such as two way interactions
The multiple QTL mapping is a very large variable selection problem: for p potential genes, with p being in the hundreds or thousands, there are possible main effect models, possible two-way interactions and possible higher order (k > 2) interactions.
Semiparmetric Multiple (Potentially Interacting) QTL Mapping • Goal: map multiple potentially interacting QTLs without specifically model all potential main and higher order interaction effects • Semiparametric model: where function is unspecified, QTL genotypes and represent all non- genetics factors/covariates. • When equals : non-explicitly modeling the two way interaction between genes 1 and 2 and the gene-environmental interaction between gene 3 and covariate 1.
Bayesian Semi/non-parametric Methods • Dirichlet process (Muller et al. 1996) • Splines (Smith and Kohn 1996; Denison et al. 1998 and DiMatteo et al. 2001) • Wavelets (Abramovich et al. 1998 JRSSB) • Kernel models (Liang et al 2007) • Gaussian process (Neal 1997; 1996) • Gaussian process priors have a large support in the space of all smooth functions through an appropriate choice of covariance kernel. • Gaussian process is flexible for curve estimation because of their flexible sample path shapes • Gaussian process related to smoothing spline somehow (Wahba 1978 JRSSB)
Prior Specification on • A Gaussian process such that all possible finite dimensional distributions follow multivariate normal with mean 0 and covariance function where , s and s are hyperparameters and
Hyperparameter defines the vertical scale of variations, i.e., controls the magnitude of the exponential part. Hyperparameters related to length scales which characterize the distance in that particular direction over which y is expected to vary significantly • controls the smoothness of : when the posterior mean of almost interpolates the data while centered around the prior mean function if • When = 0, y is expected to be an essentially constant function of that input variable xj, which is therefore deemed irrelevant (Mackay 1998).
Priors on • The original papers on the Gaussian process (Mackay 1998; Neal 1997) did not view this method as an approach for variable selection and imposed a Gamma prior on the parameters. However, does provide information about the relevance of any QTLwith value near zero indicating an irrelevant QTL. • For variable selection purpose, we can impose the following Gamma mixture priors on
Prior Specifications • Inverse Gamma distributions are used for the priors of and .
Simulations • Set ups: • backcross population • 200 or 500 individuals • 151 evenly spaced markers at 5cM intervals • Four QTLs with varying heritabilities: • Main effect model: all four QTL act additively • Main plus two way interactions • Four way interactions only
Real Data Analysis • A mouse study • # samples: 187 backcross samples • # markers: 85 with average marker distance 20 cM • Phenotypes: inguinal, gonadal, retroperitoneal and mesenteric fat pad weights
Remarks • For studies with large # of samples and/or large # of markers, MCMC converges very slowly • We employed the hybrid Monte Carlo method, which merges the Metropolis-Hastings algorithm with sampling techniques based on dynamics simulation. • We also estimated the maximum a posteriori (MAP) via conjugate gradient method (Hestenes et al 1952 J. Research of National Bureau of Standards) • point estimate
Real Study: Cardiovascular Disease • 2655 tag SNPs from roughly 200 selected candidate genes for cardiovascular disease • 820 individuals • Non-genetic covariates: gender, smoking status, age
Remarks • Semiparemetric mapping is powerful in mapping multiple (potentially interacting with higher orders) QTL • Picks up genes related to the trait regardless of their marginal main effects or joint epistasis effects • Cannot readily differentiates genetic contributions • main effect? interaction? or both? • Fine tuned parametric model with selected genes
Remarks and Future Research • How to extend the methodologies to human genome-wide association (GWA) studies, where hundreds of thousands of markers are available • Is it possible? • potential solutions: pathway analysis; data reduction techniques • How to extend the method to human pedigree analysis where mixed effect model is used for correlated family members? • Use inheritance vector: so far results are very promising
Acknowledgement • Joint work with • Hanwen Huang • Haibo Zhou • Fuxia Cheng • Ina Hoeschele • Funding support • NIH R01 GM074175