210 likes | 227 Views
Bayesian inference of genome structure and application to base composition variation Nick Smith and Paul Fearnhead, University of Lancaster. Plan of talk. Examples of genome structure Patterns of base composition in vertebrate genomes Recursive segmentation
E N D
Bayesian inference of genome structure and application to base composition variationNick Smith and Paul Fearnhead, University of Lancaster
Plan of talk • Examples of genome structure • Patterns of base composition in vertebrate genomes • Recursive segmentation • Bayesian approach to segmentation inference • Application to simulated data • Application to real data • Conclusions and future directions
Examples of genome structure • Variation in base composition (GC%) in the human genome
Variation in substitution rate, polymorphism, repeat element density, recombination rate across human chromosome 22
General research goals • How can we analyse structure in the genome? Could partition genome into uniform windows, but better to partition genome according to structure in the data. • Base composition: Partition genome according to GC content, testing differences in GC structure among chromosomes and species
Base composition in the vertebrate genome • 1970’s: Physical analysis of large regions of DNA (~100 kb) revealed striking heterogeneity in GC content, especially in warm-blooded vertebrates; idea of isochores (>300 kb) • Full genome sequences: Sliding window analyses show significant variation in GC at a variety of scales. Long-range autocorrelations up to many Mb.
Recursive segmentation for “isochore” identification • Current standard method is recursive segmentation, e.g.Isofinder which generates isochore maps shown in UCSC genome browser. • Isofinder: Apply “coarse graining” to filter out short-scale variation (exons, repetitive elements, CpG islands), choose potential cutting point as maximum t-statistic between right and left means, test significance of t-test by Monte Carlo methods, make cut if p value below threshold. Repeat process for resultant sub-sequences. 10’s Mb 1-10kb
Statistical methods for segmentation • Problems with recursive segmentation: (1) depends on significance threshold, (2) no idea of uncertainty in segmentation, (3) possible loss of information. • Various methods for DNA segmentation have been considered in the statistics literature. • Have adapted a Bayesian model for simultaneously inferring multiple changepoints which divide the data into segments. Model uses a combination of perfect simulation with recursions and MCMC.
Bayesian model • Apply coarse graining to get data (y1:n), number of GC bases in non-overlapping windows • 2-level model • (1) Model data within segments as N(i,2), where i is segment mean and 2 gives variance within segments. • (2) Model segment means as N(, 2 2), where is grand mean and 2 is ratio of variance among segment means to variance within segments. • Can calculate P(t,s) = Pr(yt:s|t,s in same segment). • Model distribution of segment lengths as negative binomial, g(r,k), r is inverse of mean segment length, k is shape parameter. • Define recursions with Q(t) = Pr(yt:n|changepoint at t-1) using P, Q and g functions. Fundamental assumption: independence of segments. Proceed from Q(n) to Q(1) = Pr(y1:n). • Posterior distributions of changepoints are given by formulae with P, Q and g functions, other information easily follows.
Tuning priors and MCMC algorithm • Problem is we don’t know correct values for underlying priors. • Priors tuned by maximising product of Q(1) and hyperpriors. • MCMC algorithm designed to overcome dependence on priors and possible mis-specifications of model. • Start with priors at tuned values (reference priors). • Store N simulations of changepoints based on reference priors. • Three updates at each new step of MCMC algorithm (N steps) (1) Metropolis-Hastings update changepoints dependent on current values of priors, if (2) Update , r, and i’s given changepoints and and (Gibbs) (3) Update and given i’s (Gibbs)
Simulated data 1: Idealised simulations • Idealised simulations: strong segmentation signal and correct model. • Simulated discretised data (5000 data points, 5 kb windows) using r = 0.01, k = 2, = 140, = 2, = 2000. • For 100 simulations, compared true values of parameters from simulations with estimated parameter values (mean of posterior distribution): highly accurate inference of r (mean estimate 0.0104, mean error 0.0005), (140.03, 0.23) and (2.02, 0.043). • Inference of k = 2 was supported by Q(1) values: average 2DlogQ(1) was 3 to k = 1 and 10 to k = 3. • Compared Bayesian and Isofinder results for a single example. • True number of changepoints 58. Bayesian: mean r implied 59.9, best set of changepoints had 53. Isofinder: number of changepoints varied with threshold probability from 96 with p = 0.05 to 64 with p = 0.01 to 52 with p = 0.001.
Simulated data 2: Correlations between segments • In real DNA data there appear to correlations between isochores so important to test sensitivity of method given fundamental assumption of recursions. • Simulated data in which there are correlations between segments, positive 1-lag autocorrelations with segment GC and length (r = 0.30), negative correlation between GC and length (r = -0.30), other simulation details as before. • Inference of parameters remains good: r (mean estimate 0.0096, mean true 0.0098), (140.16, 140.04) and (1.93, 1.94). • Correlations between segments are underestimated: GC (0.203, 0.248), length (0.114, 0.270), GC-length (-0.297,-0.378).
Simulated data 3: multimodal distribution • The model segment means as a single normal. • Simulated data according to model of four isochore classes, reduced to 70 and increased to 3.5. Parameters chosen so that distribution of segment means is multimodal. • Inference according to Bayesian model recreated the multimodal structure of segment means. But acceptance probability very low.
Non-parametric test of changepoint inference • Want some n-p way of evaluating the method, the most fundamental issue is whether changepoints are inferred correctly. • Compare two sets of changepoints to get mean minimum distance between both sets of changepoints (I to nearest T and vice versa). • For each of set of inferred changepoints, compare to true changepoints: get inferred-true distance. For randomised sets of changepoints get random-true distance. • Ratio of inferred-true/random-true gives performance: idealised ratio = 0.238, correlations ratio = 0.256, multimodal ratio = 0.263. Performance robust to model mis-specification. • Comparison of methods for idealised simulation example: Isofinder with p = 0.005 ratio = 0.279, Bayesian best set of changepoints ratio = 0.116.
Real data: human chromosome 1 • 36.8 Mb, 5 kb windows, 7358 data points, mean 2068, standard deviation 277. • Q(1) analysis supports k = 1, geometric distribution. • Means and 95% CIs from posterior distributions: r 0.062 (0.055-0.069), 1.91 (1.77-2.07), 138 (135-140). • Mean segment length 80 kb, lots of short segments. • Strong neighbouring segment correlations: GC 0.21, length 0.11, GC-length –0.29
Human chromosome 1: levels of structure Higher level structure: • Can enforce higher k and lower r, k = 5 and r = 0.02 gives mean segment size of 185 kb. • Significant partial autocorrelations between segment means up to 9-lag. • Applying Bayesian model to inferred segment mean output gives three super-changepoints at 10 Mb (2008 in 5kb data), 15 Mb (3018 in 5kb data) and 17 Mb (3419 in 5kb data). Effect of “coarse graining” window size: • Segment size positively correlated with window size (3 kb - 60 kb; 5 kb - 80 kb; 10 kb - 115 kb). • Same process for Isofinder, with p = 0.95, 3 kb windows give 85 kb segments and 5 kb windows gives 115 kb segments. • Three super-changepoints at 10, 15 and 17 Mb robust to window size.
Real data: mouse chromosome 1 • 23.89 Mb, 5kb windows, 4777 data points, mean 2050, standard deviation 180. • Means and 95% CIs from posterior distributions: r 0.042 (0.034-0.047), 1.71 (1.57-2.00), 103 (101-105). • Less variation in base composition in mouse genome than in human genome: due to longer segments (r 0.042 vs. 0.0062, 120 kb vs. 80 kb), less variation within segments ( 103 vs. 138) and less variation among segments ( 176 vs. 264).
Preliminary conclusions on GC analyses • Methods which partition genome according to data are preferable to sliding windows methods. • Bayesian method does not require the threshold parameter of recursive segmentation methods, allows improved statistical analyses, seems to perform better than recursive segmentation. • Bayesian method appears to be robust to model mis-specification.
Future work - correlated genome structure • One exciting possibility of the Bayesian method is extendsion to multiple genomic features, e.g. the issue of GC-substitution rate (K) covariation. • Accounting for uncertainty in inference of K. • Do GC and K changepoints tend to be close together, and do the changes between segments covary? • Higher level: What about super-changepoints in GC and K? • Lower level: Do GC and covary K within segments?