260 likes | 419 Views
6th July, 2006. Population Genetic Structure Analysis of The Emerging Marine Pathogen, Vibrio vulnificus. Bisharat et al Analyses by RM Harding. Here is where we started: Why two clusters? How are they related to each other? What is their origin?.
E N D
6th July, 2006 Population Genetic Structure Analysis of The Emerging Marine Pathogen, Vibrio vulnificus Bisharat et al Analyses by RM Harding
Here is where we started: Why two clusters? How are they related to each other? What is their origin? Gene order for MLST: glp-gyrB-mdh-metG-purM (large chromosome), dtdS-lysA-pntA-pyrC-tnaA (small chromosome).
Potential explanations • Historical structure. Cluster II isolates represent a clonal expansion of some ancient hybrid (eg ancient horizontal transfer event from V. cholerae?) and so share some sequence similarity by descent. Free recombination between isolates has since been erasing this difference. • Support from association of human disease cases with cluster II and prevalence of environmental origin of cluster I strains. Enriched, non-random sampling of human-disease associated isolates predicts lower diversity and higher LD within cluster II, particularly surrounding genes associated with pathogenicity phenotype. • Population structure due to isolation between geographic locations or hosts. Lack of recombinant hybrids because members of each cluster rarely meet. • No evidence of geographic structure. Ubiquitous presence in marine and estuarine environments provides no clue indicating host structure. • Genetic structure due to adaptation and ongoing selection. Lack of recombinant hybrids because of their low fitness. • What would be the nature of positive evidence?
Evidence against Hypothesis 1: Each individual MLST locus splits the isolates into the same two groups, more or less, which confirms that clustering of isolates is not due to proximity to a particular divergent locus. Splits trees also suggest that isolates have recombined with each other. But has recombination mainly occurred within cluster I? Perhaps the rate of recombination (within MLST genes) is so low for cluster II that clonal identity (LD) is being maintained across genes? So that would be why all genes split into two clusters.
Table 2. Estimates of rates of mutation and recombination, and their ratios. *Maximum likelihood estimates from likelihoods combined over 5 genes.
Results so far • Although rates of recombination are variable between genes, averages over within-locus recombination rates provide no evidence for a lower rate of recombination for cluster II on either chromosome. Likewise, no evidence of lower diversity (estimated as q) for cluster II on either chromosome. • This observation is compatible with the assumptions of house-keeping function and neutrality for diversity within genes, for both clusters. • But, using DNASp, let’s check nucleotide diversity (estimated as p) and Tajima’s D (difference between q and p) for any evidence of either selected differences between clusters or clonal expansion within clusters.
LC, both clusters TD: -0.18 (ns) p: 0.239 LC, cluster I TD: -0.44 (ns) p: 0.211 LC, cluster II TD: -0.15 (ns) p: 0.251 SC, both clusters TD: -0.18 (ns) p: 0.213 SC, cluster I TD: -1.27 (ns) p: 0.165 SC, cluster II TD: -0.66 (ns) p: 0.214 No evidence for either selected differences between clusters or clonal expansion within clusters. If sequence divergence between the clusters was particularly deep (relative to the diversity expected for random assortment), then Tajima’s D between clusters would be large and positive (>1.5) If diversity reflected clonal expansion, Tajima’s D within clusters would be large and negative (<-1.5) Tajima’s D
Linkage disequilibrium between loci and fit to an LDhat model. • Given that the splits trees are more or less concordant in their splits between clusters, we have to expect chromosome-wide LD for isolates when combined across clusters. • The next slide shows the significant LD between SNP pairs (Fisher’s exact test from DNASp) on the lower diagonal for the large chromosome (pale blue for significant against a yellow background). • Yellow above the diagonal shows a good fit to a conversion tract model for recombination. The best fit was found for 300 bp tract lengths. Red indicates hotspots, i.e. more recombination than expected given model, only observed within MLST loci. Blue indicates unexpected LD. • Keep in mind that the average length of MLST loci on the large chromosome is 460 bp. If the best fit is much smaller than this, then the scale used for interlocus distances is irrelevant. The inter-locus distances just have to be larger than any intra-locus distance. I used 1 kb between each pair of loci. • The fit isn’t great, but not bad. The lack of fit, which is mainly due to an excess of apparent hotspots within loci, can be explained by model misspecification rather than anything of biological significance. They aren’t really hotspots. It’s the estimates that are too small, because of problems with applying the model.
184 SNPs 300 bp, ML(r)=5 mdh gyrB glp purM metG mdh gyrB glp purM metG
Context and interpretation: large chromosome • The best fitting tract length is 300 bp and the ML recombination rate is ML(r)=5 over all 5 loci concatenated with 1 kb intervals (5/6301 for rate per bp) whether the average tract length is 300 bp or 500 bp. Compare with ML(r)=8 for the average of the 5 loci given in the earlier Table (8/460 bp for the rate per bp), both for cluster I and cluster II. • The estimated recombination rate over the chromosome is low to accommodate all the LD across clusters between loci. • The model fit highlights departures from expectations of rates based on ML(r)=5. Since these expectations for the rate are low, recombination between some intra-locus SNP pairs is judged high (red). But it’s not too bad – the resulting poor fit (and model failure) is mainly within loci, not between. • For comparison, the best fitting average tract length for analyses of individual loci within clusters (reported in the table) was estimated at ~500 bp, but keep in mind that this is a lower bound set by the physical intra-locus distances (average locus length is 460 bp). • Given the average locus length of 460 bp, there must be a switch between roughly two haplotype blocks within each locus, and I think I can see that in the LD patterns below the diagonal. • Now, focusing on the SNPs that segregate between the clusters, take a look in the next slide at the haplotypes (judged by a small subset of SNPs sharing lots of LD). In the following slide the rows (haplotypes) are organised by a UPGMA tree on large chromosome loci. The same two clusters are observed as when all loci are combined but ordering within the clusters is different.
Moving on to the small chromosome • Next slide is the LD and model fit for the small chromosome, both clusters combined. I found the best fit for tract lengths of 110 bp. • The LD within loci appears broken into even smaller blocks than on the large chromosome. • Again there is a lot inter-locus LD. Individual blocks align, repeating the segregation pattern between the two clusters. There must be somewhat more LD because the estimated recombination rate is lower than for the large chromosome: ML(r)=2 over all 5 loci concatenated with 1 kb intervals (2/6025 for rate per bp) whether the tract length is 110 bp or 500 bp. Compare with with ML(r)=8.5 for cluster I or ML(r)=11 for cluster II, for the average of the 5 loci given in the earlier Table, assuming tract length of 500 bp (Divide by 405 bp for the rate per bp).
252 SNPs 110 bp, ML(r) = 2 pntA tnaA lysA dtdS pyrC pntA tnaA lysA dtdS pyrC
Context and interpretation: small chromosome • Now the model fit is much worse with so called ‘hotspots’ dispersed between as well as within loci. The main reason for these ‘hotspots’ is model misspecification. The estimated ML recombination rates within and between loci, are too low. • Even judging against a model that sets the recombination rates too low, there is excess inter-locus LD between pyrC and other genes. • The inter-locus LD on the small chromosome must be more substantial in magnitude (not just significance) than on the large chromosome, leading to particularly poor model fit. Given that the recombination tract-length LDhat model worked reasonably well for the large chromosome, what is different about the small chromosome? • Suppression of recombination between clusters for the small chromosome? • Loss of hybrid isolates by selection? • Next slide shows the major segregating haplotypes. The haplotype blocks do look shorter than for the large chromosome.
Main observations on the large and small chromosomes • Estimated recombination tract length is short (300 bp for the large chromosome and 110 bp for the small chromosome.) • Although there is a lot of inter-locus LD due to chromosome wide segregation into the same two clusters of isolates, the LD has been broken into short haplotype blocks within loci. • For comparison, Jolley et al reported an average tract length of 1.1 kb for a mixed set of disease-associated and carried Neisseria meningitidis. • I think the tract length estimates are reasonable and more robust to model misspecification than the recombination rate estimates. (Gil agrees). • Why are the blocks short when we focus on the sites segregating between the clusters? • The SNP variation segregating between clusters is generally older than the variation within clusters and there has been more time for recombination events to break up the LD into short blocks. • The nature of the recombination process between clusters is different to the recombination process(es) acting within clusters, on the small chromosome in particular.
Large chromosome, keeping the clusters separate • For the large chromosome, both cluster I and cluster II isolates show LD within loci, and some but not lots of LD between loci. • For cluster I, the model fit improves as the tract length is increased to 4000 bp, overlapping up to 3 loci. • For cluster II, the model fit improves as the tract length is increased to 2000 bp, overlapping adjacent pairs of loci. • The ML recombination rate is lower for cluster II compared with cluster I. There does appear to be more inter-locus LD on cluster II. • I’ve got no way of testing whether 2000 and 4000 are significantly different fits, (probably not). What is most interesting is that there is information that recombination tract lengths within clusters on the large chromosome overlap adjacent pairs of MLST loci, which physically are a long way apart – between 0.35 MB (mdh and gyrB) and 1.1 MB (purM and metG). • The LD plots below the diagonal provide some information to explain why the best fitting tract lengths overlap adjacent loci. For both clusters there is lots of LD within loci, and some LD but not as much between some pairs of loci, eg any locus with metG. For cluster II in particular, the inter-locus LD is as likely between more distant comparisons as between adjacent loci. We do need to remember the chromosome is circular and metG is equally far from mdh and purM, and actually furtherest from glp. Perhaps the recombination probability does continue to increase with increasing inter-locus distance up to 1.6 MB, ie over-lapping any 3 neighbouring loci.
117 SNPs ML(r) C4000: 91 mdh gyrB glp purM metG mdh gyrB glp purM metG
134 SNPs ML(r) C2000: 32 (best fit) ML(r) C4000: 53
Small chromosome, keeping the clusters separate. • For the small chromosome, both cluster I and cluster II isolates show most LD within loci, some LD between loci, but more LD between pyrC and other loci for cluster II than cluster I. • For cluster I, the model fit improves as the tract length is increased to 800 bp, ie not extending between any loci. • For cluster II, the model fit improves as the tract length is increased to 4500 bp, overlapping sets of 3 loci. • The ML recombination rate is higher for cluster II than for cluster I (comparing estimates for tract lengths of 4000 bp.) But, judging by eye, the higher ML rate leads to more excess inter-locus LD and a poorer model fit for cluster II than for cluster I. It is also poorer than the model fit for either cluster on the large chromosome. • So, whatever it is that is different about the small chromosome compared with the large chromosome, cluster II on the small chromosome is the most odd.
133 SNPs ML(r) C800 = 21 (best fit) ML(r) C4000 = 65 pntA tnaA lysA dtdS pyrC pntA tnaA lysA dtdS pyrC
166 SNPs The best model fit is with 4500 bp, but the picture doesn’t look any different to this one. ML(r) C4000 = 92 pntA tnaA lysA dtdS pyrC pntA tnaA lysA dtdS pyrC
A check on variable inter-locus distances for the small chromosome • The MLST loci on the small chromosome are very unevenly distributed; dtdS is close to pyrC (and a lot of inter-locus LD is evident). Also tnaA is relatively close to lysA. Remembering that we have a circular chromosome, pyrC is as close, or as far, to pntA as it is to lysA. • So I tried varying the inter-locus distances to roughly reflect these relationships. The new results showed that for cluster II, the average tract length remained long at 4000 bp or more, potentially covering any three loci. Also the fit improved a little from 2058 for the picture shown (with 1 kb interlocus distances) to 2022 (variable inter-locus distances) but this difference, spread over the whole matrix of SNP pairs, is too small to alter the highlighted pattern of highs (red) and lows (blue). • The fits with tract length of 4000 bp (or more) are better than fits with tract lengths of 2000 bp at 2068 (1 kb interlocus distances) or 2031 (variable inter-locus distances of either 1 or 3 kb). But I doubt that any of the different tract length estimates varying from 2000 upwards are meaningful improvements. • However, it would be interesting to know if a best fit of 800 bp for cluster I on the small chromosome, rather than say 2000+ bp, is meaningful. The observation of 800 bp is in the same ball park as 1.1 kb estimated by Gil for Jolley et al.’s paper on Neisseria. It indicates recombination tract lengths covering the extent of individual MLST loci, but not overlapping them.
Adjust the focus to cluster I on the small chromosome • Cluster I isolates for the small chromosome have the lowest diversity (p = 0.165 compared with p > 0.2 for cluster II or either cluster on the large chromosome. • Cluster I isolates for the small chromosome also show the least inter-locus LD, giving tract lengths of only 800 bp • Also, I think cluster I haplotypes on the small chromosome show the least intrusion of diversity from the other cluster. • Perhaps selection as adaptation to the environment is acting more strongly on the small chromosome than on the large chromosome and eliminating diversity, including diversity that happens to increase the probability of pathogenesis.
Selection for adaptation to the environment • If selection for environmental adaptation is the important evolutionary process, then perhaps there are two adaptive basic phenotypes. The more common (and more fit?) basic phenotype is associated with cluster I MLSTs. On this background biotype 2 has evolved and propensity to human pathogenicity has been lost. • The less common basic phenotype found in the environment is associated with cluster II MLSTs. Despite being less common we have obtained a large sample because its frequency is enriched by sampling disease cases. • The observation that diversity segregates between these clusters means that both types have higher fitness than most hybrids between them (until some unusual hybrid arises, like biotype 3).
Conclusions • The recombination rate for concatenated loci seems to be comparable and high within both clusters of both chromosomes. However, the model fit is poor for cluster II of the small chromosome, suggesting that against model expectations there is excess inter-locus LD, mainly extending from pyrC. • Excess inter-locus LD extending from pyrC on the small chromosome is detectable for the full sample combining clusters, as well as for cluster II isolates separately. • In a related point, many of the theoretically expected recombinants (between clusters) for the small chromosome are missing from cluster I, and this looks to me like there has been selection against them. • Are the long recombination tracts detectable by the LDhat model for the large chromosome, within clusters, due to (a) recombination from the other cluster, or (b) recombination between similar, cluster specific haplotypes (by conjugation?)??