400 likes | 618 Views
Fine mapping of recombination in S. cerevisiae : high-density tiling arrays and semi-supervised clustering. Richard Bourgon 20 February 2007 bourgon@ebi.ac.uk. Meiotic recombination. Meiosis A diploid parent divides twice, yielding 4 haploid daughter cells.
E N D
Fine mapping of recombination in S. cerevisiae:high-density tiling arrays and semi-supervised clustering Richard Bourgon 20 February 2007 bourgon@ebi.ac.uk
Meiotic recombination • Meiosis • A diploid parent divides twice, yielding 4 haploid daughter cells. From Molecular Biology of the Cell, Fourth Edition.
Meiotic recombination • Meiosis • A diploid parent divides twice, yielding 4 haploid daughter cells. • Programmed double-stranded breaks • Only 2 strands and 1 DSB are shown. • Distribution of DSBs is not uniform!
Meiotic recombination • Meiosis • A diploid parent divides twice, yielding 4 haploid daughter cells. • Programmed double-stranded breaks • Only 2 strands and 1 DSB are shown. • Distribution of DSBs is not uniform! • Crossover • Each Holliday junction may resolve in one of two ways. Some resolution patterns produce a crossover. • Regions of gene conversion may occur. Based on B de Massy, TRENDS in Genetics 19(9), 2003
Meiotic recombination • Meiosis • A diploid parent divides twice, yielding 4 haploid daughter cells. • Programmed double-stranded breaks • Only 2 strands and 1 DSB are shown. • Distribution of DSBs is not uniform! • Crossover • Each Holliday junction may resolve in one of two ways. Some resolution patterns do not produce a crossover. • Regions of gene conversion may occur. Based on B de Massy, TRENDS in Genetics 19(9), 2003
Meiotic recombination • Meiosis • A diploid parent divides twice, yielding 4 haploid daughter cells. • Programmed double-stranded breaks • Only 2 strands and 1 DSB are shown. • Distribution of DSBs is not uniform! • Crossover • Each Holliday junction may resolve in one of two ways. Some resolution patterns do not produce a crossover. • Regions of gene conversion may occur. • Pre-ligation unwinding • Gene conversion may still occur. Based on B de Massy, TRENDS in Genetics 19(9), 2003
Research questions • In S. cerevisiae, where do hotspots occur and what are the local recombination rates? • Do hotspots (binary) or recombination rates (quantitative) correlate with features of DNA sequence or chromatin structure? • How large are conversion regions? Can we identify the various resolution patterns? • What is the relative frequency of the various DSB-repair pathways? • Does the observed pattern among recombination events concur with current models for “interference”? Do mutations impact interference?
Saccharomyces cerevisiae microarray data • Two strains • S96, isogenic to the common laboratory strain S288c. Fully sequenced. • YJM789, isogenic to the clinical isolate YJM145. Almost fully sequenced. • In alignable regions… • ≈ 67,000 known interrogated SNPs • ≈ 6,000 known interrogated insertions and deletions • Tiling microarrays • ≈ 6.5 M 5µ features (perfect/mismatch pairs) • Non-repetitive S96 sequence interrogated every 8 bases, on both strands • ≈ 4% of probes are specific to YJM789 sequence, at known polymorphic positions • For details see David et al., PNAS 103(14), 2006. • Data • 25 parental genomic DNA hybridizations (13 S96 and 12 YJM789) • 200 offspring genomic DNA hybridizations (50 tetrads × 4 spores/tetrad)
“Single feature polymorphisms” • Hybridization to a short oligonucleotide is quantitatively sensitive to the number and position of mismatches. • Differential hybridization provides a means of detecting polymorphisms, even when only the reference genome sequence is known: SFPs. • Winzeler et al., Science 281(5380), 1998. • Brem et al., Science 296(5568), 2002. • Steinmetz et al., Nature 416(6878), 2002. • Borevitz et al., Genome Research 13(3), 2003. • Given parental behavior, segregants may be genotyped via supervised classification.
Single-probe methods • Polymorphism detection • Winzeler et al. (and others): ANOVA testing 1 = 1. Equivalent to a two-sample t-test assuming common variance. • Borevitz et al.: a moderated t-test using the SAM adjustment. • Brem et al.: a moderated t-test. Then, cluster all data (parental and segregant) ignoring genotype labels for parents. Discard SFPs for which clusters don’t separate the parental data. • Segregant genotyping • ANOVA and t-test methods use the estimated posterior probability of class membership, with a uniform prior on the classes: • Brem et al. augment this: are estimated from clustered data.
S96: CCTCCTGACCGGGATTGAAGTGATAAACATGTCTAGCGTTA YJM789: CCTCCTGACCGGGATTGAACTGATAAACATGTCTAGCGTTA Probe sets: SNP interrogation 6: CTTCACTATTTGTACAGATCGCAAT Probe sets: groups of probes which each exactly map to a unique location, and which interrogate a common polymorphism. 5: CTAACTTCACTATTTGTACAGATCG 4: GGCCCTAACTTCACTATTTGTACAG 2: GACTGGCCCTAACTTCACTATTTGT 1: GGAGGACTGGCCCTAACTTCACTAT 3: GACTGGCCCTAACTTGACTATTTGT
A multi-probe method Gresham et al., Science 311(5769), 2006: • Model the decrease in a given probe’s intensity in the presence of a single SNP, as a function of • Position within the probe, • Probe response to reference sequence, • Probe GC content, and • Nucleotides surround the SNP position. • Fit model parameters using two sequenced strains with known SNPs. • To genotype a segregant or new strain at a given base, assume probes in a probe set are independent and compute a “likelihood ratio”: vs. with assumed to be common for both genotypes.
An alternative multi-probe method • Residual correlation remains after centering log intensities for each probe within inferred genotype class: a multivariate approach is justified.
An alternative multi-probe method • Residual correlation remains after centering log intensities for each probe within inferred genotype class: a multivariate approach is justified. • Parental arrays are informative, but do not always provide an ideal model. Supervised classification of offspring (i) wastes information, and (ii) may be misleading.
An alternative multi-probe method • Residual correlation remains after centering log intensities for each probe within inferred genotype class: a multivariate approach is justified. • Parental arrays are informative, but do not always provide an ideal model. Supervised classification of offspring (i) wastes information, and (ii) may be misleading.
An alternative multi-probe method • Residual correlation remains after centering log intensities for each probe within inferred genotype class: a multivariate approach is justified. • Parental arrays are informative, but do not always provide an ideal model. Supervised classification of offspring (i) wastes information, and (ii) may be misleading. • Clear division into two distributions is necessary but not sufficient. Quantitative aspects of the inferred clusters are useful.
Semi-supervised, model-based clustering C Fraley and AE Raftery, JASA 97(458), 2002: mclust package in R: • (X,Y)i with latent class variable Y. • Assume X|Y multivariate normal. • Initialize the Y with model-based, hierarchical agglomeration. • Compute MLEs for Gaussian means and covariance matrices via EM algorithm. • Optionally, select cluster count and covariance structure via BIC. Proposed procedure: • Provisional fits • Fit multivariate Gaussians to parental data. (Dimension reduction may be required.) • Unsupervised clustering of offspring data, by EM algorithm. • Final fit • Semi-supervised clustering of all arrays, by EM algorithm, provided that the probe set is “well behaved.”
Maximum likelihood and mixture distributions • For mixture distributions, which cannot typically be optimized via differentiation.
Maximum likelihood and mixture distributions • For mixture distributions, which cannot typically be optimized via differentiation. • Assume there are latent variables — in the mixture case, a class label Y: • Using indicator variables gives a tractable full log-likelihood:
The EM algorithm • Initialize with some (0) and (0). • Produce an expression for as a function of and . • Maximize J(0)(,), and define (1) and (1) to be the maximizers. • Etc., etc.:
Normal mixtures • Multivariate normal full log-likelihood: • This is easy to maximize in k and k: • Updating the conditional expectations is also easy:
Relating to Brem et al., 2002 • The method of Brem et al., 2002, may be seen as a univariate version of the EM algorithm, with premature termination: • Set the to 0 or 1 based on the outcome of the k-means clustering. Since probe sets which fail to separate the parental data are discarded, this gives the correct value when Yiis known. • Compute (1) and (1) only. Do not iterate. • Assign each offspring a genotype of • The multivariate versions is required by modern, high-density arrays. • Iteration improves the fit for clusters with some overlap.
Violation of model assumptions • Except for spontaneous mutation, offspring should possess one parent genotype or the other at each polymorphic locus. • Probe behavior is not always this simple. Naive clustering can go wrong.
Violation of model assumptions • Except for spontaneous mutation, offspring should possess one parent genotype or the other at each polymorphic locus. • Probe behavior is not always this simple. Naive clustering can go wrong.
Concordance between fit type One way to detect this is to compare different fits: • Parental arrays only • Offspring arrays only (possibly comparing different cluster counts) • Semi-supervised (possibly comparing different cluster counts)
Setting filter thresholds • Identification of aberrant probe sets is straightforward: BIC improvement when fitting >2 clusters, or measures of dissimilarity between parental-only and semi-supervised fits, are typically large. • On the other hand, there is no obvious threshold for cluster overlap filtering: this is not a standard testing/model selection problem.
Tetrad-level events • Most gene conversion and crossover events can be automatically inferred. • Gaps arise from (i) lack of polymorphism, or (ii) regions which feature repeats or which could not be aligned given current sequence data.
Tetrad-level events • At many inferred crossover locations, polymorphism resolution is high enough to detect associated gene conversion tracts. • Currently, ≈4,300 inferred crossovers, ≈4,600 high-confidence conversions.
Summary and future work • Summary • Semi-supervised clustering is natural given the experimental structure. • A multivariate Gaussian assumption is not perfect, but are not grossly violated. • EM works well when data behave as expected — but this is not always the case. Post-processing filters can detect this. • Parental data are often not a faithful indicator of offspring behavior! Supervised classification may experience problems for some polymorphisms. • Future work • Exploration and recovery of aberrant probe sets • Unanticipated polymorphism detection • Application to a single sequenced-genome • The interesting biological questions! Hotspots, conversion/crossover ratio, sizes, spacing and interference. • New arrays from a mutant which is deficient in the putative interference-generating pathway: how do interference patterns change?
Acknowledgements • EMBL Heidelberg • Eugenio Mancera Ramos • Lars Steinmetz • EBI • Wolfgang Huber • EBI, and Higgins Lab, University College, Dublin • Paul McGettigan