1 / 40

Richard Bourgon 20 February 2007 bourgon@ebi.ac.uk

Fine mapping of recombination in S. cerevisiae : high-density tiling arrays and semi-supervised clustering. Richard Bourgon 20 February 2007 bourgon@ebi.ac.uk. Meiotic recombination. Meiosis A diploid parent divides twice, yielding 4 haploid daughter cells.

pennie
Download Presentation

Richard Bourgon 20 February 2007 bourgon@ebi.ac.uk

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Fine mapping of recombination in S. cerevisiae:high-density tiling arrays and semi-supervised clustering Richard Bourgon 20 February 2007 bourgon@ebi.ac.uk

  2. Meiotic recombination • Meiosis • A diploid parent divides twice, yielding 4 haploid daughter cells. From Molecular Biology of the Cell, Fourth Edition.

  3. Meiotic recombination • Meiosis • A diploid parent divides twice, yielding 4 haploid daughter cells. • Programmed double-stranded breaks • Only 2 strands and 1 DSB are shown. • Distribution of DSBs is not uniform!

  4. Meiotic recombination • Meiosis • A diploid parent divides twice, yielding 4 haploid daughter cells. • Programmed double-stranded breaks • Only 2 strands and 1 DSB are shown. • Distribution of DSBs is not uniform! • Crossover • Each Holliday junction may resolve in one of two ways. Some resolution patterns produce a crossover. • Regions of gene conversion may occur. Based on B de Massy, TRENDS in Genetics 19(9), 2003

  5. Meiotic recombination • Meiosis • A diploid parent divides twice, yielding 4 haploid daughter cells. • Programmed double-stranded breaks • Only 2 strands and 1 DSB are shown. • Distribution of DSBs is not uniform! • Crossover • Each Holliday junction may resolve in one of two ways. Some resolution patterns do not produce a crossover. • Regions of gene conversion may occur. Based on B de Massy, TRENDS in Genetics 19(9), 2003

  6. Meiotic recombination • Meiosis • A diploid parent divides twice, yielding 4 haploid daughter cells. • Programmed double-stranded breaks • Only 2 strands and 1 DSB are shown. • Distribution of DSBs is not uniform! • Crossover • Each Holliday junction may resolve in one of two ways. Some resolution patterns do not produce a crossover. • Regions of gene conversion may occur. • Pre-ligation unwinding • Gene conversion may still occur. Based on B de Massy, TRENDS in Genetics 19(9), 2003

  7. Research questions • In S. cerevisiae, where do hotspots occur and what are the local recombination rates? • Do hotspots (binary) or recombination rates (quantitative) correlate with features of DNA sequence or chromatin structure? • How large are conversion regions? Can we identify the various resolution patterns? • What is the relative frequency of the various DSB-repair pathways? • Does the observed pattern among recombination events concur with current models for “interference”? Do mutations impact interference?

  8. Saccharomyces cerevisiae microarray data • Two strains • S96, isogenic to the common laboratory strain S288c. Fully sequenced. • YJM789, isogenic to the clinical isolate YJM145. Almost fully sequenced. • In alignable regions… • ≈ 67,000 known interrogated SNPs • ≈ 6,000 known interrogated insertions and deletions • Tiling microarrays • ≈ 6.5 M 5µ features (perfect/mismatch pairs) • Non-repetitive S96 sequence interrogated every 8 bases, on both strands • ≈ 4% of probes are specific to YJM789 sequence, at known polymorphic positions • For details see David et al., PNAS 103(14), 2006. • Data • 25 parental genomic DNA hybridizations (13 S96 and 12 YJM789) • 200 offspring genomic DNA hybridizations (50 tetrads × 4 spores/tetrad)

  9. Polymorphism density

  10. “Single feature polymorphisms” • Hybridization to a short oligonucleotide is quantitatively sensitive to the number and position of mismatches. • Differential hybridization provides a means of detecting polymorphisms, even when only the reference genome sequence is known: SFPs. • Winzeler et al., Science 281(5380), 1998. • Brem et al., Science 296(5568), 2002. • Steinmetz et al., Nature 416(6878), 2002. • Borevitz et al., Genome Research 13(3), 2003. • Given parental behavior, segregants may be genotyped via supervised classification.

  11. Single-probe methods • Polymorphism detection • Winzeler et al. (and others): ANOVA testing 1 = 1. Equivalent to a two-sample t-test assuming common variance. • Borevitz et al.: a moderated t-test using the SAM adjustment. • Brem et al.: a moderated t-test. Then, cluster all data (parental and segregant) ignoring genotype labels for parents. Discard SFPs for which clusters don’t separate the parental data. • Segregant genotyping • ANOVA and t-test methods use the estimated posterior probability of class membership, with a uniform prior on the classes: • Brem et al. augment this: are estimated from clustered data.

  12. S96: CCTCCTGACCGGGATTGAAGTGATAAACATGTCTAGCGTTA YJM789: CCTCCTGACCGGGATTGAACTGATAAACATGTCTAGCGTTA Probe sets: SNP interrogation 6: CTTCACTATTTGTACAGATCGCAAT Probe sets: groups of probes which each exactly map to a unique location, and which interrogate a common polymorphism. 5: CTAACTTCACTATTTGTACAGATCG 4: GGCCCTAACTTCACTATTTGTACAG 2: GACTGGCCCTAACTTCACTATTTGT 1: GGAGGACTGGCCCTAACTTCACTAT 3: GACTGGCCCTAACTTGACTATTTGT

  13. Probe set sizes

  14. Marginal behavior

  15. Marginal behavior

  16. A multi-probe method Gresham et al., Science 311(5769), 2006: • Model the decrease in a given probe’s intensity in the presence of a single SNP, as a function of • Position within the probe, • Probe response to reference sequence, • Probe GC content, and • Nucleotides surround the SNP position. • Fit model parameters using two sequenced strains with known SNPs. • To genotype a segregant or new strain at a given base, assume probes in a probe set are independent and compute a “likelihood ratio”: vs. with assumed to be common for both genotypes.

  17. An alternative multi-probe method • Residual correlation remains after centering log intensities for each probe within inferred genotype class: a multivariate approach is justified.

  18. An alternative multi-probe method • Residual correlation remains after centering log intensities for each probe within inferred genotype class: a multivariate approach is justified. • Parental arrays are informative, but do not always provide an ideal model. Supervised classification of offspring (i) wastes information, and (ii) may be misleading.

  19. An alternative multi-probe method • Residual correlation remains after centering log intensities for each probe within inferred genotype class: a multivariate approach is justified. • Parental arrays are informative, but do not always provide an ideal model. Supervised classification of offspring (i) wastes information, and (ii) may be misleading.

  20. An alternative multi-probe method • Residual correlation remains after centering log intensities for each probe within inferred genotype class: a multivariate approach is justified. • Parental arrays are informative, but do not always provide an ideal model. Supervised classification of offspring (i) wastes information, and (ii) may be misleading. • Clear division into two distributions is necessary but not sufficient. Quantitative aspects of the inferred clusters are useful.

  21. Semi-supervised, model-based clustering C Fraley and AE Raftery, JASA 97(458), 2002: mclust package in R: • (X,Y)i with latent class variable Y. • Assume X|Y multivariate normal. • Initialize the Y with model-based, hierarchical agglomeration. • Compute MLEs for Gaussian means and covariance matrices via EM algorithm. • Optionally, select cluster count and covariance structure via BIC. Proposed procedure: • Provisional fits • Fit multivariate Gaussians to parental data. (Dimension reduction may be required.) • Unsupervised clustering of offspring data, by EM algorithm. • Final fit • Semi-supervised clustering of all arrays, by EM algorithm, provided that the probe set is “well behaved.”

  22. Maximum likelihood and mixture distributions • For mixture distributions, which cannot typically be optimized via differentiation.

  23. Maximum likelihood and mixture distributions • For mixture distributions, which cannot typically be optimized via differentiation. • Assume there are latent variables — in the mixture case, a class label Y: • Using indicator variables gives a tractable full log-likelihood:

  24. The EM algorithm • Initialize with some (0) and (0). • Produce an expression for as a function of  and . • Maximize J(0)(,), and define (1) and (1) to be the maximizers. • Etc., etc.:

  25. Normal mixtures • Multivariate normal full log-likelihood: • This is easy to maximize in k and k: • Updating the conditional expectations is also easy:

  26. Relating to Brem et al., 2002 • The method of Brem et al., 2002, may be seen as a univariate version of the EM algorithm, with premature termination: • Set the to 0 or 1 based on the outcome of the k-means clustering. Since probe sets which fail to separate the parental data are discarded, this gives the correct value when Yiis known. • Compute (1) and (1) only. Do not iterate. • Assign each offspring a genotype of • The multivariate versions is required by modern, high-density arrays. • Iteration improves the fit for clusters with some overlap.

  27. Examples

  28. Violation of model assumptions • Except for spontaneous mutation, offspring should possess one parent genotype or the other at each polymorphic locus. • Probe behavior is not always this simple. Naive clustering can go wrong.

  29. Violation of model assumptions • Except for spontaneous mutation, offspring should possess one parent genotype or the other at each polymorphic locus. • Probe behavior is not always this simple. Naive clustering can go wrong.

  30. Concordance between fit type One way to detect this is to compare different fits: • Parental arrays only • Offspring arrays only (possibly comparing different cluster counts) • Semi-supervised (possibly comparing different cluster counts)

  31. Genotyping results — no polymorphism filtering

  32. Genotyping results — concordance and overlap filtering

  33. Genotyping results — concordance and overlap filtering

  34. Setting filter thresholds • Identification of aberrant probe sets is straightforward: BIC improvement when fitting >2 clusters, or measures of dissimilarity between parental-only and semi-supervised fits, are typically large. • On the other hand, there is no obvious threshold for cluster overlap filtering: this is not a standard testing/model selection problem.

  35. Tetrads

  36. Tetrad-level events

  37. Tetrad-level events • Most gene conversion and crossover events can be automatically inferred. • Gaps arise from (i) lack of polymorphism, or (ii) regions which feature repeats or which could not be aligned given current sequence data.

  38. Tetrad-level events • At many inferred crossover locations, polymorphism resolution is high enough to detect associated gene conversion tracts. • Currently, ≈4,300 inferred crossovers, ≈4,600 high-confidence conversions.

  39. Summary and future work • Summary • Semi-supervised clustering is natural given the experimental structure. • A multivariate Gaussian assumption is not perfect, but are not grossly violated. • EM works well when data behave as expected — but this is not always the case. Post-processing filters can detect this. • Parental data are often not a faithful indicator of offspring behavior! Supervised classification may experience problems for some polymorphisms. • Future work • Exploration and recovery of aberrant probe sets • Unanticipated polymorphism detection • Application to a single sequenced-genome • The interesting biological questions! Hotspots, conversion/crossover ratio, sizes, spacing and interference. • New arrays from a mutant which is deficient in the putative interference-generating pathway: how do interference patterns change?

  40. Acknowledgements • EMBL Heidelberg • Eugenio Mancera Ramos • Lars Steinmetz • EBI • Wolfgang Huber • EBI, and Higgins Lab, University College, Dublin • Paul McGettigan

More Related