700 likes | 921 Views
Meiotic recombination mapping with tiling microarrays: genotype and rate inference. Richard Bourgon 22 November 2007 bourgon@ebi.ac.uk. Overview. Meiotic recombination Genotyping with tiling microarrays Previous single-probe approaches Multivariate approaches Post-processing
E N D
Meiotic recombination mapping with tiling microarrays:genotype and rate inference Richard Bourgon 22 November 2007 bourgon@ebi.ac.uk
Overview • Meiotic recombination • Genotyping with tiling microarrays • Previous single-probe approaches • Multivariate approaches • Post-processing • SSC vs. SNPscanner • Recombination event inference • Tetrad level results • Local rate inference • Biology
Meiotic recombination Meiosis… • Two divisions, yielding 4 haploid daughter cells. From Molecular Biology of the Cell, Fourth Edition.
second end capture dHJ crossover Meiotic recombination DSB Meiosis… • Two divisions, yielding 4 haploid daughter cells. • Double-stranded breaks (DSBs) initiate recombination. • Both crossover and non-crossover resolutions are possible. strand resection 3’ 3’ single-end invasion D-loop DSBR
invading strand unwound second end capture nicked HJ dHJ crossover noncrossover crossover Meiotic recombination DSB Meiosis… • Two divisions, yielding 4 haploid daughter cells. • Double-stranded breaks (DSBs) initiate recombination. • Both crossover and non-crossover resolutions are possible. strand resection 3’ 3’ single-end invasion D-loop SDSA DSBR
Research questions • In S. cerevisiae, where do hotspots occur and what are the local recombination rates? Do crossovers and non-crossovers exhibit the same pattern? • Do hotspots (binary) or recombination rates (quantitative) correlate with features of DNA sequence or chromatin structure? • How large are conversion regions? Can we identify the various resolution patterns? • What is the relative frequency of the various DSB-repair pathways? • Does the observed pattern among recombination events concur with current models for “interference”? Do mutations impact interference? • Are hotspots mutagenic or conservative? Are there biases in gene conversion?
“Single feature polymorphisms” • Hybridization efficiency depends on the number and position of mismatches. • Differential hybridization provides a means of detecting polymorphisms, even when only the reference genome sequence is known: SFPs. • Winzeler et al., Science 281(5380), 1998. • Brem et al., Science 296(5568), 2002. • Steinmetz et al., Nature 416(6878), 2002. • Borevitz et al., Genome Research 13(3), 2003. • Given parental behavior, genotype segregants via supervised classification.
Single-probe methods • Polymorphism detection • Winzeler et al. (and others): ANOVA testing 1 = 1. • Borevitz et al.: moderated t-test using the SAM adjustment. • Brem et al.: moderated t-test. Then cluster all data (parental and segregant) and discard SFPs for which clusters don’t separate the parental data. • Segregant genotyping • ANOVA and t-test methods use the estimated posterior probability of class membership, with a uniform prior on the classes: • Brem et al. augment this: are estimated from clustered data.
Saccharomyces cerevisiae microarray data • Two strains • S96, isogenic to the common laboratory strain S288c. • YJM789, isogenic to the clinical isolate YJM145. • In alignable regions ≈ 56,000 SNPs, thousands of insertions and deletions. • Tiling microarrays • ≈ 6.5 M 5µ features, tiling non-repetitive S96 every 4 bases. • ≈ 4% of probes are specific to YJM789 sequence. • Data • 25 parental genomic DNA hybridizations. • 208 wildtype offspring hybes. • 20 msh4 mutant offspring hybes. • 20 mms4 mutant offspring hybes.
S96: CCTCCTGACCGGGATTGAAGTGATAAACATGTCTAGCGTTA YJM789: CCTCCTGACCGGGATTGAACTGATAAACATGTCTAGCGTTA Probe sets: SNP interrogation 6: CTTCACTATTTGTACAGATCGCAAT • Probe set: group of probes which each exactly map to a unique location, and which interrogate a common polymorphism. • Marker: one or more polymorphisms interrogated by the same probe set. 5: CTAACTTCACTATTTGTACAGATCG 4: GGCCCTAACTTCACTATTTGTACAG 2: GACTGGCCCTAACTTCACTATTTGT 1: GGAGGACTGGCCCTAACTTCACTAT 3: GACTGGCCCTAACTTGACTATTTGT
A multi-probe method: SNPscanner Gresham et al., Science 311(5769), 2006: • Model the decrease in a given probe’s intensity in the presence of a single SNP, as a function of • Position within the probe, • Probe response to reference sequence, • Probe GC content, and • Nucleotides surround the SNP position. • Fit model parameters using two sequenced strains with known SNPs. • To genotype a segregant or new strain at a given base, assume probes in a probe set are independent and compute a likelihood ratio: vs. with assumed to be common for both genotypes.
An alternative multi-probe method • Residual correlation remains after centering log intensities for each probe within inferred genotype class: a multivariate approach is justified.
An alternative multi-probe method • Residual correlation remains after centering log intensities for each probe within inferred genotype class: a multivariate approach is justified. • Parental arrays are informative, but do not always provide an ideal model. Supervised classification of offspring (i) wastes information, and (ii) may be misleading.
An alternative multi-probe method • Residual correlation remains after centering log intensities for each probe within inferred genotype class: a multivariate approach is justified. • Parental arrays are informative, but do not always provide an ideal model. Supervised classification of offspring (i) wastes information, and (ii) may be misleading. • Clear division into two distributions is necessary but not sufficient. Quantitative aspects of the inferred clusters are useful.
Semi-supervised, model-based clustering (SSC) Semi-supervised clustering via EM algorithm: • Assume a two-component mixture, with 1 = 2 = 1/2. • (Xi,Yi) with latent class variable Y. Y is known for parental arrays. • Assume X|Y multivariate normal. • Begin with E-step: initialize the unknown Y with any simple clustering scheme: k-means, hierarchical agglomeration, etc. • Iteratively estimate parameters, E(Yi|Xi), parameters, etc. • Classify segregant i based on final estimated E(Yi|Xi). For diagnostic purposes only: • Multivariate Gaussian fit to dimension-reduced parental data. • Unsupervised clustering of offspring data, by EM algorithm, with k{2,3}.
Filtering • Array level • Excess “genotype switching”. • Large RMS residual (Mahalanobis) to assigned class. • Probe set level • High estimated misclassification rate. • Aberrant cluster behavior. • Very unusual genotype ratio. • Call level • Intermediate posterior probability of class membership. • Large residual to assigned class.
SSC vs. SNPscanner • SSC • Multivariate Gaussians. • Class specific, non-diagonal covariance matrices. • Parameters estimates via EM, using labeled and unlabeled data. • Data-based estimates for both classes. • Both S288c- and YJM789- specific probes used. • SNPscanner • Multivariate Gaussians. • Common, diagonal covariance matrices for the two classes. • Parameter estimation using parental data only. • Data-based parameter estimation for reference class. • Model based shift gives variant class mean estimate. • Only S288c-specific probes may be used.
Genotype call comparison: SNPscanner vs. SSC • Filter for both methods: • Remove bad arrays and aberrant probe sets. • Remove probe sets with poorly separated clusters. • Drop calls falling between two observed clusters. • Only consider polymorphisms with at least one S288c-specific probe. • Compute concordance rate between the two methods.
Definitions • Marker and inter-marker intervals (IMIs). • Recombination event intervals: • Conversions: midpoints of IMIs immediately beyond genotype change. • Crossovers: midpoints of IMIs immediately before return to 2:2 ratio.
-M -w +w +M 0 High marker density adjustment (crossovers) • An inter-marker interval I at [-w, w], centered at 0. • Chromosome at [-M, M], with . • Yj: symmetric extension of recombination interval, given a DSB at j. • Assuming that two recombination intervals cannot overlap I,
High marker density adjustment (crossovers) • An inter-marker interval I at [-w, w], centered at 0. • Chromosome at [-M, M], with . • Yj: symmetric extension of recombination interval, given a DSB at j. • Assuming that two recombination intervals cannot overlap I, • For crossovers, “involvement” is equivalent to detection.