330 likes | 528 Views
Fine mapping of recombination in S. cerevisiae. Wolfgang Huber EMBL - EBI. The maths of marker genotyping sensitivity, specificity, data QA/QC Event classification cross-overs, conversions… and weirdness Event rates biological significance. Single-reporter methods.
E N D
Fine mapping of recombination in S. cerevisiae Wolfgang Huber EMBL - EBI
The maths of marker genotyping sensitivity, specificity, data QA/QC Event classification cross-overs, conversions… and weirdness Event rates biological significance
Single-reporter methods • De novo polymorphism detection • Winzeler et al.Science 281, 1998 (and others): ANOVA testing 1 = 1. • Borevitz et al.Genome Research 13, 2003: moderated t-test (SAM). • Brem et al. Science 296, 2002: moderated t-test, then cluster all data (parental and segregant) and discard SFPs for which clusters don’t separate the parental data. • Segregant genotyping (using polymorphims) • Use the estimated posterior probability of class membership (uniform prior on the classes): • Brem et al. augment this: are estimated from clustered data.
But we have multiple reporters per SNP: probe sets 6: CTTCACTATTTGTACAGATCGCAAT Probe sets: a set of reporters that exactly + uniquely map to a location and interrogate one polymorphism 5: CTAACTTCACTATTTGTACAGATCG 4: GGCCCTAACTTCACTATTTGTACAG 2: GACTGGCCCTAACTTCACTATTTGT 1: GGAGGACTGGCCCTAACTTCACTAT S96: CCTCCTGACCGGGATTGAAGTGATAAACATGTCTAGCGTTA YJM789: CCTCCTGACCGGGATTGAACTGATAAACATGTCTAGCGTTA 3: GACTGGCCCTAACTTGACTATTTGT
Multivariate analysis of probe set dataparallel coordinate plots log2 intensity reporters in probe set
Multivariate analysis of probe set dataparallel coordinate plots
Multivariate methods SNPScanner: Gresham et al., Science 311, 2006: • Model probe intensity xi with & without presence of SNP as function of • Probe GC content • Position of SNP within the probe • Nucleotides surrounding the SNP • Fit model parameters using two sequenced strains with known SNPs. • To genotype a segregant or new strain at a given base, compute a likelihood ratio assumption: covariance matrix diagonal and same
But • neighbouring probes' data are not independent • variances for the two genotypes are often quite different • training data is often not representative • likelihood ratio test generates too many FPs • a generalized multi-probe method
GTS (genotyping by semi-supervised clustering) An instance of the EM algorithm applied to multivariate Gaussian mixture modeling: simutaneously estimate class shapes and object class membership
GTS (genotyping by semi-supervised clustering) An instance of the EM algorithm applied to multivariate Gaussian mixture modeling: simutaneously estimate class shapes and object class membership
GTS (genotyping by semi-supervised clustering) An instance of the EM algorithm applied to multivariate Gaussian mixture modeling: simutaneously estimate class shapes and object class membership R package ss.genotyping
Filtering ambiguous individual genotype calls (z) Aberrant probe sets Weakly separating probesets Imbalanced probesets Probe Sets Genotype Calls
Benchmark SNPScanner - GTS • 233 Affymetrix yeast tiling arrays from Steinmetz group: 13 S288, 12 YJM789: training data 52 tetrads of crosses: to be genotyped • Same post-processing/filter
GTS vs SNPScanner arrays genomic position (markers)
Three adjacent cross-overs involving three chromosomes chr 1, wt_47
A cross-over plus two long conversions, involving all four chromosomes chr 3, wt_19
Three adjacent conversions involving three chromosomes chr 3, wt_38
Cross-over accompanied by multiple conversions chr 4, wt_36
Event classification Automatic algorithm takes tetrad-level genotype traces and assigns them into events: Cross-over, conversion, complex cross-over, complex coversion,... R package recombination.genotyping Still need manual curation: we are just beginning to understand the spectrum of possible event types!
Genetic Interactions Genotypes at pairs of loci on different chromosomes are unlinked, but the population shows evidence of selection over-represen-tation under-represen-tation
Acknowledgements EMBL HD Lars Steinmetz Julien Gagneur Zhenyu Xu Sandra Clauder-Münster Fabiana Perocchi Wu Wei • EBI • Elin Axelsson • Ligia Bras • Alessandro Brozzi • Tony Chiang • Audrey Kauffmann • Paul McGettigan • Greg Pau • Oleg Sklyar • Mike Smith • Jörn Tödling • Jitao Zhang Richard Bourgon Eugenio Mancera Ramos • The contributors to R and Bioconductor projects
Summary • Semi-supervised clustering is natural given the experimental structure • Parental data are often not a faithful indicator of offspring behavior! Supervised classification may experience problems for some polymorphisms. • Multivariate Gaussian model is adequate • EM works well when data behave as expected — but this is not always the case. Importance of fit diagnostics, QA/QC, post-processing filters. • Outlook • Hotspots, conversion/crossover ratio, sizes, spacing and interference. • Msh4 mutant data (deficient in the putative interference-generating pathway): how do interference patterns change? • Unanticipated polymorphism detection (de-novo in segregants; in unsequenced strains)