410 likes | 803 Views
Genome-wide characteristics of sequence coverage by next-generation sequencing: how does this impact interpretation? . Jen Taylor Bioinformatics Team CSIRO Plant Industry. Assumptions. Every k-mer has equal chance of being sequenced. Read density. Deviations from Assumptions?.
E N D
Genome-wide characteristics of sequence coverage by next-generation sequencing: how does this impact interpretation? Jen Taylor Bioinformatics Team CSIRO Plant Industry
Assumptions • Every k-mer has equal chance of being sequenced CSIRO. Newton Meeting July 2010 - Sequence coverage
Read density CSIRO. Newton Meeting July 2010 - Sequence coverage
Deviations from Assumptions? CSIRO. Newton Meeting July 2010 - Sequence coverage
Impacts on read coverage - Outline CSIRO. Newton Meeting July 2010 - Sequence coverage • Sample preparation • MNase Digestion • Alignment • Parameter choices • Mismatches • Multiple read mappings • Hamming edit distances and k-mer space
Assumptions : Digestion Illumina SOLiD http://seq.molbiol.ru/sch_lib_fr.html CSIRO. Newton Meeting July 2010 - Sequence coverage
ChIPSeq MNase Linker Digest Remove Nucleosomes Sequence & Align CSIRO. Newton Meeting July 2010 - Sequence coverage
ChIPSeq - Nucleosome Sample: MNase digested Size fractionated Control: MNase digested Random sizes CSIRO. Newton Meeting July 2010 - Sequence coverage
araTha9 Aligned Reads 36-MerMonomer Composition CSIRO. Newton Meeting July 2010 - Sequence coverage
araTha9 Aligned Reads 5’ +/- 16bpMonomer Composition CSIRO. Newton Meeting July 2010 - Sequence coverage
MNase Site PreferencingFlick et al., J. Mol. Biology 1986 CSIRO. Newton Meeting July 2010 - Sequence coverage
araTha9 Control MNase Site Preferencing CSIRO. Newton Meeting July 2010 - Sequence coverage
ChIPSeq Sequence & Align MNase Digest Remove Nucleosomes CSIRO. Newton Meeting July 2010 - Sequence coverage
araTha9 Control MNase Site Preferencing CSIRO. Newton Meeting July 2010 - Sequence coverage
Nucleosome potentials – Read Density Normalised Read Density Base Coordinate 1 Kb CSIRO. Newton Meeting July 2010 - Sequence coverage
Nucleosome potentials MNase Potential Normalised Read Density CSIRO. Newton Meeting July 2010 - Sequence coverage
Nucleosome potentials MNase Potential Normalised Read Density CSIRO. Newton Meeting July 2010 - Sequence coverage
Nucleosome potential CSIRO. Newton Meeting July 2010 - Sequence coverage
MNase biases aiding interpretation? • Can aid identification in a local sequence ? • Dependent upon local sequence context • Cautionary tale about analysing sequence contexts of ChipSeq data • Nucleotide composition analyses must take into account digestion preferencing CSIRO. Newton Meeting July 2010 - Sequence coverage
Impacts on read coverage - Outline CSIRO. Newton Meeting July 2010 - Sequence coverage • Sample preparation • MNase Digestion • Alignment • Parameter choices • Mismatches • Multiple read mappings • Hamming edit distances and k-mer space
Hamming Edit Distances CSIRO. Newton Meeting July 2010 - Sequence coverage • Defined as the number of substitution edit operations, required to transform one sequence of length k into another of length k • For all possible kmers (36, 65 ) in Arabidopsis genome • All vs.All, both strands • Minimum HE distance
Arabidopsis Minimum Hamming Edit Distances 36mer CSIRO. Newton Meeting July 2010 - Sequence coverage
Alignment issues hg18 dm3 araTha9 0 2 4 6 8 10 12 14 ce6 sacCer6 CSIRO. Newton Meeting July 2010 - Sequence coverage
Alignment artefacts : aligner properties CSIRO. Newton Meeting July 2010 - Sequence coverage
Breakdown of sequencing run CSIRO. Newton Meeting July 2010 - Sequence coverage
Hamming edits and Ubsalign HE difference AGATTAGCCTGGTACTGCTA H …..AGCTTAGCCTGGTACTGGTA…. 2 AGATTAGCCTGGTACTGCTA …..AGCTTAGCCGGGTACTGGTA…. 3 AGATTAGCCTGGTACTGCTA No Alignment CSIRO. Newton Meeting July 2010 - Sequence coverage
Hamming edits and Ubsalign HE difference AGATTAGCCTGGTACTGCTA H …..AGATTAGCCTGGTACTGCTA…. 0 AGATTAGCCTGGTACTGCTA …..AGCTTAGCCGGGTACTGCTA…. 2 AGATTAGCCTGGTACTGCTA No Alignment CSIRO. Newton Meeting July 2010 - Sequence coverage
Hamming edits and Ubsalign HE difference AGATTAGCCTGGTACTGCTA H …..AGCTTAGCCTGGTACTGCTA…. 1 AGATTAGCCTGGTACTGCTA …..AGCTTAGCCGGGTTCTGGTA…. 4 AGATTAGCCTGGTACTGCTA Alignment ! CSIRO. Newton Meeting July 2010 - Sequence coverage
Testing Aligner Accuracy • Simulated reads • Known correct location • 25 million, 50 million • Perfect match, up to 5 mismatches, up to 10 mismatches • Error 3’ bias • Numbers of : • correctly aligned reads • incorrectly aligned reads • Unalignable reads • Speed CSIRO. Newton Meeting July 2010 - Sequence coverage
Alignment artefacts :Managing mismatch thresholds CSIRO. Newton Meeting July 2010 - Sequence coverage
Alignment artefacts :Managing mismatch thresholds CSIRO. Newton Meeting July 2010 - Sequence coverage
How does this affect interpretation ? CSIRO. Newton Meeting July 2010 - Sequence coverage • Incorporation of edit differentials • Leads to gains in the number of alignable reads • Increased information • Determination of the alignment • Gains of 5 - 10% in mappable sites • Hamming edit distributions provide useful information Impact of MNase digestion on short read sequence coverage
Hamming distance variability CSIRO. Newton Meeting July 2010 - Sequence coverage
Read Deserts CSIRO. Newton Meeting July 2010 - Sequence coverage
Read Deserts CSIRO. Newton Meeting July 2010 - Sequence coverage
Sequence deserts CSIRO. Newton Meeting July 2010 - Sequence coverage
Impacts on read coverage - Conclusions CSIRO. Newton Meeting July 2010 - Sequence coverage • Sample preparation • MNase Digestion • Local biases present • Alignment • Parameter choices • Mismatches – generally too low relative to uniqueness of kmers in the genome • Multiple read mappings – can drive ‘absence’ of mapped reads • Hamming edit distances and k-mer space • Kmers have unique and genome specific properties • Can be used to inform results of alignment
Acknowledgements CSIRO PI Bioinformatics Team Andrew Spriggs Stuart Stephen Emily Ying Jose Robles Michael James CSIRO Transformational Biology Capability Platform David Lovell Mark Morrison CSIRO Prog X Chris Helliwell Frank Gubler Liz Dennis CMIS / TBCP Paul Greenfield CSIRO. Newton Meeting July 2010 - Sequence coverage
Paired end data – sample preparation C insert G insert A T CSIRO. Newton Meeting July 2010 - Sequence coverage
Control and sample read density Control Sample CSIRO. Newton Meeting July 2010 - Sequence coverage