Next-generation sequencing – the informatics angle

Next-generation sequencing – the informatics angle Gabor T. Marth Boston College Biology Department AGBT 2008 Marco Island, FL. February 6. 2008

T1. Roche / 454 FLX system • pyrosequencing technology • variable read-length • the only new technology with >100bp reads • tested in many published applications • supports paired-end read protocols with up to 10kb separation size

T2. Illumina / Solexa Genome Analyzer • fixed-length short-read sequencer • read properties are very close traditional capillary sequences • very low INDEL error rate • tested in many published applications • paired-end read protocols support short (<600bp) separation

T3. AB / SOLiD system 2nd Base A C G T 0 1 2 3 A 1 0 3 2 1st Base C 2 3 0 1 G 1 3 2 0 T • fixed-length short-read sequencer • employs a 2-base encoding system that can be used for error reduction and improving SNP calling accuracy • requires color-space informatics • published applications underway / in review • paired-end read protocols support up to 10kb separation size

T4. Helicos / Heliscope system • experimental short-read sequencer system • single molecule sequencing • no amplification • variable read-length • error rate reduced with 2-pass template sequencing

A1. Variation discovery: SNPs and short-INDELs 1. sequence alignment 2. dealing with non-unique mapping 3. looking for allelic differences

A2. Structural variation detection • copy number (for amplifications, deletions) from depth of read coverage • structural variations (deletions, insertions, inversions and translocations) from paired-end read map locations

A3. Identification of protein-bound DNA genome sequence aligned reads Chromatin structure (CHIP-SEQ) (Mikkelsen et al. Nature 2007) Transcription binding sites. Robertson et al. Nature Methods, 2007

A4. Novel transcript discovery (genes) Known exon 1 Known exon 2 • novel transcripts in known genes Known exon 1 Known exon 2 • novel genes / exons Inferred exon 1 Inferred exon 2

A5. Novel transcript discovery (miRNAs) Ruby et al. Cell, 2006

A6. Expression profiling by tag counting gene gene aligned reads aligned reads Jones-Rhoads et al. PLoS Genetics, 2007

A7. De novo organismal genome sequencing Lander et al. Nature 2001 short reads read pairs longer reads assembled sequence contigs

C1. Read length 20-35 (var) 25-35 (fixed) 25-40 (fixed) ~250 (var) 100 200 300 0 read length [bp]

When does read length matter? • longer reads are needed where one must use parts of reads for mapping: • de novo sequencing • novel transcript discovery aacttagacttaca gacttacatacgta Known exon 1 Known exon 2 accgattactatacta • short reads often sufficient where the entire read length can be used for mapping: • SNPs, short-INDELs, SVs • CHIP-SEQ • short RNA discovery • counting (mRNA miRNA)

C2. Read error rate • error rate dictates how many errors the aligner should tolerate • the more errors the aligner must tolerate, the lower the fraction of the reads that can be uniquely aligned • applications where, in addition, specific alleles are essential, error rate is even more important • error rate typically 0.4 - 1%

C3. Error rate grows with each cycle • this phenomenon limits useful read length

C4. Substitutions vs. INDEL errors • gapped alignment necessary • good SNP discovery accuracy • short-INDEL discovery difficult • SNP discovery may require higher coverage for allele confirmation • INDELs can be discovered with very high confidence!

C5. Quality values are important for allele calling • inaccurate or not well calibrated base quality values hinder allele calling Q-values should be accurate … and high! • PHRED base quality values represent the estimated likelihood of sequencing error and help us pick out true alternate alleles

Quality values should be well-calibrated assigned base quality value should be calibrated to represent the actual base quality value in every sequencing cycle

C6. Representational biases / library complexity fragmentation biases PCR amplification biases sequencing low/no representation sequencing biases high representation

Dispersal of read coverage • this affects variation discovery (deeper starting read coverage is needed) • it has major impact is on counting applications

Amplification errors early amplification error gets propagated onto every clonal copy many reads from clonal copies of a single fragment • early PCR errors in “clonal” read copies lead to false positive allele calls

C7. Paired-end reads • circularization: 500bp - 10kb (sweet spot ~3kb) • fragment length limited by library complexity Korbel et al. Science 2007 • fragment amplification: fragment length 100 - 600 bp • fragment length limited by amplification efficiency • paired-end read can improve read mapping accuracy (if unique map positions are required for both ends) or efficiency (if fragment length constraint is used to rescue non-uniquely mapping ends)

Paired-end reads for SV discovery • longer fragments tend to have wider fragment length distributions • SV breakpoint detection sensitivity & resolution depend on the width of the fragment length distribution (most 2kb deletions would be detected at 10% std but missed at 30% std) • longer fragments increase the chance of spanning SV breakpoints and/or entire events

C8. Technologies / properties / applications

Thanks Michael Stromberg MOSAIK talk Thursday, 7:40PM Chip Stewart Michele Busby Aaron Quinlan Damien Croteau-Chonka Eric Tsung Derek Barnett Weichun Huang http://bioinformatics.bc.edu/marthlab Michael Egholm David Bentley Francisco de la Vega Kristen Stoops Ed Thayer Clive Brown Elaine Mardis

Next-generation sequencing – the informatics angle