Data analysis methods for next-generation sequencing technologies

Data analysis methods for next-generation sequencing technologies Gabor T. Marth Boston College Biology Department Epigenomics & Sequencing Meeting July 14-15, 2008, Boston, MA

T1. Roche / 454 FLX system • pyrosequencing technology • variable read-length • the only new technology with >100bp reads • tested in many published applications • supports paired-end read protocols with up to 10kb separation size

T2. Illumina / Solexa Genome Analyzer • fixed-length short-read sequencer • read properties are very close traditional capillary sequences • very low INDEL error rate • tested in many published applications • paired-end read protocols support short (<600bp) separation

T3. AB / SOLiD system 2nd Base A C G T 0 1 2 3 A 1 0 3 2 1st Base C 2 3 0 1 G 1 3 2 0 T • fixed-length short-read sequencer • employs a 2-base encoding system that can be used for error reduction and improving SNP calling accuracy • requires color-space informatics • published applications underway / in review • paired-end read protocols support up to 10kb separation size

T4. Helicos / Heliscope system • experimental short-read sequencer system • single molecule sequencing • no amplification • variable read-length • error rate reduced with 2-pass template sequencing

A1. Variation discovery: SNPs and short-INDELs 1. sequence alignment 2. dealing with non-unique mapping 3. looking for allelic differences

A2. Structural variation detection • copy number (for amplifications, deletions) from depth of read coverage • structural variations (deletions, insertions, inversions and translocations) from paired-end read map locations

A3. Identification of protein-bound DNA genome sequence aligned reads Chromatin structure (CHIP-SEQ) (Mikkelsen et al. Nature 2007) Transcription binding sites. Robertson et al. Nature Methods, 2007

A4. Novel transcript discovery (genes) Mortazavi et al. Nature Methods

A5. Novel transcript discovery (miRNAs) Ruby et al. Cell, 2006

A6. Expression profiling by tag counting gene gene aligned reads aligned reads Jones-Rhoads et al. PLoS Genetics, 2007

A7. De novo organismal genome sequencing Lander et al. Nature 2001 short reads read pairs longer reads assembled sequence contigs

C1. Read length 20-35 (var) 25-35 (fixed) 25-40 (fixed) ~200-450 (var) 400 100 200 300 0 read length [bp]

When does read length matter? • longer reads are needed where one must use parts of reads for mapping: • de novo sequencing • novel transcript discovery aacttagacttaca gacttacatacgta Known exon 1 Known exon 2 accgattactatacta • short reads often sufficient where the entire read length can be used for mapping: • SNPs, short-INDELs, SVs • CHIP-SEQ • short RNA discovery • counting (mRNA miRNA)

C2. Read error rate • error rate dictates the stringency of the read mapper • the more errors the aligner must tolerate, the lower the fraction of the reads that can be uniquely aligned • error rate typically 0.4 - 1%

Error rate grows with each cycle • this phenomenon limits useful read length

Substitutions vs. INDEL errors

C3. Representational biases / library complexity fragmentation biases PCR amplification biases sequencing low/no representation sequencing biases high representation

Dispersal of read coverage • this affects variation discovery (deeper starting read coverage is needed) • it should have major impact is on counting applications

Amplification errors early amplification error gets propagated onto every clonal copy many reads from clonal copies of a single fragment • early PCR errors in “clonal” read copies lead to false positive allele calls

C4. Paired-end reads • circularization: 500bp - 10kb (sweet spot ~3kb) • fragment length limited by library complexity Korbel et al. Science 2007 • fragment amplification: fragment length 100 - 600 bp • fragment length limited by amplification efficiency • paired-end read can improve read mapping accuracy (if unique map positions are required for both ends) or efficiency (if fragment length constraint is used to rescue non-uniquely mapping ends)

Technologies / properties / applications

IND (iii) read mapping (pair-wise alignment to genome reference) (iv) read assembly (v) SNP calling IND (vi) SNP validation (i) base calling (vii) data viewing, hypothesis generation Resequencing-based SNP discovery (ii) micro-repeat analysis REF

The “toolbox” • base callers • microrepeat finders • read mappers • SNP callers • structural variation callers • assembly viewers

…AND they give you the cover on the box Reference guided read mapping Reference-sequence guided mapping: …you get the pieces… Some pieces are more unique than others

MOSAIK: an anchored aligner / assembler Step 1. initial short-hash scan for possible read locations Step 2. evaluation of candidate locations with SW method Michael Stromberg

Non-unique mapping, gapped alignments 1. Non-unique read mapping: optionally eitheronly report uniquely mapped readsorreport all map locations for each read (mapping quality values for all mapped reads are being implemented) 2. Gapped alignments: allow for mapping reads with insertion or deletion sequencing errors, and reads with bona fide INDEL alleles

Read types aligned, paired-end read strategy 3. Aligns and co-assembles customary read types: ABI/capillary Illumina/Solexa AB/SOLiD Roche/454 Helicos/Heliscope ABI/capillary 454 FLX 454 GS20 Illumina 4. Paired-end read alignments

Other mainstream read mappers • ELAND (Tony Cox, Illumina) • -- the “official” read mapper supplied by Illumina, fast • MAQ (Li Heng + Richard Durbin, Sanger) • -- the most widely used read mapper, low RAM footprint • SOAP (Beijing Genomics Institute) • -- a new mapper developed for human next-gen reads • SHRIMP (Michael Brudno, University of Toronto) • -- full Smith-Waterman

Speed

sequencing error polymorphism Polymorphism / mutation detection

A/C C/C A/A Determining genotype directly from sequence AACGTTAGCATA AACGTTAGCATA AACGTTCGCATA AACGTTCGCATA individual 1 AACGTTCGCATA AACGTTCGCATA AACGTTCGCATA AACGTTCGCATA individual 2 AACGTTAGCATA AACGTTAGCATA individual 3

SNP INS Software

Data visualization • aid software development: integration of trace data viewing, fast navigation, zooming/panning • facilitate data validation (e.g. SNP validation): co-viewing of multiple read types, quality value displays • promote hypothesis generation: integration of annotation tracks Weichun Huang

Applications 1. SNP discovery in shallow, single-read 454 coverage (Drosophila melanogaster) 2. SNP and INDEL discovery in deep Illumina short-read coverage (Caenorhabditis elegans) 3. Mutational profiling in deep 454 and Illumina read data (Pichia stipitis) (image from Nature Biotech.)

Our software is available for testing http://bioinformatics.bc.edu/marthlab/Beta_Release

Credits Elaine Mardis (Washington University) Andy Clark (Cornell University) Doug Smith (Agencourt) Research supported by: NHGRI (G.T.M.) BC Presidential Scholarship (A.R.Q.) Michael Stromberg Chip Stewart Michele Busby Aaron Quinlan Damien Croteau-Chonka Eric Tsung Derek Barnett Weichun Huang http://bioinformatics.bc.edu/marthlab

Accuracy • As is the case for all heuristic alignment algorithms accuracy and speed are option- and parameter-dependent

C3. Quality values are important for allele calling • inaccurate or not well calibrated base quality values hinder allele calling Q-values should be accurate … and high! • PHRED base quality values represent the estimated likelihood of sequencing error and help us pick out true alternate alleles

Software tools for next-gen sequence analysis

Next-generation sequencing technologies and applications

Data analysis methods for next-generation sequencing technologies

Data analysis methods for next-generation sequencing technologies

Presentation Transcript

Next Generation Sequencing

Next Generation Sequencing Technologies

Scalable Algorithms for Next-Generation Sequencing Data Analysis

Next Generation Sequencing Technologies

Next Generation Sequencing Technologies

Next Generation Sequencing Technologies

Next Generation Sequencing Data Analysis

Next Generation Sequencing Technologies

Next-generation sequencing

Next Generation Sequencing

Next Generation Sequencing and its data analysis challenges

Introduction To Next Generation Sequencing (NGS) Data Analysis

Bioinformatics Methods and Computer Programs for Next-Generation Sequencing Data Analysis

Scalable Algorithms for Next-Generation Sequencing Data Analysis

Next Generation Sequencing: comparison of the technologies for genome sequencing

Next-Generation Sequencing

Analysis of Next-Generation Sequencing (NGS) Data

Next-Generation Sequencing

Introduction To Next Generation Sequencing (NGS) Data Analysis

Next Generation Sequencing

Next-Generation Sequencing

Next Generation Sequencing