Informatics tools for next-generation sequence analysis

Informatics tools for next-generation sequence analysis Gabor T. Marth Boston College Biology Department University of Michigan October 20, 2008

Next-gen. sequencers offer vast throughput Illumina, AB/SOLiD short-read sequencers 10 Gb (5-15Gb in 25-70 bp reads) 1 Gb 454 pyrosequencer (100-400 Mb in 200-450 bp reads) bases per machine run 100 Mb 10 Mb ABI capillary sequencer 1 Mb 10 bp 100 bp 1,000 bp read length

Next-gen sequencing enables new applications • organismal resequencing & de novo sequencing • transcriptome sequencing for transcript discovery and expression profiling Ruby et al. Cell, 2006 Jones-Rhoades et al. PLoS Genetics, 2007 • epigenetic analysis (e.g. DNA methylation) Meissner et al. Nature 2008

Large-scale individual human resequencing

Technologies

Roche / 454 system • pyrosequencing technology • variable read-length • the only new technology with >100bp reads

Illumina / Solexa Genome Analyzer • fixed-length short-read sequencer • very high throughput • read properties are very close to traditional capillary sequences

AB / SOLiD system 2nd Base A C G T 0 1 2 3 A 1 0 3 2 1st Base C 2 3 0 1 G 1 3 2 0 T • fixed-length short-reads • very high throughput • 2-base encoding system • color-space informatics

Helicos / Heliscope system • short-read sequencer • single molecule sequencing • no amplification • variable read-length • error rate reduced with 2-pass template sequencing

Data characteristics

Read length 20-60 (variable) 25-50 (fixed) 25-70 (fixed) ~200-450 (variable) 400 100 200 300 0 read length [bp]

Representational biases “dispersed” coverage distribution • this affects genome resequencing (deeper starting read coverage is needed) • will have major impact is on counting applications

Amplification errors early amplification error gets propagated into every clonal copy many reads from clonal copies of a single fragment • early PCR errors in “clonal” read copies lead to false positive allele calls

Read quality

Error rate (Illumina)

Error rate (454)

Per-read errors (Solexa)

Per read errors (454)

Base quality values not well calibrated

Tools for genome resequencing

IND (ii) read mapping (iii) read assembly (v) SV calling (iv) SNP and short INDEL calling IND (i) base calling (vi) data validation, hypothesis generation The resequencing informatics pipeline REF

The variation discovery “toolbox” • base callers • read mappers • SNP callers • SV callers • assembly viewers

diverse chemistry & sequencing error profiles 1. Base calling base sequence base quality (Q-value) sequence

454 pyrosequencer error profile • multiple bases in a homo-polymeric run are incorporated in a single incorporation test  the number of bases must be determined from a single scalar signal the majority of errors are INDELs

454 base quality values • the native 454 base caller assigns too low base quality values

PYROBAYES: determine base number

PYROBAYES: Performance • assigned quality values predict measured error rate better • higher fraction of bases are high quality

Base quality value calibration

Recalibrated base quality values (Illumina)

… and they give you the picture on the box 2. Read mapping Read mapping is like doing a jigsaw puzzle… …you get the pieces… Unique pieces are easier to place than others…

Non-uniqueness of reads confounds mapping • RepeatMasker does not capture all micro-repeats, i.e. repeats at the scale of the read length • Reads from repeats cannot be uniquely mapped back to their true region of origin

Strategies to deal with non-unique mapping 0.8 0.19 0.01 • mapping to multiple loci requires the assignment of alignment probabilities read • Non-unique read mapping: optionally eitheronly report uniquely mapped readsorreport all map locations for each read (mapping quality values for all mapped reads are being implemented)

Paired-end reads help unique read placement PE • fragment amplification: fragment length 100 - 600 bp • fragment length limited by amplification efficiency MP • circularization: 500bp - 10kb (sweet spot ~3kb) • fragment length limited by library complexity Korbel et al. Science 2007 • PE reads are now the standard for genome resequencing

MOSAIK

INDEL alleles/errors – gapped alignments 454

Aligning multiple read types together ABI/capillary 454 FLX • Alignment and co-assembly of multiple reads types permits simultaneous analysis of data from multiple sources and error characteristics 454 GS20 Illumina

Aligner speed

sequencing error polymorphism 3. Polymorphism / mutation detection

deep alignments of 100s / 1000s of individuals • trio sequences New challenges for SNP calling

Rare alleles in 100s / 1,000s of samples

Allele discovery is a multi-step sampling process Samples Reads Population

Capturing the allele in the sample

Allele calling in the reads sample size individual read coverage base call base quality

Allele calling in deep sequence data aatgtagtaAgtacctac aatgtagtaAgtacctac aatgtagtaAgtacctac aatgtagtaAgtacctac aatgtagtaCgtacctac aatgtagtaCgtacctac aatgtagtaAgtacctac aatgtagtaAgtacctac aatgtagtaAgtacctac aatgtagtaAgtacctac aatgtagtaAgtacctac aatgtagtaAgtacctac aatgtagtaAgtacctac aatgtagtaAgtacctac

More samples or deeper coverage / sample? …or deeper coverage from fewer samples? Shallower read coverage from more individuals … simulation analysis by Aaron Quinlan

Analysis indicates a balance

SNP calling in trios • the child inherits one chromosome from each parent • there is a small probability for a mutation in the child

P=0.86 SNP calling in trios aatgtagtaAgtacctac aatgtagtaAgtacctac aatgtagtaAgtacctac aatgtagtaAgtacctac aatgtagtaAgtacctac aatgtagtaCgtacctac aatgtagtaAgtacctac aatgtagtaAgtacctac aatgtagtaAgtacctac aatgtagtaAgtacctac aatgtagtaAgtacctac aatgtagtaAgtacctac father mother aatgtagtaAgtacctac aatgtagtaAgtacctac aatgtagtaAgtacctac aatgtagtaAgtacctac aatgtagtaAgtacctac aatgtagtaCgtacctac P=0.79 child

A/C C/C A/A Determining genotype directly from sequence AACGTTAGCATA AACGTTAGCATA AACGTTCGCATA AACGTTCGCATA individual 1 AACGTTCGCATA AACGTTCGCATA AACGTTCGCATA AACGTTCGCATA individual 2 AACGTTAGCATA AACGTTAGCATA individual 3

4. Structural variation discovery

Informatics tools for next-generation sequence analysis