Next-generation sequencing: informatics & software aspects

Next-generation sequencing: informatics & software aspects Gabor T. Marth Boston College Biology Department Harvard Nanocourse October 7, 2009

New sequencing technologies…

… offer vast throughput … & many applications 100 Gb Illumina/Solexa, AB/SOLiD sequencers (10-50Gb in 25-100 bp reads) 10 Gb 1 Gb Roche/454 pyrosequencer (100Mb-1Gb in 200-450bp reads) bases per machine run 100 Mb 10 Mb ABI capillary sequencer 1 Mb 10 bp 100 bp 1,000 bp read length

Genome resequencing for variation discovery SNPs short INDELs structural variations • the most immediate application area

Genome resequencing for mutational profiling Organismal reference sequence • likely to change “classical genetics” and mutational analysis

De novo genome sequencing Lander et al. Nature 2001 • difficult problem with short reads • promising, especially as reads get longer

Identification of protein-bound DNA Chromatin structure (CHIP-SEQ) (Mikkelsen et al. Nature 2007) Transcription binding sites. (Robertson et al. Nature Methods, 2007) DNA methylation. (Meissner et al. Nature 2008) • natural applications for next-gen. sequencers

Transcriptome sequencing: transcript discovery Mortazavi et al. Nature Methods 2008 Ruby et al. Cell, 2006 • high-throughput, but short reads pose challenges

Transcriptome sequencing: expression profiling Cloonan et al. Nature Methods, 2008 Jones-Rhoads et al. PLoS Genetics, 2007 • high-throughput, short-read sequencing should make a major impact, and potentially replace expression microarrays

… & enable personal genome sequencing

IND (ii) read mapping (iv) SV calling (iii) SNP and short INDEL calling IND (i) base calling (v) data viewing, hypothesis generation The re-sequencing informatics pipeline REF

The variation discovery “toolbox” • base callers • read mappers • SNP callers • SV callers • assembly viewers

Base error characteristics vary 454 Illumina

Read lengths vary 25-60 (variable) 25-50 (fixed) 25-100(fixed) ~200-450 (variable) 400 100 200 300 0 read length [bp]

Sequence traces are machine-specific Base calling is increasingly left to machine manufacturers

Representationalbiases

Fragment duplication

… and they give you the picture on the box Read mapping is like a jigsaw 2. Read mapping …you get the pieces… Uniquepieces are easier to place than others…

Multiply-mapping reads • “Traditional” repeat maskingdoes not capture repeats at the scale of the read length • Reads from repeats cannot be uniquely mapped back to their true region of origin

Dealing with multiple mapping

Paired-end (PE) reads fragment length: 1 – 10kb fragment length: 100 – 600bp PE reads are now the standard for whole-genome short-read sequencing Korbelet al. Science 2007

Gapped alignments (for INDELs)

The MOSAIK read mapper Michael Strömberg • gapped mapper • option to report multiple map locations • aligns 454, Illumina, SOLiD, Helicos reads • works with standard file formats (SRF, FASTQ, SAM/BAM)

Alignment post-processing • quality value re-calibration • duplicate fragment removal

Data storage requirements

Alignment visualization • too much data – indexed browsing • too much detail – color coding, show/hide

SNP calling: old problem, new data sequencing error polymorphism

Allele calling in next-gen data SNP New technologies are perfectly suitable for accurate SNP calling, and some also for short-INDEL detection INS

Trio sequencing • the child inherits one chromosome from each parent • there is a small probability for ade novo (germ-line or somatic) mutation in the child

Alignment visualization • too much data – indexed browsing • too much detail – color coding, show/hide

Standard data formats SRF/FASTQ GLF/VCF SAM/BAM

Human genome polymorphism projects common SNPs

Human genome polymorphism discovery

The 1000 Genomes Project Pilot 1 Pilot 2 Pilot 3

1000G Pilot 3 – exon sequencing • Targets: • 1K genes / 10K targets • Capture: • Solid / liquid phase • Sequencing: • 454 / Illumina • SE / PE • Data producers: • Baylor • Broad • Sanger • Wash. U. • Informatics methods: • Multiple read mapping & • SNP calling programs

Coverage varies

On/off target capture ref allele*: 45% non-ref allele*: 54% Target region SNP (outside target region)

Fragment duplication – revisited

Reference allele bias ref allele*: 54% non-ref allele*: 45% (*) measured at 450 het HapMap 3 sites overlapping capture target regions in sample NA07346

SNP calling findings • based on a method comparison / testing exercise • 80 samples drawn from the 4 Centers • read mapping / SNP calling by the Baylor pipeline (BCM/454 data); the Broad and the BC pipelines (all 80 samples)

Overlap between call sets Broad calls BC calls # SNP calls: # dbSNPs: % dbSNPs: Ts/Tv ratio: 413 24 5.81% 0.23 452 172 38.05% 1.21 2,296 1,862 81.10% 3.40

The 1000G Structural Variation Discovery Effort

Structural variation detection Feuket al. Nature Reviews Genetics, 2006

SV detection – resolution Expected CNVs Karyotype Micro-array Sequencing Relative numbers of events CNV event length [bp]

Read Depth: good for big CNVs Detection Approaches Reference Sample • Paired-end: all types of SV Lmap • Split-Readsgood break-point resolution read contig • deNovo Assembly~ the future SV slides courtesy of Chip Stewart, Boston College

Read depth (RD)

Statistical & systematic biases

Single molecule sequencing? GC Bias Coverage bias

CN = 3 CN = 2 CN = 1 RD resolution Illumina observed read counts (per kb) density (log10) expected read counts ( per kb)

Next-generation sequencing: informatics & software aspects