High Throughput Sequencing: Technologies & Applications

High Throughput Sequencing: Technologies & Applications Michael Brudno CSC 2431 – Algorithms for HTS University of Toronto 06/01/2010

High Throughput Sequencers 100 Gb AB/SOLiDv3, Illumina/GAII short-read sequencers (10+Gb in 50-100 bp reads, >100M reads, 4-8 days) 10 Gb 454 GS FLX pyrosequencer 1 Gb (100-500 Mb in 100-400 bp reads, 0.5-1M reads, 5-10 hours) bases per machine run 100 Mb ABI capillary sequencer 10 Mb (0.04-0.08 Mb in 450-800 bp reads, 96 reads, 1-3 hours) 1 Mb 10 bp 100 bp 1,000 bp read length From Gabor Marth, BC

Sequencing chemistries DNA base extension DNA ligation Church, 2005

Massively parallel sequencing Church, 2005

Features of HTS data • Short (for now) sequence reads • 200-400bp: 454 (Roche) • 35-100bp Solexa(Illumina), SOLiD(AB) • Huge amount of sequence per run • Up to 10s of gigabases per run • Huge number of reads per run • Up to 100’s of millions • Higher error (compared with Sanger) • Different error profile

The Raw Data • Machine Readouts are different • Read length, accuracy, and error profiles are variable. • All parameters change rapidly as machine hardware, chemistry, optics, and noise filtering improves

454 Pyrosequencer error profile • multiple bases in a homo-polymeric run are incorporated in a single incorporation test  the number of bases must be determined from a single scalar signal the majority of errors are INDELs • error rates are nucleotide-dependent

Illumina/Solexa base accuracy • Error rate grows as a function of base position within the read • A large fraction of the reads contains 1 or 2 errors

AB SOLiD System dibase sequencing 2nd Base A C G T 0 1 2 3 A 1 0 3 2 1st Base C 2 3 0 1 G 3’ 3’ 3’ 5’ 5’ 5’ 1 3 2 0 T N N N A Tz z z N N N G Az z z N N N T Gz z z 2-base, 4-color: 16 probe combinations • 4 dyes to encode 16 2-base combinations • Detect a single color indicates 4 combinations & eliminates 12 • Each color reflects position, not the base call • Each base is interrogated by two probes • Dual interrogation eases discrimination • errors (random or systematic) vs. SNPs (true polymorphisms)

The decoding matrix allows a sequence of transitions to be converted to a base sequence, as long as one of two bases is known. Converting dibase (color) into letters 2nd Base A C G T 0 1 2 3 A 1 0 3 2 1st Base 0 0 0 0 0 0 1 1 1 1 2 2 2 2 2 2 3 3 AA AC AC AA AG AT AA AG AG CC CA CA CC CT CG CC CT CT GG GT GT GG GA GC GG GA GA TT TG TG TT TC TA TT TC TC A A C A A G C C T C C C A C C T A A G A G G T G G A T T C T T T G T T C G G A G C 2 3 0 1 G 1 3 2 0 T 4 Possible Sequences

SOLiD error checking code A C G G T C G T C G T G T G C G T A C G G T C G T C G T G T G C G T No change A C G G T C G C C G T G T G C G T SNP A C G G T C G T C G T G T G C G T Measurement error

SOLiD Error rate & QVs

Pacific Biosystems (PacBio)

Current and future application areas Genome re-sequencing: somatic mutation detection, organismal SNP discovery, mutational profiling, structural variation discovery reference genome DEL SNP De novo genome sequencing • Short-read sequencing will be (at least) an alternative to microarrays for: • DNA-protein interaction analysis (CHiP-Seq) • novel transcript discovery • quantification of gene expression • epigenetic analysis (methylation profiling)

What’s in it for us? Image ManagementBase calling Data StorageCloud Computing Data ManagementData Integrity Data representation for Biologists Read MappingAssembly Probabilistic Models Variant Calling

Fundamental informatics challenges 1. Interpreting machine readouts – base calling, base error estimation 3. Dealing with non-uniqueness in the genome: resequenceability 2. Alignment of billions of reads

Informatics challenges (cont’d) 4. SNP and short INDEL, and structural variation discovery 5. Data visualization 6. Data storage & management

High Throughput Sequencing: Technologies & Applications Questions?

SHRiMP: SHort Read Mapping Package • Fast Mapping Algorithm • - Spaced seed hashing • - Vectored (very fast) Smith Waterman • - Handles micro insertions/deletions • Specialized algorithm for aligning color-space (AB SOLiD) reads • Computes p-values (and other statistics)

Regular Smith-Waterman A C T A G A C T T G T C C A G T Cell being computed Previously computed cells

Fast Local Alignment AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA AGTGACCTGGGAAGACCCTGAACCCTGGGTCACAAAACTC AGTGACCTGGGAAGACCCTGAACCCTGGGTCACAAAACTC BLAST FASTA Altschul et al 1990 Pearson 1987

SHRiMP Hashing • SHRiMP uses spaced seeds • Vectored Smith-Waterman Genome Reads

Vectored Instructions 1 5 1 5 5 6 9 4 9 4 9 13 + = max = 3 8 3 8 8 11 6 4 6 4 6 10 • Modern computers provide with capacity for performing same operation on several elements (SIMD) • Can we take advantage of vectorized instruction in Smith-Waterman?

Vectorizing Smith-Waterman (1st try) A C T A G A C T T G T C C A G T Cell being computed Previously computed cells

Vectorizing Smith-Waterman (Wozniak) A C T A G A C T T G T C C A G T Current Previous Penultimate Wozniak, 1997

Vectorizing Smith-Waterman (SHRiMP) A C T A G A C T T G T G A C C T + ---+ A C T A G A C T T G T C C A G T Current Previous Penultimate

SHRiMP Speed SW within SHRiMP while mapping 50,000 reads against a 4Mb contig of C. savignyi SHRiMP performance for mapping 11,200 AB SOLiD 25 bp reads to 180Mb Ciona savignyi genome

Color-space (dibase) Se quencing 0 0 2 G A 1 3 3 1 C T 2 0 0

Mapping reads in Color-space INDELS TGAGTTA 122103 TGA-TTA 12-303 TGAGTTTA 1221003 TGAGTATA 1221333 SNPs TGAGTT 12210 TGACTT 12120 TGAATT 12030 TGATTT 12300 G: TTGAGTTATGGAT 012210331023 R: 012120331023 TTGACTTATGGAT

Mapping reads in Letter Space 0 0 2 G A 1 3 3 1 C T 2 0 0 G: TGACTTATGGAT ||||| TTGAGTCGCAAGC CCAGACTATGGAT R:012212331023 |||||||

SOLiD Translations • Given the following read, there are 4 translations (we need an initial base):

SOLiD Translations • Reads begin with a known primer (‘T’) • The translation is: T T G A G C G T T C

SOLiD Translations • What if we had a sequencing error? • The right translation was: T T G A G C G T T C

Colour-space Smith-Waterman Letter • Think of 4 SW matrices stacked above one another • If we have 1 read error, but otherwise perfect match, we’ll use 2 matrices Genome Read Frame 1 Frame 2 Frame 3 Frame 4

Combined Color/Letter Space SW 0 0 2 G A 1 3 3 1 C T 2 0 0 A C 3 2

SHRiMP on Ciona savignyi • C. savignyi is a chordate • with a very large SNP rate (5%) • Mapped 22 million AB SOLiD • reads to the reference • C. savignyi genome • (6 hours on 200 CPUs). G: 1123724 TA-ACCACGGTCACACTTGCATCAC 1123701 || |||||||||| |||X||||||| T: TACACCACGGTCAGACTtGCATCAC R: 0 T0311101130121221211313211 24

SHRiMP Summary • Fast mapping of short reads to a genome • -- Handles indels & color-space reads • -- Easy to parallelize • -- Small memory footprint • Computation of p-values & other statistics for hits • Publicly available & free

Acknowledgments Stanford Stephen Rumble UofT Phil Lacroute Anton Valouev Arend Sidow http://compbio.cs.toronto.edu/shrimp FUNDING: NSERC, CFI, NIH

Why is color-space good? • SNP discovery • Error correction with letter & color reads (assembly) • Can fix errors without (explicit) overlap • Don’t just do everything in color space! R1: 0 TAGACCACGGTCACACTTGCATCAC 24 || |||||||||| |||X||||||| T: TACACCACGGTCAGACTtGCATCAC R2: 0 T0311101130121221211313211 24 T: TACACCACGGTCAGACTTGCATCAC R1: T0311101130121221211013211 24R2: T2113013122121101321103111 24 R3: T2212110132110311121130131 24

What are structural variations? Various examples of structural variations

Type of Structural Variations (1) Insertion A REF

Type of Structural Variations (2) Deletion A REF

Type of Structural Variations (3) Inversion 3’ 5’ A 5’ 3’ REF 3’ 5’

Type of Structural Variations (4) Translocation chr1 chr2

Clone-end Sequencing Approaches • 1. “Fine-scale structural variation of the • human genome”[Tuzun et al, 2005] • Mapping matepairs onto the reference genome • If mappings of matepairs are not consistent, then • there exist structural variations. • 2. “Paired-End mappings Reveals Extensive • Structural Variation in the Human Genome” [Korbel et al, 2007] • Proposed high-throughput and massive paired end • mapping technique • Detailed types of structural variations

Motivation Reads can map to many locations on the genome. How do we choose between them? Tuzun & Korbel used scores which are combinationof several factors. (e.g. length, identity, quality of the sequences, concordance)

Probabilistic Framework (1) We play with p(Y) to describe our probabilistic framework p(Y): distribution of mapped distances of “uniquely mapped” matepairs of various sizes

High Throughput Sequencing: Technologies & Applications