1 / 73

High Throughput Sequencing: Technologies & Applications

High Throughput Sequencing: Technologies & Applications. Michael Brudno CSC 2431 – Algorithms for HTS University of Toronto 06/01/2010. High Throughput Sequencers. 100 Gb. AB/SOLiDv3, Illumina/GAII short-read sequencers. (10+Gb in 50-100 bp reads, >100M reads, 4-8 days).

ursa-deleon
Download Presentation

High Throughput Sequencing: Technologies & Applications

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. High Throughput Sequencing: Technologies & Applications Michael Brudno CSC 2431 – Algorithms for HTS University of Toronto 06/01/2010

  2. High Throughput Sequencers 100 Gb AB/SOLiDv3, Illumina/GAII short-read sequencers (10+Gb in 50-100 bp reads, >100M reads, 4-8 days) 10 Gb 454 GS FLX pyrosequencer 1 Gb (100-500 Mb in 100-400 bp reads, 0.5-1M reads, 5-10 hours) bases per machine run 100 Mb ABI capillary sequencer 10 Mb (0.04-0.08 Mb in 450-800 bp reads, 96 reads, 1-3 hours) 1 Mb 10 bp 100 bp 1,000 bp read length From Gabor Marth, BC

  3. Sequencing chemistries DNA base extension DNA ligation Church, 2005

  4. Massively parallel sequencing Church, 2005

  5. Features of HTS data • Short (for now) sequence reads • 200-400bp: 454 (Roche) • 35-100bp Solexa(Illumina), SOLiD(AB) • Huge amount of sequence per run • Up to 10s of gigabases per run • Huge number of reads per run • Up to 100’s of millions • Higher error (compared with Sanger) • Different error profile

  6. The Raw Data • Machine Readouts are different • Read length, accuracy, and error profiles are variable. • All parameters change rapidly as machine hardware, chemistry, optics, and noise filtering improves

  7. 454 Pyrosequencer error profile • multiple bases in a homo-polymeric run are incorporated in a single incorporation test  the number of bases must be determined from a single scalar signal the majority of errors are INDELs • error rates are nucleotide-dependent

  8. Illumina/Solexa base accuracy • Error rate grows as a function of base position within the read • A large fraction of the reads contains 1 or 2 errors

  9. AB SOLiD System dibase sequencing 2nd Base A C G T 0 1 2 3 A 1 0 3 2 1st Base C 2 3 0 1 G 3’ 3’ 3’ 5’ 5’ 5’ 1 3 2 0 T N N N A Tz z z N N N G Az z z N N N T Gz z z 2-base, 4-color: 16 probe combinations • 4 dyes to encode 16 2-base combinations • Detect a single color indicates 4 combinations & eliminates 12 • Each color reflects position, not the base call • Each base is interrogated by two probes • Dual interrogation eases discrimination • errors (random or systematic) vs. SNPs (true polymorphisms)

  10. The decoding matrix allows a sequence of transitions to be converted to a base sequence, as long as one of two bases is known. Converting dibase (color) into letters 2nd Base A C G T 0 1 2 3 A 1 0 3 2 1st Base 0 0 0 0 0 0 1 1 1 1 2 2 2 2 2 2 3 3 AA AC AC AA AG AT AA AG AG CC CA CA CC CT CG CC CT CT GG GT GT GG GA GC GG GA GA TT TG TG TT TC TA TT TC TC A A C A A G C C T C C C A C C T A A G A G G T G G A T T C T T T G T T C G G A G C 2 3 0 1 G 1 3 2 0 T 4 Possible Sequences

  11. SOLiD error checking code A C G G T C G T C G T G T G C G T A C G G T C G T C G T G T G C G T No change A C G G T C G C C G T G T G C G T SNP A C G G T C G T C G T G T G C G T Measurement error

  12. SOLiD Error rate & QVs

  13. Pacific Biosystems (PacBio)

  14. Current and future application areas Genome re-sequencing: somatic mutation detection, organismal SNP discovery, mutational profiling, structural variation discovery reference genome DEL SNP De novo genome sequencing • Short-read sequencing will be (at least) an alternative to microarrays for: • DNA-protein interaction analysis (CHiP-Seq) • novel transcript discovery • quantification of gene expression • epigenetic analysis (methylation profiling)

  15. What’s in it for us? Image ManagementBase calling Data StorageCloud Computing Data ManagementData Integrity Data representation for Biologists Read MappingAssembly Probabilistic Models Variant Calling

  16. Fundamental informatics challenges 1. Interpreting machine readouts – base calling, base error estimation 3. Dealing with non-uniqueness in the genome: resequenceability 2. Alignment of billions of reads

  17. Informatics challenges (cont’d) 4. SNP and short INDEL, and structural variation discovery 5. Data visualization 6. Data storage & management

  18. High Throughput Sequencing: Technologies & Applications Questions?

  19. SHRiMP: SHort Read Mapping Package • Fast Mapping Algorithm • - Spaced seed hashing • - Vectored (very fast) Smith Waterman • - Handles micro insertions/deletions • Specialized algorithm for aligning color-space (AB SOLiD) reads • Computes p-values (and other statistics)

  20. Regular Smith-Waterman A C T A G A C T T G T C C A G T Cell being computed Previously computed cells

  21. Fast Local Alignment AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA AGTGACCTGGGAAGACCCTGAACCCTGGGTCACAAAACTC AGTGACCTGGGAAGACCCTGAACCCTGGGTCACAAAACTC BLAST FASTA Altschul et al 1990 Pearson 1987

  22. SHRiMP Hashing • SHRiMP uses spaced seeds • Vectored Smith-Waterman Genome Reads

  23. Vectored Instructions 1 5 1 5 5 6 9 4 9 4 9 13 + = max = 3 8 3 8 8 11 6 4 6 4 6 10 • Modern computers provide with capacity for performing same operation on several elements (SIMD) • Can we take advantage of vectorized instruction in Smith-Waterman?

  24. Vectorizing Smith-Waterman (1st try) A C T A G A C T T G T C C A G T Cell being computed Previously computed cells

  25. Vectorizing Smith-Waterman (Wozniak) A C T A G A C T T G T C C A G T Current Previous Penultimate Wozniak, 1997

  26. Vectorizing Smith-Waterman (SHRiMP) A C T A G A C T T G T G A C C T + ---+ A C T A G A C T T G T C C A G T Current Previous Penultimate

  27. SHRiMP Speed SW within SHRiMP while mapping 50,000 reads against a 4Mb contig of C. savignyi SHRiMP performance for mapping 11,200 AB SOLiD 25 bp reads to 180Mb Ciona savignyi genome

  28. Color-space (dibase) Se quencing 0 0 2 G A 1 3 3 1 C T 2 0 0

  29. Mapping reads in Color-space INDELS TGAGTTA 122103 TGA-TTA 12-303 TGAGTTTA 1221003 TGAGTATA 1221333 SNPs TGAGTT 12210 TGACTT 12120 TGAATT 12030 TGATTT 12300 G: TTGAGTTATGGAT 012210331023 R: 012120331023 TTGACTTATGGAT

  30. Mapping reads in Letter Space 0 0 2 G A 1 3 3 1 C T 2 0 0 G: TGACTTATGGAT ||||| TTGAGTCGCAAGC CCAGACTATGGAT R:012212331023 |||||||

  31. SOLiD Translations • Given the following read, there are 4 translations (we need an initial base):

  32. SOLiD Translations • Reads begin with a known primer (‘T’) • The translation is: T T G A G C G T T C

  33. SOLiD Translations • What if we had a sequencing error? • The right translation was: T T G A G C G T T C

  34. Colour-space Smith-Waterman Letter • Think of 4 SW matrices stacked above one another • If we have 1 read error, but otherwise perfect match, we’ll use 2 matrices Genome Read Frame 1 Frame 2 Frame 3 Frame 4

  35. Combined Color/Letter Space SW 0 0 2 G A 1 3 3 1 C T 2 0 0 A C 3 2

  36. Combined Color/Letter Space SW 0 0 2 G A 1 3 3 1 C T 2 0 0 A C 3 2

  37. SHRiMP on Ciona savignyi • C. savignyi is a chordate • with a very large SNP rate (5%) • Mapped 22 million AB SOLiD • reads to the reference • C. savignyi genome • (6 hours on 200 CPUs). G: 1123724 TA-ACCACGGTCACACTTGCATCAC 1123701 || |||||||||| |||X||||||| T: TACACCACGGTCAGACTtGCATCAC R: 0 T0311101130121221211313211 24

  38. SHRiMP Summary • Fast mapping of short reads to a genome • -- Handles indels & color-space reads • -- Easy to parallelize • -- Small memory footprint • Computation of p-values & other statistics for hits • Publicly available & free

  39. Acknowledgments Stanford Stephen Rumble UofT Phil Lacroute Anton Valouev Arend Sidow http://compbio.cs.toronto.edu/shrimp FUNDING: NSERC, CFI, NIH

  40. Acknowledgments Stanford Stephen Rumble UofT Phil Lacroute Anton Valouev Arend Sidow http://compbio.cs.toronto.edu/shrimp FUNDING: NSERC, CFI, NIH

  41. Why is color-space good? • SNP discovery • Error correction with letter & color reads (assembly) • Can fix errors without (explicit) overlap • Don’t just do everything in color space! R1: 0 TAGACCACGGTCACACTTGCATCAC 24 || |||||||||| |||X||||||| T: TACACCACGGTCAGACTtGCATCAC R2: 0 T0311101130121221211313211 24 T: TACACCACGGTCAGACTTGCATCAC R1: T0311101130121221211013211 24R2: T2113013122121101321103111 24 R3: T2212110132110311121130131 24

  42. What are structural variations? Various examples of structural variations

  43. Type of Structural Variations (1) Insertion A REF

  44. Type of Structural Variations (2) Deletion A REF

  45. Type of Structural Variations (3) Inversion 3’ 5’ A 5’ 3’ REF 3’ 5’

  46. Type of Structural Variations (4) Translocation chr1 chr2

  47. Clone-end Sequencing Approaches • 1. “Fine-scale structural variation of the • human genome”[Tuzun et al, 2005] • Mapping matepairs onto the reference genome • If mappings of matepairs are not consistent, then • there exist structural variations. • 2. “Paired-End mappings Reveals Extensive • Structural Variation in the Human Genome” [Korbel et al, 2007] • Proposed high-throughput and massive paired end • mapping technique • Detailed types of structural variations

  48. Motivation Reads can map to many locations on the genome. How do we choose between them? Tuzun & Korbel used scores which are combinationof several factors. (e.g. length, identity, quality of the sequences, concordance)

  49. Probabilistic Framework (1) We play with p(Y) to describe our probabilistic framework p(Y): distribution of mapped distances of “uniquely mapped” matepairs of various sizes

More Related