440 likes | 459 Views
Explore the history, basics, and advancements in high-throughput sequencing technologies. Understand Solexa, SOLiD, 454, and Helicos platforms and learn about data analysis and interpretation. Join us for an informative session.
E N D
Canadian Bioinformatics Workshops www.bioinformatics.ca
Informatics on High Throughput Sequencing Data Introduction to next-gen sequencing Francis Ouellette francis@oicr.on.ca July 25th 2008
Outline • Sequencing DNA • Next Generation Technologies • Solexa • SOLiD • 454 • Helicos • AB’s color space • What next, & things to keep in mind!
Adapted from John McPherson, OICR Biological Research
1870 Miescher: Discovers DNA Avery: Proposes DNA as ‘Genetic Material’ 1940 Watson & Crick: Double Helix Structure of DNA 1953 Holley: Sequences Yeast tRNAAla 1965 Wu: Sequences Cohesive End DNA 1970 Sanger: Dideoxy Chain Termination Gilbert: Chemical Degradation 1977 1980 Messing: M13 Cloning Hood et al.: Partial Automation 1986 1990 • Cycle Sequencing • Improved Sequencing Enzymes • Improved Fluorescent Detection Schemes 2002 • Next Generation Sequencing • Improved enzymes and chemistry • Improved image processing Adapted from Eric Green, NIH; Adapted from Messing & Llaca, PNAS (1998) History of DNA Sequencing Efficiency (bp/person/year) 1 15 150 1,500 15,000 25,000 50,000 200,000 50,000,000 100,000,000,000 2008
Basics of the “old” technology • Clone the DNA. • Generate a ladder of labeled (colored) molecules that are different by 1 nucleotide. • Separate mixture on some matrix. • Detect fluorochrome by laser. • Interpret peaks as string of DNA. • Strings are 500 to 1,000 letters long • 1 machine generates 57,000 nucleotides/run • Assemble all strings into a genome.
Basics of the “new” technology • Get DNA. • Attach it to something. • Extend and amplify signal with some color scheme. • Detect fluorochrome by microscopy. • Interpret series of spots as short strings of DNA. • Strings are 30-300 letters long • Multiple images are interpreted as 0.4 to 1.2 GB/run (1,200,000,000 letters/day). • Map or align strings to one or many genome.
From Debbie Nickerson, Department of Genome Sciences, University of Washington, http://tinyurl.com/6zbzh4
Differences between the various platforms: • Nanotechnology used. • Resolution of the image analysis. • Chemistry and enzymology. • Signal to noise detection in the software • Software/images/file size/pipeline • Cost $$$
Adapted from Richard Wilson, School of Medicine, Washington University, “Sequencing the Cancer Genome” http://tinyurl.com/5f3alk Next Generation DNA Sequencing Technologies 3 Gb ==
Solexa-based Whole Genome Sequencing Adapted from Richard Wilson, School of Medicine, Washington University, “Sequencing the Cancer Genome” http://tinyurl.com/5f3alk
Solexa flow cell ~50M clusters are sequenced per flow cell. Solexa-based Whole Genome Sequencing Adapted from Richard Wilson, School of Medicine, Washington University, “Sequencing the Cancer Genome” http://tinyurl.com/5f3alk
Debbie Nickerson, Department of Genome Sciences, University of Washington, http://tinyurl.com/6zbzh4
Roche / 454 : GS FLX • Real Time Sequencing by Synthesis • Chemiluminescence detection in pico titer plates • Amplification: emulsion PCR • Pyrosequencing • up to 400,000 reads / run • on average 250 bases / read (and longer) • up to 100 Mb / run
Roche / 454 : GS FLX • Made for de novo sequencing. • Too expensive for resequencing. • For example, this platform will be used a lot by laboratories doing new bacterial genomes. • Baylor Genome Center involved in Sea Urchin, Bee, Platypus genomes: They have a number of 454.
Adapted from: Barak Cohen, Washington University, Bio5488 http://tinyurl.com/6zttuq http://tinyurl.com/6k26nh Single Molecule Sequencing Microscope slide * * * Single DNA molecule Super-cooled TIRF microscope primer dNTP-Cy3 * Helicos Biosciences Corp.
Helicos Approximate Data Production per Run at Current Peak Throughput (1 strand/µ2) Single Pass Dual Pass 7 day run 14 day run • Image Data: 35 TB 60 TB • Diagnostic Images: 350 GB 600 GB • Object Table: 3.5 TB 6 TB • Sequence Data: 350 GB 600 GB • Log Files: 350 GB 600 GB • Total ~4.5 TB ~7.8 TB(w/o full image stack)
It’s more complicated! • Get files with quality scores • Get files with miss-matches • Need to align them to a reference genome • Multiple tools do this today … and there will be more later. • What do you do? Do it all!
Things to keep in mind • All people are learning, if you don’t know, ask, and they probably won’t know either, and you can figure it out together! • The technology is changing – This workshop next year will be totally different! • We can only do so much in two days – you will need to find things, find people who can help you, and you will need to teach your friends!
Other factors • Changing technology • New and disappearing companies? • Changing price structure • Cost of machine • Cost of operation (reagents/people) • Service from the company • 1 machine vs (2 or 3 machines) vs 40 machines. • Changing software and processing
Questions? • Coffee break!