560 likes | 684 Views
Canadian Bioinformatics Workshops. www.bioinformatics.ca. Module 1 Introduction to next-gen sequencing. FRANCIS OUELLETTE Informatics on High Throughput Sequencing Data July 2009. Overview. “next-gen” or “next-next-gen”: why are we here? What kinds of sequencing are we doing?
E N D
Canadian Bioinformatics Workshops www.bioinformatics.ca
Module 1Introduction to next-gen sequencing FRANCIS OUELLETTE Informatics on High Throughput Sequencing Data July 2009
Overview • “next-gen” or “next-next-gen”: why are we here? • What kinds of sequencing are we doing? • How does DNA sequencing works? • Trying to stay away from vender-specific challenges, but can we really? • Where next?
History of DNA Sequencing Adapted from Eric Green, NIH; Adapted from Messing & Llaca, PNAS (1998) 1870 Miescher: Discovers DNA Avery: Proposes DNA as ‘Genetic Material’ 1940 Efficiency (bp/person/year) Watson & Crick: Double Helix Structure of DNA 1953 Holley: Sequences Yeast tRNAAla 1 15 1965 Wu: Sequences Cohesive End DNA 150 1970 Sanger: Dideoxy Chain Termination Gilbert: Chemical Degradation 1,500 1977 Messing: M13 Cloning 15,000 1980 25,000 Hood et al.: Partial Automation 50,000 1986 • Cycle Sequencing • Improved Sequencing Enzymes • Improved Fluorescent Detection Schemes 200,000 1990 50,000,000 2002 • Next Generation Sequencing • Improved enzymes and chemistry • New image processing 100,000,000,000 2009
Why are we sequencing? • Before Next-generation: • Reductionist perspective on life • DNA, RNA, (proteins), (populations), sampling, averages, consensus • Problems: sampling, averages, consensus. • After Next-generation: • We are still reductionist, but better • Genome sequence and structure • Less cloning/PCR • Single molecules (for some)
Basics of the “old” technology • Clone the DNA. • Generate a ladder of labeled (colored) molecules that are different by 1 nucleotide. • Separate mixture on some matrix. • Detect fluorochrome by laser. • Interpret peaks as string of DNA. • Strings are 500 to 1,000 letters long • 1 machine generates 57,000 nucleotides/run • Assemble all strings into a “whole”.
Differences between the various platforms: • Nanotechnology used. • Resolution of the image analysis. • Chemistry and enzymology. • Signal to noise detection in the software • Software/images/file size/pipeline • Cost $$$
Adapted from Richard Wilson, School of Medicine, Washington University, “Sequencing the Cancer Genome” http://tinyurl.com/5f3alk Next Generation DNA Sequencing Technologies
From John McPherson, OICR Next-gen sequencers 100 Gb AB/SOLiDv3, Illumina/GAII short-read sequencers (10+Gb in 50-100 bp reads, >100M reads, 4-8 days) 10 Gb 454 GS FLX pyrosequencer 1 Gb (100-500 Mb in 100-400 bp reads, 0.5-1M reads, 5-10 hours) bases per machine run 100 Mb ABI capillary sequencer 10 Mb (0.04-0.08 Mb in 450-800 bp reads, 96 reads, 1-3 hours) 1 Mb 10 bp 100 bp 1,000 bp read length
From John McPherson, OICR 2009/10 Promises? AB SOLiDv3 120Gb, 100 bp reads 100 Gb Illumina GAII 90Gb, 175bp reads 10 Gb 1 Gb 454 GS FLX Titanium bases per machine run 0.4-0.6 Gb, 100-400 bp reads 100 Mb 10 Mb ABI capillary sequencer (0.04-0.08 Mb, 450-800 bp reads 1 Mb 10 bp 100 bp 1,000 bp read length
Solexa-based Whole Genome Sequencing Adapted from Richard Wilson, School of Medicine, Washington University, “Sequencing the Cancer Genome” http://tinyurl.com/5f3alk
From Debbie Nickerson, Department of Genome Sciences, University of Washington, http://tinyurl.com/6zbzh4
Sample AB data Lab >443_1087_001_F3 T12111121313231331100020021211112211 >443_1087_002_F3 T01121100201303232033213132212320123 >443_1087_003_F3 T21333200110101330330011101121132111 >443_1087_004_F3 T21322103331203331001002121021323111 >443_1088_005_F3 T32311301011311231133321301012223110 >443_1088_006_F3 T13211113031122103020002220012122101 >443_1088_007_F3 T21112301301221022023212000311310313 >443_1088_008_F3 T12133033210200001231010301011012031 >443_1088_009_F3 T23330012121212103111123012012320300 >443_1088_010_F3 T10213330331021322130123311011312110 • Get sequence assignment from instructor • Work with people at your table. • Use info from lecture notes (Panel E) • BLAST sequence at NCBI • What is it?
Roche / 454 : GS FLX • Also known as “pyrosequencing” • http://www.454.com/products-solutions/system-features.asp • 500 million bp/run • 10 hr run • 400-500 bp/read & > 1 M reads
Roche / 454 : GS FLX • Made for de novo sequencing. • Too expensive for resequencing. • For example, this platform will be used a lot by laboratories doing new bacterial genomes. • Baylor Genome Center involved in Sea Urchin, Bee, Platypus genomes: They have a number of 454.
It’s more complicated! • Get files with quality scores • Get files with miss-matches • Need to align them to a reference genome • Multiple tools do this today … and there will be more later. • What do you do? Do it all!
Things to keep in mind • All people are learning, if you don’t know, ask, and they probably won’t know either, and you can figure it out together! • The technology is changing – This workshop next year will be totally different! • We can only do so much in two days – you will need to find things, find people who can help you, and you will need to teach your friends!
Other factors • Changing technology • New and disappearing companies? • Changing price structure • Cost of machine • Cost of operation (reagents/people) • Service from the company • 1 machine vs (2 or 3 machines) vs 40 machines. • Changing software and processing