620 likes | 804 Views
The Nuts and Bolts of Bacterial Genome Sequencing. Lisa Crossman & the Pathogen Sequencing Unit. Dr. Fred Sanger Double Nobel laureate and developer of the dideoxy sequencing method, first published in December 1977. [Credit: Wellcome Images].
E N D
The Nuts and Bolts of Bacterial Genome Sequencing Lisa Crossman & the Pathogen Sequencing Unit
Dr. Fred Sanger Double Nobel laureate and developer of the dideoxy sequencing method, first published in December 1977. [Credit: Wellcome Images] "Fred Sanger is a quiet giant, whose discoveries and inventions transformed our research world.” (A.Bradley, WTSI.)
Sequence centre contributions to the finished human genome sequence
Human sequences The Human genome project (2000) Celera (2000) First working drafts published 2001 Race for the ‘$1,000 genome’ 2004 (~£510) James Watson ($2 million) (2007) J. Craig Venter (2007)
Francis Crick, 1958 Watson & Crick
Sanger Sequencing DNA extraction Sequencing reactions Automated DNA Sequencing Finishing Annotation and analysis Publication in journal Publication on the internet
Two stage strategy for producing long DNA sequences Target DNA molecule Randomly produced DNA fragments Many overlapping sequences (reads) Assembled reads (draft sequence) Finished sequence BAC Prepare multiple copies Purified BAC DNA Physically fragment DNA Subclone random fragments Random shotgun Generate reads from random subclones Assemble sequence Prefinished sequence Sequence finishing Directed finishing Final assembly Finished sequence
– use a kit Millipore's Montage Plasmid Miniprep 96 kit Promega's Wizard SV 96 Plasmid DNA Purification System How to purify DNA Methods for DNA purification - Many sources of DNA • bacteria, animal cells, blood, soil, plant cells • properties (solubility, charge) not sequence dependent • thus “generic” purification methods are possible • chemically stable, particularly at alkaline pH • intracellular & associated with stabilizing proteins • fragile & subject to mechanical shear
Beckman Coulter Biomek FX Laboratory Automation Workstation 2 x 96-well plates in 70 minutes 192 plasmid preps in 70 minutes (using Promega Wizard SV96 kit) PerkinElmer MiniTrak liquid handling system 12 x 384-well boxes in less than 1h 4608 plasmid preps in less than 1h Automated DNA purification
Levels of automation Colony picking robots
DNA sequencing from: Sanger, F., Nicklen, S. & Coulson, A. R., Proc. Natl. Acad. Sci. USA 74, 5463 (1977)
DNA sequencing using dideoxy-mediated chain termination method of Sanger et al. • DNA to be sequenced acts a template for the enzymatic synthesis of new DNA starting at a defined primer site • Incorporation of a dideoxynucleotide blocks further chain elongation
Dideoxy sequencing • Denature ds DNA • Anneal Primer 5’ 3’ to template DNA 3’ 5’ • Enzyme, dNTPs and buffer added at the optimum temperature will initiate chain elongation • Addition of dideoxynucleotides will terminate elongation
Ratios of deoxy and dideoxynucleotides • Are such that a finite probability is created for a dideoxynucleotide to be incorporated in place of the usual deoxynucleotide at each nucleotide position on the growing chain resulting in a population of truncated fragments
Label location • The label can be incorporated into: • The oligonucleotide primer used to initiate the sequencing reaction • The deoxynucleotides in chain elongation • The dideoxynucleotides used in chain termination
Radio (manual) and fluorescent (automated) label sequencing
Run module Capillary length to detector (cm) Runs/ day LOR* Phred Q20 bases/read Phred Q20 bases/day Rapid 36 40 550 500 1,920,000 Standard 36 24 700 650 1,497,600 Long-Read 50 12 > 1,000 > 800 > 921,600 Automated DNA sequence analyzer Specification for Applied Biosystems 3730xl DNA Analyzer (96 samples per run) Sequencing Production Capacity * Length of read with 98.5% basecalling accuracy, less than 2% N's, using pGEM-3Zf(+) as template.
Original Sanger method (1977) DNA sequencing in capillary analysers (e.g ABI 3730xl), 1999 to present Chain termination with dideoxy nucleotides Chain termination with dideoxy nucleotides DNA radiolabelled DNA labelled with fluorescent dyes Detection of DNA fragments by autoradiography Fluorescent detection of DNA fragments Data format not digital Data in digital form Single sequencing reaction Thermostable DNA polymerase used for cycle sequencing DNA fragments separated by electrophoresis in polyacrylamide slab gels DNA fragments separated by electrophoresis in a liquid matrix in capillaries Manual gel pouring Automated filling of capillaries Manual sample loading Automated sample loading
Large scale DNA sequencing facility Every day 120,000 DNA sequences 60,000 plasmid preps
Two stage strategy for producing long DNA sequences Target DNA molecule Randomly produced DNA fragments Many overlapping sequences (reads) Assembled reads (draft sequence) Finished sequence BAC Prepare multiple copies Purified BAC DNA Physically fragment DNA Subclone random fragments Random shotgun Generate reads from random subclones Assemble sequence Prefinished sequence Sequence finishing Directed finishing Final assembly Finished sequence
What do we mean by finished sequence? • A closed consensus sequence without gaps that meets our finishing criteria and therefore has an overall accuracy of at least 99.99%. • The sequence may contain (small) regions that do not meet our finishing criteria. These are likely to be of lower quality but they will have been characterized and should be identified in the annotation. • The sequence has been checked by an experienced finisher. • When the finishing is finished, no further finishing is being done.
DNA extraction Sequencing reactions Automated DNA Sequencing Finishing Annotation and analysis Publication in journal Publication on the internet
What do we mean by a finished and annotated sequence? • A closed consensus sequence in which coding sequences have been identified, systematically numbered and analysed. • Vital metabolic genes and previously sequenced genes have been identified. • The sequence and annotation has been checked by an experienced annotator. • A full analysis has been carried out and the genome sequence is deposited in the sequence databases.
Next (New) Generation Sequencing Technologies Technological breakthroughs…… …..driven by the race for the $1,000 (human) genome 454 (Roche) Solexa (Illumina) And others….
Next Generation Sequencing Technologies • Pyrosequencing • 454 sequencing • Clonal amplification on beads • Pico titre plate (1.6 M wells) • Sequencing-by-synthesis • Chemiluminescent detection • No cloning required • Increased performance • 20,000,000 bp per run (4.5 hours) • 2 Mb genome, 10x coverage • Current performance (ABI 3730) • 48,000 bp per run
The GS20 Sequencing Machine Reagent Drawer CCD Camera andSequencing PlateHousing Computery Bits
Developments in Technology Pyrosequencing
454 • Genome fragmented into 300-500 bp • Ends are polished and adapters ligated: 4 nucleotide “key” + sequencing primer + PCR primer • Fragments immobilised onto magnetic, streptavidin-coated beads • A+B fragments then isolated as sstDNA library B Isolate AB fragments only A
emPCR A) Anneal Single Stranded template to an excess of DNA Capture beads C) Break Microreactors and enrich for DNA positive beads B) Emulsify beads and PCR reagents in water-in-oil microreactors
44 μm Depositing DNA Beads into the PicoTiter™Plate Load Enzyme Beads Load beads into PicoTiter™Plate Centrifugation
Reagent Flow Across PicoTiterTMPlate Peristaltic Pump Sequencing plate in front of CCD Reagent Cassette The four nucleotides are washed in series over the plate
Repeated dNTP Flow Sequence: G T C A PP Sulfurylase Luciferase i APS ATP luciferin Light + oxy luciferin Pyrosequencing Signal Generation • Each of the hundreds of thousands of beads with millions of copies of DNA are sequenced in parallel. • If a complementary nucleotide is flowed into a well, polymerase extends the strand by adding a nucleotide. • Addition of one or more generates a light signal which is recorded. DNA Capture Bead Containing Millions of Copies of a Single Clonal Fragment A A T C G G C A T G C T A A A A G T C A G T T A G C C G T A C G A T T T T C A G T Anneal Primer Process continues until defined number of nucleotide flow cycles are completed
Illumina (Solexa) machine (from http://www.gatc-biotech.com)
Ilumina (Solexa) Sequencing Dense lawn of primers
Next Generation sequencing technologies 454Solexa Sanger Data generation 25 Mbp/run 3,3000 Mbp/run 0.25 Mbp/run Read length 240 bp 35 bp 800 bp Read pair information no no ` yes Homopolymeric runs <5 accurate accurate Cloning bias no no yes De novo genomes hybrid? hybrid? yes Current Cost $100/Mb $5/Mb $500/Mb (~£50) (~£2.55) (~£255) www.454.com and Margulies et al (2005) Nature 15;376-80 www.solexa.com
Even newer sequencing technologies • ABI SOLiD (bead/light, interrogates every 3,4 base) • - Roche GS FLX (bead/light, longer reads)
New Challenges…. Genome Sequence Finishing Sanger sequence New Generation sequencing technologies
New Challenges • Data handling WTSI will generate ~100 Terabytes of Processed sequence data per year: global repository is currently Only 75TB Each machine of the newer generation sequencing technologies can generate 1,000,000,000,000 bytes/day raw data. (~ 1Tb and equivalent to approximately 10 laptops).
New Challenges - Annotation De novo annotation Deep resequencing Metagenomics Comparative genomics
Artemis free genome viewer & analysis tool www.sanger.ac.uk/Software/Artemis
Escherichia coli Workhorse of modern molecular biology Human commensal organism found in the gut Gram negative, optimum growth temperature 37oC, motile Indicator of feacal contamination in the environment Some strains can cause severe infections: E.coli 0157:H7
748 You sequenced one E. coli you’ve done ‘em all? 0157:H7 (EDL922) K12 66 190 226 3166 EAEC Unique 114 152 240 CFT073 4902 CDS total 748 =15%
Cole, 2001 • Yersinia pestis • Primarily a pathogen of rodents • Evolved from the gastrointestinal pathogen Y.pseudotuberculosis 16srRNA – identical, DNA-DNA hybridisations - highly related Diverged 1,500-20,000 years ago • Employ an insect vector • Infect multiple hosts • Become a blood borne intracellular pathogen
Bacterial diversity is large: Enteric genome content correlates with pathogenicity and host range: inter-species inter-genus inter-strain Yersinia pestis 1335 (33%) plague 2686 1460 (35%) Yersinia enterocolitica gastroenteritis 1708 (41%) 2438 Escherichia coli O157:H7 1876 (43%) 1387 (26%) gastroenteritis 3953 528 (12%) Escherichia coli K12 1220 (28%) non-pathogen 3094 1505 (33%) Salmonella enterica Typhi 601 (13%) typhoid fever 3998 479 (11%) Salmonella enterica Typhimurium 100 Mya gastroenteritis unique Gene differences shared 4 3 2 1 0 unique