990 likes | 1.46k Views
Introduction to Next Generation Sequencing. About the assignment. The practicals are the subparts of the assignment Write up in more or less ‘paper format’ or in standard prac format Due end of day on 16 March (Sunday). Part 1: Introduction. What is sequencing?.
E N D
About the assignment • The practicals are the subparts of the assignment • Write up in more or less ‘paper format’ or in standard prac format • Due end of day on 16 March (Sunday)
What is sequencing? Finding the sequence of a DNA/ RNA molecule What can we sequence? http://cancergenome.nih.gov/newsevents/multimedialibrary/images/CancerBiology
Front matter: Central Dogma of Molecular Biology Reverse Transcription
Why does that matter? DNA sequencing exploits the physicochemical properties of DNA and the enzymes involved in its replication
The DNA molecule • The two strands of DNA are different • One is called the sense strand and it is the plan to make a protein • The other strand is the antisense strand
Connecting the DNA molecule • The two strands of DNA are said to be antiparallel • One strand is oriented in a 5’ to 3’ direction • The other strand is oriented in the opposite 3’ to 5’ direction 5’ 3’ antisense sense 3’ 5’
Introns and Exons • Introns– non-codingsequences in the DNA that are NOT used to make to make a protein • Exons–coding sequences in the DNA that are expressed or used to make mRNA and ultimately are used to make a protein
Sanger Method Fred Sanger, 1958 Was originally a protein chemist Made his first mark in sequencing proteins Made his second mark in sequencing RNA 1980 dideoxy sequencing
Sanger Method: Dideoxy Chain Termination 300-500 bases
Capillary Method - Fluorescent Dyes 800-1000 bases
Automated Sequencing • Leroy Hood developed fluorescent color labels for the 4 terminator nucleotide bases (late 80s). • This allowed all 4 bases to be sequenced in a single reaction and sorted in a single gel lane. • Hood also pioneered direct data collection by computer • Improvements in this technology now enabled sequencing of billion base genomes in < 1 year.
Automated sequencing machines use 4 colors, so they can read all 4 bases at once.
TG..GT TC..CC AC..GC CG..CA TT..TC TG..AC AC..GC GA..GC CT..TG AC..GC GT..GC AC..GC AA..GC AT..AT TT..CC Short DNA sequences ACGTGGTAACGTATACAC TAGGCCATAGTAATGGCG CACCCTTAGTGGCGTATACATA… ACGTGGTAATGGCGTATACACCCTTAGGCCATA ACGTGACCGGTACTGGTAACGTACACCTACGTGACCGGTACTGGTAACGTACGCCTACGTGACCGGTACTGGTAACGTATACACGTGACCGGTACTGGTAACGTACACCTACGTGACCGGTACTGGTAACGTACGCCTACGTGACCGGTACTGGTAACGTATACCTCT... Sequenced genome Sanger can do whole genomes (painfully) Genome Short fragments of DNA 20
-2001 The HGP consortium publishes its working draft in Nature (15 February), and Celera publishes its draft in Science (16 February).
2001: Human Genome Project 2.7G$, 11 years 2007: 454 1M$, 3 months 2008: ABI SOLiD 60K$, 2 weeks 2001: Celera 100M$, 3 years 2009: Illumina, Helicos 40-50K$ 2000 Sequencing the Human Genome 10 8 6 Log10(price) 2010: 7K$, a few days 4 2 2014: 1000$, 24 hrs? 2005 2010 Year
Sanger vs NGS • ‘Sanger sequencing’ has been the only DNA sequencing method for 30 years but… • NGS has the ability to process millions of sequence reads in parallel rather than 96 at a time (at a smallfractionof the cost)
High Throughput Sequencing Massively parallel sequencing • Sequencing millions of molecules in parallel • Do not need prior knowledge of what you’re sequencing We will work with Illumina data only
454 = Paradigm Shift • Standard ABI “Sanger” sequencing • 96 samples/day • Read length ~650 bp= 450,000 bases • 454 • ~400,000 different templates (reads)/day • Read length ~250 bp • Total = 100,000,000 bases of sequence data
Solexa ups the Game • Solexa (Illumina GA) • 60,000,000 different sequence templates (yes that is an insane 60 million reads) • originally 36 bp read length (much longer now) • 4 billion bases of DNA per run (3 days)
Some NGS milestones • 454 Life Sciences/Roche • Genome Sequencer FLX: currently produces 400-600 million bases per day per machine • Published 1 million bases of Neanderthal DNA in 2006 • May 2007 published complete genome of James Watson (3.2 billion bases ~20x coverage) • Solexa/Illumina • 10 GB per machine/week • May 2008 published complete genomes for 3 hapmap subjects (14x coverage) • ABI SOLID • 20 GB per machine/week
The general NGS workflow Random shearing of the DNA Adding adaptors and barcodes Size selection Amplification Sequencing
It’s all about nanotechnology • Each system works differently, but they are all based on a similar principals: • Shear target DNA into small pieces • bind individual DNA molecules to a solid surface, • amplify each molecule into a cluster • copy one base at a time and detect different signals for A, C, T, & G bases • requires very precise high-resolution imaging of tiny features • (Solexa has 800 images @ 4 megapixels each)
454 Sequencing Overview • Prepare library of single stranded DNA, 200-500 bp long and ligate adapters • Perform emulsion PCR, amplifying a single DNA template molecule in each microreactor (bead). • Sequence all clonally amplified sample fragments in parallel using pyrosequencing technology • Analyze sequence results • CLEAN data • Align overlapping sequence of individual reads to define contigs (Shotgun) • Order and orient contigs, create scaffolds (Paired End) • Identify variants (Amplicon) • Determine gene expression patterns (Transcriptome)
Emulsion Based Clonal Amplification A + PCR Reagents + Emulsion Oil B Micro-reactors Mix DNA Library & capture beads (limited dilution) Create “Water-in-oil” emulsion Adapter carrying library DNA “Break micro-reactors” Isolate DNA containing beads Perform emulsion PCR • Generation of millions of clonally amplified sequencing templates on each bead From: Roche 454 James Grabeau 2007 (www.lsbi.mafes.msstate.edu/Roche%20454%20James%20Grabau%202007.ppt )
Load Enzyme Beads Load beads into PicoTiter™Plate 44 μm Depositing DNA Beads into the PicoTiter™Plate Adapted from: Roche 454 James Grabeau 2007 (www.lsbi.mafes.msstate.edu/Roche%20454%20James%20Grabau%202007.ppt )
PicoTiterPlate Wells Reagent Flow PhotonsGenerated are Captured by Camera Sequencing By Synthesis Reagent flow and image capture Sequencing Image Created Adapted from: Roche 454 James Grabeau 2007 (www.lsbi.mafes.msstate.edu/Roche%20454%20James%20Grabau%202007.ppt )
FLX Sequencing Reaction www.roche-applied-science.com
Different Library Preparation Methods for Different Project Aims • Shotgun Library Preparation for de novo or resequencing of genomic DNA or long PCR product • Paired End Library Preparation provides regions of sequence a known distance apart, allowing for ordering of contigs and analysis of genetic rearrangement. • AmpliconLibrary Preparation, eg.for detection of rare variants.
Create random DNA fragments, 300-800 bp, by nebulization with compressed N2 Ligate universal adpaters “A” and “B”. Select for “A” – “B” fragments. Remove second strand Attach to library beads via “B” adapter at calculated concentration to yield a single template molecule per library bead 454: Shotgun Library Preparation Proceed to emPCR Images from: https://www.roche-applied-science.com/
454: AmpliconLibrary Preparation • Target amplicon of 200-500 bp • 200 bp for uni-direction reads • 500 bp requires bi-directional reads • Amplify using fusion primers that include template specific primer and primers A and B • Purify and quantify • Proceed to emPCR
From Debbie Nickerson, Department of Genome Sciences, University of Washington, http://tinyurl.com/6zbzh4
Illumina (Solexa) Applications Resequencing • Characterise different related species or strains Transcriptome analysis • No chip/array required! • random priming of RNA DNA methylation analysis • sequencing bisulfite-converted DNA methylation-sensitive restriction digest enriched fragments Examine chromatin modifications • Quantify in vivo protein-DNA interactions using the combination of chromatin immunoprecipitation and sequencing (ChIP-Seq) Computational Biology Research Group
454 vs Solexa • Homopolymers (AAAAA..) • Read length: 400 bp • Number of reads: 400.000 • Per-base cost greater • De novo assembly, metagenomics • Read length: 40 bp • Number of reads: millions • Per-base cost cheaper • Ideal for application requiring short reads: ncRNA
Problem: Huge Amount of Image Data • Raw image data huge: 1-2 TB for the Solexa, more for ABI-SOLID, less for 454 • The images are immediately processed into intensity data (spots w/ location and brightness) • Intensity data is then processed into basecalls (A, C, T, or G plus a quality score for each) • Basecall data is on the order of 5-10 GB per run (or a week of runs for 454)
Adapted from John McPherson, OICR 2009/10 AB SOLiDv3 120Gb, 100 bp reads 100 Gb Illumina HiSeq 100Gb, 150bp reads 10 Gb 1 Gb 454 GS FLX Titanium bases per machine run 0.4-0.6 Gb, 100-400 bp reads 100 Mb 10 Mb ABI capillary sequencer (0.04-0.08 Mb, 450-800 bp reads 1 Mb 10 bp 100 bp 1,000 bp read length
Storage becoming a real problem Kahn, 2011, Science
Lower Cost = More Innovation • As sequencing becomes cheaper, more investigators can use it for routine assays • Leads to variations and absolutely novel applications • Replacement of other really good technologies
Lower Cost = More samples • More patients in association studies • More replicates in all other assays • More permutations
Bioinformatics is the Bottleneck • Sequencing is a commodity – can easily be outsourced • Bioinformatics is the essential point of the science • Data analysis and discovery of meaning in results • As the data throughput increases, the cost and time spent on analysis increase more than linearly