Introduction to Next Generation Sequencing

Introduction to Next Generation Sequencing

About the assignment • The practicals are the subparts of the assignment • Write up in more or less ‘paper format’ or in standard prac format • Due end of day on 16 March (Sunday)

Part 1: Introduction

What is sequencing? Finding the sequence of a DNA/ RNA molecule What can we sequence? http://cancergenome.nih.gov/newsevents/multimedialibrary/images/CancerBiology

Front matter: Central Dogma of Molecular Biology Reverse Transcription

Why does that matter? DNA sequencing exploits the physicochemical properties of DNA and the enzymes involved in its replication

The DNA molecule • The two strands of DNA are different • One is called the sense strand and it is the plan to make a protein • The other strand is the antisense strand

Connecting the DNA molecule • The two strands of DNA are said to be antiparallel • One strand is oriented in a 5’ to 3’ direction • The other strand is oriented in the opposite 3’ to 5’ direction 5’ 3’ antisense sense 3’ 5’

Replication of DNA

Introns and Exons • Introns– non-codingsequences in the DNA that are NOT used to make to make a protein • Exons–coding sequences in the DNA that are expressed or used to make mRNA and ultimately are used to make a protein

Introns and Exons

Transcription

Translation

Sanger Method Fred Sanger, 1958 Was originally a protein chemist Made his first mark in sequencing proteins Made his second mark in sequencing RNA 1980 dideoxy sequencing

Sanger Method: Dideoxy Chain Termination 300-500 bases

Capillary Method - Fluorescent Dyes 800-1000 bases

Automated Sequencing • Leroy Hood developed fluorescent color labels for the 4 terminator nucleotide bases (late 80s). • This allowed all 4 bases to be sequenced in a single reaction and sorted in a single gel lane. • Hood also pioneered direct data collection by computer • Improvements in this technology now enabled sequencing of billion base genomes in < 1 year.

Automated sequencing machines use 4 colors, so they can read all 4 bases at once.

TG..GT TC..CC AC..GC CG..CA TT..TC TG..AC AC..GC GA..GC CT..TG AC..GC GT..GC AC..GC AA..GC AT..AT TT..CC Short DNA sequences ACGTGGTAACGTATACAC TAGGCCATAGTAATGGCG CACCCTTAGTGGCGTATACATA… ACGTGGTAATGGCGTATACACCCTTAGGCCATA ACGTGACCGGTACTGGTAACGTACACCTACGTGACCGGTACTGGTAACGTACGCCTACGTGACCGGTACTGGTAACGTATACACGTGACCGGTACTGGTAACGTACACCTACGTGACCGGTACTGGTAACGTACGCCTACGTGACCGGTACTGGTAACGTATACCTCT... Sequenced genome Sanger can do whole genomes (painfully) Genome Short fragments of DNA 20

-2001 The HGP consortium publishes its working draft in Nature (15 February), and Celera publishes its draft in Science (16 February).

2001: Human Genome Project 2.7G$, 11 years 2007: 454 1M$, 3 months 2008: ABI SOLiD 60K$, 2 weeks 2001: Celera 100M$, 3 years 2009: Illumina, Helicos 40-50K$ 2000 Sequencing the Human Genome 10 8 6 Log10(price) 2010: 7K$, a few days 4 2 2014: 1000$, 24 hrs? 2005 2010 Year

Sanger vs NGS • ‘Sanger sequencing’ has been the only DNA sequencing method for 30 years but… • NGS has the ability to process millions of sequence reads in parallel rather than 96 at a time (at a smallfractionof the cost)

High Throughput Sequencing Massively parallel sequencing • Sequencing millions of molecules in parallel • Do not need prior knowledge of what you’re sequencing We will work with Illumina data only

454 = Paradigm Shift • Standard ABI “Sanger” sequencing • 96 samples/day • Read length ~650 bp= 450,000 bases • 454 • ~400,000 different templates (reads)/day • Read length ~250 bp • Total = 100,000,000 bases of sequence data

Solexa ups the Game • Solexa (Illumina GA) • 60,000,000 different sequence templates (yes that is an insane 60 million reads) • originally 36 bp read length (much longer now) • 4 billion bases of DNA per run (3 days)

Some NGS milestones • 454 Life Sciences/Roche • Genome Sequencer FLX: currently produces 400-600 million bases per day per machine • Published 1 million bases of Neanderthal DNA in 2006 • May 2007 published complete genome of James Watson (3.2 billion bases ~20x coverage) • Solexa/Illumina • 10 GB per machine/week • May 2008 published complete genomes for 3 hapmap subjects (14x coverage) • ABI SOLID • 20 GB per machine/week

Technology Overview

The general NGS workflow Random shearing of the DNA Adding adaptors and barcodes Size selection Amplification Sequencing

It’s all about nanotechnology • Each system works differently, but they are all based on a similar principals: • Shear target DNA into small pieces • bind individual DNA molecules to a solid surface, • amplify each molecule into a cluster • copy one base at a time and detect different signals for A, C, T, & G bases • requires very precise high-resolution imaging of tiny features • (Solexa has 800 images @ 4 megapixels each)

454 Sequencing Overview • Prepare library of single stranded DNA, 200-500 bp long and ligate adapters • Perform emulsion PCR, amplifying a single DNA template molecule in each microreactor (bead). • Sequence all clonally amplified sample fragments in parallel using pyrosequencing technology • Analyze sequence results • CLEAN data • Align overlapping sequence of individual reads to define contigs (Shotgun) • Order and orient contigs, create scaffolds (Paired End) • Identify variants (Amplicon) • Determine gene expression patterns (Transcriptome)

Emulsion Based Clonal Amplification A + PCR Reagents + Emulsion Oil B Micro-reactors Mix DNA Library & capture beads (limited dilution) Create “Water-in-oil” emulsion Adapter carrying library DNA “Break micro-reactors” Isolate DNA containing beads Perform emulsion PCR • Generation of millions of clonally amplified sequencing templates on each bead From: Roche 454 James Grabeau 2007 (www.lsbi.mafes.msstate.edu/Roche%20454%20James%20Grabau%202007.ppt )

Load Enzyme Beads Load beads into PicoTiter™Plate 44 μm Depositing DNA Beads into the PicoTiter™Plate Adapted from: Roche 454 James Grabeau 2007 (www.lsbi.mafes.msstate.edu/Roche%20454%20James%20Grabau%202007.ppt )

PicoTiterPlate Wells Reagent Flow PhotonsGenerated are Captured by Camera Sequencing By Synthesis Reagent flow and image capture Sequencing Image Created Adapted from: Roche 454 James Grabeau 2007 (www.lsbi.mafes.msstate.edu/Roche%20454%20James%20Grabau%202007.ppt )

FLX Sequencing Reaction www.roche-applied-science.com

Different Library Preparation Methods for Different Project Aims • Shotgun Library Preparation for de novo or resequencing of genomic DNA or long PCR product • Paired End Library Preparation provides regions of sequence a known distance apart, allowing for ordering of contigs and analysis of genetic rearrangement. • AmpliconLibrary Preparation, eg.for detection of rare variants.

Create random DNA fragments, 300-800 bp, by nebulization with compressed N2 Ligate universal adpaters “A” and “B”. Select for “A” – “B” fragments. Remove second strand Attach to library beads via “B” adapter at calculated concentration to yield a single template molecule per library bead 454: Shotgun Library Preparation Proceed to emPCR Images from: https://www.roche-applied-science.com/

454: AmpliconLibrary Preparation • Target amplicon of 200-500 bp • 200 bp for uni-direction reads • 500 bp requires bi-directional reads • Amplify using fusion primers that include template specific primer and primers A and B • Purify and quantify • Proceed to emPCR

Illumina

From Debbie Nickerson, Department of Genome Sciences, University of Washington, http://tinyurl.com/6zbzh4

Sequencing by Synthesis (SBS)

Illumina (Solexa) Applications Resequencing • Characterise different related species or strains Transcriptome analysis • No chip/array required! • random priming of RNA DNA methylation analysis • sequencing bisulfite-converted DNA methylation-sensitive restriction digest enriched fragments Examine chromatin modifications • Quantify in vivo protein-DNA interactions using the combination of chromatin immunoprecipitation and sequencing (ChIP-Seq) Computational Biology Research Group

454 vs Solexa • Homopolymers (AAAAA..) • Read length: 400 bp • Number of reads: 400.000 • Per-base cost greater • De novo assembly, metagenomics • Read length: 40 bp • Number of reads: millions • Per-base cost cheaper • Ideal for application requiring short reads: ncRNA

Problem: Huge Amount of Image Data • Raw image data huge: 1-2 TB for the Solexa, more for ABI-SOLID, less for 454 • The images are immediately processed into intensity data (spots w/ location and brightness) • Intensity data is then processed into basecalls (A, C, T, or G plus a quality score for each) • Basecall data is on the order of 5-10 GB per run (or a week of runs for 454)

Adapted from John McPherson, OICR 2009/10 AB SOLiDv3 120Gb, 100 bp reads 100 Gb Illumina HiSeq 100Gb, 150bp reads 10 Gb 1 Gb 454 GS FLX Titanium bases per machine run 0.4-0.6 Gb, 100-400 bp reads 100 Mb 10 Mb ABI capillary sequencer (0.04-0.08 Mb, 450-800 bp reads 1 Mb 10 bp 100 bp 1,000 bp read length

Stein Genome Biology 2010 11:207

Storage becoming a real problem Kahn, 2011, Science

Lower Cost = More Innovation • As sequencing becomes cheaper, more investigators can use it for routine assays • Leads to variations and absolutely novel applications • Replacement of other really good technologies

Lower Cost = More samples • More patients in association studies • More replicates in all other assays • More permutations

Bioinformatics is the Bottleneck • Sequencing is a commodity – can easily be outsourced • Bioinformatics is the essential point of the science • Data analysis and discovery of meaning in results • As the data throughput increases, the cost and time spent on analysis increase more than linearly

Introduction to Next Generation Sequencing