Next generation sequencing: an overview

Next generation sequencing: an overview A I Bhat Indian Institute of Spices Research Calicut

DNA sequencing • Chain termination method (Sangerset al., 1977): In this method, the sequence of a single stranded DNA molecule is determined by enzymatic synthesis of complementary polynucleotide chains, these chains terminating at specific nucleotide positions. • The chemical degradation method (Maxum and Gilbert, 1977), in which the sequence of a double stranded DNA molecule is determined by treatment with chemicals that cut the molecule at specific nucleotide positions

Chain termination method

Dye-terminator sequencing • Utilizes labelling of the chain terminator ddNTPs, which permits sequencing in a single reaction • Each of the four dideoxynucleotide chain terminators is labelled with different fluorescent dyes (ddA Green, ddT Red, ddG Yellow and ddC Blue), each of which with different wavelengths of fluorescence and emission. • The fragment stopping at the base position can be detected on the gel by a powerful laser beam. • Owing to its greater expediency and speed, dye-terminator sequencing is now the mainstay in automated sequencing.

Capillary electrophoresis View of dye-terminator read Sanger method can sequence only 1000–1200 bp in one reaction

Genome sequencing 1970s: Bacteriophage 1995, the bacterium Haemophilus influenzae Followed by several other bacteria and archaea The first eukaryotic chromosome sequence in 1992: yeast Many eukaryotes several plants and their pathogens 2006: Human genome Until 2006, all genome sequencing used Sanger chemistry

Shotgun sequencing Human Genome Project Genomic DNA is enzymatically or mechanically broken down Cloned into sequencing vectors Sequenced individually Numerous fragments of DNA sequenced –BIRTH OF GENOME INFORMATICS AND NEXT GENERATION SEQUENCING

Whole genome sequencing

The core philosophy of massive parallel sequencing used in next-generation sequencing (NGS) is adapted from shotgun sequencing NGS -breaking the entire genome into small pieces Ligating DNA to designated adapters DNA synthesis (sequencing-by-synthesis) massively parallel sequencing Coverage (number of short reads that overlap each other within a specific genomic region) Sufficient coverage is critical for accurate assembly of the genomic sequence. To ensure the correct identification of genetic variants, short-read coverage of at least 30× is recommended in whole-genome scans (Zhang et al., 2011. J Genet Genomics, 38:95-109)

Next generation sequencing • Enables a genome to be sequenced within hours to days. • The 454 FLX Pyrosequencer from Roche Applied Sciences was the first next-generation sequencer to become commercially available in 2004, • The Solexa 1G Genetic Analyzer from Illumina was commercialized 2006 • SOLiD (Supported Oligonucleotide Ligation and Detection) System from Applied Biosystems launched in 2007 Next-next generation or third generation sequencing • Single molecule sequencing

Platforms on NGS technologies

Next (2nd) generation platforms

Roche GS-FLX 454 Genome Sequencer Longest short reads (600 bp) among all the NGS platforms Generates ~400–600 Mb of sequence reads per run de novo assembly of microbes in metagenomics Raw base accuracy reported is very good (over 99%)

Chemistry • Nucleotide incorporation releases pyrophosphate (PPi) • ATP sulfurylase quantitatively converts PPi to ATP in the presence of adenosine 5´ phosphosulfate. • This ATP acts as fuel to the luciferase-mediated conversion of luciferin to oxyluciferin that generates visible light in amounts that are proportional to the amount of ATP. • The light produced in the luciferase-catalyzed reaction is detected by a camera and analyzed in a program. • Unincorporated nucleotides and ATP are degraded by the apyrase, and the reaction can restart with another nucleotide.

Illumina/Solexa Genome Analyzer Superior data quality and proper read lengths have made it the system of choice for many genome sequencing projects. Majority of published NGS papers used Genome Analyzer. uses a proprietary reversible terminator-based method that enables detection of single bases as they are incorporated into growing DNA strands A fluorescently-labeled terminator is imaged as each dNTP is added and then cleaved to allow incorporation of the next base. Since all four reversible terminator-bound dNTPs are present during each sequencing cycle, natural competition minimizes incorporation bias. The end result is true base-by-base sequencing that enables the industry’s most accurate data for a broad range of applications.

Solexa-based Whole Genome Sequencing Adapted from Richard Wilson, School of Medicine, Washington University, “Sequencing the Cancer Genome” http://tinyurl.com/5f3alk

ABI SOLiD platform The latest model, 5500×l solid system (previously known as SOLiD4hq) Can generate over 2.4 billion reads per run with a raw base accuracy of 99.94% The SOLiD4 platform probably provides the best data quality as a result of its sequencing-by-ligation approach but the DNA library preparation procedures prior to sequencing can be tedious and time consuming. Preferred for Re-sequencing than denovosequencing. (Zhang et al., 2011)

Next generation sequencing using Roche 454 Sample Preparation Nucleic acid isolation Double-stranded cDNA synthesis Rapid library preparation Fragmentation (Nebulization/ shearing) into smaller sized fragments of 400 to 1000 bp Addition of adopters Remove small fragment (<300 bp) Library Quality Assessment

Emulsion based clonal amplification (emPCR) • Preparation of reagents and of emulsion oil • Preparation of amplification mix (addition of additive, amplification • mix, primers, enzyme mix and PPiase) • DNA library capture (one molecule of DNA per bead and one bead • per aqueous microreactor to be insulated from other beads by • surrounding oil. • Emulsification (shaking captured library to form a water–in-oil • mixture) • Amplification (emulsified beads are clonally amplified) • Bead recovery and enrichment

Sequencing • Clonally amplified fragments loaded onto a PicoTiter Plate device for sequencing (diameter of Plate wells allow only one bead per well) • After addition of sequencing enzymes, fluidics subsystem of sequencing instrument flows individual nucleotides in a fixed order across all wells • Addition of one (or more) nucleotide(s) complementary to the template strand results in a chemiluminescent signal recorded by the CCD camera within the instrument • During nucleotide flow, thousands of beads each carrying millions of copies of ss DNA molecule are sequenced in parallel • Each 10-h sequencing run will typically produce over 1,000,000 flowgrams (one flowgram per bead)

Base calling (to check quality of each read) Trimming primer sequence Production of contigs

NGS platform under development (3rd Generation sequencers) Aim single DNA molecule sequencing (without amplification) Provides accurate data with long reads Flouresence based single molecule sequencing (Pacific Biosciences; US Genomics) Nano technologies for single molecule sequencing (Oxford Nanopore technologies, Nabsys, BioNanomatrix, Electronic Biosciences, Cracker Bio) Electronic detection for single molecule sequencing (Reveo, Intelligent Biosystems) Electron microscopy for single molecule sequencing (Light speed genomics, Halcyon Molecular, ZS Genetics)

Single Molecule Sequencing (Helicos Biosciences, USA) Billions of single molecules of sample DNA are captured on an application-specific proprietary surface serve as templates for the sequencing-by-synthesis Polymerase and one fluorescently labeled nucleotide (C, G, A or T) are added. The polymerase catalyzes the sequence-specific incorporation of fluorescent nucleotides into nascent complementary strands on all the templates. After a wash step, which removes all free nucleotides, the incorporated nucleotides are imaged and their positions recorded. The fluorescent group is removed in a highly efficient cleavage process, leaving behind the incorporated nucleotide. The process continues through each of the other three bases. Multiple four-base cycles result in complementary strands greater than 25 bases in length synthesized on billions of templates—providing a greater than 25-base read from each of those individual templates.

Single Molecule Sequencing (Helicos Biosciences, USA)

Ion Sequencing (Rothberg et al., Life technologies: Nature, July 2011) Non-optical method of DNA sequencing of genomes Sequence data obtained by directly sensing the ions produced by template-directed DNA polymerase synthesis using all-natural nucleotides on this massively parallel semiconductor-sensing device or ion chip The ion chip contains ion-sensitive, 1.2 million wells, which provide confinement and allow parallel, simultaneous detection of independent sequencing reactions. Performance of the system showed by sequencing three bacterial and one human genome World’s smallest solid state pH meter

DNA is fragmented, ligated to adapters, and clonally amplified onto beads. Sequencing primers and DNA polymerase are then bound to the templates and pipetted into the chip’s loading port. Individual beads are loaded into individual sensor wells by spinning. Well depth will allow only a single bead to occupy a well All four nucleotides are provided in a stepwise fashion during an automated run. When nucleotide in the flow is complementary to the template base directly downstream of the sequencing primer, the nucleotide is incorporated into the nascent strand by the bound polymerase. This increases length of sequencing primer by one base (or more, if a homopolymer stretch is directly downstream of the primer) and results in the hydrolysis of the incoming nucleotide triphosphate, which causes the net liberation of a single proton for each nucleotide incorporated during that flow. Release of proton produces a shift in pH of surrounding solution proportional to the no. of nucleotides incorporated in the flow (0.02 pH units per single base incorporation). This is detected by the sensor on the bottom of each well, converted to a voltage and digitized by off-chip electronics . The signal generation and detection occurs over 4 s After the flow of each nucleotide, a wash is used to ensure nucleotides do not remain in the well.

Sequencing methods

Mining NGS data to obtain meaningful information Average NGS experiment generates gigabytes to terabytes of raw data Existing bioinformatics tools functions fit into several general categories: (1) alignment of reads to a reference sequence (2) de novo assembly (3) reference-based assembly (4) genetic variation detection (such as SNV, Indel) (5) genome annotation (6) utilities for data analysis. The most important step in NGS data analysis is successful assembly or alignment of reads to a reference genome. After successful alignment and assembly the next step is to interpret the large number of putative novel genetic variants (or mutations) present by chance Recognition of functional variants is at the center of the NGS data analysis and bioinformatics

Thanks

Next generation sequencing: an overview