Sequencing and Assembly

Sequencing and Assembly Cont’d

Steps to Assemble a Genome Some Terminology read a 500-900 long word that comes out of sequencer mate pair a pair of reads from two ends of the same insert fragment contig a contiguous sequence formed by several overlapping reads with no gaps supercontig an ordered and oriented set (scaffold) of contigs, usually by mate pairs consensus sequence derived from the sequene multiple alignment of reads in a contig 1. Find overlapping reads 2. Merge some “good” pairs of reads into longer contigs 3. Link contigs to form supercontigs 4. Derive consensus sequence ..ACGATTACAATAGGTT..

2. Merge Reads into Contigs • Overlap graph: • Nodes: reads r1…..rn • Edges: overlaps (ri, rj, shift, orientation, score) Reads that come from two regions of the genome (blue and red) that contain the same repeat Note: of course, we don’t know the “color” of these nodes

repeat region 2. Merge Reads into Contigs We want to merge reads up to potential repeat boundaries Unique Contig Overcollapsed Contig

2. Merge Reads into Contigs • Remove transitively inferable overlaps • If read r overlaps to the right reads r1, r2, and r1 overlaps r2, then (r, r2) can be inferred by (r, r1) and (r1, r2) r r1 r2 r3

2. Merge Reads into Contigs

2. Merge Reads into Contigs repeat boundary??? sequencing error • Ignore “hanging” reads, when detecting repeat boundaries b a … b a

Overlap graph after forming contigs Unitigs: Gene Myers, 95

Repeats, errors, and contig lengths • Repeats shorter than read length are easily resolved • Read that spans across a repeat disambiguates order of flanking regions • Repeats with more base pair diffs than sequencing error rate are OK • We throw overlaps between two reads in different copies of the repeat • To make the genome appear less repetitive, try to: • Increase read length • Decrease sequencing error rate Role of error correction: Discards up to 98% of single-letter sequencing errors decreases error rate  decreases effective repeat content  increases contig length

3. Link Contigs into Supercontigs Normal density Too dense  Overcollapsed Inconsistent links Overcollapsed?

3. Link Contigs into Supercontigs Find all links between unique contigs Connect contigs incrementally, if  2 forward-reverse links supercontig (aka scaffold)

3. Link Contigs into Supercontigs • Fill gaps in supercontigs with paths of repeat contigs • Complex algorithmic step • Exponential number of paths • Forward-reverse links

4. Derive Consensus Sequence TAGATTACACAGATTACTGA TTGATGGCGTAA CTA Derive multiple alignment from pairwise read alignments TAGATTACACAGATTACTGACTTGATGGCGTAAACTA TAG TTACACAGATTATTGACTTCATGGCGTAA CTA TAGATTACACAGATTACTGACTTGATGGCGTAA CTA TAGATTACACAGATTACTGACTTGATGGGGTAA CTA TAGATTACACAGATTACTGACTTGATGGCGTAA CTA Derive each consensus base by weighted voting (Alternative: take maximum-quality letter)

Some Assemblers • PHRAP • Early assembler, widely used, good model of read errors • Overlap O(n2)  layout (no mate pairs)  consensus • Celera • First assembler to handle large genomes (fly, human, mouse) • Overlap  layout  consensus • Arachne • Public assembler (mouse, several fungi) • Overlap  layout  consensus • Phusion • Overlap  clustering  PHRAP  assemblage  consensus • Euler • Indexing  Euler graph  layout by picking paths  consensus

Quality of assemblies—mouse Terminology:N50 contig length If we sort contigs from largest to smallest, and start Covering the genome in that order, N50 is the length Of the contig that just covers the 50th percentile. 7.7X sequence coverage

Quality of assemblies—dog 7.5X sequence coverage

Quality of assemblies—chimp 3.6X sequence Coverage Assisted Assembly

History of WGA 1997 • 1982: -virus, 48,502 bp • 1995: h-influenzae, 1 Mbp • 2000: fly, 100 Mbp • 2001 – present • human (3Gbp), mouse (2.5Gbp), rat*, chicken, dog, chimpanzee, several fungal genomes Let’s sequence the human genome with the shotgun strategy That is impossible, and a bad idea anyway Phil Green Gene Myers

$985 deCODEme (November 2007) $399 Personal Genome Service (November 2007) $2,500 Health Compass service (April 2008) Genetic Information Nondiscrimination Act (May 2008) $350,000 Whole-genome sequencing (November 2007)

Applications Whole-genome sequencing Comparative genomics Genome resequencing Structural variation analysis Polymorphism discovery Metagenomics Environmental sequencing Gene expression profiling Genotyping Population genetics Migration studies Ancestry inference Relationship inference Genetic screening Drug targeting Forensics

New sequencing applications Sequencing applications Increase in sequencing data output Demand for more sequencing Sequencing technology improvement

Sequencing technology Sanger sequencing $10.00 $1.00 Cost per finished bp: $0.10 $0.01 1975 1980 1990 2000 2008 Fred Sanger Read length: 15 – 200 bp 500 – 1,000 bp Throughput: “grad-student years” 2 ∙ 106 bp/day

Sequencing technology Sanger sequencing 3 ∙ 109 bp 1x coverage 10x coverage × 3 ∙ 109 bp = 40 years 2 ∙ 106 bp/day = $30 million 10x coverage × 3 ∙ 109 bp × $0.001/bp

Pyrosequencing on a chip • Mostafa Ronaghi, Stanford Genome Technologies Center • 454 Life Sciences

Sequencing technology Next-generation sequencing “short reads” Read length: 250 bp Throughput: 300 Mb/day Cost: ~ 10,000 bp/$ De novo: yes Genome Sequencer / FLX

Single Molecule Array for Genotyping—Solexa

Polony Sequencing

Sequencing technology Next-generation sequencing Genome Analyzer SOLiD Analyzer “microreads” Read length: ~ 35 bp Throughput: 300 – 500 Mb/day Cost: ~ 100,000 bp/$ De novo: yes

Molecular Inversion Probes

Illumina Genotype Arrays

Sequencing technology Next-generation sequencing “SNP chips” Infinium Assay GeneChip Array genotypes Read length: 1bp Throughput: 1 – 2 Mb/day Cost: 5,000 bp/$ De novo: no

Nanopore Sequencing http://www.mcb.harvard.edu/branton/index.htm

Sequencing technology Next-generation sequencing

Sequencing technology ?

Sequencing and Assembly

Sequencing and Assembly

Presentation Transcript

Introduction to Genomic Sequencing Assembly Annotation Diego Martinez

DNA Sequencing and Assembly

Sequencing

Cloning and Sequencing

Computational assembly for prokaryotic sequencing projects

Next Generation Sequencing, Assembly, and Alignment Methods

Genome Sequencing and Assembly High throughput Sequencing

Sequencing techniques and genome assembly

Concepts and methods in genome sequencing and sequence assembly

Genome Sequencing and Assembly

CUGI Pilot Sequencing/Assembly Projects

Sequencing and control

Whole Genome Sequencing, Assembly and Annotation

Genome Sequencing and Assembly High throughput Sequencing

Genome Sequencing and Assembly Progress

Concepts and methods in sequencing and genome assembly

NGS Bioinformatics Workshop 2.1 Next Generation Sequencing and Sequence Assembly Algorithms

Amplicon -Based Quasipecies Assembly Using Next Generation Sequencing

Sequencing and Assembly

CSCI2950-C Lecture 3 DNA Sequencing and Fragment Assembly

Sequencing technology and assembly

CSCI2950-C Lecture 2 DNA Sequencing and Fragment Assembly