Genome Assembly: a brief introduction

Genome Assembly: a brief introduction Slides courtesy of Mihai Pop, Art Delcher, and Steven Salzberg

Shotgun DNA Sequencing (Technology) DNA target sample SHEAR SIZE SELECT e.g., 10Kbp ± 8% std.dev. End Reads (Mates) 550bp LIGATE & CLONE Primer SEQUENCE Vector

Whole Genome Shotgun Sequencing + single highly automated process + only three library constructions – assembly is much more difficult • Collect 10x sequence in a 1-to-1 ratio of two types of read pairs: ~ 35million reads for Human. Short Long 10Kbp 2Kbp • Collect another 20X in clone coverage of 50Kbp end sequence pairs: ~ 1.2million pairs for Human. • Early simulations showed that if repeats were considered black boxes, one could still cover 99.7% of the genome unambiguously. BAC 3’ BAC 5’

Sequencing Factory

Celera’s Sequencing Factory(circa 2001) • 300 ABI 3700 DNA Sequencers • 50 Production Staff • 20,000 sq. ft. of wet lab • 20,000 sq. ft. of sequencing space • 800 tons of A/C (160,000 cfm) • $1 million / year for electrical service • $10 million / month for reagents

Human Data (April 2000) • Collected 27.27 Million reads = 5.11X coverage • 21.04 Million are paired (77%) = 10.52 Million pairs • 2Kbp 5.045M 98.6% true * <6% std.dev. • 10Kbp 4.401M 98.6% true * <8% std.dev. • 50Kbp 1.071M 90.0% true * <15% std.dev. * validated against finished Chrom. 21 sequence • The clones cover the genome 38.7X times • Data is from 5 individuals (roughly 3X, 4 others at .5X)

Pairs Give Order & Orientation Contig Assembly without pairs results in contigs whose order and orientation are not known. Consensus (15- 30Kbp) Reads ? 2-pair Pairs, especially groups of corroborating ones, link the contigs into scaffolds where the size of gaps is well characterized. Mean & Std.Dev. is known Scaffold

Anatomy of a WGS Assembly STS Chromosome STS-mapped Scaffolds Contig Gap (mean & std. dev. Known) Read pair (mates) Consensus Reads (of several haplotypes) SNPs External “Reads”

Assembly gaps Physical gaps Sequencing gaps sequencing gap - we know the order and orientation of the contigs and have at least one clone spanning the gap physical gap - no information known about the adjacent contigs, nor about the DNA spanning the gap

Shotgun sequencing statistics

Typical contig coverage Imagine raindrops on a sidewalk

Lander-Waterman statistics L = read length T = minimum detectable overlap G = genome size N = number of reads c = coverage (NL / G) σ = 1 – T/L E(#islands) = Ne-cσ E(island size) = L((ecσ – 1) / c + 1 – σ) contig = island with 2 or more reads

Example Genome size: 1 Mbp Read Length: 600 Detectable overlap: 40

Experimental data Caveat: numbers based on artificially chopping up the genome of Wolbachia pipientis dMel

Assembly paradigms • Overlap-layout-consensus • greedy (TIGR Assembler, phrap, CAP3...) • graph-based (Celera Assembler, Arachne) • Eulerian path (especially useful for short read sequencing)

TIGR Assembler/phrap Greedy • Build a rough map of fragment overlaps • Pick the largest scoring overlap • Merge the two fragments • Repeat until no more merges can be done

Overlap-layout-consensus Main entity: read Relationship between reads: overlap 1 4 7 2 5 8 3 6 9 2 3 4 5 6 7 8 9 1 ACCTGA ACCTGA AGCTGA ACCAGA 1 2 3 2 3 1 1 2 3 3 1 1 2 3 1 3 2 2

Paths through graphs and assembly • Hamiltonian circuit: visit each node (city) exactly once, returning to the start Genome

Implementation details

Overlap between two sequences overlap (19 bases) overhang (6 bases) …AGCCTAGACCTACAGGATGCGCGGACACGTAGCCAGGAC CAGTACTTGGATGCGCTGACACGTAGCTTATCCGGT… overhang % identity = 18/19 % = 94.7% • overlap - region of similarity between regions • overhang - un-aligned ends of the sequences • The assembler screens merges based on: • length of overlap • % identity in overlap region • maximum overhang size.

All pairs alignment • Needed by the assembler • Try all pairs – must consider ~ n2 pairs • Smarter solution: only n x coverage (e.g. 8) pairs are possible • Build a table of k-mers contained in sequences (single pass through the genome) • Generate the pairs from k-mer table (single pass through k-mer table) k-mer

Assembly Pipeline A B implies TRUE A B OR A B REPEAT-INDUCED Trim & Screen Find all overlaps  40bp allowing 6% mismatch. Overlapper Unitiger Scaffolder Repeat Rez I, II

Assembly Pipeline Trim & Screen Compute all overlap consistent sub-assemblies: Unitigs (Uniquely Assembled Contig) Overlapper Unitiger Scaffolder Repeat Rez I, II

OVERLAP GRAPH A A B B B A B A A B A B Edge Types: Regular Dovetail Prefix Dovetail Suffix Dovetail E.G.: Edges are annotated with deltas of overlaps

The Unitig Reduction A C A B C B 1. Remove “Transitively Inferrable” Overlaps:

The Unitig Reduction A 412 352 A B B 45 2. Collapse “Unique Connector” Overlaps:

Identifying Unique DNA Stretches Repetitive DNA unitig Unique DNA unitig Arrival Intervals Discriminator Statistic is log-odds ratio of probability unitig is unique DNA versus 2-copy DNA. +10 -10 0 Dist. For Unique Dist. For Repetitive Definitely Repetitive Don’t Know Definitely Unique

Assembly Pipeline Mated reads Scaffold U-unitigs with confirmed pairs Trim & Screen Overlapper Unitiger Scaffolder Repeat Rez I, II

Assembly Pipeline Trim & Screen Fill repeat gaps with doubly anchored positive unitigs Overlapper Unitig>0 Unitiger Scaffolder Repeat Rez I, II

REPEATS

Handling repeats • Repeat detection • pre-assembly: find fragments that belong to repeats • statistically (most existing assemblers) • repeat database (RepeatMasker) • during assembly: detect "tangles" indicative of repeats (Pevzner, Tang, Waterman 2001) • post-assembly: find repetitive regions and potential mis-assemblies. • Reputer, RepeatMasker • "unhappy" mate-pairs (too close, too far, mis-oriented) • Repeat resolution • find DNA fragments belonging to the repeat • determine correct tiling across the repeat

Statistical repeat detection Significant deviations from average coverage flagged as repeats. - frequent k-mers are ignored - “arrival” rate of reads in contigs compared with theoretical value (e.g., 800 bp reads & 8x coverage - reads "arrive" every 100 bp) Problem 1: assumption of uniform distribution of fragments - leads to false positives non-random libraries poor clonability regions Problem 2: repeats with low copy number are missed - leads to false negatives

Mis-assembled repeats excision collapsed tandem rearrangement

Genome Assembly: a brief introduction

Genome Assembly: a brief introduction

Presentation Transcript

Genome Assembly

Computational Genomics: Genome assembly

Genome sequence assembly

Bacterial Genome Assembly

Genome Assembly Stewardship (Ames)

Genome Assembly

Genome Assembly Final Results

Genome Assembly

Genome Assembly

On Genome Assembly

Genome Assembly Preliminary Results

Genome Assembly

Genome Sequencing and Assembly

Genome sequence assembly

De novo genome assembly

Genome Assembly and Annotation

Whole Genome Shotgun Assembly

Whole Genome Assembly

Problems of Genome Assembly

De Novo Genome Assembly - Introduction

Introduction to Genome Assembly

De Novo Genome Assembly - Introduction