Genome De Novo Assemblies and Applications in NGS Sequencing

Genome De Novo Assemblies and Applications in NGS Sequencing Zemin Ning The Wellcome Trust Sanger Institute

Outline of the Talk: • My academic background • Challenges in genome assemblies from pure Illumina reads • The Phusion2 pipeline • The Tasmanian devil genome project • The Devil genome assembly • Other assemblies: human , bamboo,miscanthus, etc

Powder Simulation

Hair Dynamics Genetics and Human Hair Structure EAST ASIAN CAUCASIAN AFRICAN

Informatics Projects Involved • SSAHA (Sequence Search and Alignment by the Hashing Algorithm • Ssaha2 – Alignment tool for Solexa, 454, ABI capillary reads • ssahaSNP – SNP/indel detection, mainly for ABI capillary reads • ssahaEST – EST or cDNA alignment • ssaha_SV – Structural variation (CNVs) detection • ssaha_pileup – SNP/indel detection from next-gen data • Phusion & Phusion2 • Development and maintenance of the pipeline • Production of WGS assemblies: • Mouse, Zebrafish, Human (Venter genome), C. Briggsae, Rice, Schisto, Sea Lamprey, Gorilla, Malaria and many bacterial genomes • TraceSeach • Public sequence search facility for all the traces • Fuzzypath • Short read assembler

Challenges in Whole Genome Assembly using Pure Illumina Reads • Short read length: 2x36; 2x54; 2x75; 2x100 • Large genome and huge datasets • For human: 100Gb at 30x • Repetitive/Duplication structures, Alus, LINES, SVAs • 30-40% such as human, mouse; 50-60% such as rice and other plant genomes. • Tandem repeats: how many copies they have? • TATATATATATATATATATATATATATA • GCGCGCGCGCGCGCGCGCGCGCGCGCGCGCG • GTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTG • AGTAGTAGTAGTAGTAGTAGTAGTAGTAGTAGTAGTAGT

De Bruijnvs Read overlap Missing sequences Missing from de Bruijn contigs

Phusion2 Assembly Pipeline Assembly Data Process Solexa Reads Supercontig Long Insert Reads PRono Fuzzypath Contigs Reads Group 2x75 or 2x100 Base Correction Velvet Phrap RP_Assemble

ATGGCGTGCAGTCCATGTTCGGATCA ATGGCGTGCAGT TGGCGTGCAGTC GGCGTGCAGTCC GCGTGCAGTCCA CGTGCAGTCCAT ATGGCGTGCAGTCCATGTTCGGATCA ATGGGCAGATGT TGGCCAGTTGTT GGCGAGTCGTTC GCGTGTCCTTCG Kmer Word Hashing Contiguous Base Hash K = 12 Gap-Hash 4x3

Useful Region Real Data Curve Poisson Curve Word use distribution for the mouse sequence data at ~7.5 fold

Sorted List of Each k-Mer and Its Read Indices ACAGAAAAGC 10h06.p1c High bits Low bits ACAGAAAAGC 12a04.q1c ACAGAAAAGC 13d01.p1c ACAGAAAAGC 16d01.p1c ACAGAAAAGC 26g04.p1c ACAGAAAAGC 33h02.q1c ACAGAAAAGC 37g12.p1c ACAGAAAAGC 40d06.p1c ACAGAAAAGG 16a02.p1c ACAGAAAAGG 20a10.p1c ACAGAAAAGG 22a03.p1c ACAGAAAAGG 26e12.q1c ACAGAAAAGG 30e12.q1c ACAGAAAAGG 47a01.p1c 64 -2k 2k

Relation Matrix: R(i,j) – number of kmer words shared between read i and read j 1 2 3 4 5 6 … j … N 41 0 0 0 0 1 2 41 37 0 0 0 3 0 37 0 22 0 4 0 0 0 0 27 Group 2: (4,6) 5 0 0 22 0 0 6 0 0 0 27 0 i R(i,j) Group 1: (1,2,3,5) N

Paired Reads Separated by “NN”

Error Bases Correction

Mis-assembly errors: Contig Breaking

Read Pair Guided Local Assembler A2 A1 A1 A2 Track read pairs to walk through repetitive regions

Tasmanian devil Tasmanian devil Wallaby Opossum

Transmissible cancer characterised by the growth of large tumours on the face, neck and mouth of Tasmanian devils Transmitted by biting Commonly metastasises First observed in 1996 Primarily affects adults >1yr Death in 4 – 6 months Tasmanian devil facial tumour disease (DFTD)

DFTD samples Area still DFTD free DFTD originated here c.1996 Narawntapu Mt William (2) Upper Natone 2006 Frankford Wisedale (?) Railton 2007 St Mary’s (2) West Pencil Pine (3) Reedy Marsh 2008 Trowunna (2) Bronte Park Coles Bay Tarraleah Kempton (2) Mangalore Fentonbury (no host) Nugent (2) 4 14 Forestier (33) 13

DFTD samples for sequencing Area still DFTD free DFTD originated here c.1996 Narawntapu 2007 Mt William 2007 or 2008 Upper Natone 2007 Strain 1, tetraploid Strain 2 Reedy Marsh 2007 Strain 3 “Evolved” Unknown strain Coles Bay Mangalore 2007 Forestier 2007

Sequencing T. Devil on Illumina: Strategy Tumour or normal genomic DNA Fragments of defined size 0.5, 2, 5, 7, 8, 10 kb Sequencing 2x100bp reads short insert 2x50bp mate pairs Sequencing performed at Illumina Alignment using bwa, ssaha2 De novo Assembly Somatic mutations Germline variants

Genome Assembly – T. Devil Solexa reads: Number of read pairs: 528 Million;Finished genome size: 3.5 GB; Read length: 2x100bp; Estimated read coverage: ~30X; Insert size: 410/50-600 bp; Mate pair data: 2k,4k,5k,6k,8k,10k Number of reads clustered: 458 Million Assembly features: - stats Contigs SupercontigsTotal number of contigs: 1,246,970 792,099 Total bases of contigs: 3.22 Gb 3,62 Gb N50 contig size: 9,642 434,642 Largest contig: 96,919 4,150,712 Averaged contig size: 2,578 4,564 Contig coverage on genome: ~92% >99% Ratio of placed PE reads: ~92% ?

Brown Bear Dog Macropus eugenii (Wallaby) Monodelphis domestica (Opossum ) Sminthopsis macroura (Dunnart)

Pipeline of Contig Gap Closure

Human Assembly - Yoruba NA18507 Solexa reads: Number of read pairs: 560 Million;Finished genome size: 3.0 GB; Read length: 2x100bp; Estimated read coverage: ~37X; Insert size: 500/50-700 bp; Number of reads clustered: 499 Million Assembly features: - contig statsTotal number of contigs: 1,142,077; Total bases of contigs: 2.92 Gb N50 contig size: 12,875; Largest contig: 140,463 Averaged contig size: 2,561; Contig coverage over the genome: ~94 %; Mis-assembly errors: ?

Bamboo Genome Assembly Tetraploid Solexa reads: Number of read pairs: 359 Million;Finished genome size: 2.0 GB; Read length: 2x120bp; Estimated read coverage: ~43X; Insert size: 500/50-700 bp; Number of reads clustered: 316 Million Assembly features: - contig statsTotal number of contigs: 733,465; Total bases of contigs: 1.91 Gb N50 contig size: 8,163; Largest contig: 117,250 Averaged contig size: 2,592; Contig coverage over the genome: ~92 %; Mis-assembly errors: ?

Genome Assembly – Miscanthus Solexa reads: Number of read pairs: 502 Million;Finished genome size: 2.0 GB; Read length: 2x76bp; Estimated read coverage: ~35X; Insert size: 410/50-600 bp; Mate pair data: 5Kb Number of reads clustered: 438 Million Assembly features: - stats Contigs SupercontigsTotal number of contigs: 2,241,465 2,090,385 Total bases of contigs: 1.64 Gb 1.92 Gb N50 contig size: 4,301 29,076 Largest contig: 71,161 730,290 Averaged contig size: 732 919 Contig coverage on genome: ~85% >95% Ratio of placed PE reads: ~82% ?

Melanoma cell line COLO-829 Paul Edwards, Departments of Pathology and Oncology, University of Cambridge

Plots of INDELs/SVs size distribution for all events detected by Pindel at single-base resolution. Left, insertions from 1bp to 60 bp. Right, deletions from 1bp to 1Mb.

Homozygous/Heterozygous Indels Insertion H1 (b) (a) B Ref Deletion H1 B2 B1 Ref (a) Insertions: Solid lines – reads with alignment terminates at the breakpoint; dashed line – reads with alignment crosses over the breakpoint. (b) Deletion: Solid line – read with alignment terminates at breakpoint; Dashed lines – reads with alignment crosses over the breakpoint.

Assemblies are used to confirm Pindel predictions: (a) deletion is confirmed by aligning two flanking sequences F1 and F2 to the reference; (b) deletion is not found in the reference with flanking sequences; (c) insertion is confirmed.

Acknowledgements: • Elizabeth Murchuson • Erin Preasance • Mike Stratton • Kai Ye • Dirk Evers • Ole Schulz-Trieglaff • Qi Feng • Bin Han

Genome De Novo Assemblies and Applications in NGS Sequencing

Genome De Novo Assemblies and Applications in NGS Sequencing

Presentation Transcript

Whole Genome Sequencing

Genome sequencing

GENOME SEQUENCING AND OBJECTIVES

Genome sequencing

Mouse Genome Sequencing

Recent applications of NGS sequencing in cancer studies

Genome Sequencing and genome viewers

Genome Sequencing and Assembly High throughput Sequencing

De Novo Sequencing and Homology Searching with De Novo Sequence Tags

Genome Sequencing and Assembly

Genome sequencing and annotation

NGS sequencing and Genome Assemblies from Animals and Large Plants

De novo genome assembly

Genome Sequencing and Assembly High throughput Sequencing

Genome sequencing and annotation

Sequencing a genome

applications of genome sequencing projects

De Novo Antibody Sequencing

bacterial genome sequencing

De Novo Genome Assembly - Introduction

De Novo Genome Assembly - Introduction

Next Generation Sequencing (NGS)