WGS Assembly and Reads Clustering

WGS Assembly and Reads Clustering Zemin Ning Production Software Group Informatics Division

Outline of the Talk: • Whole Genome Shotgun Sequencing • Insert Sizes • Repeats in the Genomes • Kmer Words Hashing and Distribution • Relational Matrix • Profile of unique kmer words • Phusion Steps • How to run Phusion – parameter selections

WGS Sequencing: The WGS method begins by fragmenting the genome into many pieces of various sizes. This fragmentation can be done in several ways, including physically shaking the DNA and cutting it with restriction enzymes. Depending on the size of the resulting fragment, various hosts are used to clone these regions. Clone-by-Clone Sequencing • ADV. Easy assembly • DIS. Build library & physical map; redundant sequencing Whole Genome Shotgun (WGS) • ADV. No mapping, no redundant sequencing • DIS. Difficult to assemble and resolve repeats

cut many times at random Whole Genome Shotgun Sequencing genome plasmids (2 – 10 Kbp) forward-reverse paired reads known dist cosmids (40 Kbp) ~500 bp ~500 bp

Automatic Sequencing

Base Calling - Phred Idealized traces would consist of evenly spaced, nonoverlapping peaks. Real traces deviate from this ideal due to imper- fections of the sequencing reactions, of gel electro-phoresis, and of trace processing. The first 50 or so peaks and peaks over 500 or so are particularly noisy. Quality: high – no ambiguities medium – some ambiguities Poor – low confidence

Historical Context 1995: H.influenzae sequenced using TIGR by Craig Venter. H. influenzae is the first free living organism to be sequnced. It has roughly 2 million base pairs. The sequencing used a shotgun method that assembled 25,000 fragments of 500 bp each. 1997: Whole Genome Shotgun paper written by Weber & Meyers. This is the first time that a shotgun method has been suggested for sequencing the human genome. By this time, the public Human Genome Project has already started using a clone-by-clone method. 1997: Phil Green writes review against WGS. 1998: Celera founded. Celera entered into a competition with the public Human Genome Project to sequence the human genome first. Celera’s main advantage was using the Whole Genome Shotgun method, which had a chance of failing, but if successful would produce faster results.

1999: Fly genome (180Mbs) sequenced by Celera using the Celera assembler. The genome is available by subscription to Celera’s database 2001: Human Genome published. The genome was sequenced using data from the public Human Genome Project and Celera. The public effort used the clone-by-clone method, while Celera used the Whole Genome Shotgun method. Celera gives access to the genome through subscription to the database. The results from the public project are free to access. 2001: Mouse Genome sequenced by Celera using the Whole Genome Shotgun method. It is made available by Celera on a subscription basis. 2002: The Mouse Genome published. Whitehead’s ARACHNE and Sanger’s Phusion were involved.

Whole Genome Assemblers TIGR Assembler G.G. Sutton et al., Genome Sci Technol 1, 9-19 (1995) PHRAP P. Green (1996) Celera Assembler CAP3 X. Huang, A. Madan, Genome Res 9, 868-877 (1999) RePS J. Wang et al. Genome Res 12, 824-831 (2002) Phusion (Sanger) J.C. Mullikin, Z. Ning, Genome Res 13, 81-90 (2003) Arachne (Whitehead/MIT) Euler (UCSD, USC) P.A. Pevzner, H. Tang, M.S. Waterman, RECOMB (2001) most assemblers follow the same approach: overlap – layout - consensus

Unique Section Repetitive Section Depth Depth Unique and Repetitive DNA Sections A X’ B X’’ C

Repetitive Contig and Read Pairs Depth Depth Depth Grouped Reads by Phusion

ATGGCGTGCAGTCCATGTTCGGATCA ATGGCGTGCAGT TGGCGTGCAGTC GGCGTGCAGTCC GCGTGCAGTCCA CGTGCAGTCCAT ATGGCGTGCAGTCCATGTTCGGATCA ATGGGCAGATGT TGGCCAGTTGTT GGCGAGTCGTTC GCGTGTCCTTCG Kmer Word Hashing Contiguous Base Hash K = 12 Gap-Hash 4x3

Useful Region Real Data Curve Poisson Curve Word use distribution for the mouse sequence data at ~7.5 fold

Sorted List of Each k-Mer and Its Read Indices ACAGAAAAGC 10h06.p1c High bits Low bits ACAGAAAAGC 12a04.q1c ACAGAAAAGC 13d01.p1c ACAGAAAAGC 16d01.p1c ACAGAAAAGC 26g04.p1c ACAGAAAAGC 33h02.q1c ACAGAAAAGC 37g12.p1c ACAGAAAAGC 40d06.p1c ACAGAAAAGG 16a02.p1c ACAGAAAAGG 20a10.p1c ACAGAAAAGG 22a03.p1c ACAGAAAAGG 26e12.q1c ACAGAAAAGG 30e12.q1c ACAGAAAAGG 47a01.p1c 64 -2k 2k

Relation Matrix: R(i,j) – number of kmer words shared between read i and read j 1 2 3 4 5 6 … j … N 227 0 0 0 0 1 2 227 187 0 0 0 3 0 187 0 170 0 4 0 0 0 0 213 Group 2: (4,6) 5 0 0 170 0 0 6 0 0 0 213 0 i R(i,j) Group 1: (1,2,3,5) N

Relation Matrix: R(i,j) – Implementation 1 2 3 4 5 6 … j … 500 1 2 3 4 Number of shared kmer words (< 63) 5 . . . Read index R(i,j) N

Phusion Iterations – Cutting The Weakest Link

This graph shows the effect of k-mer on relative contig N50 size for C. briggsae assemblies. At k = 15, 4 ^ 15 is about 10 times the genome size.

Profile of Unique kmer Words Non-unique sequence Unique sequence P=Kmer P=P2 P=P1 TCGGATCATCCGTTAACGT ATGGCGTGCAGTCCATT Quality values are reset over the read

Phusion Steps • Hashing the kmer words; • Calculate kmer words distribution; • Get the list of kmer words – only use those occur 2-D times; • Combine the kmer words with read index; • Sort the combined list; • Build up relational matrix; • Group the reads; • Output.

Phusion command line for Zfish ./phusion –kmer 18 –depth 13 [-fill 6] [-gap 5] -match 6 -match2 6 -matrix 500 -break 1 -set 12000 mates fasta/fastq files

Phusion2 ?

Acknowledgements: • Jim Mullkin • Richard Durbin • David Jaffe – Broad Institute

WGS Assembly and Reads Clustering

WGS Assembly and Reads Clustering

Presentation Transcript

Hybrid error correction and de novo assembly of single-molecule sequencing reads

Groundbreaking Reads

Souhegan Reads!

Good Reads

BOLAHUN READS

Good Reads

IDAHO READS!

Top Reads and More…

Raw reads

Winter Reads

DATUM : WGS 84

Arlington Reads

Linebacker Reads

WGS Relationship Marketing

Raw Reads

Assembly of Solexa tomato reads

Quick Reads

America Reads

Genetics_Eng_7.1.6_Genetic_Testing_and_Screening_Whole_Genome_Sequencing_(WGS)

Lancashire Reads

Human Chromosome Y Sequence Assembly using Oxford Nanopore Reads

Lancashire Reads