280 likes | 403 Views
Assembly of Paired-end Solexa Reads by Kmer Extension using Base Qualities. Zemin Ning The Wellcome Trust Sanger Institute. Outline of the Talk:. Euler Path and Sequence Reconstruction Euler Hash Table Read Extension Using Base Qualities and Read Pairs
E N D
Assembly of Paired-end Solexa Reads by Kmer Extension using Base Qualities Zemin Ning The Wellcome Trust Sanger Institute
Outline of the Talk: • Euler Path and Sequence Reconstruction • Euler Hash Table • Read Extension • Using Base Qualities and Read Pairs • Repeat Junctions and Single Base Variation • Assembly Results • Future Work
Sequence Reconstruction - Hamiltonian path approach S=(ATGCAGGTCC) ATG -> TGC -> GCA -> CAG -> AGG -> GGT -> GTC -> TCC ATG AGG TGC TCC GTC GGT GCA CAG • Vertices: k-tuples from the spectrum shown in red (8); • Edges: overlapping k-tuples (7); • Path: visiting all vertices corresponding to the sequence.
CG GT GC AT TG CA GG Sequence Reconstruction - Euler path approach ATG -> TGG -> GGC -> GCG -> CGT -> GTG -> TGC -> GCA ATGCGTGGCA ATGGCGTGCA • Vertices: correspond to (k-I)-tuples (7); • Edges: correspond to k-tuples from the spectrum (8); • Path: visiting all EDGES corresponding to the sequence.
E k-tuples Indices, Offsets and links to the next 7 ATG 1,1,28 3,1,28 4,1,28 8 ATC 2,1,29 10 AGT 4,5,38 11 AGG 1,5,42 2,4,42 3,6,42 19 TAG 3,5,11 24 TTC 4,7,32 28 TGC 1,2,45 3,2,46 4,2,45 29 TCA 2,2,51 32 TCC 1,8,-1 2,7,-1 3,9,-1 4,8,-1 38 GTT 4,6,24 40 GTC 1,7,32 2,6,32 3,8,32 42 GGT 1,6,40 2,5,40 3,7,40 45 GCA 1,3,51 4,3,51 46 GCT 3,3,53 51 CAG 1,4,11 2,3,11 4,4,10 52 CAC 3,4,19 SSAHA Type Hash Table S1=(ATGCAGGTCC) , S2=(ATCAGGTCC) S3=(ATGCTAGGTCC), S4=(ATGCAGTTCC)
E k-tuples Indices, Offsets and links to the next 7 ATG 1,1,28 3,1,28 4,1,28 8 ATC 2,1,29 10 AGT 4,5,38 11 AGG 1,5,42 2,4,42 3,6,42 19 TAG 3,5,11 24 TTC 4,7,32 28 TGC 1,2,45 3,2,46 4,2,45 29 TCA 2,2,51 32 TCC 1,8,-1 2,7,-1 3,9,-1 4,8,-1 38 GTT 4,6,24 40 GTC 1,7,32 2,6,32 3,8,32 42 GGT 1,6,40 2,5,40 3,7,40 45 GCA 1,3,51 4,3,51 46 GCT 3,3,53 51 CAG 1,4,11 2,3,11 4,4,10 52 CAC 3,4,19 Point to the Next - Hash Table Links S1=(ATGCAGGTCC) , S2=(ATCAGGTCC) S3=(ATGCTAGGTCC), S4=(ATGCAGTTCC)
Repeat Repeat Repeat Sequence Repeat Graph reads
Extend Solexa reads to long reads of 1-2 Kb forward-reverse paired reads known dist ~500 bp 30-40 bp 30-40 bp Capillary reads assembler Phrap/Phusion Genome/Chromosome Assembly Strategy
Depth Depth Pair read position Current read position Contig start Insert length Repetitive Contig and Read Pairs For each hit read in the contig, contig index and offset are stored.
Handling of Repeat Junctions A = A1 + A2 A2 A1 B1 B = B1 + B2 B2
Handling of Single Base Variations A B1 A B2 B1 = B2 S = A + B1
S Suis P1/7 Solexa Assembly Solexa reads: Number of reads: 3,084,185;Finished genome size: 2,007,491 bp; Read length: 39 and 36 bp; Estimated read coverage: ~40X; Estimated Kmer coverage: 14X; Number of vector reads: ?; Assembly features: - contig statsTotal number of contigs: 362; Total bases of contigs: 1,938,732 bp N50 contig size: 10,849; Largest contig: 33,388 Averaged contig size: 5,356; Contig coverage over the genome: ~97 %; Contig extension errors: 1 Mis-assembly errors: 3
S Suis P1/7 Shredded Read Assembly Shredded reads: Number of reads: 1,338,161;Finished genome size: 2,007,491 bp; Read length: 36; Estimated read coverage: 24X;Insert size: 500 bp; Assembly features: Paired_Data Not_Paired Number of contigs: 35 317 Total assembled bases: 1.996 Mb 1.956 Mb N50 contig size: 243,039 13,929 Largest contig: 474,070 33,460 Averaged contig size: 57,043 6,168 Contig coverage: >99.0 % >99.0 % Contig extension errors: 0 0 Mis-assembly errors: 3 2
STyphi 6979 Solexa Assembly Solexa reads: Number of reads: 5,142,190;Finished genome size: 4,809,037 bp; Read length: 41; Estimated read coverage: ~15X;Assembly features: - contig statsTotal number of contigs: 3,126; Total bases of contigs: 4,633,241 bp N50 contig size: 2,460; Largest contig: 15,325; Averaged contig size: 1,482; Contig coverage over the genome: ~97.5 %; Mis-assembly errors: 0
STyphi CT18 Shredded Read Assembly Solexa reads: Number of reads: 4,808,788;Finished genome size: 4,809,037 bp; Read length: 40; Estimated read coverage: 40X; Assembly features: - contig statsTotal number of contigs: 65; Total bases of contigs: 4,800,992 bp N50 contig size: 158,460; Largest contig: 489,849; Averaged contig size: 73,861; Contig coverage over the genome: ~99.0 %; Mis-assembly errors: 3
PF_3D7 Shredded Read Assembly Solexa reads: Number of reads: 11,630,428;Finished genome size: 23.5 Mp; Read length: 40; Estimated read coverage: 20X; Assembly features: - contig statsTotal number of contigs: 29,313; Total bases of contigs: 17.17 Mp N50 contig size: 1,355; Largest contig: 14,136; Averaged contig size: 585; Contig coverage over the genome: ~72.8 %; Mis-assembly errors: ?
Shred reads with given coverage forward-reverse paired reads known dist ~500 bp ~40 bp ~40 bp Organize reads into small groups covering clone 200 kb Clone Level Assembly with Shredded Error Free Reads Genome/Chromosome
Human Chromosome X Shredded reads: Number of reads: 156 million Chromosome length: 156 Mb Number of Clones: 774 Read length: 40; Estimated read coverage: 40X; Assembly features: - contig statsTotal number of contigs: 28,204; Total bases of contigs: 148 Mp N50 contig size: 30,968; Largest contig: 173,157; Averaged contig size: 5,254;
Zebrafish Chromosome 5 Shredded reads: Number of reads: 70.2 million Chromosome length: 70.3 Mb Number of Clones: 351 Read length: 40; Estimated read coverage: 40X; Assembly features: - contig statsTotal number of contigs: 22,405; Total bases of contigs: 67.5 Mp N50 contig size: 9,587; Largest contig: 70,757; Averaged contig size: 3,012;
Plasmodium Chr14 Shredded reads: Number of reads: 3.2 million Chromosome length: 3.29 Mb Number of Clones: 16 Read length: 40; Estimated read coverage: 40X; Assembly features: - Original dataTotal number of contigs: 1,960; Total bases of contigs: 2.86 Mp N50 contig size: 2,924; Largest contig: 18,366; Averaged contig size: 1,461; Assembly features: - Replacing “TATATA…”Total number of contigs: 1,333; Total bases of contigs: 3.05 Mp N50 contig size: 4,596; Largest contig: 23,345; Averaged contig size: 2,287;
Acknowledgements: • Ian Goodhead and Chris Clee • James Bonfield • Yong Gu and Adam Spargo • Daniel Zerbino (EBI) • Tony Cox • Richard Durbin