190 likes | 327 Views
FuzzyPath Assemblies - from Mixed Solexa /454 Datasets to Extremely GC Biased Genomes. Zemin Ning The Wellcome Trust Sanger Institute. Selexa reads assembler to extend long reads of 1-2Kb. forward-reverse paired reads. known dist. ~500 bp. 30-70 bp. 30-70 bp. Capillary reads assembler
E N D
FuzzyPath Assemblies - from Mixed Solexa/454 Datasets to Extremely GC Biased Genomes Zemin Ning The Wellcome Trust Sanger Institute
Selexa reads assembler to extend long reads of 1-2Kb forward-reverse paired reads known dist ~500 bp 30-70 bp 30-70 bp Capillary reads assembler Phrap/Phusion Genome/Chromosome Assembly Strategy
Kmer Extension & Repeat Junctions A = A1 + A2 A2 A1 B1 B = B1 + B2 B2
Handling of Single Base Variations A B1 A B2 B1 = B2 S = A + B1
Number of Mismatches between Two Kmers ACGTAACTAACAGTT 00 01 10 11 00 00 01 11 00 00 01 00 10 11 11 Kmer_1 ACGTAACTCACAGTT 00 01 10 11 00 00 01 11 01 00 01 00 10 11 11 Kmer_2 ACGTAACT ACAGTT 00 00 00 00 00 00 00 00 01 00 00 00 00 00 00 Kmer_1^Kmer_2
Pileup of other reads like 454, Sanger etc at a repeat junction Kmer Extension & Repeat Junctions A2 A1 Consensus Means to handle repeats: - Base quality - Read pair - Fuzzy kmers - Closely related reference - 454 or Sanger reads
S.Suis P1/7 Solexa/454 Assembly Solexa reads: Number of reads: 3,084,185;Finished genome size: 2,007,491 bp; Read length: 39 and 36 bp; Estimated read coverage: ~55X; Number of 454 reads: 100,000; Read coverage of 454: 10X; Assembly features: - contig statsTotal number of contigs: 73; Total bases of contigs: 1,999,817 bp N50 contig size: 62,508; Largest contig: 162,190 Averaged contig size: 27,394; Contig coverage over the genome: ~99 %; Contig extension errors: 2 Mis-assembly errors: 3
Salmonella seftenbergSolexaAssembly from Pair-End Reads Solexa reads: Number of reads: 6,000,000;Finished genome size: ~4.8 Mbp; Read length: 2x37 bp; Estimated read coverage: ~92.5 X; Insert size: 170/50-300 bp; Assembly features: - contig stats Solexa 454Total number of contigs: 75; 390 Total bases of contigs: 4.80 Mbp 4.77 Mb N50 contig size: 139,353 25,702 Largest contig: 395,600 62,040 Averaged contig size: 63,969 12,224 Contig coverage on genome: ~99.8 % 99.4% Contig extension errors: 0 Mis-assembly errors: 0 4
E.Coli strain 042 Assembly Solexa reads: Number of reads: 7,055,348;Finished genome size: 5.35 Mbp; Read length: 2x36bp; Estimated read coverage: ~95X; Insert size: 170/50-300 bp; Assembly features: - contig statsTotal number of contigs: 168; Total bases of contigs: 5.19 Mbp N50 contig size: 85,886; Largest contig: 337,768 Averaged contig size: 30,886; Contig coverage over the genome: ~99 %; Contig extension errors: 1 Mis-assembly errors: 2
The Malaria Genome Project
Datasets with Various GC Content GC 68.0% 50.5% 19.0% 68.0% 19.0% 50.8% 19.0% 19.0% 19.0% 19.0%
Malaria 3D7 Assemblies Solexa reads: 2x36 bp 2x76 bp Number of reads: 14.0m 9.77mFinished genome size: 23 Mbp 23 Mbp Estimated read coverage: 43x 64x Insert size: 170 bp 170 bp Assembly features:Total number of contigs: 26,926 22839 Total bases of contigs: 19.2 Mbp 21.1 Mb N50 contig size: 1456 1621 Largest contig: 9106 9825 Averaged contig size: 706 923 Contig coverage on genome: ~83.5 % 91.7% Contig extension errors: ? ? Mis-assembly errors: ? ?
Salmonella delhi5 Solexa Assembly Guided by A Close Reference Solexa reads: Number of reads: 6,346,317;Finished genome size: 4.7 Mbp; Read length: 33 bp; Estimated read coverage: ~40 X; Shredded reference of SpA: 10X; Assembly features: - contig statsTotal number of contigs: 66; Total bases of contigs: 4,615,704 bp N50 contig size: 168,793; Largest contig: 401,700 Averaged contig size: 69,934; Contig coverage over the genome: ~98 %; Contig extension errors: 0 Mis-assembly errors: 2
Acknowledgements: • Yong Gu • Ben Blackburne • Hannes Ponstingl • Daniel Turner • Michael Quail • Tony Cox • Richard Durbin