400 likes | 709 Views
Francisella tularensis. Francisella tularensis. Data Sources NCBI Genome Projects Short Read Archive (SRA) Trace Archive (TA) TA FTP Broad Institute Microbial Sequencing Center Download Sequence Baylor College of Medicine Microbial Genome Projects FTP. Francisella tularensis.
E N D
Data Sources NCBI Genome Projects Short Read Archive (SRA) Trace Archive (TA) TA FTP Broad Institute Microbial Sequencing Center Download Sequence Baylor College of Medicine Microbial Genome Projects FTP Francisella tularensis
Assembly methods: Whole Genome Assembly (WGA): Celera July 18th 2007 version runCA-OBT.pl script Sanger traces Comparative Assembly : AMOS 2.0.2 version AMOScmp script Sanger/454 traces 454 1.1.01.20 version runProject script 454 traces Francisella tularensis
Whole Genome Assembly De novo assembly Does not use a reference genome Requires high quality long Sanger paired reads Cannot handle 454 data (yet) Max error rate=1.5% Overlap based trimming of the reads Re-estimate library insert size (mean,stdev) Resolves surrogates Creates contigs & scaffolds Francisella tularensis
AMOScmp Comparative assembly Uses a reference genome Can assemble low quality/short reads Rearrangements/indels can create breaks in the assembly Can use both Sanger & 454 unpaired reads Max error rate=20% Francisella tularensis
454 de novo assembler Does not use a reference genome Uses 454 unpaired/pair-end reads Requires 454 flowgrams Creates contigs Creates scaffolds (if pair-end reads are available) Francisella tularensis
Software used: Assemblers: WGA, AMOScmp, 454 Data download: ftp, wget, query_tracedb Data formatting: tarchive2ca(AMOS),sfftools Trimming: Lucy, veraTrim, OBT Alignment, repeats, snp's: mummer Sequence manipulation: EMBOSS Assembly validation: cavalidate, amosvalidate(AMOS) Contig joining: autoJoiner Scripts: Perl Wiki site: keep track of the assemblies progress https://wiki.umiacs.umd.edu/cbcb/index.php/Francisella_tularensis Francisella tularensis
The only complete genome for which we have most of the traces available! Published: Chromosome Rearrangement and Diversification of Francisella tularensis Revealed by the Type B (OSU18) Genome Sequence; J Bacteriol. 2006 October; 188(19): 6977–6985. Sequencing and assembly of the F. tularensis subsp. holarctica strain OSU18 genome were accomplished by the whole-genome shotgun (WGS) method, similar to a previously described method (22). Briefly, the WGS clones were sequenced using ABI 3730 sequencers, and the sequence bases were called using the Applied Biosystems sequencing analysis software KB Basecaller. The WGS reads were assembled by using Atlas (11) and Phrap (7). The initial WGS assembly resulted in 132 contigs in 33 scaffolds with approximately 26× sequence coverage. Gaps between contigs and scaffolds were closed by sequencing PCR products that spanned gaps or by sequencing small insert libraries generated from the PCR products. Low-quality regions were resequenced using clones or PCR products spanning the regions to ensure that the Phrap quality score for each base was equal to or greater than 30. This relatively deep data set should enable further studies involving new sequencing, comparative genomics, and proteomics strategies and technologies Francisella tularensis holarctica OSU18
Alignments OSU18 vs OSU18 (complete genome) many ~1KB dispersed repeats one 30KB two copy repeat OSU18 vs other strains (SchuS4,U112) (complete genomes) large rearrangements indels SNP's common repeats OSU18 (complete genome) vs OSU18 (assemblies) assembly gaps errors OSU18 (complete genome) vs OSU18 (traces) read coverage: 4 regions with 0X coverage trimming library insert sizes: original values (2KB) are underestimates Francisella tularensis holarctica OSU18
Trimming Default (Factory): CLIP values from NCBI TA Lucy: quality trimming veraTrimming: 5' vector trimming Overlap Based Trimming (OBT): quality & vector trimming Francisella tularensis holarctica OSU18
Mummerplot: OSU18 vs OSU18; red lines beside the main diagonal are repeats Francisella tularensis holarctica OSU18
Mummerplot: OSU18(Type B) vs SchuS4(Type A); many rearrangements Francisella tularensis holarctica OSU18
Mummerplot: OSU18(Type B) vs U112(Novicida); many rearrangements Francisella tularensis holarctica OSU18
WGA metrics: [Scaffolds] TotalScaffolds=160 TotalContigsInScaffolds=163 TotalBasesInScaffolds=2104125 MeanBasesInScaffolds=13151 MinBasesInScaffolds=1028 MaxBasesInScaffolds=454878 [Contigs] TotalContigsInScaffolds=163 TotalBasesInScaffolds=2104125 MeanContigLength=12909 MinContigLength=1028 MaxContigLength=454878 [BigContigs_greater_10000] TotalBigContigs=19 BigContigLength=1869172 MeanBigContigLength=98377 MinBigContig=13002 MaxBigContig=454878 BigContigsPercentBases=88.83 ... Francisella tularensis holarctica OSU18 [Mates] ReadsWithNoMate=3052(4.54%) ReadsWithGoodMate=54516(81.10%) ReadsWithBadShortMate=0(0.00%) ReadsWithBadLongMate=26(0.04%) ReadsWithSameOrientMate=122(0.18%) ReadsWithOuttieMate=42(0.06%) ReadsWithBothChaffMate=1744(2.59%) ReadsWithChaffMate=1034(1.54%) [Reads] TotalUsableReads=67220 AvgClearRange=791 ContigReads=61480(91.46%) BigContigReads=60524(90.04%) SmallContigReads=956(1.42%) SingletonReads=2768(4.12%) ChaffReads=2768(4.12%) [Coverage] ContigsOnly=23.31 AllReads=25.29
Mummerplot: OSU18 (complete) vs OSU18 (WGA): 2 inversions & several indels Francisella tularensis holarctica OSU18
Mummerplot: OSU18 (complete) vs OSU18 (WGA): SNP's are represented as red dots Francisella tularensis holarctica OSU18
AMOScmp metrics: [Scaffolds] TotalScaffolds 1 TotalContigsInScaffolds 22 MeanContigsPerScaffold 22.00 MinContigsPerScaffold 22 MaxContigsPerScaffold 22 TotalBasesInScaffolds 1882156 MeanBasesInScaffolds 1882156 MaxBasesInScaffolds 1882156 N50ScaffoldBases 1882156 TotalSpanOfScaffolds 1895684 MeanSpanOfScaffolds 1895684 MinScaffoldSpan 1895684 MaxScaffoldSpan 1895684 IntraScaffoldGaps 21 MeanSequenceGapSize 613.91 ... [Contigs] TotalContigs 22 TotalBasesInContigs 1926006 MeanContigSize 87545.73 MinContigSize 99 MaxContigSize 567200 N50ContigBases 465176 Francisella tularensis holarctica OSU18 ... [BigContigs_greater_10000] TotalBigContigs 10 BigContigLength 1918334 MeanBigContigSize 191833.40 MinBigContig 10681 MaxBigContig 567200 BigContigsPercentBases 99.60 [SmallContigs] TotalSmallContigs 12 SmallContigLength 7672 MeanSmallContigSize 639.33 MinSmallContig 99 MaxSmallContig 2153 SmallContigsPercentBases 0.40 [Reads] TotalReads 68460 ReadsInContigs 66471 BigContigReads 66367 SmallContigReads 104 SingletonReads 1989
Hawkeye: Original AMOcmp assembly; many orientation violations (red mates) indicate missassemblies Francisella tularensis holarctica OSU18
Hawkeye: Original AMOcmp assembly: 1st missassembled region 16,336-21,562 (5 KB) Francisella tularensis holarctica OSU18
Hawkeye: Original AMOcmp assembly: 2nd missassembled region 167,086-184,936 (17 KB) Francisella tularensis holarctica OSU18
Final Assembly Steps: 1. The complete genome sequence was downloaded from NCBI: RefSeq NC_008369.1 2. The reads were downloaded from TA and re-formatted using the AMOS package (tarchive2ca) => .frg & .afg files 3. Only the two Sanger libraries were considered: 68K reads should provide enough coverage to assemble the whole genome 4. The reads were retrimmed using veraTrim (-T 10 -M 100 -E 500) 5. The WGA assembler (runCA-OBT.pl) was used to assemble the reads 6. The library sizes were updated using the WGA estimates BFTBP (mean,stdev): original (2000,1000) ; WGA (2690, 643) BFTDP (mean,stdev): original (2000,1000) ; WGA (3675,1225) 7. The WGA assembly was aligned to the reference using nucmer; two rearrangements & multiple SNP's were noticed 8. The reads were assembled using AMOScmp; 2 missoriented read piles were noticed 9. The assembly was aligned to itself; two 950 bp inverted repeats were identified as flanking the problem regions; the region coordinates are: 16336-21562 (5 KB), 167086-184936 (17 KB) 10. The 2 regions were flipped ; the new reference was called NC_008369.2 11. Several small contig read clear ranges (step 8) were extended to their OBT trimming points 12. AMOScmp was rerun using more relaxed parameters: nucmer MINCLUSTER=30 , casm-layout MAXTRIM=50 Conclusions: We Reduced the number of AMOScmp contigs from 22 to 8. We increased the number of assembled bases from 1,882,156 to 1,889,817. Francisella tularensis holarctica OSU18
New AMOcmp assembly: the 2 piles of orientation violations (red mates) are gone Francisella tularensis holarctica OSU18
Francisella tularensis tularensis FSC033 Reference: NZ_AAYE00000000 Name Length %GC NZ_AAYE01000001.1 101124 33.65 NZ_AAYE01000002.1 46675 32.87 NZ_AAYE01000003.1 1600 34.25 NZ_AAYE01000004.1 295522 31.87 NZ_AAYE01000005.1 650364 31.73 NZ_AAYE01000006.1 2400 37.29 NZ_AAYE01000007.1 132212 32.96 NZ_AAYE01000008.1 23680 31.04 NZ_AAYE01000009.1 201 45.27 high GC%: 16S-23S rRNA(megablast) NZ_AAYE01000010.1 571 46.06 high GC% NZ_AAYE01000011.1 61231 32.30 NZ_AAYE01000012.1 249955 31.91 NZ_AAYE01000013.1 137017 32.24 NZ_AAYE01000014.1 91009 32.87 NZ_AAYE01000015.1 50644 33.42 Total 1844205 1,892,819 bp in SCHU S4(complete)=> ~ 48,614 bp FSC033 draft assembly gaps
Francisella tularensis tularensis FSC033 Mummerplot: SchuS4(Type A, complete) vs FSC033(Type A,draft)
Francisella tularensis tularensis FSC033 AMOScmp assembly, Sanger+454 reads, FSC033 Broad draft assembly used as reference 3rd largest scaffold: 30 KB repeat looks collapsed; appears as 28,659 bp surrogate in WGA
Francisella tularensis tularensis FSC033 AMOScmp assembly, Sanger+454 reads, SchuS4 complete genome used as reference
Francisella tularensis holarctica KO97 AMOScmp assembly, 454 unpaired reads, OSU18 complete genome used as reference