160 likes | 310 Views
Harry Presman. Gene-Boosted Assembly of a Novel Bacterial Genome from Very Short Reads. Overview. Motivation Assembly Results Advantages/Limitations. Motivation. Next-gen sequencers produce short read-lengths Useful for polymorphism discovery Difficult to assemble whole genomes
E N D
Harry Presman Gene-Boosted Assembly of a Novel Bacterial Genomefrom Very Short Reads
Overview • Motivation • Assembly • Results • Advantages/Limitations
Motivation • Next-gen sequencers produce short read-lengths • Useful for polymorphism discovery • Difficult to assemble whole genomes • Current assembly algorithms produce highly fragmented results
Sequencing P. aeruginosa(PAb1) • Source of common in-hospital infections • Chosen due to available comparators, PAO1 and PA14 • 8,627,900 shotgun reads (Solexa)
Assembly • Step 1: AMOScmp • Comparative assembler • Uses MUMmer • Alignment system based on suffix trees • Referenced in “Comparative Genome Assembly” • PA14 – 2053 contigs • PAO1 – 2797 contigs
Assembly • Step 2 : multiple sequence alignment • Align PAO1 and PA14 assemblies • Use Minimus to fill gaps with contigs • AMOS component for small data sets • Re-map reads using AMOScmp to clean assembly • Closed 203 gaps
Assembly • Step 3 : gene-boosted assembly • UofMaryland annotation pipeline • Based on BLAST and Glimmer • Protein-coding genes used to fill gaps • Identify genes at contig edges and gaps • Extract AA sequences • tBlastn identified potential filler reads • ABBA assembled reads into gaps • Closed 185 gaps
Aside • Tested gene-boosted analysis alone • PAb1 assembled using PA14 proteins • 96% of PAb1 proteins assembled using only this method • Lacks global genome structure information
Assembly • Step 4 : Clean up • SSAKE • “Short Sequence Assembly by K-mer search and 3’ read Extension” • Edena • “Exact DE Novo Assembler” • Velvet • Closed 46 gaps
Results • 76 contigs containing 6,290,005 bp • 94% of bases in single scaffold • 5602 protein-coding genes identified • Error rate per read = 1.04% • Error with coverage > 20X is zero • Slight bias toward high gene coverage regions
Results • SNP analysis • Aligned PA14 and PAb1 • 5,537,508/5,568,550 bp agreed • 1157/5,568,550 possible sequence errors • 187/1104 indels in error • Accuracy of assembly: > 99.97%
Advantages/Limitations • Requires related genomes and protein sequences • GenBank contains > 650 microbial genomes • Genome size should not matter • High speed and low cost • ¼ of a single Solexa sequencing run in this case
Thank You Questions?