200 likes | 651 Views
Microbial Genome Assembly and Finishing Alla Lapidus, Ph.D. Microbial genomics DOE Joint Genome Institute, Walnut Creek, CA. A typical Microbial project. Sequencing. Auto- assembly . Gap closure FINISHING . Annotation. Public release. Sanger only
E N D
Microbial Genome Assembly and Finishing Alla Lapidus, Ph.D. Microbial genomics DOE Joint Genome Institute, Walnut Creek, CA
A typical Microbial project Sequencing Auto- assembly Gap closure FINISHING Annotation Public release
Sanger only 4x of 3kb plasmids + 4x of 8kb plasmids + 1x of fosmids ~ $50k for 5MB genome draft Evolution of Microbial Drafts Hybrid Sanger/pyrosequence/Illumina • 4x 8kb Sanger + 15 x coverage 454 shotgun + 20x Illumina (quality improvement) • ~ $35k for 5MB genome draft 454 + Solexa (current) -20x coverage 454 standard + 4x coverage 454 paired end (PE) + 50x coverage Illumina shotgun (quality improvement; gaps) - ~ $10k per 5MB genome
Sanger reads only (phrap, PGA, Arch, etc) ---------40kb-------- • Hybrid Sanger/pyrosequence/Solexa (no special assemblers; use PGA and Arachne) 454 contig 454 shreds --3kb-- --3kb-- --8kb-- --8kb-- --8kb-- --8kb-- --8kb-- --8kb-- --8kb-- --8kb-- --8kb-- • 454/Solexa (Newbler, PCAP?) – 454 reads only Shotgun reads PE reads Assembly (assembler)
Align solexa reads Identify errors Automatically suggest corrections for manual curation Automatically suggest and implement corrections List Disc x1 – G x2 – T x3 – A etc G T A Use of Illumina data x1 x2 x3 Polisher
Finished consensus 454 contig Sanger reads Errors corrected with Solexa (Polisher) Frame shift detected in this area (454 contig) CCTCTTTGATGGAAATGATA**TCTTCGAGCATCGCCTC**GGGTTTTCCATACAGAGAACCTTTGATGATGAACCGGTTGAAGATCTGCGGGTCAAA <- Solexa Consensus CCTCTTTGATGGAAATAATA**TATTCGAGCATC TTAGTGGAAATGATA**TCTTCGAGCATCGCCTC CGAGCNTCGCCTC**GGGCTTTCCCT CGAGCATCGCCTC**GGGTTCTCCATACACAGA GCATCGCCTC**GGGTTTTCAATACAGAGAACCT CAGCGCCTC**GGGTTTTCCATACAGAGAACCTT ATCGCCTC**GGGTTTTCCAGACAGAGAACCTTT GGTTC**GGGTTTTCCATACAGAGAACCTTTGAT GTTTTCCATACAGAGAACATTTGATGATGAAC GTTGTCCATACAGAGAACTTTTGATGATGAAC TATANCATACAGAGAACCTTTGATGATGAACC ATTTCCAGACAGAGAACCNTTGATGATGAACC CAAACAGAGAACCTTTGAGGATGAACCGGTTG ACAGGGAACCTTAGATGATGAACCGGTTGAAG ACAGAGAACCTTAGATGATGAACCGGTTGAAG ACCGTTGATGATGAACCGGTTGAAGATCTGCG GATGGTGAACGGGTTGAAGATCTGCGGGTCAA GGTTTGAAGATCTGCGGGTCAAACCAGTCCTC GGTGGAAGATCTGCGGGTAAAACCAGTCCTCT GGT.GNAGAGCTGCGGGTCAAACCAGTCCTCTG TGAAGATCTGCGGTTCAAACCAGTCCTCTCCC GATCGGCGTGTCAAACCAGTCCTCTGCCTCGT TCTGCGGGTCAAACCAGTACTCTGCCTCGTTC
Ordered sets of contigs (scaffolds) PCR product pri1 pri2 10 21 16 Clone walk (Sanger lib) PCR - sequence Draft assembly - what we get Assembly: set of contigs 10 16 21 New technologies: no clones to walk off even if you can scaffold contigs
Why do we have gaps What are gaps (Sanger)? - Genome areas not covered by random shotgun • Sequencing coverage may not span all regions of the genome, thus producing gaps in the assembly. • Assembly results of the shotgun reads may produce misassembled regions due to repetitive sequences. • A biased base content (this can result in failure to be cloned, poor stability in the chosen host-vector system, or inability of the polymerase to reliably copy the sequence): ~ AT-rich DNA clones poorly in bacteria (cloning bias; promoters like structures )=> uncaptured gaps ~GC rich DNA is difficult to PCR and to sequence and often requires the use of special chemistry => captured gaps
454 (pyrosequence) and low GC genome Thermotoga lettingae TMO (JGI ID 4002278) Draft assembly +454 - 2 total contigs; 1 contigs >2kb - 454 – no cloning Draft assembly: - 55 total contigs; 41 contigs >2kb - 38GC% - biased Sanger libraries <166bp> - average length of gaps
The presence of small hairpins (inverted repeat sequences) in the DNA that re anneal ether during sequencing or electrophoresis resulting in failed sequencing reactions or unreadable electrophoresis results. (This can be aided by adding modifiers to the reaction, sequencing smaller clones and running gels at higher temperatures in the presence of stronger denaturants). High GC stops (Sanger and Hybrid)
454 and High GC project Xylanimonas cellulosilytica DSM 15894 (3.8 MB; 72.1% GC) PGA assembly - 9x of 8kb +454 PGA assembly - 9x of 8kb
Genome closure issues • Resolve repeats and mis-assemblies • Repeats within or in vicinity of other repeats • Large repetitive regions • Complex repetitive regions (tandems) • Fill in gaps • DNA region lethal to E.coli (Sanger libraries) • Hairpins, GC rich, hard stops or other 2° structure/physical premature termination • Hard to PCR (new technologies) • Other issues • Homopolymeric tracts and other polymorphisms (SNPs, VNTRs, indels)
The process of taking a rough draft assembly composed of shotgun sequencing reads, identifying and resolving miss assemblies, sequence gaps and regions of low quality to produce a highly accurate finished DNA sequence. What is Finishing? Final quality: Final error rate should be less than 1 per 50 Kb. No gaps, no misassembled areas, no characters other than ACGT
JGI Microbial Finishing Currently: >250 individual microbes
Typically size of metagenomic sequencing project is very large Different organisms have different coverage. Non-uniform sequence coverage results in significant under- and over-representation of certain community members Low coverage for the majority of organisms in highly complex communities leads to poor (if any) assemblies Chimerical contigs produced by co-assembly of sequencing reads originating from different species. Genome rearrangements and the presence of mobile genetic elements (phages, transposons) in closely related organisms further complicate assembly. No assemblers developed for metagenomic data sets Metagenomic assembly The whole-genome shotgun sequencing approach was used for a number of microbial community projects, however useful quality control and assembly of these data require reassessing methods developed to handle relatively uniform sequences derived from isolate microbes.
QC: Annotation of poor quality sequence To avoid this: make sure you use high quality sequence choose proper assembler
Use Trimmer (Lucy etc) to treat reads PRIOR to assembly None of the existing assemblers designed for metagenomic data but assemblers like PGA work better with paired reads information and produce better assemblies. We are not using pharp for metagenomic projects. Recommendations for metagenomic assembly (Sanger)
Binning:Which DNA fragment derived from which phylotype? (BLAST; GC%; read depth) Lucy/PGA Complete genome of Candidatus Accumulibacter phosphatis CAP reads ~ 45% + Non-CAP reads Finishing approach for metagenomes Example: Candidatus Accumulibacter phosphatis(CAP)
Metagenomic finishing: projects Completed Projects: Candidatus Korarchaeum cryptofilum OPF8 - is the first of this apparently ancient hyperthermophilic phyletic group to be sequenced Desulforudis audaxviator - isolated from old water in fissures of a South African gold mine at a depth of 3000 meters. Finished with Sanger and 454 Candidatus Accumulibacter phosphatis Type IIA (CAP) - from EBPR sludge community, US In progress: Candidatus Endomicrobium trichonymphae - an intracellular symbiont of a flagellate protist, itself part of the hindgut community of a termite host. It is of interest in the pursuit of the efficient breakdown of cellulose and lignin necessary in the hoped-for conversion of bulk plant materials to CO2-neutral fuel