1 / 20

Microbial Genome Assembly and Finishing Alla Lapidus, Ph.D. Microbial genomics

Microbial Genome Assembly and Finishing Alla Lapidus, Ph.D. Microbial genomics DOE Joint Genome Institute, Walnut Creek, CA. A typical Microbial project. Assembly. Base calling. Quality screening. Vector screening. Sequencing. Annotation. Auto-assembly. Contigs. Public release.

Download Presentation

Microbial Genome Assembly and Finishing Alla Lapidus, Ph.D. Microbial genomics

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Microbial Genome Assembly and Finishing Alla Lapidus, Ph.D. Microbial genomics DOE Joint Genome Institute, Walnut Creek, CA

  2. A typical Microbial project Assembly Base calling Quality screening Vector screening Sequencing Annotation Auto-assembly Contigs Public release Gap closure FINISHING

  3. Sanger only (yesterday) 4x coverage in 3kb + 4x in 8kb + fosmids to 1x if possible Total ~ $50k for 5mb genome draft Hybrid Sanger/pyrosequence/Solexa (today) 4x coverage 8kb Sanger + 20x coverage 454 shotgun + 20x Solexa (quality improvement) Total ~ $35k for 5mb genome draft 454 + Solexa (tomorrow – starting this week) 20x coverage 454 standard + 4x coverage 454 paired end (PE) + 50x coverage Solexa shotgun (quality improvement; gaps) Total ~ $10k per 5mb genome draft Processing Microbial projects (Sequencing)

  4. Sanger reads only (phrap, PGA, Arch, etc) ---------40kb-------- • Hybrid Sanger/pyrosequence/Solexa (no special assemblers; use PGA and Arachne) 454 contig 454 shreds --3kb-- --3kb-- --8kb-- --8kb-- --8kb-- --8kb-- --8kb-- --8kb-- --8kb-- --8kb-- --8kb-- Shotgun reads PE reads Assembly (assembler) • 454/Solexa (Newbler, PCAP) – 454 reads only

  5. Align solexa reads Identify errors Automatically suggest corrections for manual curation Automatically suggest and implement corrections List Disc x1 – G x2 – T x3 – A etc G T A Role of Solexa data: “The Polisher” x1 x2 x3

  6. Finished consensus 454 contig Sanger reads Errors corrected by Solexa Frame shift detected (454 contig) CCTCTTTGATGGAAATGATA**TCTTCGAGCATCGCCTC**GGGTTTTCCATACAGAGAACCTTTGATGATGAACCGGTTGAAGATCTGCGGGTCAAA CCTCTTTGATGGAAATAATA**TATTCGAGCATC TTAGTGGAAATGATA**TCTTCGAGCATCGCCTC CGAGCNTCGCCTC**GGGCTTTCCCT CGAGCATCGCCTC**GGGTTCTCCATACACAGA GCATCGCCTC**GGGTTTTCAATACAGAGAACCT CAGCGCCTC**GGGTTTTCCATACAGAGAACCTT ATCGCCTC**GGGTTTTCCAGACAGAGAACCTTT GGTTC**GGGTTTTCCATACAGAGAACCTTTGAT GTTTTCCATACAGAGAACATTTGATGATGAAC GTTGTCCATACAGAGAACTTTTGATGATGAAC TATANCATACAGAGAACCTTTGATGATGAACC ATTTCCAGACAGAGAACCNTTGATGATGAACC CAAACAGAGAACCTTTGAGGATGAACCGGTTG ACAGGGAACCTTAGATGATGAACCGGTTGAAG ACAGAGAACCTTAGATGATGAACCGGTTGAAG ACCGTTGATGATGAACCGGTTGAAGATCTGCG GATGGTGAACGGGTTGAAGATCTGCGGGTCAA GGTTTGAAGATCTGCGGGTCAAACCAGTCCTC GGTGGAAGATCTGCGGGTAAAACCAGTCCTCT GGT.GNAGAGCTGCGGGTCAAACCAGTCCTCTG TGAAGATCTGCGGTTCAAACCAGTCCTCTCCC GATCGGCGTGTCAAACCAGTCCTCTGCCTCGT TCTGCGGGTCAAACCAGTACTCTGCCTCGTTC

  7. Ordered sets of contigs (scaffolds) PCR product pri1 pri2 10 21 16 Clone walk (Sanger lib) PCR - sequence What we get Assembly: unordered set of contigs 10 16 21 New technologies: no clones to walk off

  8. Why do we have gaps What are gaps (Sanger)? - Genome areas not covered by random shotgun • Sequencing coverage may not span all regions of the genome, thus producing gaps in the assembly. • Assembly results of the shotgun reads may produce misassembled regions due to repetitive sequences. • A biased base content (this can result in failure to be cloned, poor stability in the chosen host-vector system, or inability of the polymerase to reliably copy the sequence): ~ AT-rich DNA clones poorly in bacteria (cloning bias; promoters like structures )=> uncaptured gaps ~GC rich DNA is difficult to PCR and to sequence and often requires the use of special chemistry => captured gaps

  9. Low GC project and 454 Thermotoga lettingae TMO (JGI ID 4002278) Draft assembly +454 - 2 total contigs; 1 contigs >2kb - 454 – no cloning Draft assembly: - 55 total contigs; 41 contigs >2kb - 38GC% - biased Sanger libraries 6810 bases 454 only out of 2,170,737bp <166bp> - average length of gaps

  10. The presence of small hairpins (inverted repeat sequences) in the DNA that re anneal ether during sequencing or electrophoresis resulting in failed sequencing reactions or unreadable electrophoresis results. (This can be aided by adding modifiers to the reaction, sequencing smaller clones and running gels at higher temperatures in the presence of stronger denaturants). High GC stops (Sanger and Hybrid)

  11. High GC project and 454 Xylanimonas cellulosilytica DSM 15894 (3.8 MB; 72.1% GC) PGA assembly - 9x of 8kb +454 PGA assembly - 9x of 8kb

  12. The process of taking a rough draft assembly composed of shotgun sequencing reads, identifying and resolving miss assemblies, sequence gaps and regions of low quality to produce a highly accurate finished DNA sequence. What is Finishing? Current standards: All low quality areas in consensus (<Q30) should be reviewed and re-sequenced. No single clone coverage, i.e. minimum of 2X depth everywhere. Final error rate should be less than 1 per 50 Kb.

  13. Genome closure issues • Resolve repeats and mis-assemblies • Repeats within or in vicinity of other repeats • Large repetitive regions • Complex repetitive regions (tandems) • Fill in gaps • DNA region lethal to E.coli (Sanger libraries) • Hairpins, GC rich, hard stops or other 2° structure/physical premature termination • Hard to PCR (new technologies) • Other issues • Homopolymeric tracts and other polymorphisms (SNPs, VNTRs, indels)

  14. JGI Microbial Finishing Currently: >250 individual microbes “I am all for finished genomes! It will serve us best in the long run.. Unfinished ones are likely to contribute to some chaos” – Proff. Sallie W. Chisholm. MIT

  15. Typically size of metagenomic sequencing project is very large Different organisms have different coverage. Non-uniform sequence coverage results in significant under- and over-representation of certain community members Low coverage for the majority of organisms in highly complex communities leads to poor (if any) assemblies Chimerical contigs produced by co-assembly of sequencing reads originating from different species. Genome rearrangements and the presence of mobile genetic elements (phages, transposons) in closely related organisms further complicate assembly. No assemblers developed for metagenomic data sets Metagenomic assembly The whole-genome shotgun sequencing approach was used for a number of microbial community projects, however useful quality control and assembly of these data require reassessing methods developed to handle relatively uniform sequences derived from isolate microbes.

  16. QC: Annotation of poor quality sequence To avoid this: make sure you use high quality sequence; choose proper assembler

  17. Use Trimmer (Lucy etc) to treat reads PRIOR to assembly Do not use PHRAP for metagenomic projects None of the existing assemblers designed for metagenomic data but assemblers like PGA work better with paired reads information and produce better assemblies Recommendations for metagenomic assembly

  18. Metagenomic finishing: projects Completed Projects: Candidatus Korarchaeum cryptofilum OPF8 - is the first of this apparently ancient hyperthermophilic phyletic group to be sequenced Desulforudis audaxviator - isolated from old water in fissures of a South African gold mine at a depth of 3000 meters. Finished with Sanger and 454 Candidatus Accumulibacter phosphatis Type IIA (CAP) - from EBPR sludge community, US In progress: Candidatus Endomicrobium trichonymphae - an intracellular symbiont of a flagellate protist, itself part of the hindgut community of a termite host. It is of interest in the pursuit of the efficient breakdown of cellulose and lignin necessary in the hoped-for conversion of bulk plant materials to CO2-neutral fuel

  19. CAP reads + Non-CAP reads Metagenomic finishing: approach Candidatus Accumulibacter phosphatis(CAP) Binning:Which DNA fragment derived from which phylotype? (BLAST; GC%; read depth) Lucy/PGA Complete genome of Candidatus Accumulibacter phosphatis ~ 45%

  20. The end

More Related