1 / 20

CUGI Pilot Sequencing/Assembly Projects

CUGI Pilot Sequencing/Assembly Projects. Christopher Saski. Sequencing the Cacao Genome: 3 Megabases at a Time. Pilot project to sequence and assemble 3Mbp segment of cacao genome IBM in silico assembly project – Testing the assembly pipeline.

perrin
Download Presentation

CUGI Pilot Sequencing/Assembly Projects

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CUGI Pilot Sequencing/Assembly Projects Christopher Saski

  2. Sequencing the Cacao Genome:3 Megabases at a Time • Pilot project to sequence and assemble 3Mbp segment of cacao genome • IBM in silico assembly project – Testing the assembly pipeline

  3. Sequencing the Cacao Genome:3 Megabases at a Time • Combination of: • “Old School Genomics” • BAC libraries, physical mapping, and clone-by-clone sequencing • Roche 454 Titanium and FLX De Novo sequencing • Key: • Not yet accurately assembled a eukaryotic genome with NGS alone • Reduce assembly complexity

  4. 3 Megabase segments Rounsley et al., 2009

  5. Advantages • Reduce assembly complexity • Limit number of sequencing libraries • Prioritize critical genomic regions • Outsource BAC pools for sequencing in parallel at any center that has a 454 Titanium/GS-FLX sequencer • Flexibility – Start slow with minimal investment • Could redesign strategy to reduce sequence runs

  6. Strategy Components • Integrated Physical/Genetic framework • Pool development and sequencing: • BAC-end • Titanium 454 (paired/non-paired) • Draft sequence • Assembly and integration: • Newbler • Celera (CABOG)

  7. Cacao Integrated Physical/Genetic Framework • Represents ~29X coverage (3 BAC libraries) • Assembled into small number of large contigs • Suggests reasonable levels of heterozygosity • Manageable amounts of repetitive sequence • 220 anchored genetic markers spanning 10 linkage groups • Resemble recombinational derived order

  8. Pool Development • Select contiguous BAC clones from MTP • Pools will contain 25-30 clones • 20-30kb overlap • Complete Cacao MTP will require 120-150 pools • Repetitive-type regions: • BAC-end sequence and physical map data predictive tool • Modify pools accordingly

  9. Pool Development • Estimate contig size using Consensus Band (CB) algorithm • Example: Cacao cp genome is 160,604bp • Hybridization revealed cp containing contig and is estimated to be ~160 kb based on CB algorithm. • Purified pool DNA can be produced at CUGI • Treat with ATP-dependent Dnase

  10. Sequencing • 3 Levels of Sequence: • Paired BAC-end Sequence – 20 kb increments • End sequencing of pool members • 454 sequencing of BAC pools • Paired 3.5X-5.1X coverage (Roche 454/FLX) • Non-paired 17X-26X coverage (Titanium)

  11. 454 Runs—Whole Genome • 454 Titanium non-paired – 26X coverage/pool • 4 pools per slide (up to 150 pools total) • Up to 38 slide runs • 454 FLX paired-end (3kb) – 5X coverage/pool • 16 pools per slide (up to 150 pools total) • Up to 10 slide runs total

  12. Assembly/Curation of 3Mbp Segment • Preprocessing • Filter reads to remove: • Pair-end that did not contain both ends • BAC vector • E. coli (host DNA) • Newbler Assembler (Roche) • Celera Assembler (CABOG) • Improvements in homopolymer calls, and heterogeneous read length issues • Recently shown N50 contig size double to Newbler • Human (50% repetitive) and microbes

  13. Assembly Curation of 3Mbp Segment • Assembly at various depths (5X, 10X, 15X) • Determine optimal sequencing coverage • Utilize available data to scaffold contigs: • BAC end sequences every 20kb • Genetic marker sequences • RNA-seq clusters • Arabidopsis – Cacao synteny • Draft Sequence (2X) • Augment approach by covering regions missed by clones – assist in selecting MTP

  14. Assembly Curation of 3Mbp Segment • Deliverable will be a pseudomolecule sequence for the 3Mbp region • Gaps will be strings of N • Assess and employ lab-based gap filling strategies • Make every attempt to close gaps

  15. Assembly Validation and Correction • In-silico virtual digest of scaffold sequence and compare to physical map restriction fragments • Draft sequence integration (DSI) via FPC • Integrate and visualize physical map, 3 Mbp segments, and draft sequence

  16. Sequence/Assembly Pipeline

  17. IBM in silico Sequences • IBM will provide a set of sequences that mimic the pilot caco sequences • Input error • Indels, homopolymer calls, nucleotide substitutions • Simulated data to test pipeline: • Physical map • Simulated BAC end sequences • Simulated pseudo-reads from pooled BACs • EST clusters • Indicate reference species for syntenic comparisons

  18. Pilot Project Budget • BAC-end sequencing (30K BACs), 20Kb increments • $206,605.00 • Assembly/curation/validation of cacao 3Mbp • $16,720.00 • Assembly of IBM in-silico derived sequences • $15,400.00

  19. ESTIMATED Budget – Whole Genome Assembly • Assembly, curation, validation of 130-150, 3Mbp segments • $147,620.00 • Automated structural/functional annotation • $8,800.00

  20. Acknowledgements • USDA-ARS • Mars Inc. • Dr. Alex Feltus • Stephen Ficklin • Dr. Keith Murphy • Dr. Margaret Staton

More Related