200 likes | 339 Views
CUGI Pilot Sequencing/Assembly Projects. Christopher Saski. Sequencing the Cacao Genome: 3 Megabases at a Time. Pilot project to sequence and assemble 3Mbp segment of cacao genome IBM in silico assembly project – Testing the assembly pipeline.
E N D
CUGI Pilot Sequencing/Assembly Projects Christopher Saski
Sequencing the Cacao Genome:3 Megabases at a Time • Pilot project to sequence and assemble 3Mbp segment of cacao genome • IBM in silico assembly project – Testing the assembly pipeline
Sequencing the Cacao Genome:3 Megabases at a Time • Combination of: • “Old School Genomics” • BAC libraries, physical mapping, and clone-by-clone sequencing • Roche 454 Titanium and FLX De Novo sequencing • Key: • Not yet accurately assembled a eukaryotic genome with NGS alone • Reduce assembly complexity
3 Megabase segments Rounsley et al., 2009
Advantages • Reduce assembly complexity • Limit number of sequencing libraries • Prioritize critical genomic regions • Outsource BAC pools for sequencing in parallel at any center that has a 454 Titanium/GS-FLX sequencer • Flexibility – Start slow with minimal investment • Could redesign strategy to reduce sequence runs
Strategy Components • Integrated Physical/Genetic framework • Pool development and sequencing: • BAC-end • Titanium 454 (paired/non-paired) • Draft sequence • Assembly and integration: • Newbler • Celera (CABOG)
Cacao Integrated Physical/Genetic Framework • Represents ~29X coverage (3 BAC libraries) • Assembled into small number of large contigs • Suggests reasonable levels of heterozygosity • Manageable amounts of repetitive sequence • 220 anchored genetic markers spanning 10 linkage groups • Resemble recombinational derived order
Pool Development • Select contiguous BAC clones from MTP • Pools will contain 25-30 clones • 20-30kb overlap • Complete Cacao MTP will require 120-150 pools • Repetitive-type regions: • BAC-end sequence and physical map data predictive tool • Modify pools accordingly
Pool Development • Estimate contig size using Consensus Band (CB) algorithm • Example: Cacao cp genome is 160,604bp • Hybridization revealed cp containing contig and is estimated to be ~160 kb based on CB algorithm. • Purified pool DNA can be produced at CUGI • Treat with ATP-dependent Dnase
Sequencing • 3 Levels of Sequence: • Paired BAC-end Sequence – 20 kb increments • End sequencing of pool members • 454 sequencing of BAC pools • Paired 3.5X-5.1X coverage (Roche 454/FLX) • Non-paired 17X-26X coverage (Titanium)
454 Runs—Whole Genome • 454 Titanium non-paired – 26X coverage/pool • 4 pools per slide (up to 150 pools total) • Up to 38 slide runs • 454 FLX paired-end (3kb) – 5X coverage/pool • 16 pools per slide (up to 150 pools total) • Up to 10 slide runs total
Assembly/Curation of 3Mbp Segment • Preprocessing • Filter reads to remove: • Pair-end that did not contain both ends • BAC vector • E. coli (host DNA) • Newbler Assembler (Roche) • Celera Assembler (CABOG) • Improvements in homopolymer calls, and heterogeneous read length issues • Recently shown N50 contig size double to Newbler • Human (50% repetitive) and microbes
Assembly Curation of 3Mbp Segment • Assembly at various depths (5X, 10X, 15X) • Determine optimal sequencing coverage • Utilize available data to scaffold contigs: • BAC end sequences every 20kb • Genetic marker sequences • RNA-seq clusters • Arabidopsis – Cacao synteny • Draft Sequence (2X) • Augment approach by covering regions missed by clones – assist in selecting MTP
Assembly Curation of 3Mbp Segment • Deliverable will be a pseudomolecule sequence for the 3Mbp region • Gaps will be strings of N • Assess and employ lab-based gap filling strategies • Make every attempt to close gaps
Assembly Validation and Correction • In-silico virtual digest of scaffold sequence and compare to physical map restriction fragments • Draft sequence integration (DSI) via FPC • Integrate and visualize physical map, 3 Mbp segments, and draft sequence
IBM in silico Sequences • IBM will provide a set of sequences that mimic the pilot caco sequences • Input error • Indels, homopolymer calls, nucleotide substitutions • Simulated data to test pipeline: • Physical map • Simulated BAC end sequences • Simulated pseudo-reads from pooled BACs • EST clusters • Indicate reference species for syntenic comparisons
Pilot Project Budget • BAC-end sequencing (30K BACs), 20Kb increments • $206,605.00 • Assembly/curation/validation of cacao 3Mbp • $16,720.00 • Assembly of IBM in-silico derived sequences • $15,400.00
ESTIMATED Budget – Whole Genome Assembly • Assembly, curation, validation of 130-150, 3Mbp segments • $147,620.00 • Automated structural/functional annotation • $8,800.00
Acknowledgements • USDA-ARS • Mars Inc. • Dr. Alex Feltus • Stephen Ficklin • Dr. Keith Murphy • Dr. Margaret Staton