Finishing tomato chromosomes #6 and #12 using a Next Generation whole genome shotgun approach

Finishing tomato chromosomes #6 and #12 using a Next Generation whole genome shotgun approach Roeland van Ham, CBSG, NL René Klein Lankhorst, EUSOL Giovanni Giuliano, ENEA, IT Giorgio Valle, Univ. Padua, IT Michiel van Eijk, Keygene, NL Satoshi Tabata, Kazusa, JP

Status International project (2003-2008)

Status International project (2003-2008) • Overall progress is slow • Several chromosomes have large gaps • efforts to identify novel seed BACs within various of these gaps have remained unsuccessful • Higher gene density in heterochromatin than expected • Combined strategies using NGS technologies now enable de novo sequencing of large, complex genomes

1. Status sequencing chr. 6-12 • Chr 6: 155 BACs sequenced (12.6 Mb non-redundant) • 66 seed, 89 extension BACs • 118 HTGS1, 37 HTGS3 • 28 BAC contigs, 9 singletons • Chr 12: 55 BACs sequenced (5.1 Mb non-redundant) • 34 seed, 31 extension BACs • 21 HTGS1, 11 HTGS2, 23 HTGS3 • 14 BAC contigs, 20 singletons

2. Example: What is required to finish chr. 6? • 12.6 Mb 155 BACs • 20.4 Mb 250 BACs • 32.0 Mb 381 BACs • estimated no. of gaps: • 26 small (< 4 BACs) • 13 large (4-15 BACs) • estimated no. of BACs to sequence: • ~160 BACs ~ +100 ~ +230

3. Options to finish chr. 6 and 12 • Continue classical sequencing by BAC walking • Purify and shotgun sequence chr. 6-12 • combination of flow cytometry and chromosome amplification • Sequence chr. 6-12 by shotgun sequencing whole genome Time-consuming, expensive, no seed BACs anchored in large gaps ~ one year to develop technology Exploit next generation sequencing, sequence 99.9% of genome

4a. The initiative Together with our partners we will produce: • A whole genome physical map based on 10X Genome Analyzer (Solexa) generated AFLP sequence tags on BAC’s • A 20X genome coverage in 454 reads using upcoming Titanium upgrade • read length ~400, ~500 Mb per run • use combination of shotgun and paired-end runs (short and long-jump inserts, 3 and ~20 kb) • A 30 X genome coverage in SOLID reads • reads ~30 bp, paired-ends ~3 kb • ~3 Million Sanger reads from Selected BAC Mixture (SBM-data, Kazusa)

4b. The initiative We will assemble this data together with all currently available data: 300,000 BAC ends (120,000 pairs) 180,000 fosmid ends (90,000 pairs) ~30% euchromatic sequence (66 Mb) Anchor the contigs to new physical map using AFLP sequence tags

5. The challenge: assemble the genome • Use 66 Mb of available sequence to benchmark procedure • ~strategy used for Vitis genome • Match all vs. all reads, 100% identity • Cluster reads and divide in repeat and low copy (unique) clusters • Separately assemble low copy clusters • Merge assembled clusters, lowering stringency step-wise • Use BAC-end, fosmid-end and SOLiD/454 paired ends to scaffold and build supercontigs • Anchor clusters / supercontigs to novel physical map (KeyGene)

6. Funding • 10 X Solexa BAC based physical map, KeyGene/BSP: Secured • Data production Q1 & Q2 2009 • 15 X SOLID coverage, The Netherlands: Secured • Data production has started October 2008 • 10 X 454 coverage, The Netherlands: Application (CBSG 2012) • Data production expected to start December 2008 • 15 X SOLID coverage, Italy: Secured • Data production expected to start November 2008 • 10 X 454 coverage, Italy: Secured • Data production expected to start November 2008 • SBM data set (Kazusa); Data available

7. Data release • The data will consist of an assembly of next gen data with contigs as much as possible anchored to new physical map • All data will be released to SOL Consortium for the purpose of finishing the Heinz 1706 genome • Data release within the Consortium will follow the newly proposed international standards: “ENCODE Consortia data release policy” (draft 11/09/2008). In a nutshell: • Data will be released by the data producers as soon as possible after verification of the data • Users of the data are not allowed to publish the data without consent of the data producers for a moratorium period of 9 months • In case of consent, proper reference to the data producers should be made • After expiration of the moratorium period, data users may only publish the data when making proper reference to the data producers

8. Time line (estimate) • Production of SOLID and 454 data: October 2008 – April 2009 • Production of the physical map: Jan – July 2009 • Assembly of all data sets: May 2009 – August 2009 • Release of assembly to SOL Consortium: September 2009

9. Invitation • Other SOL members are welcome to join the “seed consortium” for Next Gen Tomato Sequencing, provided that: • Novel significant expertise and/or data sets are brought in (sequence coverage, assembly resources, etc.) • Own funding is secured • The time line can be adhered to • The policy of data release is subscribed

Finishing tomato chromosomes #6 and #12 using a Next Generation whole genome shotgun approach

Finishing tomato chromosomes #6 and #12 using a Next Generation whole genome shotgun approach

Presentation Transcript

Tomato Finishing Workshop

Second Tomato Finishing Workshop Chromosome 4

2nd TOMATO FINISHING WORKSHOP chromosome 9

INDIAN INITIATIVE FOR TOMATO GENOME SEQUENCING Tomato Finishing Workshop

International Tomato Finishing Workshop

Tomato Finishing Workshop

Second Tomato Finishing Workshop Chromosome 4

INDIAN INITIATIVE FOR TOMATO GENOME SEQUENCING Tomato Finishing Workshop

Finishing tomato chromosomes #6 and #12 using a Next Generation whole genome shotgun approach