Large Plant Genome Assemblies using Phusion2

Large Plant Genome Assemblies using Phusion2 Zemin Ning The Wellcome Trust Sanger Institute

NGS Data Assembly Phusion2 Assembly Pipeline Scaffolding Spinner Mate Pair Reads 2k-40k Pair End Reads 170-800bp Consensus Bases Smalt & Gap5 Filtering Unikalow Fermi Clustering Phusion2 Contig Generation Contig Merge ABySS SOAPdenovo

iCAS – an Illumina Clone Assembly System ftp://ftp.sanger.ac.uk/pub/badger/aw7/icas_v061.tar.bz2

Data filtering using Unikalow Unikalow: ftp://ftp.sanger.ac.uk/pub/zn1/unikalow/

Assembly Method Sequencing reads: 1. Overlap graph 2. de Bruijn graph 3. String graph

Scaffold Merge: ftp://ftp.sanger.ac.uk/pub/users/zn1/merge/ Ref Base Sup Contig Merge: Ref Base Ctg

Contig Consensus using Gap5

Can we really trust Single Molecule Sequencing? PacBio Capillary Illumina

Clone Assemblies vs Assemblers 5 BAC clones and 3 fosmids Clone coverage: 99.7%; Base quality: Q39

Spinner – a scaffolding tool ftp://ftp.sanger.ac.uk/pub/users/zn1/spinner/ Spinner uses mate pair data to scaffold contigs. Contigs, and pairs of contigs connected by pairs, define a bi-directional graph: Using expected insert size, a estimate of the gap size can be given for each contig.

Spinner – walks through a loop These techniques alone produces useful results. Further stages will be used to resolve repeats pairs that “jump over” repeats, and graph flow concepts.

Spinner vs SSPACE _________________________________________________________ SSPACESPINNER _________________________________________________________ Genome_Size N50 AverageN50 Average Assemblathon 1 119 Mb 608Kb 86.8Kb 11Mb 450Kb Grass Carp (F) 900Mb 2.3Mb 14.4 5.85Mb 17.1Kb Grass Carp (M) 1000MB 0.34Mb 11.2Kb 2.27 Mb 8.2Kb Bamboo 2.0 Gb 322Kb 7404 488Kb 7689 Parrot 1.23 Gb 906Kb 4675 1.32Mb 6969 ________________________________________________________

Grass Phylogeny

Bamboo Genome: Size Estimation Gs = (Kn – Ks)/D = 1.97x109 Kn = 80.5x109 – Total number of kmer words; Ks = 9.5x109 - Number of single copy kmer words; D = 36 - Depth of kmer occurrence

Bamboo Genome Assembly Solexa reads: Number of read pairs: 877 Million;Finished genome size: 2.0 GB; Read length: 2x100bp; Estimated read coverage: ~90X; Insert size: 500/50-600 bp; Mate pair data: 3k,5k,7k,8k,10k,20k Number of reads clustered: 757 Million Assembly features: - stats Contigs ScaffoldsTotal number of contigs: 744,286 277,278 Total bases of contigs: 1.86 Gb 2.05 Gb N50 contig size: 11,622 328,698 Largest contig: 188,163 4,869,017 Averaged contig size: 2,500 7,400 Contig coverage on genome: ~90% >95%

Bamboo Genome Assembly QC using Finished BACs

Evolution of the Wheat Genome

Size of the Wheat Genome: 17Gb

International Wheat Genome Sequencing Consortium

WHEjyyDADDBAAPE 167 WHEjjzDADDCBAPE 199 WHEjjzDADDCCAPE 223 WHEjjzDADDCABPE 230 WHEjyyDAEDDAAPE 250 WHEjyyDAEDDABPE 250 WHEjyyDAEDDBAPE 250 WHEjyyDAEDDBBPE 250 WHEjyyDAEDDCAPE 250 WHEjyyDAEDDCBPE 250 WHEjyyDAEDDDAPE 250 WHEjjzDADDCACPE 254 WHEjyyDAEDIAAPE 500 WHEjyyDAEDIBAPE 500 WHEjyyDADDIAAPE 502 WHEjyyDADDIDAPE 510 WHEjyyDADDICAPE 527 WHEjyyDADDIBAPE 532 WHEjyyDADDIBBPE 551 WHEjyyDADDKAAPE 682 WHEjyyDADDMBAPE 706 WHEjyyDADDKCAPE 725 WHEjyyDADDMAAPE 764 Sequencing of D Genome Libraries & Insert Sizes WHEjyyDAADWAAPE 2000 WHEjyyDAADWBAPE 2000 WHEjyyDAADWCAPE 2000 WHEjyyDAADWDAPE 2000 WHEjyyDACDWAAPE 2002 WHEjyyDAEDWAAPE 2008 WHEjyyDACDWBBPE 2500 WHEjyyDAADLAAPE 5000 WHEjyyDAADLBAPE 5000 WHEjyyDAADLBBPE 5000 WHEjyyDAEDLAAPE 5004 WHEjjzDADLBBPE 8300 WHEjyyDAADTAAPE 10000 WHEjyyDABDTAAPE 10000 WHEjyyDADDTAAPE 10000 WHEjyyDADDTBBPE 10000 WHEjyyDAIDUAAPE 20000

D Genome: Size Estimation Gs = (Kn – Ks)/D = 4.2x109 Kn = 59.8x109 – Total number of kmer words; Ks = 4.3x109 - Number of single copy kmer words; D = 13 - Depth of kmer occurrence

Wheat D Genome Assembly Solexa reads: Number of read pairs: 805 Million;Estimated genome size: 4.2 GB; Read length: 45-95bp; Estimated read coverage: ~40X; Insert size: 167-800 bp; Mate pair data: 2k - 20k Number of reads clustered: 558 Million Assembly features: - stats Contigs Total number of contigs: 3,228,623 Total bases of contigs: 3.34 Gb N50 contig size: 3,084 Largest contig: 86,064 Averaged contig size: 1,035 Contig coverage on genome: ~80%

Grass carp(F&M) 55,277 130,221 0.88 Gb 0.97Gb 40,353 18,252 5.89 Mb 2.27Mb Miscanthus Wild rice

Acknowledgements: • Joe Henson • German Tischler • Andrew Whitwham • Chinese Academy of Agricultural Sciences • Jizeng Jia • Guangyue Zhao • National Gene Research Centre, Chinese Academy of Sciences • Han Bin • Hengyun Lu

Large Plant Genome Assemblies using Phusion2

Large Plant Genome Assemblies using Phusion2

Presentation Transcript

Large-scale genome projects

Stoichiometric Shifts in Plant Mitochondrial Genome

ABySS Explorer: Visualizing Genome Sequence Assemblies

Large scale proteome comparisons Genome trees

Detecting selection using genome scans

Improving t he Accuracy o f Genome Assemblies

Assemblies

Solid Edge ST4 Training Working with large assemblies

Solid Edge ST6 Training Working with large assemblies

The Genome Assemblies of Tasmanian Devil

Novel Plant Viral Genome Sequencing and Characterization

NGS sequencing and Genome Assemblies from Animals and Large Plants

Phusion2 and The Genome Assembly of Tasmanian Devil

The Plant Genome Research Program

Segmenting the genome by quantitative cellular descriptors using large-scale RNAi perturbations

Improving Genome Annotation using Proteomics

Segmenting the human genome using large-scale RNAi perturbations

Human Genome Sciences Large Scale Manufacturing Facility

Using genome browsers

Drive Assemblies

large indoor plant pots

Genome De Novo Assemblies and Applications in NGS Sequencing