280 likes | 479 Views
Large Plant Genome Assemblies using Phusion2. Zemin Ning The Wellcome Trust Sanger Institute. NGS Data. Assembly. Phusion2 Assembly Pipeline. Scaffolding Spinner. Mate Pair Reads 2k-40k. Pair End Reads 170-800bp. Consensus Bases Smalt & Gap5. Filtering Unikalow. Fermi.
E N D
Large Plant Genome Assemblies using Phusion2 Zemin Ning The Wellcome Trust Sanger Institute
NGS Data Assembly Phusion2 Assembly Pipeline Scaffolding Spinner Mate Pair Reads 2k-40k Pair End Reads 170-800bp Consensus Bases Smalt & Gap5 Filtering Unikalow Fermi Clustering Phusion2 Contig Generation Contig Merge ABySS SOAPdenovo
iCAS – an Illumina Clone Assembly System ftp://ftp.sanger.ac.uk/pub/badger/aw7/icas_v061.tar.bz2
Data filtering using Unikalow Unikalow: ftp://ftp.sanger.ac.uk/pub/zn1/unikalow/
Assembly Method Sequencing reads: 1. Overlap graph 2. de Bruijn graph 3. String graph
Scaffold Merge: ftp://ftp.sanger.ac.uk/pub/users/zn1/merge/ Ref Base Sup Contig Merge: Ref Base Ctg
Can we really trust Single Molecule Sequencing? PacBio Capillary Illumina
Clone Assemblies vs Assemblers 5 BAC clones and 3 fosmids Clone coverage: 99.7%; Base quality: Q39
Spinner – a scaffolding tool ftp://ftp.sanger.ac.uk/pub/users/zn1/spinner/ Spinner uses mate pair data to scaffold contigs. Contigs, and pairs of contigs connected by pairs, define a bi-directional graph: Using expected insert size, a estimate of the gap size can be given for each contig.
Spinner – walks through a loop These techniques alone produces useful results. Further stages will be used to resolve repeats pairs that “jump over” repeats, and graph flow concepts.
Spinner vs SSPACE _________________________________________________________ SSPACESPINNER _________________________________________________________ Genome_Size N50 AverageN50 Average Assemblathon 1 119 Mb 608Kb 86.8Kb 11Mb 450Kb Grass Carp (F) 900Mb 2.3Mb 14.4 5.85Mb 17.1Kb Grass Carp (M) 1000MB 0.34Mb 11.2Kb 2.27 Mb 8.2Kb Bamboo 2.0 Gb 322Kb 7404 488Kb 7689 Parrot 1.23 Gb 906Kb 4675 1.32Mb 6969 ________________________________________________________
Bamboo Genome: Size Estimation Gs = (Kn – Ks)/D = 1.97x109 Kn = 80.5x109 – Total number of kmer words; Ks = 9.5x109 - Number of single copy kmer words; D = 36 - Depth of kmer occurrence
Bamboo Genome Assembly Solexa reads: Number of read pairs: 877 Million;Finished genome size: 2.0 GB; Read length: 2x100bp; Estimated read coverage: ~90X; Insert size: 500/50-600 bp; Mate pair data: 3k,5k,7k,8k,10k,20k Number of reads clustered: 757 Million Assembly features: - stats Contigs ScaffoldsTotal number of contigs: 744,286 277,278 Total bases of contigs: 1.86 Gb 2.05 Gb N50 contig size: 11,622 328,698 Largest contig: 188,163 4,869,017 Averaged contig size: 2,500 7,400 Contig coverage on genome: ~90% >95%
Bamboo Genome Assembly QC using Finished BACs
WHEjyyDADDBAAPE 167 WHEjjzDADDCBAPE 199 WHEjjzDADDCCAPE 223 WHEjjzDADDCABPE 230 WHEjyyDAEDDAAPE 250 WHEjyyDAEDDABPE 250 WHEjyyDAEDDBAPE 250 WHEjyyDAEDDBBPE 250 WHEjyyDAEDDCAPE 250 WHEjyyDAEDDCBPE 250 WHEjyyDAEDDDAPE 250 WHEjjzDADDCACPE 254 WHEjyyDAEDIAAPE 500 WHEjyyDAEDIBAPE 500 WHEjyyDADDIAAPE 502 WHEjyyDADDIDAPE 510 WHEjyyDADDICAPE 527 WHEjyyDADDIBAPE 532 WHEjyyDADDIBBPE 551 WHEjyyDADDKAAPE 682 WHEjyyDADDMBAPE 706 WHEjyyDADDKCAPE 725 WHEjyyDADDMAAPE 764 Sequencing of D Genome Libraries & Insert Sizes WHEjyyDAADWAAPE 2000 WHEjyyDAADWBAPE 2000 WHEjyyDAADWCAPE 2000 WHEjyyDAADWDAPE 2000 WHEjyyDACDWAAPE 2002 WHEjyyDAEDWAAPE 2008 WHEjyyDACDWBBPE 2500 WHEjyyDAADLAAPE 5000 WHEjyyDAADLBAPE 5000 WHEjyyDAADLBBPE 5000 WHEjyyDAEDLAAPE 5004 WHEjjzDADLBBPE 8300 WHEjyyDAADTAAPE 10000 WHEjyyDABDTAAPE 10000 WHEjyyDADDTAAPE 10000 WHEjyyDADDTBBPE 10000 WHEjyyDAIDUAAPE 20000
D Genome: Size Estimation Gs = (Kn – Ks)/D = 4.2x109 Kn = 59.8x109 – Total number of kmer words; Ks = 4.3x109 - Number of single copy kmer words; D = 13 - Depth of kmer occurrence
Wheat D Genome Assembly Solexa reads: Number of read pairs: 805 Million;Estimated genome size: 4.2 GB; Read length: 45-95bp; Estimated read coverage: ~40X; Insert size: 167-800 bp; Mate pair data: 2k - 20k Number of reads clustered: 558 Million Assembly features: - stats Contigs Total number of contigs: 3,228,623 Total bases of contigs: 3.34 Gb N50 contig size: 3,084 Largest contig: 86,064 Averaged contig size: 1,035 Contig coverage on genome: ~80%
Grass carp(F&M) 55,277 130,221 0.88 Gb 0.97Gb 40,353 18,252 5.89 Mb 2.27Mb Miscanthus Wild rice
Acknowledgements: • Joe Henson • German Tischler • Andrew Whitwham • Chinese Academy of Agricultural Sciences • Jizeng Jia • Guangyue Zhao • National Gene Research Centre, Chinese Academy of Sciences • Han Bin • Hengyun Lu