1 / 68

Assembly and Annotation of a 22Gb C onifer G enome, Loblolly Pine

Assembly and Annotation of a 22Gb C onifer G enome, Loblolly Pine. Jill Wegrzyn Pieter de Jong, Chuck Langley, Dorrie Main, Keithanne Mockaitis, Steven Salzberg, Kristian Stevens , Nick Wheeler, Jim Yorke, Aleksey Zimin , David Neale.

laddie
Download Presentation

Assembly and Annotation of a 22Gb C onifer G enome, Loblolly Pine

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Assembly and Annotation of a 22Gb Conifer Genome, Loblolly Pine Jill Wegrzyn Pieter de Jong, Chuck Langley, Dorrie Main, Keithanne Mockaitis, Steven Salzberg, KristianStevens, Nick Wheeler, Jim Yorke, Aleksey Zimin, David Neale Univ. of Calfornia, Davis; Children’s Hospital of Oakland Research Institute; Indiana Univ.; Washington State Univ.; Univ. of Maryland; and Johns Hopkins Univ.

  2. PineRefSeq Goal To provide the benefits of conifer reference genome sequences to the research, management and policy communities. Specific Objectives • Provide a high-quality reference genome sequence of loblolly pine looking toward sugar pine and then Douglas-fir. • Provide a complete transcriptome resource for gene discovery, reference building, and aids to genome assembly • Provide annotation, data integration, and data distribution through Dendrome and TreeGenes databases.

  3. The Large, Complex Conifer Genomes Present a Challenge • Challenges • The estimated 22 Gigabase loblolly pine genome is 8 times larger than the human genome • Conifer genomes generally possess large gene families (duplicated and divergent copies of a gene), and abundant pseudo-genes. • The vast majority of the genome appears to be repetitive DNA • Approaches to Resolving Challenges • Complementary sequencing strategies that seek to reduce complexity through use of actual or functional haploid genomes and reduced size of individual assemblies.

  4. 40000 3000 Arabidopsis Oryza 2000 35000 Populus 1C DNA content (Mb) Sorghum 1000 Glycine 30000 Zea 0 Pinus pinaster Pinus 25000 Picea taeda glauca Picea abies Pseudotsuga 20000 menziesii 15000 Taxodium distichum 10000 5000 0 Plant Genome Size Comparisons Pinuslambertiana P. menziesii Image Credit: Modified from Daniel Peterson, Mississippi State University

  5. Existing and Planned Angiosperm Tree Genomes

  6. Existing and Planned Gymnosperm Tree Genomes Genome size: Approximate total size, not completely assembled.

  7. Elements of the Conifer Genome Sequencing Project

  8. Acquiring the DNA Haploid Haploid megagametophyte tissue 1N Shotgun sequenced Diploid Diploid needle tissue 2N 40 Kb cloned fosmids, pooled and sequenced Figure Credit: Nicholas Wheeler, University of California, Davis

  9. Sequencing Strategy 65X 12X

  10. Technology for De Novo Sequencing of the Conifer Genomes Parallel and Complementary Approaches Max Output: 95 Gigabases Max. paired end reads - 640 million Max Output: 300 Gigabases Max. paired end reads - 3 billion 1 Effectively haploid

  11. Sequencing Strategy Today

  12. Megagametophyte Whole Genome Shotgun (M-WGS) • Not enough haploid DNA in a megagametophyteto implement a complete list of WGS ingredients. • Compromise: Obtain DNA for longer • insert linking libraries (> 1kbp) from • diploid needle tissue. • Prepare only short insert Illumina • libraries from megagametophye • tissue.

  13. M-WGS Short Insert LibrariesPreliminary QC and Size Selection Each DNA sample is then run on an Agilent Bioanalyzer to determine a preliminary estimate of insert size and coefficient of variation. If within spec, selected DNA samples are converted into Illumina libraries

  14. M-WGS Short Insert LibrariesLibrary QC and Titration • Libraries are subsequently QCed on the IlluminaMiSeq

  15. A k-mer Genome Size Estimate How deep to sequence the libraries? Experimentally – hybridization Computationally (WGS) – choose substring of the reads of length k P. taedagenome size ≅total k-mersin genome total k-mersin P. taedagenome ≅ total k-mers in P. taedareads expected number of times a genomically unique k-meris observed in the reads

  16. k-mer Genome Size Estimates Loblolly pine Pinustaeda: 31-mers total: 3.736 x 1011 Expected k-mer depth: 18.11 Estimated genome size: 20.63 GB High Copy 31-mers 1.09% of distinct 31-mers 33% of all 31-mers 24-mers total: : 4.092 x 1011 Expected k-mer depth: 19.79 Estimated genome size: 20.68 GB Sugar pine Pinuslambertiana: 31-mers total: 2.776 x 1011 Expected k-mer depth: 8.12 Estimated genome size: 34.19 GB High Copy 31-mers 0.35% of distinct 31-mers 33% of all 31-mers 24-mers total: 3.031 x 1011 Expected k-mer depth: 8.89 Estimated genome size: 33.98 GB

  17. truly large genomes

  18. P. taedaVersion 0.9 Library Statistics • Haploid short insert libraries • 10 short insert libraries 200 - 640bp • 1.4Tbp GA2x, HiSeq, MiSeq sequence • 65 fold coverage • Diploid jumping libraries • 47 jumping libraries 1300 – 5500bp • 280Gbp GA2x sequence • 12 fold coverage • 13 Fosmid DiTag Libraries

  19. Elements of the Conifer Genome Sequencing Project

  20. 65X coverage in paired ends from a single seed • 1/3 in GAIIx, 160-bp overlapping pairs • 2/3 in HiSeq, 100-bp pairs • 1.7 billion reads from “jumping” libraries from pine needles, diploid DNA

  21. Collect jumping reads from same haplotype 1.7 billion jumping reads (4 Kbp) Keep only pairs where both reads match haploid DNA 93 million Di-Tag reads (36 Kbp) Filter: both reads had to be covered by 52-mers from megagametophyte data

  22. How to get all these reads into a single assembly run? 16 billion paired reads

  23. MSR-CA (Aleksey Zimin, UMD) Based on Celera assembler 454, Illumina, and Sanger reads Allpaths-LG SOAPdenovo Velvet ABySS Contrail SGA Recent Assemblers for Illumina Data

  24. Two Classes of Assembly Algorithms • Overlap-Layout-Consensus (OLC) • Used by most assemblers for previous generation (Sanger) sequencing • Celera Assembler, PCAP, Phusion, Arachne, etc • De Bruijn Graph • Used by most assemblers for Illumina data • SOAPdenovo, Allpaths-LG, Velvet, Abyss, etc • We use a combined approach that combines the benefits of both OLC and the De Bruijn Graph in our MSR-CA assembler

  25. Combine Benefits of OLC and De Bruijn Graph • Benefits of DeBruijnGraph • Computationally faster • Drawbacks of DeBruijnGraph • Errors in the reads create spurious branches in the graph requiring error correction • Max. size of k-mer is limited by the shortest read size • All overlaps in the graph are exact overlaps of k-1 bases • Repeats of longer than k bases cannot be resolved • Without space consuming side information • Benefits of OLC • Can deal with variable length reads and reads from different sequencing platforms • Overlaps can be long and thus more reliable • Overlaps do not have to be exact • Can resolve repeats of up to read size • Drawbacks of OLC • Computationally intensive, number of overlaps grows quickly with the number of reads and coverage

  26. Super reads GOAL: Reduce the amount of input data without losing information • Consider a read CGACTGACCAGATGACCATGACAGATACATGGT stop extend 5 GACTGACCAG ATACATGGTA 10 stop extend 3 CGACTGACCA ATACATGGTC 2 • Typically Illumina sequencing projects generate data with high coverage (>50x). With 100bp reads this implies that a new read starts on average at least every other base: read R extended tosuper read S super read S (red) the other reads extend to theSas well

  27. Super-Reads Compress the Data 16 billion paired reads 150 million super-reads • 100-fold compression • 50% of sequence is in super reads > 500 bp • Super-read total: 52 Gbp

  28. MaSuRCAassembler performance 64-core computer with 1 Terabyte of RAM Time/memory to assemble: QuORUM error correction: 10 days / 800 GB Super-reads construction plus filtering: 11 days / 400 GB Contig and scaffold construction: 60+ days / 450 Gb uses CABOG assembler Gap filling with super-reads: 8 days / 300 Gb

  29. MSR-CA Output Contigs: contiguous sequences that do not appear to be repetitive (may contain internal repeats). These end up in scaffolds. Scaffolds: ordered and oriented collections of contigs, built using mate pair data. A scaffold can consist of just one contig (a "single-contig" scaffold). Degenerate contigs: contigs that appeared to be repeats according to the coverage statistics. Only placed in scaffolds when linked to contigs via mate pairs. Most of them will end up being placed in more than one location, but many will not appear in any scaffold.

  30. P. taeda WGS V0.6 (June 2012) • Approximately 35X coverage • 7 billion reads (50 million jumping library reads) • Compressed to 377 million Super-reads • Total Sequence: 18,321,727,393 bp • Total contigsequence: 14,606,783,345 bp • N50 1,199bp(9.16 Gbp is contained in contigs of 1199 bp or longer) • Total scaffold sequence (with imputed gaps): 18,428,460,141bp • N50 1,230bp(9.21 Gbp is contained in scaffolds of 1230 bp or longer) • Degenerate contig sequence 3.8Gb

  31. P. taeda WGS V0.8 (January 2013) • Approximately 65X coverage • 16 billion reads (1.7 billion jumping library reads) • Compressed to 150 million Super-reads • Total Sequence: 22,518,572,092 bp • N50 Contig: 7,083bp • N50 Scaffold: 15,885 bp

  32. P. taeda WGS V0.9 (March 2013) • Total Sequence: 20.1 Gbp • Total contigsequence: 2.3 Gbp • N50 8,200bp(11.6 million) • Total scaffold sequence (with imputed gaps): 17.8 Gbp • N50 30,700bp (4.8 million)

  33. Ongoing Efforts • Improve MSR-CA scaffolding • Transcriptome + WGS assembly • Fosmid pool sequencing and assembly • GBS to anchor and orient scaffolds • Sugar pine genome: 35 Gigabases!

  34. Elements of the Conifer Genome Sequencing Project

  35. Sequencing Strategy Molecular approach to complexity reduction End of summer 2013

  36. Fosmid Pooling:Genome partitioning for reduced assembly complexity • The immense and complex diploid pine genome can be economically and efficiently partitioned into smaller, functionally haploid, pieces using pools of fosmid clones. • Fosmids in a pool should have a combined insert size far less than a haploid genome size; to ensure haploid genome representation. • The sequence data obtained from a single fosmid pool may be up to 80 X deep. • The sequence data obtained from a pool must be screened for vector and E. coli contamination • Ideally: larger clones (BACs) are more desirable, more likely to span repeats

  37. Fosmid Sequence Components • Haploid fosmids with vector tagged ends • Primary coverage from short insert libraries • Additional coverage from long insert librariesfrom equi-molar pool of pools. • Fosmid end sequences (diTags) link ends of the assembly and count fosmids in a pool

  38. Fosmid Pools Determining the Best Assembler for the Job Assembly results for a relatively large pool of approximately 600 P. taedafosmids

  39. Use Cases for Fosmid Pools • Assembler Evaluation • Repeat Library Construction • SNP Identification

  40. Genomic Sequence PinustaedaBACs and Fosmids Combined sequence resource represents roughly 1% of the estimated 22 GB genome

  41. Similarity and De Novo Repeat Identification • Tandem Repeat Finder (TRF) • Homology (Censor against RepBase) • Summary of Repbase v17.07 • Number of entries: 28,155 • Number of species represented: 715 • Number of repeat families: 280 • Angiosperm entries: 131 • Gymnosperm entries (conifer):15 • De Novo (REPET/TEannot) • Self-alignment (all vs all) with BLAST to find • HSPs is followed by clustering with Grouper, • Recon, and Piler • 3 sets of clusters are aligned with a MSA • (MAP) to derive a consensus sequence • Structural search runs simultaneously • (LTR Harvest) to detect highly diverged LTRs • Final Blastclust to cluster potential sequences

  42. Tandem Repeats Comparison across sequenced angiosperms and other gymnosperms (partial) Total tandem content: 2.6% 3.31% of BACs 2.59% of fosmids

  43. Homology Search Results Censor (BLAST-style) comparisons against Repbase Partial and Full-length Interspersed Alignments (compared across species) Full-length Alignments Only • Full Length Sequences • 80-80-80 Rule (Wicker et al. 2007) • 80 bp in length • 80% identity • 80% coverage

  44. 88% repetitive (partial and full-length) 29% repetitive (full-length only defined by 80-80-80) 87% of the full-length content is characterized as LTR retrotransposons Repeats are highly diverged Only 23% identified by homology for full and partial elements Repbase contains just 15 (+5) gymnosperm elements 6,270 novel families discovered with no homology 5,155 are single copy High copy elements are either Gypsy or Copia LTRs Nested repeats common in LTR retrotransposons Summary of Combined Homology and De Novo Approach

  45. Novel Repeat Elements Diverged LTRs are annotated as 6,270 novel families Top 400 elements only cover 12% of the combined sequence sets

  46. Novel Repeat Elements MSA with annotations of the novel Gypsy LTR - PtAppalachian MSA with annotations of the novel Copia LTR -PtPineywoods

  47. Elements of the Conifer Genome Sequencing Project

  48. Loblolly transcriptome from 30 unique RNA collections Carol Loopstra (RNA) and Keithanne Mockaitis (sequencing)

  49. Progressive Transcript Profiling Build a useful transcriptome reference early in project:  generate long reads for ease of assembly, scaffolding of existing shorter data  integrate community data into assemblies Early Development seeds young seedlings Reproductive Development megastrobili microstrobili Early Stress Signaling Responses cold heat elevated UV compression Vegetative Organs vegetative buds candles stems needles roots

  50. Transcriptome Assembly • Considerable variation in de novo transcriptome assemblies • Used a compare and compete methodology to select the final transcripts • Two Trinity versions and Velvet/Oasis (6 different k-mer sizes) • First analysis: Basic clustering methods with 454 and other protein evidence to determine optimal full-length proteins

More Related