1 / 25

Accurate Assembly of Maize BACs

Accurate Assembly of Maize BACs. Patrick S. Schnable Srinivas Aluru Iowa State University. Motivation. Maize genome is more complex than previously sequenced genomes Many high-copy, long, highly conserved repeats

marysperez
Download Presentation

Accurate Assembly of Maize BACs

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Accurate Assembly of Maize BACs Patrick S. Schnable Srinivas Aluru Iowa State University

  2. Motivation • Maize genome is more complex than previously sequenced genomes • Many high-copy, long, highly conserved repeats • Genome contains many NIPs (Nearly Identical Paralogs, low-copy genes that are expressed and >98% identical; Emrich et al., 2007) (= CNPs and CNV) • Hence, assembling this genome presents new challenges • Are existing assembly programs up to the task?

  3. Evidence of Assembly Errors Wash U noticed examples of collapse of repeats ISU identified examples of NIP collapse

  4. A C A T G C Terms SNP: single nucleotide polymorphism between alleles of a single gene Paramorphism (PM): a single nucleotide substitution between paralogs Nearly Identical Paralogs (NIPs): paralogous sequences with >99% identity B73 Mo17

  5. Paramorphisms Provide Evidence of NIPs

  6. Frequency of NIPs • Conservatively ~1% of maize genes have NIPs (Emrich et al., 2007) • Inspection of assembled BACs reveals NIP clusters • But in addition also detect examples of “NIP collapse” • CNPs/CNV associated with adaptive evolution in humans (Perry et al., Nat. Genetics, 2007)

  7. 589 bp 56,572 55,984 GenBank CH201-140C17: gi|146322123|gb|AC203431.1 (152,054 bp) BAC Assembly, Example 1 • MAGI3.1 ID: MAGI_18749 (Emrich et al., 2007) • BAC ID: CH201-140C17 Paramorphic Sites: C/T (1,175), C/T (1,293), C/T (1,359)

  8. “Consensus Base” Paramorphic Site #1 BAC Assembly Example 1 - Site #1 BAC ID: CH201-140C17 GI: 146322123 GB: AC203431.1 152,054 bp MAGI_18749 Paramorphic Site #1: C/T (1,175) 2 C vs 2 T 2/7 assembled BACs known to contain NIPs exhibit evidence of NIP collapse (conservative)

  9. Traditional Assembly • Sequence alignments between reads are identified • Construct contigs • Start at a good alignment • Extend ends of contig one sequence at a time • Clone pair information is usedto scaffold contigs after contig construction.

  10. Our Approach • Integrate clone pair data into contig assembly process • Model sequence alignments & clone pairs as a graph. First, construct an alignment graph Sequence reads are nodes A black edge is drawn between a pair of nodes if there is a valid sequence alignment

  11. Clone Pair Informed Assembly Second, introduce two add’l types of edges into the graph Clone pair edges (red) Path edges (green) A path edge exists between two nodes if: • they are close together in the graph • AND their clone pairs are also close together Identifies assembly-relevant sequence alignments

  12. Repeat Example

  13. Our Approach • Series of graph transformations to ensure black edges (sequence alignments) represent correct genomic overlaps, and resolve entries into and exits out of repeats. • Use clone pairs to validate alignments in repeat regions if the corresponding mate pairs are anchored to unique regions and exhibit alignment. • Use paramorphisms to break spurious alignments due to NIPs. • Use clone pairs to match entries into and exits out of repeats. • Use clone pairs and validated alignments to guide contigs. • Use graph min-cuts to find correct assignment of reads to the complementary strands. • Use graph reductions and visualization for further analysis.

  14. Example: Use Paramorphisms to Break Spurious Alignments GTCT A CAG GTCT A CAG GTCT A CAG GTCT C CAG GTCT C CAG GTCT C CAG GTCT C CAG

  15. Three Random “Stage 3” BACs • Shotgun sequences extracted from Genbank and trimmed

  16. 273D22 • Annotate paths via walking through the graph. • Make use of three levels of pointers: • Black edges: show what steps are available • Green edges: indicate the best path • Red edges: indicate our final destination

  17. 273D22: Incorrect Contiging Contig 1 is a small contig in the finished BAC that contains sequences that should be attached to the end of Contig 0. Contig 0 Contig 0 Contig 1

  18. 273D22: Missing Scaffold

  19. 306N19: Mis-assembly Contig 3 Contig 5 Contig 0 Contig 3 Contig 4

  20. 306N19: Complex Repeat

  21. D396H10: Missed Scaffolding Contig 8 Contig 6 Contig 5

  22. D396H10: Missed Scaffolding Contig 3 Contig 7 Contig 2

  23. Identifying Assembly Errors ???

  24. 273D22: Weak Link not Corroborated by Clone Pairs Contig 3 Contig 3

  25. Conclusions & Future Directions • Discovered misassembled regions in all three randomly chosen BACs • Conclusions supported by multiple lines evidence (clone pair + overlap) • Mis-assemblies (e.g., repeat-induced “knots”; collapsed repeats & NIPs) and missed scaffolding • Benefits of our approach • Can provide better assemblies • Can navigate through repeats • Can correctly assemble NIPs • With development could output contigs and perform scaffolding in one step • Could provide refined finishing advice • Could include a community-accessible visualization of assembled BAC contigs and supporting data (confidence levels) • Longer term • Our assembly approach could be applied to whole genome assembly of maize and other complex genomes • Could incorporate paired next generation sequencing data (e.g. 454, Solexa, Solid) • Needed research • Random collection of finished BACs (“truth”) • Develop algorithms for navigating paths through the graph • Accurately construct final contigs that contain multiple copies of repeats • Create BAC re-assembly pipeline (inform finishing efforts in future sequencing projects) • Scale approach to whole genome level

More Related