250 likes | 260 Views
Explore a comprehensive approach incorporating clone pair data for precise assembly of complex maize genomes, overcoming challenges like NIPs. Discover misassembly correction strategies and potential refinements for enhanced genomic analysis.
E N D
Accurate Assembly of Maize BACs Patrick S. Schnable Srinivas Aluru Iowa State University
Motivation • Maize genome is more complex than previously sequenced genomes • Many high-copy, long, highly conserved repeats • Genome contains many NIPs (Nearly Identical Paralogs, low-copy genes that are expressed and >98% identical; Emrich et al., 2007) (= CNPs and CNV) • Hence, assembling this genome presents new challenges • Are existing assembly programs up to the task?
Evidence of Assembly Errors Wash U noticed examples of collapse of repeats ISU identified examples of NIP collapse
A C A T G C Terms SNP: single nucleotide polymorphism between alleles of a single gene Paramorphism (PM): a single nucleotide substitution between paralogs Nearly Identical Paralogs (NIPs): paralogous sequences with >99% identity B73 Mo17
Frequency of NIPs • Conservatively ~1% of maize genes have NIPs (Emrich et al., 2007) • Inspection of assembled BACs reveals NIP clusters • But in addition also detect examples of “NIP collapse” • CNPs/CNV associated with adaptive evolution in humans (Perry et al., Nat. Genetics, 2007)
589 bp 56,572 55,984 GenBank CH201-140C17: gi|146322123|gb|AC203431.1 (152,054 bp) BAC Assembly, Example 1 • MAGI3.1 ID: MAGI_18749 (Emrich et al., 2007) • BAC ID: CH201-140C17 Paramorphic Sites: C/T (1,175), C/T (1,293), C/T (1,359)
“Consensus Base” Paramorphic Site #1 BAC Assembly Example 1 - Site #1 BAC ID: CH201-140C17 GI: 146322123 GB: AC203431.1 152,054 bp MAGI_18749 Paramorphic Site #1: C/T (1,175) 2 C vs 2 T 2/7 assembled BACs known to contain NIPs exhibit evidence of NIP collapse (conservative)
Traditional Assembly • Sequence alignments between reads are identified • Construct contigs • Start at a good alignment • Extend ends of contig one sequence at a time • Clone pair information is usedto scaffold contigs after contig construction.
Our Approach • Integrate clone pair data into contig assembly process • Model sequence alignments & clone pairs as a graph. First, construct an alignment graph Sequence reads are nodes A black edge is drawn between a pair of nodes if there is a valid sequence alignment
Clone Pair Informed Assembly Second, introduce two add’l types of edges into the graph Clone pair edges (red) Path edges (green) A path edge exists between two nodes if: • they are close together in the graph • AND their clone pairs are also close together Identifies assembly-relevant sequence alignments
Our Approach • Series of graph transformations to ensure black edges (sequence alignments) represent correct genomic overlaps, and resolve entries into and exits out of repeats. • Use clone pairs to validate alignments in repeat regions if the corresponding mate pairs are anchored to unique regions and exhibit alignment. • Use paramorphisms to break spurious alignments due to NIPs. • Use clone pairs to match entries into and exits out of repeats. • Use clone pairs and validated alignments to guide contigs. • Use graph min-cuts to find correct assignment of reads to the complementary strands. • Use graph reductions and visualization for further analysis.
Example: Use Paramorphisms to Break Spurious Alignments GTCT A CAG GTCT A CAG GTCT A CAG GTCT C CAG GTCT C CAG GTCT C CAG GTCT C CAG
Three Random “Stage 3” BACs • Shotgun sequences extracted from Genbank and trimmed
273D22 • Annotate paths via walking through the graph. • Make use of three levels of pointers: • Black edges: show what steps are available • Green edges: indicate the best path • Red edges: indicate our final destination
273D22: Incorrect Contiging Contig 1 is a small contig in the finished BAC that contains sequences that should be attached to the end of Contig 0. Contig 0 Contig 0 Contig 1
306N19: Mis-assembly Contig 3 Contig 5 Contig 0 Contig 3 Contig 4
D396H10: Missed Scaffolding Contig 8 Contig 6 Contig 5
D396H10: Missed Scaffolding Contig 3 Contig 7 Contig 2
273D22: Weak Link not Corroborated by Clone Pairs Contig 3 Contig 3
Conclusions & Future Directions • Discovered misassembled regions in all three randomly chosen BACs • Conclusions supported by multiple lines evidence (clone pair + overlap) • Mis-assemblies (e.g., repeat-induced “knots”; collapsed repeats & NIPs) and missed scaffolding • Benefits of our approach • Can provide better assemblies • Can navigate through repeats • Can correctly assemble NIPs • With development could output contigs and perform scaffolding in one step • Could provide refined finishing advice • Could include a community-accessible visualization of assembled BAC contigs and supporting data (confidence levels) • Longer term • Our assembly approach could be applied to whole genome assembly of maize and other complex genomes • Could incorporate paired next generation sequencing data (e.g. 454, Solexa, Solid) • Needed research • Random collection of finished BACs (“truth”) • Develop algorithms for navigating paths through the graph • Accurately construct final contigs that contain multiple copies of repeats • Create BAC re-assembly pipeline (inform finishing efforts in future sequencing projects) • Scale approach to whole genome level