420 likes | 534 Views
Improving t he Accuracy o f Genome Assemblies. July 17 th 2012. Roy Ronen *,1 , Christina Boucher *,1 , Hamidreza Chitsaz 2 and Pavel Pevzner 1 1. University of California, San Diego 2. Wayne State University, Michigan * Contributed equally to this work. ≈ $ thousands ≈ several weeks
E N D
Improving the Accuracy of Genome Assemblies July 17th 2012 Roy Ronen*,1, Christina Boucher*,1, Hamidreza Chitsaz2 and Pavel Pevzner1 1. University of California, San Diego 2. Wayne State University, Michigan * Contributed equally to this work
≈ $ thousands ≈ several weeks ≈ two people ≈ $ billions ≈ several years ≈ hundreds of people
Draft Genome from HTS Sample Preparation Fragments Sequencing Reads Assembly Contigs Analysis, Analysis, Analysis
Sample Preparation Fragments Sequencing • HTS assemblies (contigs) still contain an abundance of error: • 20-30 subst. errors per 100kbp with SOAPdenovo. • 5-20 subst. errors per 100kbp with Velvet. • Small (<50 bp) INDEL errors. • Misassemblies, large INDELs, etc. Reads Assembly Contigs Analysis, Analysis, Analysis
Sample Preparation Fragments Sequencing Reads Errors in the assembled contigs will profoundly affect any downstream analysis. Assembly Contigs Analysis, Analysis, Analysis
Sample Preparation Fragments Sequencing Reads SEQuel Assembly Contigs Analysis, Analysis, Analysis Refined Contigs
De Bruijn Graph for Fragment Assembly
De Bruijn Graph GCC CCA CAT CCT GCC ATT TTT CCT CTA TAT CTT CCA CTA TTA CAT TTT ATT CCT ATT TTA TAT CTT (Pevzner, Tang, Waterman 2001)
De Bruijn Graph CCA CCA GCC ATT CCT GCC CAT TTT CCT CTA TAT CTT TTA CCT ATT CTT TTT ATT TTA CTA CAT TAT (Pevzner, Tang, Waterman 2001)
De Bruijn Graph CCA CAT TAT CTA TTT GCC CTT ATT CAT GCC TTA ATT TAT CTA CTT ATT TTT TTA CCT CCT CCT (Pevzner, Tang, Waterman 2001)
De Bruijn Graph GCC CCA CAT CAT ATT TAT CTT CTA TTT TTA ATT TTA CTT ATT CTA TAT TTT GCC CCT (Pevzner, Tang, Waterman 2001)
De Bruijn Graph CCA CAT TTT CTT TAT CTA CAT GCC ATT ATT TAT TTT TTA CTT TTA ATT CTA CCT (Pevzner, Tang, Waterman 2001)
GCC CCT AGG GGA CTA GAC TAG CAC ACT TGG GGC CTT GCA TTG GCCTAGGAC CACTTGGCA GCCTAGGAC CACTTGGCA GCCTAGGAC CACTTGGCA ..............GCCTAGGAC.............CACTTGGCA..............
Sequencing errors cause bulges in the de Bruijn graph GCC CCT AGG GGA CTA GAC TAG CAC ACT TGG GGC CTT GCA TTG TGGA TTGA CTTG CCTT GCCTTGGAC CACTTGGCA GCCTAGGAC CACTTGGCA GCCTAGGAC CACTTGGCA ..............GCCTAGGAC.............CACTTGGCA..............
Sequencing errors cause bulges in the de Bruijn graph 2 2 AGG CTA 2 2 TAG 3 3 GCC CCT GGA GAC 1 1 4 4 TGG CTT TTG 3 GGC GCA 3 CAC ACT 3 3 GCCTTGGAC CACTTGGCA GCCTAGGAC CACTTGGCA GCCTAGGAC CACTTGGCA ..............GCCTAGGAC.............CACTTGGCA..............
Sequencing errors cause bulges in the de Bruijn graph 3 3 GCC CCT GGA GAC 1 1 4 4 TGG CTT TTG 3 GGC GCA 3 CAC ACT 3 3 ......GCCTTGGAC...... ......CACTTGGCA...... GCCTTGGAC CACTTGGCA GCCTAGGAC CACTTGGCA GCCTAGGAC CACTTGGCA ..............GCCTAGGAC.............CACTTGGCA..............
Sample Preparation Fragments Sequencing Reads SEQuel Assembly Contigs Analysis, Analysis, Analysis Refined Contigs
The SEQuel Algorithm 53 12 25 29 34 40 21 32 19 8 26 39 68 81 75 34 44 21 89 57 Permissively aligned read-pair: a read-pair for which at least one read aligned uniquely.
Positional De Bruijn Graph Positional k-mer: a pair (k-mer, position), e.g. (GCCA, 111). GCC,975 GCC,111 CCT,976 CCT,112 CTA,977 TTT,114 CAT,113 TAT,978 CCA,112 CTT,113 ATT,114 TTA,115 TAT,978 CTA,977 CCT,976 TTT,114 ATT,979 CTT,113 TTA,115 ATT,114 CAT,113 CCA,112
Positional De Bruijn Graph TAT,978 CTA,977 CCT,976 GCC,975 CTT,113 CCT,112 TTT,114 GCC,111 TAT,978 CTA,977 TTA,115 CCT,976 ATT,979 TTT,114 CCA,112 ATT,114 CTT,113 ATT,979 TTA,115 CCA,112 CCA,112 CAT,113 CAT,113 ATT,114 ATT,114
Positional De Bruijn Graph 4 4 4 4
The SEQuel Algorithm partial contig #1: GCCATTA partial contig #2: GCCTATT Original contig GTATTCCGAGGACCACTGGATTATGA
The SEQuel Algorithm GTATTCCGAGGACCACTGGATTATGA 28
The SEQuel Algorithm GTATTCCGAGGACCAC---TGGATTATGA GCGGGCCGAGGA CAAATGGATTACGA 29
The SEQuel Algorithm GTATTCCGAGGACCAC---TGGATTATGA GCGGGCCGAGGA CAAATGGATTACGA 30
The SEQuel Algorithm GCGGGCCGAGGACCAC---TGGATTATGA GCGGGCCGAGGA CAAATGGATTACGA 31
The SEQuel Algorithm GCGGGCCGAGGACCAC---TGGATTATGA GCGGGCCGAGGA CAAATGGATTACGA 32
The SEQuel Algorithm GCGGGCCGAGGACCACAAATGGATTACGA GCGGGCCGAGGA CAAATGGATTACGA 33
The SEQuel Algorithm GCGGGCCGAGGACCACAAATGGATTACGA Repeat for all contigs. 34
Results • Standard and Single-Cell E. coli. • 100 bp paired-end, Illumina (GAII) reads. • Mean coverage ≈ 600x. • Assemblies compared to reference with & without SEQuel.
Single Cell Sequencing Single Cell Standard (Chitsaz et al., 2011)
Summary • Removed 35% to 96% of small-scale assembly errors. • Introduced positional de Bruijn graph for contig refinement. • Demonstrated utility in hard (single-cell) assembly. • SEQuel can be used in combination with any assembler. • Freely available at: http://bix.ucsd.edu/SEQuel
Acknowledgments 3P41RR024851-02S1 CCF-1115206