210 likes | 223 Views
AutoEditor. Automated base caller error correction tool Slides courtesy of Pawel Gajer, Ph.D. Base-calling in the context of single chromatogram is hard…. but finding base-calling “mistakes” in a multiple alignment is easy. AutoEditor. Principal and secondary aims of AutoEditor
E N D
AutoEditor Automated base caller error correction tool Slides courtesy of Pawel Gajer, Ph.D.
Base-calling in the context of single chromatogram is hard… but finding base-calling “mistakes” in a multiple alignment is easy. AutoEditor
Principal and secondary aims of AutoEditor • AutoEditor as a higher level base caller • Tiling discrepancy types • Base caller error types • Resolving discrepancies of the form B…B* • Resolving discrepancies of the form *…*B • AutoEditor statistics
A principal goal of AutoEditor is to automatically correct a majority of tiling discrepancies, reducing human editing effort to the most problematic discrepancy types. A tiling discrepancy is any deviation from the homogeneous coverage of a consensus base.
autoEditor as a higher level base caller single read trace data nucleotide sequence base caller tiling of reads tiling discrepancies multiple read trace data autoEditor list of corrected discrepancies
Other applications: • Clear range editing (read expansion) • SNP detection
Clear range editing trimming algorithm single read quality values data trimmed read less stringently trimmed reads assembler autoEditor tiling of reads
SNP detection Alignment data of genome 1 List of putative SNPs Combined genomes alignment data Alignment data of genome 2 autoEditor List of putative SNPs that pass autoEditor error screening
Tiling discrepancy types Single deletion: Single insertion:
Single insertion and single deletion are extreme cases of insertion/deletion discrepancies A A A A A A A * A A * * A * * * * * * * The above sequence of discrepancies can be represented schematically as an edge in a two vertex graph: A *
The configuration space of all tiling discrepancy types can be schematically represented as a 4-dimensional simplex G * A C T
minimum difference between amplitude and local minimum (c) Black dots on the signal curve indicate local maxima and open circles indicate local minima. amplitude (a) support support (b) support Re-calling individual bases
Base caller error types • Missed signal • Signal shift • Unresolved peaks
Resolving a single deletion discrepancy compute discrepancy’s read multiplicity - mult. if mult = 0 then check for a missed signal error if |mult| > 0 then check for a signal shift error if it is not a signal shift error then it is a unresolved peaks error To resolve it, find two other reads with well resolved peaks over the unresolved peaks bases A discrepancy read multiplicity is the number of bases to the right or left (negative sign) of the discrepancy positions equal to the consensus base covering the discrepancy.
Resolving a single insertion discrepancy compute discrepancy’s read multiplicity - mult ifmult = 0 then check if the signal parameters are within allowable ranges if | mult | > 0 then check if the insertion base is a part of |mult |+1 well-resolved signal peaks if not find two other reads whose traces have exactly |mult | well-resolved signal peaks between the bases flanking the discrepancy position
mult = 0, weak signal error mult = -2, unresolved peaks error with two other reads with exactly 2 signal peaks between Gs flanking AA*
Missed-signal (MS) and signal shift (SS) correction errors AutoEditor version 1.1 from Nov 12, 2002 Test set: the first 10 contigs of Mycoplasma arthritidis asmbl_id size(kb) # corrections # autoEdit # errors in errors newer autoEdit 1 132 124 3 0 2 64 78 4 1 3 40 55 3 0 4 53 45 2 1 5 16 15 0 0 6 22 29 1 0 7 23 19 0 0 8 51 48 1 0 9 26 33 1 0 10 15 15 0 0 ---------------------------------------------------------------------- Total: 442 461 15 2 ~3.25% ~0.43%
AutoEditor version 1.2 correcting all single deletion errors Test set: the first 10 contigs of Mycoplasma arthritidis asmbl_id size(in kb) #disc #corr %corr 1 132 3390 3266 96% 2 64 2195 2142 98% 3 40 1344 1325 99% 4 53 1304 1242 95% 5 16 508 487 96% 6 22 777 757 97% 7 23 624 613 98% 8 51 1303 1232 95% 9 26 783 760 97% 10 15 437 423 97% -------------------------------------------------------------------- Total: 442 12665 12065 95% where #disc is the total number of discrepancies in the given contig #corr is the number of corrected discrepancies %corr is the percentageo of corrected discrepancies
AutoEditor accuracy